8
Deploying DeepSeek Models
In the previous chapter, we distilled and fine-tuned smaller, domain-specific models that you could run on modest hardware and within strict privacy boundaries. That work is optimized for efficiency and control at a smaller scale. This chapter takes the complementary step of deploying full-parameter DeepSeek models (V3 and R1) as dependable production services.
Deployment is the bridge from research to production. It forces concrete choices about memory footprint, throughput, and operational risk. DeepSeek’s architectures magnify these trade-offs: V3’s Mixture-of-Experts (MoE) stresses VRAM placement; R1’s extended reasoning inflates token counts and time-to-first-token. The right path depends on your constraints. ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access