Chapter 17. Serving LLMs with Ollama
We’ve explored how to use transformers to download a model and put together an easy pipeline that lets you use it for inference or fine-tuning. However, I’d be remiss if I didn’t show you the open source Ollama project, which ties it all together by giving you an environment that gives you a full wrapper around an LLM that you can either chat with in your terminal or use as a server that you can HTTP POST to and read the output from.
Technologies like Ollama will be the vanguard of the next generation of LLMs, which will let you have dedicated servers inside your data center or dedicated processes on your computer. That will make them completely private to you.
At its core, Ollama is an open source project that simplifies the process of downloading, running, and managing LLMs on your computer. It also handles nonfunctional difficult requirements, such as memory management and model optimization, and it provides standardized interfaces for interaction, such as the ability to HTTP POST to your models.
Ollama is also a key strategic tool you should consider because it bridges the gap between cloud-based third-party services like GPT, Claude, and Gemini and locally deployed services. It goes beyond giving you a local development environment to giving you one that you could, for example, use within your own data center to serve multiple internal users.
By running models locally, you can ensure the complete privacy of your data, eliminate network ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access