Chapter 6. Inference APIs
You’ve already expanded your knowledge about AI and the many types of models. Moreover, you deployed these models locally (if possible) and tested them with queries. But when it is time to use models, you need to expose them properly, follow your organization’s best practices, and provide developers with an easy way to consume the model.
An inference API helps solve these problems, making models accessible to all developers. This chapter explores how to expose an AI/ML model by using an inference API in Java.
What Is an Inference API?
An inference API allows developers to send data (in any protocol, such as HTTP, gRPC, or Kafka) to a server with an ML model deployed and receive the predictions or classifications as a result. Practically, every time you access cloud models like OpenAI or Gemini or locally deployed models using Ollama, you do so through their inference API.
Even though it is common these days to use big models trained by big corporations like Google, IBM, or Meta, mostly for LLM purposes, you might need to use small custom-trained models to solve one specific problem for your business. Usually, these models are developed by your organization’s data scientists, and you must develop code to infer them.
For example, suppose you are working for a bank, and data scientists have trained a custom model to detect whether a credit card transaction can be considered fraud. The model is a predictive AI model in ONNX format with six input parameters ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access