AI in the Box: What is Inference-as-a-Service?

My Image

This is definitely not an instruction guideline, but rather my effort to build a top-down picture of a system we can call “local models inference in production”. And I am doing this by looking for the right questions and asking them - in fact, all of you remember The Hitchhiker's Guide to the Galaxy, right? Having an answer does not get you anywhere, if you do not have the right question. 

So, what is in fact included in the local models inference stack? Which subset of engineering building blocks and functionalities can we call like this? How should it be configured to fit enterprise production purposes?

What does local inference look like in theory?

Building an open-source local AI model inference engine involves incorporating several core components, each playing a key role in ensuring that the model functions effectively and efficiently. How inference engine works theoretically at the high-level:

  1. Load Model: the model loader reads and loads the trained model into memory.
  2. Preprocessing: input data is prepared, such as tokenization for text or resizing for images.
  3. Run Inference: preprocessed data is fed to the model using the inference engine, which utilizes framework-specific libraries.
  4. Optimize Execution: use optimizations e.g. pruning, adding/deleting layers and quantization to speed up inference.
  5. Hardware Utilization: utilize hardware-specific operations, such as selecting an instance with clusters and GPU acceleration.
  6. Postprocessing: format the raw output to meaningful data, e.g., labels or bounding boxes.
  7. Interface Handling: the input-output interface handles incoming requests and returns model predictions.
  8. Log and Monitor: track the requests, performance metrics, and errors.

In term of how far Inference-as-a-Service can abstracted away, current companies doing this seem to provide two levels of local systems configuration:

  • API-only experience: no GPU/CPU or hardware customization available here
  • Certain customization of the local inference system. In this case you can basically customize two blocks:
    • Container image configuration: defining images in Python, or using YAML file
    • GPU resources: define which GPUs and in which clusters to use and for which tasks 

 So why are exactly these two blocks predefined to be customizable? This is most probably because they are essential components that directly impact the performance and environment in which the model operates. But what else can be customized?

Other customizable aspects include model deployment logic (e.g. batching strategies), pre/post-processing data pipelines (e.g., normalization, tokenization), model optimization techniques (e.g. quantization), networking and APIs (latency and security), and memory/storage configurations (sharing memory and caching), all of which further enhance the adaptability and performance of the system.

In general, where Inference-as-a-Service should begin and end?

Moving bottom up now, let’s look at the three levels of abstraction while using models locally - how far can it go? Basically, there are:

  • Model Deployment: streamlined deployment process for pre-trained and fine-tuned models.
  • Inference Optimization: advanced features such as batching, caching, and distributed processing for high-throughput environments.
  • Training & Fine-Tuning: support for distributed training and finetuning of models across multiple nodes and frameworks (PyTorch, TensorFlow).

For me the question here is: which level(s) of abstraction should/need IaaS to cover? How does it depend on a company tech and infrastructure stack, data privacy must-have, need for composability and controllability and certain jurisdiction regulations?

What if we go deeper in the infrastructure?

Further questions come up when we dive into how those levels of abstraction can be organized in terms of the infrastructure.

  1. Inference Server Strategy: build a custom inference server or rely on existing solutions?
  2. Model Storage: should we integrate with public model repositories (e.g., Hugging Face) or build private storage?
  3. Model Transport & Optimization: what protocols should we support for transporting and optimizing models?
  4. Model Reference for Implementation: which models should we support by default (e.g., Llama3, GPT-based models)?
  5. Hardware Requirements: what hardware should be our primary focus for model inference and training?
  • Kubernetes: do we focus on Kubernetes as a necessary technology for implementing the platform? 

 Building a use-case: what can the boundaries be?

Defining boundaries is always hard. If you build a local models inference engine, which lines and where should be drawn? I understand there are two generic levels of such an engine: everything related to model handling and everything about data handling.

Models handling:

  • Storage in a private repository
  • Fine tuning
  • Serving

Data handling:

  • Data scientists/ML engineers tools e.g. notebooks
  • Data pipelines support
  • Data management e.g. external dataset management, data augmentation

An effective local inference engine can provide both levels as an AI-in-the-box. Conversely, this may appear to be too complicated (e.g. locking in a company using it) and only models handling level needs to be abstracted away. This is one of the big questions the inference-as-a-service niche is solving now.

Stay Ahead in Tech & Startups

Get bi-monthly email with insights, trends, and tips curated by Founders

Show Cookie Preferences