Batch Inference API

A distributed system for running batch inference operations across GPU providers.

In order to generate datasets with millions of rows in a cost-effective and timely manner, I needed to develop a distributed inference system that could orchestrate GPUs across multiple cloud providers. The Batch Inference API powered Glaive's synthetic data generation at scale.

The previous inference pipeline had two issues: cost and throughput. We used a single provider whose GPUs were abstracted from hardware with significant markup. They also had scaling limitations—new GPUs took a long time to provision and weren't always available.

The solution required: (a) running inference on any GPU provider, (b) managing hardware directly, and (c) scaling horizontally fast.

How It Worked

Job Request
    ↓
Determine Hardware Needs (model size → GPU requirements)
    ↓
Query Multiple Providers (price comparison)
    ↓
Deploy Workers (cheapest option)
    │
    ├─→ Provider A (Kubernetes)
    ├─→ Provider B (VMs)
    └─→ Provider C (Spot Instances)
         ↓
    Workers pull prompts in batches
         ↓
    Results written back to dataset

The server receives a job and determines hardware requirements based on model size. We query multiple providers for pricing, deploy to the cheapest option, and distribute our worker binary. Workers boot, install the model, and start processing batches of prompts.

Jobs started with a single GPU instance. Based on throughput, we'd calculate how many instances were needed to finish by the deadline and scale accordingly. We'd roll over to cheaper instances as they became available and pass instances between jobs when possible.

The system was fault-tolerant by design. Workers grabbed small batches of ~250 prompts and wrote results back to a central server that tracked which rows had been generated. This was committed to a database backed up regularly to S3, so if a worker failed only a small number of rows were lost, and if the API itself failed only a manageable number of rows needed regeneration.

By spreading across providers, we could select the ideal hardware for each job (fastest, cheapest, or most efficient) while scaling beyond what any single provider could offer. This gave us a 10x improvement in dataset size and generation speed with greatly reduced costs. We scaled from ~50k row datasets to 20M+ rows, enabling task-specific fine-tunes that were both cheaper and higher-performing than general-purpose models.

Note: The source code for this project is proprietary and not publicly available.