All case studies

How We're Helping Olio Labs Cut Their GPU Bill in Half

~53%

Reduction in total AWS spend

~$213K

Projected annual savings

$401K to $189K

Projected 12-month TCO

Boost your global cloud visibility and control with Pump. Share your email for details!

By submitting your email, you agree to opt in to marketing emails.

Ready to start optimizing on your cloud spend?

By submitting your email, you agree to opt in to marketing emails.

Overview

"We'd already moved to spot and figured our savings were tapped out. Pump found the real lever inside our inference path, model format, GPU packing, and reprocessing, and mapped a path to roughly halve our GPU bill."

David Tingley

Founder

Olio Labs runs a data processing pipeline for rodent behavioral experiments. Their system takes raw sensor data (video from USB cameras, thermal video, CO2 readings) and transforms it into structured behavioral features and summary statistics that researchers actually use. The pipeline is built on Nextflow and runs across AWS Batch, with three independent sub-workflows handling video, thermal, and CO2 inputs in parallel.

The video sub-workflow is where most of the cost lives. It runs YOLOv11n-pose for keypoint detection, YOLOv8n for object detection, a VAME variational autoencoder for posture analysis, and an action classification model on individual cropped clips, all GPU steps running on g4dn instances out of AWS Batch. During a typical large run, Olio spins up around 870 GPU instances in parallel to keep up with experimental throughput.

Industry

Life Sciences / ML Infrastructure

Integrations

Location

Pump services

  • Pump Save

Use Case 1

TensorRT Model Export

Olio's pipeline runs the YOLO models using native PyTorch weights. Exporting these models to TensorRT format typically yields a 2-5x inference speedup on the same hardware, and Ultralytics supports the export natively with a one-line command, so the inference scripts need only minimal changes. It is less effort than just about any other optimization on the table, and the savings drop straight to the bottom line because every reduction in inference time is a reduction in instance-hours billed. Pump recommended starting with the most parallelized GPU step, benchmarking the throughput delta against the PyTorch baseline, then rolling out to the other three GPU steps.

Use Case 2

GPU Utilization Analysis and Task Packing

With 870 GPU instances spinning up per run, even modest improvements in per-GPU throughput translate to meaningful instance-count reductions. Pump asked Olio to capture nvidia-smi output during their next batch run to characterize current utilization. Where GPU and VRAM utilization are low, which is expected for nano-class models on g4dn, the path forward is to batch multiple videos per GPU task and reduce instance count proportionally. This requires changes to both the Python inference scripts and the Nextflow process definitions, but at this scale the savings are significant.

Incremental Reprocessing

The cost spikes on model update days are the most expensive single events in Olio's cost profile, driven entirely by the pipeline reprocessing the entire experimental catalog whenever a new model ships. The pipeline already versions models via Nextflow config parameters, so the foundation for selective reprocessing exists. The remaining work is to tag pipeline outputs with the model versions used to produce them, then skip reprocessing when an output's model version matches the current configured version. It is the highest-effort recommendation, but the biggest individual lever.

Pump’s impact

Stacking all three levers, the projected total cost of ownership drops from about $401,348 to roughly $188,537, a net savings of around $212,811 over 12 months, or about a 53 percent reduction in total AWS spend if all three levers land at the modeled efficiency. TensorRT alone is projected to deliver around $104,000 in annual savings, GPU task packing adds another ~$49,000, and incremental reprocessing brings in roughly $60,000 more by eliminating full-catalog reprocesses on model update events.

The team committed to the staged rollout. TensorRT export work has started with the most parallelized step, the nvidia-smi data capture is queued for the next large batch run, and incremental reprocessing has been added to the engineering roadmap. The broader lesson: for spot-heavy GPU workloads on AWS Batch, the next layer of optimization lives inside the inference path, in model format, batch size, and GPU utilization, not in traditional levers like RIs or Savings Plans.

When the easy GPU savings are gone, look inside the inference path

Olio Labs had already moved to spot pricing. The next ~50 percent came from model format, GPU packing, and smarter reprocessing.

When the easy GPU savings are gone, look inside the inference path

Olio Labs had already moved to spot pricing. The next ~50 percent came from model format, GPU packing, and smarter reprocessing.

When the easy GPU savings are gone, look inside the inference path

Olio Labs had already moved to spot pricing. The next ~50 percent came from model format, GPU packing, and smarter reprocessing.

When the easy GPU savings are gone, look inside the inference path

Olio Labs had already moved to spot pricing. The next ~50 percent came from model format, GPU packing, and smarter reprocessing.