Running LLM Inference on Kubernetes: What It Actually Takes
Security Boulevard, Friday, June 5th, 2026
Running LLM inference on Kubernetes breaks standard web-app patterns due to GPU state, slow startups and specialized scheduling.
This Fairwinds-authored piece explains why running LLM inference on Kubernetes does not work like a typical web app. The inference pipeline has three stages, tokenization (CPU-bound), pre-fill (highest GPU demand), and decode (memory-bound), each with different resource profiles.
Standard Kubernetes patterns break because inference pods carry large state, can take 15-30 minutes to start while loading model weights, and have codependencies in disaggregated deployments that default scheduling ignores; a 27-billion-parameter model may need 24-32 GB of GPU memory just to load, and multi-GPU instances cost tens of dollars per hour.