
Best Practices for Deploying Open-Source LLMs on Kubernetes with NVIDIA GPUs
Modern large language models (LLMs) can be self-hosted on Kubernetes clusters with NVIDIA GPUs to maintain control over data and cost. However, these models are resource-intensive, requiring careful planning of cluster setup, GPU scheduling, and performance optimizations. This report covers best practices for developers to deploy open-source LLMs on Kubernetes, including recommended models, cluster configuration, GPU resource strategies, performance tuning, serving frameworks, common challenges, and security/scalability considerations.
Recommended Open-Source LLMs for Self-Hosting
Open-source LLMs offer freely available model weights that can be deployed on your own infrastructure. Popular choices include:
- Llama 2 (Meta) – A series of 7B, 13B, and 70B-parameter models available for research and commercial use (Choosing an LLM: The 2024 getting started guide to open source LLMs | Elastic Blog). Llama 2 provides strong general-purpose performance and can be fine-tuned for various tasks.
- GPT-NeoX-20B and GPT-J-6B (EleutherAI) – Autoregressive transformer models with 20 billion and 6 billion parameters, respectively (Choosing an LLM: The 2024 getting started guide to open source LLMs | Elastic Blog) (Choosing an LLM: The 2024 getting started guide to open source LLMs | Elastic Blog). They are fully open-source GPT-3 alternatives, suitable for text generation in English and backed by the Pile dataset for diverse knowledge.
- Falcon (TII) – An Apache 2.0 licensed model available in 7B, 40B, and up to 180B parameter versions (Choosing an LLM: The 2024 getting started guide to open source LLMs | Elastic Blog). Falcon is trained on high-quality web data and offers strong multilingual and generative capabilities for enterprise use.
- BLOOM (BigScience) – A 176B-parameter multilingual LLM released by a collaboration of researchers (Choosing an LLM: The 2024 getting started guide to open source LLMs | Elastic Blog). It supports 46 languages and is useful for translation or applications requiring diverse language understanding (although the full model requires multiple high-memory GPUs).
- Dolly 2.0 (Databricks) – A 12B-parameter model derived from EleutherAI’s base and fine-tuned on instruction data. Dolly 2.0 is an open model for commercial use that can follow prompts for chatbot-like interactions (Deploy LLM models on Kubernetes using OpenLLM ).
- Vicuna 13B – An instruct-tuned conversational model based on LLaMA-13B, trained on ShareGPT dialogues. Vicuna achieves ~90% of ChatGPT-quality in preliminary evaluations (8 Top Open-Source LLMs for 2024 and Their Uses | DataCamp), making it a strong open-source chatbot for customer service or assistants.
These and other models like StableLM, Flan-T5, Mistral 7B, ChatGLM, StarCoder (for code) are readily available. When choosing a model, consider its license (some “open” models have research-only restrictions), the parameter count vs. available GPU memory, and the model’s suitability (e.g. multilingual, code generation) for your use case.
Kubernetes Setup and Configuration for LLM Workloads
Setting up your Kubernetes cluster for LLM inference involves enabling GPU support and configuring nodes for heavy workloads:
- GPU Drivers and Runtime: Ensure all GPU nodes have the NVIDIA driver and CUDA libraries installed, either manually or via the NVIDIA GPU Operator. These components are crucial for GPU acceleration (Deploy LLM models on Kubernetes using OpenLLM ). The GPU Operator can automate deployment of drivers, the CUDA toolkit, and the device plugin across the cluster.
- NVIDIA Device Plugin: Install the official Kubernetes device plugin for NVIDIA GPUs so that pods can request GPU resources (AI/ML in Kubernetes Best Practices: The Essentials | Wiz). This plugin advertises GPUs as schedulable resources (e.g.,
nvidia.com/gpu
) and handles assigning GPUs to containers. Without it, the Kubernetes scheduler won’t be aware of GPU availability. - Cluster Nodes Configuration: Use dedicated GPU node pools and label or taint them accordingly. For example, label GPU nodes with
nvidia.com/gpu=true
and set podnodeSelector
or affinity rules to schedule LLM workloads on those nodes (AI/ML in Kubernetes Best Practices: The Essentials | Wiz). This prevents non-GPU workloads from landing on GPU nodes and vice-versa, ensuring proper isolation. - Container Base Image: Use images that include the necessary GPU drivers or CUDA libraries (e.g., NVIDIA NGC images). This avoids compatibility issues. The container’s CUDA version should match the host driver’s supported version – for instance, CUDA 11.x requires a minimum driver version as per NVIDIA’s compatibility matrix (A Quick Guide to Troubleshooting Most Common LLM Issues).
- Resource Requests: Define resource requests/limits for CPU, memory, and GPUs in your pod specs. For example, requesting
nvidia.com/gpu: 1
ensures the scheduler places the pod on a node with at least one free GPU (AI/ML in Kubernetes Best Practices: The Essentials | Wiz). Also allocate sufficient CPU and memory for preprocessing and the model itself (LLM containers can be memory-heavy when loading large models).
By preparing the cluster with proper GPU support and configuration, you create a stable foundation for running LLM inference. Verifying the setup with a simple CUDA test or nvidia-smi
inside a pod can confirm that GPUs are accessible before deploying large models.
GPU Scheduling and Resource Allocation Strategies
Efficient GPU scheduling is key to maximizing hardware utilization and meeting the demands of LLM workloads:
- Exclusive GPU Allocation: By default, Kubernetes schedules GPUs as indivisible resources – a container that requests 1 GPU gets exclusive use of one physical GPU (Enabling the Large Language Models Revolution: GPUs on Kubernetes). For large inference jobs (e.g., running a 70B model on an A100), this exclusivity is appropriate. You can also allocate multiple GPUs to a single pod (e.g.,
nvidia.com/gpu: 4
) if using multi-GPU inference or model parallelism. - Multi-Instance GPU (MIG): On NVIDIA A100/A30 GPUs, enable MIG to partition a GPU into slices (instances). For example, an A100 can be split into up to seven independent GPU instances (MIG Support in Kubernetes — Kubernetes with NVIDIA GPUs 1.0.0 documentation). Each MIG partition is exposed as a schedulable GPU resource, allowing you to run multiple smaller LLM inference pods on one physical GPU. MIG improves utilization by isolating workloads that don’t need the full GPU, securely partitioning the memory and compute for each instance (MIG Support in Kubernetes — Kubernetes with NVIDIA GPUs 1.0.0 documentation). To use MIG, configure the device plugin in MIG mode and schedule pods requesting specific MIG device types.
- Time-Slicing and MPS: Newer scheduling features allow time-sharing a GPU among pods. NVIDIA’s Kubernetes device plugin (v0.12+) supports GPU time-slicing, enabling multiple pods to use a single GPU concurrently by context-switching execution (Improving GPU Utilization in Kubernetes | NVIDIA Technical Blog). Similarly, NVIDIA’s Multi-Process Service (MPS) can be enabled to allow concurrent CUDA contexts on one GPU, which is useful for many light inference workloads (Enabling the Large Language Models Revolution: GPUs on Kubernetes). These approaches let you over-provision a GPU (fractional GPU requests) for improved throughput, at the cost of potential context-switch overhead.
- Node Affinity and Anti-Affinity: Use affinity rules to schedule related pods appropriately. For example, if an LLM service consists of multiple pods (e.g., embedding model + generator), you might collocate them on the same node for performance. Conversely, use anti-affinity to spread replicas of the same model across nodes to avoid single-node bottlenecks. Always ensure that GPU resource requests are set, otherwise the scheduler could place an LLM pod on a non-GPU node by mistake, causing failures (AI/ML in Kubernetes Best Practices: The Essentials | Wiz).
- Resource Bin Packing: Aim to fully utilize each GPU. If a single LLM inference doesn’t use all the GPU’s compute (for instance, serving small queries on a 40GB A100), consider running multiple replicas or models on that GPU (via MIG or container orchestration) to increase utilization. Kubernetes won’t automatically share GPUs between pods without MIG or time-slicing enabled, so plan your deployments (or use specialized schedulers) to pack workloads onto GPUs efficiently.
Choosing the right GPU allocation strategy depends on workload profiles. High-throughput, large-batch scenarios may use an entire GPU per model, while many smaller queries could benefit from GPU sharing. Techniques like MIG and MPS help partition GPU resources to balance utilization and cost (Enabling the Large Language Models Revolution: GPUs on Kubernetes). Always monitor GPU metrics (utilization, memory) to adjust scheduling strategies as load changes.
Performance Optimization Techniques for NVIDIA GPUs
Optimizing inference performance ensures you get the most out of expensive GPU hardware when serving LLMs:
- Lower Precision and Quantization: Running models in half-precision (FP16/BF16) or smaller integer formats can dramatically accelerate inference on NVIDIA GPUs. Quantization converts model weights to lower-bit representations (e.g., INT8 or INT4) to reduce memory and computation requirements (Unleash the full potential of LLMs: Optimize for performance with vLLM). This typically yields 2–4× speed-ups in inference with minimal impact on accuracy (often <1% loss) (Unleash the full potential of LLMs: Optimize for performance with vLLM). Use libraries like NVIDIA TensorRT, Hugging Face Optimum, or GPTQ to quantize models and leverage tensor cores for faster matrix ops.
- Optimized Inference Kernels: Use optimized libraries or runtimes tailored for transformer models. NVIDIA’s TensorRT-LLM provides optimizations like kernel fusion (combining operations to reduce launch overhead), efficient batching, and paged-attention for long sequences (Scaling LLMs with NVIDIA Triton and NVIDIA TensorRT-LLM Using Kubernetes | NVIDIA Technical Blog). By converting your model to a TensorRT engine, you can achieve lower latency and higher throughput on GPUs. These optimizations (fusion, faster memory access patterns, etc.) significantly speed up inference on large models (Scaling LLMs with NVIDIA Triton and NVIDIA TensorRT-LLM Using Kubernetes | NVIDIA Technical Blog). Similarly, frameworks like FasterTransformer or ONNX Runtime with GPU execution can provide boosts by using optimized CUDA kernels.
- Batching and Concurrent Requests: Maximize GPU utilization by processing multiple requests in parallel. LLM serving frameworks often implement dynamic batching – aggregating incoming requests so the model processes them together – which increases throughput. For example, vLLM and TGI automatically batch requests to keep the GPU busy (Unleash the full potential of LLMs: Optimize for performance with vLLM). This is especially important for smaller queries; a single request might not use all SM cores, but a batch of requests will. Adjust batch sizes (or enable auto-batching) to find a balance between latency and throughput.
- Token Streaming and Caching: When generating text, use techniques to avoid repeating work. Many LLM servers use a cache of attention key-values (KV cache) so that across multiple prompts or iterative generation, the model doesn’t recompute attention for previous tokens. vLLM’s PagedAttention algorithm manages the KV cache efficiently in GPU memory to enable high throughput generation without memory bloating (vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention | vLLM Blog) (vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention | vLLM Blog). In practice, this means you can handle longer context or multi-turn conversations more smoothly. Enable streaming outputs (sending tokens as they’re generated) to overlap computation with network transmission, improving perceived latency for end-users.
- GPU Utilization and Parallelism: Ensure the GPU is kept busy. If using a deep learning framework like PyTorch, enable inference mode and GPU acceleration (disable gradient computations). Load the model on the GPU once and reuse the same process for many inferences to amortize initialization cost. Utilize multiple GPU streams or threads if the serving software supports it, so that while the GPU is waiting on one memory-bound task, it can compute another. Also, prefer transferring larger chunks of data less frequently – for example, move the entire input tensor in one CUDA transfer rather than many small transfers.
In summary, combine model optimization (quantized weights, optimized kernels) with serving-side techniques (batching, caching, parallelism) to achieve high throughput and low latency. Measure performance with profiling tools and iterate – small changes like using FP16 or increasing batch size can yield substantial improvements in tokens per second on NVIDIA GPUs.
Model Serving Frameworks for LLMs
Instead of writing a custom server, leverage existing serving frameworks that are optimized for LLM inference:
- NVIDIA Triton Inference Server (Triton) – An open-source serving system that supports multi-framework models (PyTorch, TensorFlow, ONNX, TensorRT) and multi-model deployments (Scaling LLMs with NVIDIA Triton and NVIDIA TensorRT-LLM Using Kubernetes | NVIDIA Technical Blog). Triton provides features like dynamic batching, request concurrency control, and model ensemble pipelines. It can run on GPUs or CPUs and is production-grade. For LLMs, you can convert models to TensorRT engines for maximum speed, then serve them via Triton. Triton also supports distributed deployments and autoscaling, making it suitable for enterprise use. (Use case example: serving a GPT-NeoX-20B model with GPU batching – Triton will automatically batch incoming requests to improve throughput.)
- vLLM – A high-throughput, memory-efficient LLM inference engine originally from UC Berkeley, now Linux Foundation-backed. vLLM introduces PagedAttention to optimize GPU memory usage for attention caches, achieving substantially higher throughput than standard transformer pipelines (vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention | vLLM Blog). In benchmarks, vLLM delivers up to 24× higher throughput than Hugging Face Transformers and over 3× more than previous servers like TGI under heavy load (vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention | vLLM Blog) (vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention | vLLM Blog). It is easy to use (one command to serve a model) and supports popular model architectures. vLLM dynamically batches and pipelines requests, enabling it to handle many concurrent generation requests with minimal latency overhead (Unleash the full potential of LLMs: Optimize for performance with vLLM) (Unleash the full potential of LLMs: Optimize for performance with vLLM). This makes it ideal for building chatbots or real-time applications where throughput and responsiveness are critical.
- Hugging Face Text Generation Inference (TGI) – An open-source inference server in Rust/Python that is used to power HuggingFace’s own API and HuggingChat. TGI is optimized for transformer text-generation and supports many open-source models out of the box (Llama, Falcon, BLOOM, GPT-NeoX, etc.) (GitHub - huggingface/text-generation-inference: Large Language Model Text Generation Inference). It provides a simple launcher to spin up a model endpoint and implements features for production:
- Continuous batching of incoming requests to maximize throughput (GitHub - huggingface/text-generation-inference: Large Language Model Text Generation Inference),
- Tensor Parallelism to split inference across multiple GPUs (useful for speeding up large models across 2+ GPUs) (GitHub - huggingface/text-generation-inference: Large Language Model Text Generation Inference),
- Streaming SSE outputs for token streaming,
- OpenAI-compatible API (so you can query it with the same format as ChatGPT API),
- and built-in support for optimized transformer kernels (FlashAttention, etc.) and quantization methods (GitHub - huggingface/text-generation-inference: Large Language Model Text Generation Inference) (GitHub - huggingface/text-generation-inference: Large Language Model Text Generation Inference). TGI is highly scalable and can be integrated with Kubernetes for distributed inference. It’s a great choice if you want a ready-to-use server that is already tuned for popular models and is actively maintained by the community.
- Other Tools: There are additional frameworks and tools depending on needs. OpenLLM (BentoML) provides a developer-friendly way to package and deploy many LLMs with a unified interface (Deploy LLM models on Kubernetes using OpenLLM ) (Deploy LLM models on Kubernetes using OpenLLM ). KServe (KFServing) can serve models on K8s with GPU support and has a HuggingFace model server that can be extended to LLMs. If you require custom logic or lightweight serving, you might use a web framework (FastAPI, gRPC) with the model loaded in memory – but be cautious, as you’ll need to implement your own batching and optimization in that case.
Each of these frameworks abstracts away a lot of complexity (GPU management, batching, parallelism). Evaluate them based on your requirements: Triton is very flexible for multi-model and multi-framework scenarios, vLLM focuses on maximum throughput for text generation, and TGI provides a balanced, production-ready solution specifically for text generation with Transformers. In many cases, using one of these will significantly shorten development time and improve performance versus writing a custom serving stack.
Common Challenges and Troubleshooting Tips
Deploying LLMs on Kubernetes with GPUs can introduce a range of challenges. Here are common issues and tips to resolve them:
- Out-of-Memory (OOM) Errors: Large models can exceed GPU memory or even pod memory limits. If a pod crashes with CUDA OOM, first verify your GPU has enough VRAM for the model. For instance, a 13B parameter model can require >16 GB GPU memory (and 70B can need ~80 GB, which exceeds a single GPU). Solutions include using a GPU with more memory or activating 8-bit/4-bit quantization to shrink the model size (A Quick Guide to Troubleshooting Most Common LLM Issues) (A Quick Guide to Troubleshooting Most Common LLM Issues). You can also reduce the context length for generation – long input prompts consume a lot of memory for the attention cache, so limit the prompt length or process long text in smaller chunks (A Quick Guide to Troubleshooting Most Common LLM Issues). If GPU memory is still an issue, consider sharding the model across two GPUs (supported by frameworks like TGI’s tensor parallelism) or using gradient checkpointing/offloading techniques.
- CUDA/Driver Compatibility Issues: A common stumbling block is mismatched versions between the NVIDIA driver, CUDA toolkit, and the deep learning framework (TensorFlow/PyTorch) in your container. This can lead to runtime failures where the GPU isn’t utilized at all. To troubleshoot, use
nvidia-smi
on the node to check the driver version and ensure your container’s CUDA version is compatible. Cross-check against NVIDIA’s compatibility matrix and the framework’s requirements (A Quick Guide to Troubleshooting Most Common LLM Issues). For example, if the node has Driver 525 (CUDA 12 capable) but your container is built with CUDA 11, it should still work (backwards compatibility), but not vice versa. If you see errors about missing CUDA symbols, it often means the container’s CUDA runtime is newer than the host driver – update the driver or use a matching container image. Using the NVIDIA GPU Operator can avoid these issues by aligning driver and container toolkit versions automatically. - Pod Scheduling and Startup Problems: If your LLM pod stays in a Pending state, it likely means no suitable node is available – check that a GPU node is free and that you've requested the
nvidia.com/gpu
resource. If the resource is not recognized, ensure the NVIDIA device plugin is running (the GPU resource should show up in node capacity). Also verify you didn’t set resource requests beyond what any single node can satisfy (e.g., requesting 2 GPUs when each node has only 1). For multi-GPU pods, all requested GPUs must be on the same node (Kubernetes won’t split a single pod across multiple nodes), so use nodes with enough GPUs or adjust your request. If using MIG, ensure the pod requests a specific MIG device type that exists on the node (GPU Operator’sgpu-feature-discovery
can label nodes with MIG profiles to help with scheduling). - Slow Model Downloads or Initialization: Large models (tens of GBs) can take a long time to pull from remote storage (like Hugging Face) on pod startup. This can lead to readiness probes failing or long delays in scaling up. To mitigate this, consider baking the model weights into the container image (if license allows) or using a PersistentVolume to store model files so they don’t need to download on every restart. Alternatively, use init containers to preload models or enable caching mechanisms. Ensure your pods have enough initialization time (tune readiness/liveness probe thresholds) to load the model, especially if using Triton or others that might optimize the model at startup (e.g., compiling to TensorRT can take several minutes for big models).
- Throughput or Latency Issues: If the inference latency is higher than expected, or GPU utilization is low, profile the workload. It may be processing one request at a time. In such cases, enabling batching is crucial – use the serving frameworks’ batching features or increase the client batch size. If latency is critical, adjust the max batch size to avoid queuing too long. Also check if the model is running on CPU due to some fallback (sometimes if CUDA isn’t available, frameworks run on CPU and it’s easy to miss). Monitor
nvidia-smi
during runtime: if you see little GPU memory usage or 0% utilization, the model might not be on the GPU at all. Lastly, ensure your application isn’t saturating CPU or I/O – feeding a large model with data can become CPU-bound (e.g., tokenization of many inputs). You might need to scale up CPU requests or use faster tokenization libraries to keep the GPU fed with data. - Logging and Debugging: Use the Kubernetes events and logs to troubleshoot issues. For example,
kubectl describe pod
will show if a pod failed scheduling due to resource limits or if it was evicted. Application logs (stdout) can reveal Python stack traces for issues like missing model files or CUDA out of memory errors. Enabling debug/verbose mode in your LLM serving framework can also help pinpoint configuration issues. On the GPU side, the NVIDIA Toolkit includes tools likenvidia-smi dmon
or NSight Systems for deeper performance analysis if needed.
By anticipating these challenges, you can set up alerts or readiness checks to catch problems early. For instance, you might implement a startup probe that loads a small test prompt through the model to verify everything is working, and an automated rollback if it fails. In all cases, consulting documentation and community forums for your specific LLM and tooling can provide guidance, as many others have likely encountered similar issues.
Security Considerations
Running LLM workloads in a cluster introduces security considerations at both the infrastructure and application level:
- Cluster and Pod Security: Follow Kubernetes security best practices. Apply Pod Security Standards (PSS) or PodSecurityPolicies to ensure pods don’t run with unnecessarily elevated privileges (AI/ML in Kubernetes Best Practices: The Essentials | Wiz). LLM serving containers typically do not require root access – run them as a non-root user and drop Linux capabilities not needed for GPU operation. The NVIDIA device plugin and driver containers will run with high privileges, but your LLM pods should be isolated and have minimal permissions (no host network or filesystem mounts unless required).
- Network Access Control: Protect the LLM service behind appropriate network policies or API gateways. An open LLM endpoint could be abused (for example, an attacker could send extremely large inputs to consume resources). Only expose the service within your application or behind authentication if offering it externally. Use Kubernetes Network Policies to restrict which pods or external IPs can communicate with the LLM service.
- Data Privacy: Self-hosting an LLM means you control the data – ensure that sensitive data in prompts or responses is handled securely. Transmit data to the LLM over TLS if crossing node boundaries. Since all inference happens on your hardware, you avoid sending data to third-party APIs (one motivation for self-hosting). However, implement proper logging hygiene (don’t log entire confidential prompts) and consider encryption at rest if storing any conversation histories or prompt logs.
- Vulnerability Management: Use container image scanning on any images (both the base CUDA images and the model server) to detect vulnerabilities (AI/ML in Kubernetes Best Practices: The Essentials | Wiz). LLM containers often use large images with many dependencies, so regularly scan (with tools like Trivy or Clair) and update them to pick up security patches. Monitor and update the LLM framework itself – for example, security issues in libraries like huggingface/transformers or Flask (if your serving stack uses it) should be patched promptly.
- Supply Chain Trust: Only download model weights from trusted sources to avoid corrupted or malicious models. Verify checksums of model files if provided. Similarly, use official releases of frameworks (or well-vetted forks) to avoid running untrusted code. Because LLMs execute on GPUs, a malicious model could theoretically exploit GPU crashes or memory leaks, so apply the same zero-trust mindset to model artifacts as you would to container images.
- Multi-Tenancy Isolation: If your Kubernetes cluster serves multiple teams or models, consider dedicating specific GPU nodes to sensitive workloads. Kubernetes supports namespace-level isolation, but GPUs are a shared resource on a node. MIG can help isolate at the hardware level if multiple tenants must share one physical GPU. Also, be mindful of information leakage – while uncommon, research has shown it’s possible to infer aspects of another model’s data if sharing the same GPU (through side-channel attacks). Isolate high-security workloads on separate physical GPUs or nodes when possible.
- Monitoring and Auditing: Enable auditing on your cluster to track who schedules GPU pods or accesses the LLM service. Given the cost and sensitivity of LLM deployments, you want to know if an unauthorized job is using your GPUs. Tools like Prometheus/Grafana can alert on unusual GPU usage patterns. Additionally, integrate cluster monitoring with your SIEM to catch any signs of compromise (e.g., crypto mining containers trying to run on your GPU nodes).
By addressing these security areas, you can deploy LLMs with confidence that the infrastructure and data are safeguarded. The principle of least privilege, defense in depth (network, pod, and supply chain security layers), and regular audits are key to a secure LLM service in Kubernetes (AI/ML in Kubernetes Best Practices: The Essentials | Wiz).
Scalability Considerations
Scaling LLM deployments to meet demand while controlling costs is a balancing act. Kubernetes provides tools to help scale both the application and the underlying infrastructure:
- Horizontal Pod Autoscaling (HPA): Set up autoscaling for your LLM inference deployment based on appropriate metrics. Unlike typical web services that scale on CPU usage, GPU-bound workloads might require custom metrics. For example, you can scale based on the queue length of requests or GPU utilization reported by DCGM (NVIDIA’s GPU metrics exporter). In practice, teams often export a metric like “requests per second” or use Triton’s built-in metrics; these can feed an HPA to add more replicas when load increases (Scaling LLMs with NVIDIA Triton and NVIDIA TensorRT-LLM Using Kubernetes | NVIDIA Technical Blog) (Scaling LLMs with NVIDIA Triton and NVIDIA TensorRT-LLM Using Kubernetes | NVIDIA Technical Blog). Ensure the HPA has some cooldown period, as loading an LLM pod is expensive – you don’t want it flapping up and down rapidly. Aim to keep each GPU busy ~70–80% at peak by adding pods as needed.
- Cluster Autoscaler: Integrate cluster autoscaling to add/remove GPU nodes in response to those new pods. When the HPA increases replicas beyond current capacity, a Cluster Autoscaler can provision new GPU instances in the cloud (or you scale up on-prem nodes) (Deploy LLM models on Kubernetes using OpenLLM ). This elasticity is important for cost efficiency – for instance, scale down to zero GPU nodes overnight if no traffic (some workloads even scale-to-zero to completely shut off, though cold start time for LLMs must be considered). Coordinate HPA and cluster autoscaling so that new pods can actually be scheduled on new nodes promptly.
- Multi-GPU Scaling: For very large models or extreme throughput needs, you might scale within a pod by using multiple GPUs. As noted, TGI supports tensor parallelism to split a single model across GPUs (GitHub - huggingface/text-generation-inference: Large Language Model Text Generation Inference). This can reduce latency for big models (e.g., using 2×A100 40GB instead of 1×A100 80GB might double throughput). However, multi-GPU pods won’t increase overall throughput per cost as efficiently as multiple single-GPU pods (because those same GPUs could serve independent requests in parallel). Use multi-GPU inference when single-GPU latency is insufficient for a single request’s needs (e.g., real-time responses from a 175B model).
- Load Balancing: When running multiple replicas of an LLM service, use a Kubernetes Service to load-balance requests. For LLMs, you might also consider smarter routing – e.g., routing longer requests to certain pods (since they will occupy the GPU longer) or dedicating some pods to high-priority queries. Although Kubernetes Service routing is generally round-robin, you can build logic at the application level or use an API gateway to implement such routing if needed (for instance, send small queries to a smaller model).
- Stateful vs Stateless Scaling: Most LLM servers are stateless (each request independent), which simplifies horizontal scaling. One exception is if you maintain a conversation context in the server – ensure that either the context is passed in every request (making it stateless from server perspective) or use a sticky session approach. Generally, it’s easier to keep LLM inference stateless and handle state in the client or a database, so any replica can serve any request.
- Throughput vs. Latency Trade-offs: Decide on a scaling strategy based on your SLA. If you need low latency for each request, running more smaller pods (with fewer concurrent requests each) can reduce queueing delay. If you need maximum throughput and can tolerate a bit more latency, running fewer pods each handling large batches might be more efficient. Use performance testing to find the knee of the curve. For example, you might find that beyond 4 concurrent requests per GPU, latency per request grows unacceptably – that would indicate scaling out to more GPUs after 4 concurrent jobs is better.
- Cost Considerations: GPUs are expensive, so scalability isn’t just technical—it's also about cost management. Implement resource quotas or limits in your namespace to prevent accidental over-provisioning of GPUs. Use scheduling constraints to pack workloads efficiently (if one model isn’t heavily used, co-locate another light workload on the same GPU via MIG or containers, if appropriate). Monitor usage and turn off idle pods or nodes. Over time, you might employ spot instances for GPU if your workload is fault-tolerant to save money.
By leveraging Kubernetes autoscaling mechanisms, you can dynamically adjust to workload fluctuations common in real-world applications (e.g., daytime traffic vs. nighttime). A best practice is to start with conservative limits (to avoid overwhelming the system), then gradually increase scale while observing system metrics. This ensures your LLM deployment remains responsive and cost-effective as it grows, delivering a scalable service to end-users without manual intervention in day-to-day operations.
Sources:
- Pavan Shiraguppi. “Deploy LLM models on Kubernetes using OpenLLM.” CloudRaft Blog, Aug. 14, 2023 (Deploy LLM models on Kubernetes using OpenLLM ) (Deploy LLM models on Kubernetes using OpenLLM ).
- Nicolas Ehrman. “AI/ML in Kubernetes Best Practices: The Essentials.” Wiz Blog, Mar. 6, 2025 (AI/ML in Kubernetes Best Practices: The Essentials | Wiz) (AI/ML in Kubernetes Best Practices: The Essentials | Wiz).
- “MIG Support in Kubernetes.” NVIDIA Docs (Kubernetes with NVIDIA GPUs), 2024 (MIG Support in Kubernetes — Kubernetes with NVIDIA GPUs 1.0.0 documentation).
- TrueFoundry Engineering. “Enabling the LLM Revolution: GPUs on Kubernetes.” TrueFoundry Blog, Mar. 30, 2023 (Enabling the Large Language Models Revolution: GPUs on Kubernetes).
- Maggie Zhang et al. “Scaling LLMs with NVIDIA Triton and TensorRT-LLM Using Kubernetes.” NVIDIA Technical Blog, Oct. 22, 2024 (Scaling LLMs with NVIDIA Triton and NVIDIA TensorRT-LLM Using Kubernetes | NVIDIA Technical Blog) (Scaling LLMs with NVIDIA Triton and NVIDIA TensorRT-LLM Using Kubernetes | NVIDIA Technical Blog).
- Saša Zelenović. “Unleash the full potential of LLMs: Optimize for performance with vLLM.” Red Hat Blog, Feb. 27, 2025 (Unleash the full potential of LLMs: Optimize for performance with vLLM) (Unleash the full potential of LLMs: Optimize for performance with vLLM).
- Woosuk Kwon et al. “vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention.” vLLM Blog, June 20, 2023 (vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention | vLLM Blog) (vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention | vLLM Blog).
- Hugging Face. “Text Generation Inference – README.” GitHub: huggingface/text-generation-inference, 2023 (GitHub - huggingface/text-generation-inference: Large Language Model Text Generation Inference).
- Hyperstack. “A Quick Guide to Troubleshooting Most Common LLM Issues.” Hyperstack Tutorials, 2023 (A Quick Guide to Troubleshooting Most Common LLM Issues) (A Quick Guide to Troubleshooting Most Common LLM Issues).
- DataCamp. “8 Top Open-Source LLMs for 2024 and Their Uses.” DataCamp Blog, Aug. 8, 2024 (Choosing an LLM: The 2024 getting started guide to open source LLMs | Elastic Blog) (8 Top Open-Source LLMs for 2024 and Their Uses | DataCamp).
- Elastic. “Choosing an LLM: 2024 guide to open source LLMs.” Elastic Blog, 2024 (Choosing an LLM: The 2024 getting started guide to open source LLMs | Elastic Blog) (Choosing an LLM: The 2024 getting started guide to open source LLMs | Elastic Blog).
- Kubernetes Horizontal Pod Autoscaling for ML. (NVIDIA Blog snippet via developer.nvidia.com) (Scaling LLMs with NVIDIA Triton and NVIDIA TensorRT-LLM Using Kubernetes | NVIDIA Technical Blog).
- Wiz Research. “Kubernetes Security for AI/ML Workloads.” Wiz Blog, 2025 (AI/ML in Kubernetes Best Practices: The Essentials | Wiz) (AI/ML in Kubernetes Best Practices: The Essentials | Wiz).