🟢 🏥 In Practice Published: · 3 min read ·

CNCF: NetEase Games achieves 30-second LLM cold start on Kubernetes via Fluid prefetching layers

Editorial illustration: NetEase Games achieves 30-second LLM cold start on Kubernetes via Fluid prefetching layers

CNCF published a case study from NetEase Games on 21 May 2026 (authors Haifeng Liao and Xiang Zhang) describing how they reduced load times for 70B-class LLM models from 42 minutes (direct S3 access) to under 30 seconds using the CNCF-incubated Fluid project. The key is a Fluid prefetching layer that shares models between teams instead of duplicating caches, together with pre-warming scheduling that eliminates cold starts. A technical case study for everyone running serverless LLM inference on Kubernetes with large models.

🤖

This article was generated using artificial intelligence from primary sources.

CNCF (Cloud Native Computing Foundation) published on 21 May 2026 a technical case study from NetEase Games — one of China’s largest gaming companies — describing in detail how they reduced the load time of large LLM models into their Kubernetes serving stack from a catastrophic 42 minutes to under 30 seconds. The authors are Haifeng Liao and Xiang Zhang from the NetEase Games infrastructure team.

What was the initial problem they were solving?

NetEase Games uses 70B-class LLM models (Llama 3, Qwen, or similar) for several production use cases — AI NPC dialogues in games, content moderation, automatic translation. The models are too large to keep in memory on all nodes in the cluster — they must be loaded on-demand at every scaling event or pod restart.

Direct access to the model from S3-compatible object storage took 42 minutes for a full load of a 70B model into GPU memory. This is unacceptable for a production workload — it means every scaling event causes a 42-minute outage for the new pod.

What phases of optimisation did they go through?

NetEase carried out optimisation in several phases:

Phase 1 — direct S3 access: 42 minutes. Baseline.

Phase 2 — Fluid distributed cache: 14 minutes. Implementing the CNCF Fluid project that shares models between nodes in the cluster through P2P transfers. Instead of each pod pulling directly from S3, new pods can fetch the model from neighbouring nodes that already have it cached.

Phase 3 — Fluid with local SSD cache: 3 minutes. Addition of a local SSD caching layer that keeps warm copies of the most frequently used models. At a new scaling event, the model is already in the local cache, eliminating the network transfer.

Phase 4 — Pre-warming + predictive scheduling: under 30 seconds. Pre-warming strategy — the system predicts when a new pod will be needed (based on historical load patterns) and pre-loads the model before the pod is actually required. Combined with predictive scheduling that assigns new pods to nodes that already have the model in memory.

What is Fluid as a CNCF project?

Fluid is a CNCF-incubated project focused on data orchestration for Kubernetes. The primary use case is accelerating access to large datasets — whether these are LLM weights, training datasets, or scientific data. Fluid abstracts the underlying storage (S3, GCS, HDFS, NFS) and provides a uniform layer with built-in caching, prefetching, and scheduling integration.

For the LLM use case in particular, Fluid enables:

  • Pod-level affinity — the Kubernetes scheduler can prioritise placing a pod on a node where the model is already cached
  • Asynchronous prefetch — the model can be pre-loaded before the pod needs it
  • Shared cache across team — multiple teams can share the same model without duplicating it

What does this mean for production LLM inference?

Cold start latency is a critical problem for serverless or auto-scaling LLM deployments. Industry standards (OpenAI, Anthropic) achieve sub-second cold start times for their proprietary stacks, but this results from custom infrastructure that the open-source community cannot easily replicate.

The NetEase case study provides a concrete blueprint that other companies can follow using open-source components (Kubernetes + Fluid + vLLM). Under 30 seconds cold start for a 70B model is acceptable for most production workloads — comparable to the time required for a scaling event in a typical microservice.

For CNCF, this case study is validation of the Fluid project as a production-ready tool. It is worth watching whether other LLM serving operators (Replicate, Together AI, Anyscale) adopt similar Fluid-based approaches for their own multi-tenant LLM platforms.

Frequently Asked Questions

What is Fluid in the context of CNCF projects?
Fluid is a CNCF-incubated project for orchestrating data-intensive workloads on Kubernetes, specifically focused on accelerating access to large datasets through prefetching and caching layers.
By how much did NetEase Games reduce LLM cold start time?
From 42 minutes (direct access) to under 30 seconds, through intermediate stages of 14 and 3 minutes, using Fluid prefetching and pre-warming strategies.
What model sizes does NetEase Games use?
70B-class LLM models, corresponding to architectures such as Llama 3 70B, Qwen 2.5 72B, or similar.