Private AI Cloud — A Field Report

Abstract. This field report describes the operational shape of a private AI cloud built for regulated and sustained-inference workloads in Bangladesh. It covers cost composition, scheduling, model-serving runtimes, and the workload classes for which a private build dominates hosted GPU rental. Emphasis is on the operator-facing decisions — the choices that shape the cost curve over a three-year horizon, not the technical novelty that animates conference talks.

H100/H200

Production-grade GPUs in current builds

vLLM

Default open-model serving runtime

LoRA/QLoRA

Cost-effective fine-tuning method

RAG

Most common customer-data integration pattern

Methodology

Observations are drawn from production deployments of three sovereign AI clusters across BFSI, healthcare, and government workloads, each in the 8-to-32 GPU range. Workloads include sustained inference, fine-tuning on regulated data, and retrieval-augmented generation against in-country knowledge bases. Telemetry is collected via DCGM at the GPU level and Prometheus at the cluster level; cost data is reconstructed from the finance ledger to match the calendar quarter. Notes from operator post-mortems are included where they clarify a non-obvious operational decision.

Stack

The four layers operators have to industrialise

GPU compute & interconnect Hardware

H100 / H200 / A100 · NVLink within node · 200–400 Gb RoCE between nodes

Cluster orchestration Platform

Kubernetes + NVIDIA GPU Operator · Slurm for batch · MIG slicing for multi-tenancy

Model & serving runtime Runtime

vLLM · Triton · TGI · TensorRT-LLM · Ray Serve

Data & integration Application

Vector store (pgvector/Qdrant) · embeddings · RAG · evaluation harness

Hardware-level findings

Three observations from the production fleet matter for capacity planning. First, NVLink-connected node-local sharding consistently outperforms cross-node sharding for the model sizes we run, so we plan fleet topology to keep model-parallel groups within a single 8-GPU node where possible. Second, memory bandwidth is the binding constraint, not FLOPs, for most inference workloads — the H200’s HBM3e bandwidth advantage shows up cleanly in throughput numbers. Third, RoCE/InfiniBand topology decisions made at procurement time are effectively unrecoverable later, so we over-engineer the network at the start and under-engineer it never.

Cost composition

Where the cost of a private AI cluster build actually sits

GPU acquisition (CAPEX) Amortised 3–4 yr

51 %

Power & cooling 30–60 kW per rack

18 %

Networking (NVLink/RoCE) Often underestimated

12 %

Storage (NVMe + object)

9 %

Software & support

10 %

Source: Indicative composition for sovereign AI cluster build, 2025 pricing.

Utilisation findings

73%

Realised GPU utilisation in well-tuned production fleets

Below 73% usually indicates a scheduler-fairness or batching problem, not insufficient demand. Hosted GPU regions rarely report this metric.

Where private AI cloud dominates hosted

Three workload classes consistently favour private deployment. Data-bound training and fine-tuning on data that legally cannot leave the country — medical imaging, financial transaction records, citizen analytics. Sustained high-throughput inference on customer-facing surfaces that run 24×7 and price out of metered hyperscaler GPU within months. Model-quality differentiation where the value comes from a model that has seen the customer’s own data, and where leakage to a hosted API is a strategic risk, not just a compliance one. The break- even between private and hosted in our experience is around 50% sustained utilisation: above that, private wins on TCO; below that, hosted wins on flexibility.

Scheduling and fairness

The operational discipline that distinguishes a working private AI cloud from a frustrating one is scheduling. A naïve cluster that lets any user grab any GPU at any time will, within weeks, contain three production inference services starved by a runaway fine-tune. The working clusters we operate enforce hard fairness with named queues, MIG slicing for multi-tenancy of inference workloads, and explicit priority bands that the platform team controls. Slurm handles the batch case; Kubernetes with the GPU Operator handles the long-running case; both share a consistent quota model.

Model serving — the runtime decision

The choice of inference runtime is more consequential than most architects assume. vLLM’s continuous batching and paged KV-cache deliver the best throughput on most open models we have tested, often by a factor of two over a naïve transformer-based server. Triton and TensorRT-LLM win when the workload is dominated by a small set of fixed models that benefit from kernel fusion. TGI suits Hugging Face-native shops that want lower operational complexity. Ray Serve becomes relevant once the deployment shape includes pre/post-processing or multi-model pipelines. We currently default to vLLM for new deployments and migrate to TensorRT-LLM only when the throughput delta justifies the engineering cost.

Discussion

The serving stack is mature: vLLM, Triton, TGI, TensorRT-LLM, and Ray Serve all run production fleets. The harder problems are operational — scheduler fairness so a runaway fine-tune cannot starve production inference; capacity planning that maps workloads to GPU SKUs and dates the next refresh; and an evaluation harness that measures model quality on a fixed gold set every release. None of these exist by default. The team shape that runs a successful private AI cloud has roughly equal parts ML engineering, platform engineering, and SRE — not the typical ML-heavy team profile that hosted-GPU deployments encourage.

Limitations

Two limitations of this report deserve naming. The fleet sample is small, and the workloads observed skew toward inference rather than training; a primarily-training fleet would weight the cost composition differently and place harder demands on the network and storage tiers. Second, the cost numbers reflect 2025 pricing for current-generation GPUs and will move with the next generation; the operational discipline points should remain stable.

1,994 reads · 164 likes · AI GPU H100 vLLM RAG LoRA sovereign cloud