Private AI Cloud — A Field Report
A research-style walkthrough of what it actually costs and takes to run modern AI workloads on sovereign infrastructure, written from the operator side of the rack.
Abstract. This field report describes the operational shape of a private AI cloud built for regulated and sustained-inference workloads in Bangladesh. It covers cost composition, scheduling, model-serving runtimes, and the workload classes for which a private build dominates hosted GPU rental. Emphasis is on the operator-facing decisions — the choices that shape the cost curve over a three-year horizon, not the technical novelty that animates conference talks.
Methodology
Observations are drawn from production deployments of three sovereign AI clusters across BFSI, healthcare, and government workloads, each in the 8-to-32 GPU range. Workloads include sustained inference, fine-tuning on regulated data, and retrieval-augmented generation against in-country knowledge bases. Telemetry is collected via DCGM at the GPU level and Prometheus at the cluster level; cost data is reconstructed from the finance ledger to match the calendar quarter. Notes from operator post-mortems are included where they clarify a non-obvious operational decision.
Stack
H100 / H200 / A100 · NVLink within node · 200–400 Gb RoCE between nodes
Kubernetes + NVIDIA GPU Operator · Slurm for batch · MIG slicing for multi-tenancy
vLLM · Triton · TGI · TensorRT-LLM · Ray Serve
Vector store (pgvector/Qdrant) · embeddings · RAG · evaluation harness
Hardware-level findings
Three observations from the production fleet matter for capacity planning. First, NVLink-connected node-local sharding consistently outperforms cross-node sharding for the model sizes we run, so we plan fleet topology to keep model-parallel groups within a single 8-GPU node where possible. Second, memory bandwidth is the binding constraint, not FLOPs, for most inference workloads — the H200’s HBM3e bandwidth advantage shows up cleanly in throughput numbers. Third, RoCE/InfiniBand topology decisions made at procurement time are effectively unrecoverable later, so we over-engineer the network at the start and under-engineer it never.
Cost composition
Source: Indicative composition for sovereign AI cluster build, 2025 pricing.
Utilisation findings
Where private AI cloud dominates hosted
Three workload classes consistently favour private deployment. Data-bound training and fine-tuning on data that legally cannot leave the country — medical imaging, financial transaction records, citizen analytics. Sustained high-throughput inference on customer-facing surfaces that run 24×7 and price out of metered hyperscaler GPU within months. Model-quality differentiation where the value comes from a model that has seen the customer’s own data, and where leakage to a hosted API is a strategic risk, not just a compliance one. The break- even between private and hosted in our experience is around 50% sustained utilisation: above that, private wins on TCO; below that, hosted wins on flexibility.
Scheduling and fairness
The operational discipline that distinguishes a working private AI cloud from a frustrating one is scheduling. A naïve cluster that lets any user grab any GPU at any time will, within weeks, contain three production inference services starved by a runaway fine-tune. The working clusters we operate enforce hard fairness with named queues, MIG slicing for multi-tenancy of inference workloads, and explicit priority bands that the platform team controls. Slurm handles the batch case; Kubernetes with the GPU Operator handles the long-running case; both share a consistent quota model.
Model serving — the runtime decision
The choice of inference runtime is more consequential than most architects assume. vLLM’s continuous batching and paged KV-cache deliver the best throughput on most open models we have tested, often by a factor of two over a naïve transformer-based server. Triton and TensorRT-LLM win when the workload is dominated by a small set of fixed models that benefit from kernel fusion. TGI suits Hugging Face-native shops that want lower operational complexity. Ray Serve becomes relevant once the deployment shape includes pre/post-processing or multi-model pipelines. We currently default to vLLM for new deployments and migrate to TensorRT-LLM only when the throughput delta justifies the engineering cost.
Discussion
The serving stack is mature: vLLM, Triton, TGI, TensorRT-LLM, and Ray Serve all run production fleets. The harder problems are operational — scheduler fairness so a runaway fine-tune cannot starve production inference; capacity planning that maps workloads to GPU SKUs and dates the next refresh; and an evaluation harness that measures model quality on a fixed gold set every release. None of these exist by default. The team shape that runs a successful private AI cloud has roughly equal parts ML engineering, platform engineering, and SRE — not the typical ML-heavy team profile that hosted-GPU deployments encourage.
Limitations
Two limitations of this report deserve naming. The fleet sample is small, and the workloads observed skew toward inference rather than training; a primarily-training fleet would weight the cost composition differently and place harder demands on the network and storage tiers. Second, the cost numbers reflect 2025 pricing for current-generation GPUs and will move with the next generation; the operational discipline points should remain stable.
Read next
- Cloud Strategy
Five Truths About Building a Sovereign Cloud in Bangladesh
Hard-won lessons from the field — what every newcomer underestimates about the licensing, the customer, the currency, and the country.
- BFSI
Notes from a BFSI Hybrid Migration: An Audit Diary
Eighteen months inside a Tier-1 bank moving from a single-vendor private cloud to a regulated hybrid. The notes I kept, and what they say in retrospect.
- Cloud Strategy
A Letter on the Bangladesh Cloud Economy
Annual letter to the Cloud Digit board on the shape, drivers, and trajectory of the Bangladeshi cloud market. Written in the long-form-thesis style that serious investors actually read.