11 mins
How to Right-Size Your GPU Infrastructure for LLM Fine-Tuning vs. Pre-Training
Your ML team provisioned a multi-node H100 cluster three months ago. It was the safe default: production-grade, widely recommended, and built for scale. Six weeks later, the CFO forwards the invoice with a single question mark in the subject line. The jobs finish in four to eight hours. The InfiniBand fabric is never saturated. The GPUs idle at 30% VRAM utilization. The interconnect that justified the cluster configuration is going entirely unused.

That scenario is not unusual. It is the default outcome when teams apply pre-training-class infrastructure assumptions to fine-tuning workloads, or source hardware without first mapping compute requirements to job type. The financial gap between a correctly specified and incorrectly specified GPU cluster is measured in hundreds of thousands of dollars annually for teams running production workloads.
This post gives infrastructure teams and ML buyers a practical framework for LLM GPU sizing: how memory and compute requirements differ between fine-tuning and pre-training, what the infrastructure layer demands for each workload, how cost drivers differ, and which deployment model fits which job profile.
Pre-Training vs. Fine-Tuning: Where Compute Requirements Diverge

Fine-tuning vs. pre-training hardware differences come down to five dimensions: parameter scale, dataset size, job duration, memory bandwidth vs. throughput priority, and interconnect demand. Understanding these differences is the foundation of any sound GPU infrastructure sizing decision for LLM training.
Why Pre-Training Is Throughput- and Interconnect-Bound
Pre-training a large language model is a distributed systems problem as much as a modeling one: it involves streaming trillions of tokens through many GPUs while synchronizing gradients and moving checkpoints across a cluster. Meta’s Llama 3 paper describes a 405B-parameter model and positions it as a large-scale foundation model built for multilinguality, coding, reasoning, and tool use (Source: Kili Technology, 2024). At this scale, network fabric, checkpointing, and cross-node coordination can matter more than raw peak FLOP count when you are deciding whether a training cluster is actually well-specified.
Why Fine-Tuning Workloads Are Routinely Over-Provisioned
LLM fine-tuning hardware requirements are a fraction of those for pre-training: a supervised fine-tuning run on a 13B parameter model with a domain-specific dataset of several million tokens typically completes in hours on a single node with sufficient VRAM. Parameter-efficient methods including LoRA and QLoRA reduce the number of trainable parameters to one to five percent of the model total, cutting memory requirements, compute overhead, and job duration further. The common mistake is provisioning multi-node infrastructure with high-speed interconnect for workloads that never stress a single GPU's memory bus.
The Middle Ground: Continual Pre-Training and Domain Adaptation
Continual pre-training and domain adaptation occupy the space between fine-tuning and full pre-training, and represent the category buyers most frequently miscategorize. A model trained further on a large vertical corpus, such as medical literature, legal filings, or code, may require multi-node infrastructure and sustained job duration without reaching the scale of a foundation model pre-training run. The classification test is dataset size and job duration: if the dataset is under ten billion tokens and the run completes in days rather than months, it maps closer to fine-tuning infrastructure than pre-training infrastructure.
LLM VRAM Requirements: How to Size GPU Memory for Training and Fine-Tuning
GPU memory requirements for LLM training are calculated from four components: the model parameters, optimizer states, gradient storage, and activation memory. The canonical formula is:
VRAM ≈ (parameters × bytes per parameter) + optimizer states + gradients + activations
Using half precision (bf16 or fp16) for model weights rather than fp32 halves per-parameter memory usage and is standard practice for LLM fine-tuning. For a 13B parameter model under bf16 mixed precision with AdamW: parameters occupy approximately 26 GB (13B × 2 bytes), optimizer states add roughly 104 GB (4x parameter size for fp32 Adam moments), gradients add another 26 GB, and activations depend on sequence length and batch size but typically add 20 to 40 GB. Understanding how much VRAM a job will consume before provisioning hardware is the single most common sizing gap: total VRAM for full fine-tuning of a 13B model sits in the 200 to 300 GB range, while LoRA fine-tuning of the same 13B model often fits in 24 to 48 GB per GPU.
Reference benchmarks for planning:
Model Size | Full Fine-Tuning (bf16 + Adam) | LoRA / QLoRA |
|---|---|---|
7B | ~120–160 GB total | 12–24 GB per GPU |
13B | ~200–300 GB total | 24–48 GB per GPU |
34B | ~500–600 GB total | 40–80 GB per GPU |
70B | ~1 TB+ total | 80–160 GB per GPU |
Estimates based on standard mixed-precision training assumptions. Actual requirements vary with sequence length, batch size, and parallelism strategy.
Full Fine-Tuning Memory Requirements by Model Size
VRAM requirements for 13B and 70B models differ by nearly an order of magnitude, which directly determines the number of GPUs required for full fine-tuning. For a 7B model under bf16 with AdamW, total VRAM across the cluster typically falls in the 120 to 160 GB range, achievable on two to four A100 80GB or H100 80GB GPUs. For a 70B model, full fine-tuning requires 1 TB or more of aggregate VRAM, meaning a minimum of 14 to 16 H100 80GB GPUs before accounting for activation memory at longer sequence lengths. Gradient checkpointing reduces activation memory by recomputing activations during the backward pass rather than storing them, at a cost of roughly 20 to 30 percent additional compute time, and is standard practice for large-model fine-tuning on memory-constrained clusters.
How LoRA and QLoRA Reduce VRAM Requirements
LoRA VRAM requirements are substantially lower than full fine-tuning because only the low-rank adapter matrices are trained, not the full model weights. For a 13B model, LoRA with rank 16 to 64 reduces trainable parameters from 13 billion to roughly 20 to 80 million, cutting the memory footprint of optimizer states and gradients proportionally. QLoRA memory requirements are lower still: the base model weights are quantized to 4-bit (NF4 format), reducing memory consumption by approximately 4x compared to fp16, while the LoRA adapters train in bf16. These memory savings allow fine-tuning a 70B model on a single 80 GB H100 in many configurations, enabling parameter-efficient fine-tuning with fewer GPUs and with minimal quality impact relative to full fine-tuning. The tradeoff is slightly reduced throughput due to dequantization overhead, which is acceptable for fine-tuning workloads but problematic at pre-training scale.
Multi-GPU Parallelism for LLM Training: Tensor, Pipeline, and ZeRO
Multi-GPU parallelism for LLM training covers three strategies that address different aspects of memory and compute distribution: tensor parallelism splits individual layer computations across GPUs and generates communication overhead through all-reduce operations at every layer forward pass, pipeline parallelism splits the model depth-wise across stages and is better suited to longer pipelines with high compute-to-communication ratios, and ZeRO (Zero Redundancy Optimizer) shards optimizer states, gradients, and parameters across the data-parallel group to eliminate multiple copies of model state and reduce per-GPU memory without splitting the model graph. ZeRO Stage 3 extends this further by enabling CPU offloading of optimizer states to CPU memory, reducing GPU memory consumption at the cost of additional CPU-GPU transfer latency. For distributed training infrastructure for LLMs, the choice of parallelism strategy should be made before selecting the interconnect tier, because tensor parallelism in particular imposes high-bandwidth, low-latency requirements that eliminate lower-cost Ethernet configurations from consideration.
Pre-Training Infrastructure Requirements: Interconnect, Storage, and Scale
LLM pre-training infrastructure requirements extend well beyond GPU count to the network fabric, storage layer, and operational systems that sustain a multi-month job. Teams that size GPUs correctly and underspecify these components routinely encounter performance ceilings that no additional GPU spend can fix.
Why Interconnect Bandwidth Determines Pre-Training Efficiency
At pre-training scale for AI training, choosing InfiniBand over Ethernet is not marginal: benchmarks show Ethernet causes training step-time degradation from 15% in smaller configurations to roughly 10× in large-scale all-reduce-heavy workloads. HDR InfiniBand (200 Gb/s) and NDR InfiniBand (400 Gb/s) deliver the low-latency, high-bandwidth fabric required for tensor parallelism and data-parallel all-reduce. While 100 Gb/s Ethernet can sustain data-parallel training for smaller models with less frequent synchronization, it introduces all-reduce bottlenecks at scale that substantially reduce effective GPU utilization, turning expensive clusters into expensive idling ones. When evaluating GPU infrastructure for LLM training, interconnect specifications must be treated as a first-class requirement alongside GPU model and count, not an afterthought in procurement (Sources: Arc Compute, 2025; Benquan, 2026; ApX, 2025).
Storage I/O Requirements for Pre-Training: What Buyers Underestimate
Pre-training clusters require storage throughput capable of sustaining continuous dataset streaming at the speed of the training pipeline, plus checkpoint write bandwidth that does not stall the cluster during saves. A typical pre-training checkpoint for a 70B model written in fp16 is approximately 140 GB. At checkpoint frequency of every 500 steps on a fast H100 cluster, that may require writing several hundred GB per hour without throttling the GPUs. Distributed file systems such as Lustre, GPFS, and WekaFS are the standard storage system for pre-training at scale, providing the storage performance that sequential workloads demand; NVMe-backed storage handles smaller cluster configurations. Teams that use NFS for pre-training workloads routinely encounter data loading bottlenecks: training throughput stalling when storage I/O cannot keep pace with the dataset streaming pipeline, leaving GPUs waiting for data rather than training.
Checkpoint Infrastructure and Job Continuity for Long-Running Runs
Checkpoint frequency strategy for pre-training is the primary fault tolerance mechanism for long-running jobs, determining both data loss exposure and recovery cost if interrupted. A cluster without a recent checkpoint loses meaningful training progress plus direct compute costs ranging from $1.03 per hour for spot H100 on Spheron to $12.29 per hour for Azure H100, depending on provider type from specialized GPU clouds to hyperscalers. Spot and preemptible instances carry interruption risk acceptable for short fine-tuning but financially damaging for pre-training runs measured in weeks. This operational reality, more than any GPU benchmark, should drive the reserved versus spot versus dedicated procurement decision for long-running pre-training infrastructure (Source: Spheron, 2026).
GPU Infrastructure Cost Drivers for LLM Workloads
GPU infrastructure cost for LLM workloads is determined by four variables that buyers routinely underestimate: AI training cost per run, interconnect overhead, the underutilization penalty, and the billing model mismatch between cloud pricing and actual workload duration. Understanding these drivers before selecting a deployment model is what separates informed GPU cost optimization from GPU procurement that looks reasonable until the invoice arrives. Sourcing platforms that provide instant pricing across deployment models make this cost comparison tractable without requiring multiple vendor calls.
Cost Per Training Run: Fine-Tuning vs. Pre-Training
AI training cost per run varies by multiple orders of magnitude between fine-tuning and pre-training workloads. A LoRA fine-tuning run on a 7B model for a domain-specific task might consume four to eight GPU-hours on a single A100 or H100, costing $12 to $40 at standard cloud rates. Full fine-tuning of a 70B model runs into hundreds of GPU-hours. Pre-training a foundation model at Llama 3 scale, 15 trillion tokens on 70B parameters, required 6.4 million H100 GPU-hours by Meta's published estimate, representing tens of millions of dollars in compute at market rates. For most enterprise teams, fine-tuning is not just the technically appropriate choice for adapting an existing model. The cost differential makes it the only viable path for any organization not operating at hyperscaler scale (Source: CoreWeave, 2025).
The Interconnect Tax: What InfiniBand Costs in Practice
InfiniBand fabric adds meaningful cost to a GPU cluster, both in hardware, including NDR InfiniBand NICs, switches, and cabling, and in the colocation or cloud premium for InfiniBand-equipped configurations vs. standard Ethernet. On bare metal or colocation infrastructure, InfiniBand adds roughly 10 to 20 percent to total cluster cost at smaller scales, with per-node cost declining at larger cluster sizes where switch port density improves. The correct question is not whether to pay the interconnect premium but whether the workload actually requires it. Fine-tuning jobs running on single nodes or two- to four-GPU configurations do not need InfiniBand. Pre-training jobs at 32 nodes or above almost always do. Paying the interconnect premium on fine-tuning workloads is the most consistent source of unnecessary cost in GPU infrastructure for LLM training.
The Underutilization Penalty: Why Over-Provisioning Is Expensive
Low GPU utilization sustained below 50% represents a direct financial penalty: compute is being paid for and not used. The scenario from the opening of this post, GPUs idling at 30% VRAM utilization on a multi-node cluster, is the common outcome of provisioning for theoretical peak demand rather than actual workload profile. The directional threshold for GPU cost optimization in machine learning: at greater than 70% sustained GPU utilization, bare metal dedicated hardware typically becomes cheaper than cloud on a cost-per-useful-compute basis. Below 30% or for highly intermittent workloads, the cloud premium is justified by the cost of idle capacity you would otherwise pay for on dedicated hardware.
GPU Infrastructure Sourcing Models: Matching LLM Workloads to Deployment
GPU cloud vs. bare metal for AI workloads is not a binary choice: the correct model depends on utilization profile, job duration, team operational capacity, and budget horizon. LLM workloads span a wide range of infrastructure requirements, and sourcing models should be evaluated against the actual job profile rather than against category defaults.
When Bare Metal Dedicated Makes More Sense Than Cloud
Bare metal GPU servers make more economic sense than cloud for LLM workloads running above 70% sustained utilization over a period of weeks or months. At that utilization level, the per-hour premium of cloud GPU instances accumulates into meaningful overspend versus dedicated capacity. Bare metal also eliminates noisy-neighbor effects on shared infrastructure, where a co-tenant's workload can affect memory bandwidth availability on GPU instances that share a physical host. For teams running repeated fine-tuning jobs on a production schedule, or sustained pre-training on owned model weights, bare metal dedicated is the operationally appropriate and cost-efficient choice.
When GPU Cloud Is the Right Call
GPU cloud is the right infrastructure model for LLM workloads running below 30% utilization, for exploratory or one-off fine-tuning jobs, and for teams without the infrastructure operations headcount to manage dedicated hardware. The on-demand pricing premium is justified by the absence of idle capacity cost: you pay only for hours consumed, not for a server sitting available. For teams benchmarking multiple fine-tuning approaches, testing model configurations, or running evaluation pipelines with unpredictable scheduling, GPU cloud removes the over-provisioning risk entirely.
Colocation as a GPU Infrastructure Strategy
Colocation is a viable GPU infrastructure strategy for organizations that want to own or lease GPU hardware while outsourcing the physical facility, power, and cooling. Dense GPU compute, particularly H100 SXM configurations, draws 700 watts per GPU and above, requiring power density of 30 to 50 kW per rack or higher for standard AI deployments. Colocation facilities that support AI workloads provide the power density, cooling infrastructure, and network connectivity that on-premise environments typically cannot match at this density. For teams running sustained pre-training or production fine-tuning at scale, colocation combines the cost efficiency of owned hardware with the infrastructure quality of a carrier-grade facility (Source: IntuitionLabs, 2026).
The Right-Sizing Decision Framework
Five questions to map your LLM workload to the correct infrastructure model.
1. What is your target model size? Under 13B: single-node configurations with one to two H100s or A100s are viable for full fine-tuning. Above 34B: plan for multi-GPU configurations and evaluate parallelism strategy before selecting hardware.
2. Which fine-tuning method applies? QLoRA or LoRA: a single GPU at a lower VRAM tier is sufficient in most configurations. Full fine-tuning: use the VRAM formula above to size total cluster memory before making any sourcing decision.
3. How long do jobs run and how frequently? Jobs under 24 hours, intermittent schedule: GPU cloud eliminates idle capacity cost. Jobs running continuously for weeks at high utilization: bare metal dedicated or colocation is the cost-efficient model.
4. What is your budget horizon? Short-term or variable budget: cloud on-demand. Multi-quarter committed budget: evaluate bare metal or reserved capacity, which typically reduces cost by 30 to 50 percent vs. on-demand at equivalent utilization.
5. What is your team's infrastructure operations capacity? No dedicated infrastructure ops: managed cloud or a colocation provider with managed services. Limited ops team running large models on a frequent production schedule: managed cloud or a hybrid arrangement that limits the operational surface area rather than self-managed bare metal.
How Inflect Helps Teams Source GPU Infrastructure for LLM Workloads
Inflect is a digital infrastructure marketplace where ML engineering teams and infrastructure buyers can search, compare, and receive instant pricing across bare metal servers, GPU cloud instances, and colocation facilities, without a sales call. For teams working through the GPU right-sizing decision this post covers, Inflect surfaces options across more than 6,000 facilities and providers in over 100 countries, including Equinix, Digital Realty, CoreSite, Flexential, NTT, and hundreds of others. The ability to compare deployment models at actual market prices, rather than requesting quotes from individual vendors, is what makes the cost comparison between cloud, bare metal, and colocation tractable in the timeframe a real infrastructure decision requires. Free expert advisory is available at no charge to buyers, covering sourcing strategy, provider comparison, and capacity requirements for specific LLM training and fine-tuning workloads.
Ready to source GPU infrastructure for your LLM workload?
Search bare metal GPU configurations and colocation facilities with instant pricing across more
Compare H100 vs. A100 configurations at actual market rates without a sales call
Access free expert advisory to validate your infrastructure model before you commit
Find providers that support the power density and interconnect requirements your workload demands
About the Author
Haley Rogers
Content & Social Media Specialist
Haley Rogers is the Content & Social Media Specialist at Inflect, bringing over two years of experience in social media, marketing, and content strategy — including time at a fast-paced tech company before joining the Inflect team. She specializes in translating complex digital infrastructure topics into clear, engaging content, with a particular focus on blog writing and brand storytelling across channels.
Contact:

