14 mins
GPU Memory Bandwidth: Why It Matters More Than VRAM for Inference Workloads
If your LLM is generating tokens at 20 tokens per second on an H100, adding more VRAM will not make it faster. Increasing memory bandwidth will.
That claim runs counter to how most GPU procurement conversations are framed. Buyers compare GPUs on VRAM first (32 GB, 80 GB, 141 GB), because memory capacity is the number that determines whether a model fits at all. Bandwidth is listed further down the spec sheet, measured in units that require more context to interpret, and rarely appears in the headline metrics of cloud provider marketing.

The framing is wrong because it confuses a necessary condition with a performance driver. VRAM capacity is necessary: if your model does not fit in memory, no amount of bandwidth helps. But once the model is loaded, the question is how fast it can generate tokens. At the batch sizes typical of latency-sensitive inference, the answer is almost entirely determined by how fast the GPU can move data between memory and compute units, not by how many gigabytes are available or how many FLOPS the chip can theoretically reach.
This post explains the mechanism behind that claim, walks through what it means for the GPU tiers available today, and gives infrastructure teams the specific procurement questions that translate the technical argument into buying decisions.
What Is GPU Memory Bandwidth and How Does It Differ from VRAM?
GPU memory bandwidth and VRAM capacity are two separate hardware specifications that measure different properties of a GPU's memory system, and conflating them is the most common source of misaligned GPU procurement decisions for inference workloads.
How GPU Memory Bandwidth Is Measured
GPU memory bandwidth is measured in gigabytes per second (GB/s) or terabytes per second (TB/s) and reflects the rate at which data can move between GPU memory and compute units. It is determined primarily by memory bus width and memory clock speed. Memory architecture plays a decisive role: High Bandwidth Memory (HBM), used in data center GPUs such as the H100, achieves very high throughput through a wide, stacked-die interface, while GDDR-based designs rely on narrower buses operating at higher clock speeds, resulting in lower overall bandwidth. For example, the NVIDIA H100 SXM5 delivers up to 3.35 TB/s of memory bandwidth using HBM3. Although modern GPUs include small amounts of on-chip SRAM (registers and L1/L2 cache), these are limited to megabytes of capacity; large data structures such as model weights and KV cache reside in off-chip memory. As a result, the bandwidth of HBM or GDDR memory is the primary determinant of data movement throughput and a key constraint for inference performance (Source: NVIDIA, 2023).
Why VRAM Capacity and Memory Bandwidth Are Separate Specifications
VRAM capacity (measured in GB) and memory bandwidth (measured in GB/s) are independent properties: capacity tells you how much data the GPU can hold at once, while bandwidth tells you how fast it can access that data. A GPU can have large VRAM and relatively low bandwidth (for example, older GDDR6-based cards with 48 GB of memory but under 900 GB/s), or moderate VRAM with very high bandwidth, such as the H100 SXM5 with 80 GB HBM3 at 3.35 TB/s. For inference workloads, these two specs predict different things: VRAM sets the minimum model size threshold, while bandwidth sets the serving throughput ceiling. Treating them as proxies for each other leads to GPU selections that clear the size requirement but underperform on the metric that determines serving cost.
Why Autoregressive Inference Is Memory-Bandwidth-Bound, Not Compute-Bound
Autoregressive LLM inference is memory-bandwidth-bound at low-to-moderate batch sizes because the token generation loop requires continuous streaming of model weights from HBM to compute units, and the volume of memory traffic per step far exceeds the compute work performed, leaving the GPU's arithmetic units waiting on memory rather than the reverse.
The Token Generation Loop and Why Weights Must Be Streamed Each Step
The sequential nature of autoregressive decode means each new token requires a full forward pass through the model's weight matrices before the next output token can be produced. For a single sequence at batch size 1, this means streaming the model's parameters from HBM through the memory bus on each step. Modern GPU kernels and caching mechanisms reduce this to streaming the weights relevant to each layer sequentially rather than naively reloading the entire model in full, but the fundamental constraint holds: the volume of data that must transit the memory bus each step scales with model size and is largely fixed regardless of kernel efficiency. Compute units remain idle, waiting for the next block of weights to arrive. The GPU is not compute-starved; it is data-starved.
Arithmetic Intensity: A Worked Example for LLM Decode

Arithmetic intensity is measured in FLOPS per byte and defines whether a workload's execution time is limited by compute throughput or memory bandwidth. Token generation at batch size 1 is fundamentally a memory bound operation: for a 70-billion-parameter model in FP16, a single decode step involves approximately 140 billion FLOPs of computation, primarily in matrix multiplication across each layer (roughly 2 FLOPs per parameter), and requires streaming approximately 140 GB of model parameter data through the memory bus, giving a low arithmetic intensity of roughly 1 FLOP per byte (Source: FlexGen, 2023). The ridge point of an H100 SXM5, the crossover above which a workload becomes compute-bound rather than bandwidth-bound, sits at approximately 590 FLOPS per byte (1,979 TFLOPS FP16 divided by 3.35 TB/s bandwidth) (Source: NVIDIA, 2023). At an arithmetic intensity of 1 FLOP per byte, a 70B decode at batch size 1 on an H100 SXM5 operates at roughly 0.17 percent of peak GPU utilization. For bandwidth-bound inference, tokens per second scales approximately with available memory bandwidth, not peak FLOPS.
How Batch Size Shifts the Bottleneck
Arithmetic intensity scales approximately linearly with batch size during the decode phase: at batch size 8, arithmetic intensity reaches roughly 8 FLOPS per byte; at batch size 64, roughly 64 FLOPS per byte. The H100 SXM5's ridge point of approximately 590 FLOPS per byte means that inference for a 70B model does not enter the compute bound regime until batch sizes approach several hundred simultaneous sequences. For production serving scenarios that prioritize low latency, where batch sizes typically range from 1 to 32, the workload remains firmly bandwidth-bound. High-throughput offline processing that batches multiple requests together to maximize throughput is the scenario where FLOPS begin to matter more, and it is the one use case where prioritizing raw compute over bandwidth may be justified for GPU selection.
GPU Memory Bandwidth Benchmarks: H100, H200, MI300X, and Mid-Tier Options Compared
The primary data center GPU tiers available today span from 600 GB/s to 5.3 TB/s in memory bandwidth, and the right selection depends on whether a workload is bottlenecked by bandwidth, capacity, or compute at the batch sizes and context lengths being served.
NVIDIA H100 SXM vs. PCIe: Why the Form Factor Changes Bandwidth
The NVIDIA H100 is available in two form factors that deliver meaningfully different memory bandwidth: the SXM5 variant delivers 3.35 TB/s using HBM3, while the PCIe variant delivers 2.0 TB/s using HBM2e (Source: NVIDIA, 2023). For bandwidth-bound inference at batch size 1 with a 70B model, the bandwidth difference translates directly to a proportional difference in token generation throughput: the SXM5 has approximately 67 percent more bandwidth and will deliver approximately 67 percent more tokens per second under identical conditions. Both variants carry 80 GB of VRAM, so model capacity is identical. The form factor distinction is not a marginal spec variation; it is a performance tier separation that matters when a cloud provider lists "H100" without specifying SXM or PCIe. Best for: bandwidth-bound, latency-sensitive inference at large model sizes.
H200 and HBM3e: What the Memory Architecture Upgrade Delivers for Inference
The NVIDIA H200 delivers 4.8 TB/s of memory bandwidth using HBM3e, a 43 percent increase over the H100 SXM5's 3.35 TB/s, with VRAM capacity expanding to 141 GB from 80 GB (Source: NVIDIA, 2023). For bandwidth-bound inference workloads, the H200's high memory bandwidth translates directly to higher token throughput per GPU at the same batch sizes. The combined increase in bandwidth and capacity makes the H200 particularly relevant for large model sizes at long context lengths, where KV cache growth erodes usable VRAM headroom on H100 configurations. Best for: bandwidth-bound, latency-sensitive inference at scale, and workloads where VRAM headroom matters alongside bandwidth.
AMD MI300X: How 5.3 TB/s and 192 GB HBM3 Change the Competitive Calculus
The AMD Instinct MI300X delivers 5.3 TB/s of memory bandwidth with 192 GB of HBM3, the highest bandwidth and capacity of any currently available production GPU (Source: AMD, 2025). For inference buyers, those two numbers address the two binding constraints simultaneously: bandwidth above the H200's ceiling for throughput-limited serving, and enough capacity to run 70B models in FP16 with substantial KV cache headroom, or to run 405B-class models in 4-bit quantization on a single card. The MI300X does not lead the market on raw FLOPS, which reinforces the central argument: for bandwidth-bound inference, AMD's bandwidth-first architecture creates genuine price-performance advantages that FLOPS-focused comparisons obscure. Best for: workloads that are both bandwidth- and capacity-constrained, including large model serving and long-context inference.
L40S and A10: The Mid-Market Bandwidth Reality for Cost-Sensitive Inference
The NVIDIA L40S delivers 864 GB/s of memory bandwidth with 48 GB of GDDR6 memory, while NVIDIA’s A10 delivers 600 GB/s with 24 GB of GDDR6 memory; both are well below the throughput ceiling of HBM-based data center GPUs. For smaller inference workloads, especially when model size and context length fit comfortably within available VRAM, these cards can remain commercially viable because their lower cost may outweigh their lower bandwidth. Best for: cost-sensitive inference on smaller models, moderate-context workloads, and batch processing where throughput per dollar matters more than absolute token speed (Source: NVIDIA, 2023).
When VRAM Capacity Still Matters and How to Balance Both Specs
VRAM capacity becomes the binding constraint for inference in three specific conditions: when the model plus KV cache exceeds available memory, when long context lengths cause KV cache to dominate memory usage, and when multi-GPU inference requires data distribution across interconnects with limited bandwidth.
The VRAM Floor: Model Sizes That Cannot Fit on Low-Memory GPUs
Model size sets a hard lower bound on VRAM requirements that no amount of bandwidth can compensate for: if the model weights do not fit in GPU memory, the GPU cannot serve. A model's parameter count directly determines that minimum: approximately 2 bytes per parameter in FP16, giving a 7B model roughly 14 GB, a 13B model roughly 26 GB, a 70B model roughly 140 GB, and a 405B model roughly 810 GB (Source: Frantar et al., 2022). In practice, FP16 is not always the serving precision: 8-bit quantization approximately halves VRAM requirements and 4-bit quantization reduces them by approximately 75 percent, with acceptable model quality degradation for many use cases. For larger models in the 405B class, even 4-bit quantization requires distributing across multiple high-VRAM GPUs. An A10's 24 GB does not fit a 70B model at any standard precision; an H100's 80 GB fits it in FP16; the MI300X's 192 GB fits it in FP16 with substantial KV cache headroom remaining. Capacity governs feasibility, and bandwidth governs throughput: they are not substitutes for each other.
KV Cache Growth in Long-Context Inference
KV cache memory stores the key value pairs computed from each attention layer for all previous tokens in the current sequence, and its memory usage scales with sequence length, hidden dimension size, number of layers, and numerical precision: KV cache memory in bytes equals approximately 2 times the number of layers times the hidden dimension times the sequence length times the bytes per element, with separate storage for keys and values. As both the count of input tokens and the output length grow, KV cache accumulates rapidly. For Llama 2 70B in FP16 at a maximum context length of 128,000 tokens, KV cache per sequence reaches approximately 320 GB, more than twice the model's weight size of 140 GB (Source: Touvron et al., 2023). At context lengths above 32,000 tokens, KV cache can exceed model weight size, making VRAM, not bandwidth, the binding constraint even on high-bandwidth GPUs. Architectures such as multi-query attention significantly reduce this footprint relative to standard multi-head attention by sharing key and value projections across query heads, and are worth evaluating specifically for long-context serving. The practical implication: bandwidth-first GPU selection is correct for short-to-medium context inference, while long-context serving at scale requires evaluating the combined memory envelope of model weights plus expected KV cache load before selecting hardware.
Multi-GPU Inference: When NVLink Bandwidth Becomes the New Bottleneck
When a model is too large to fit on a single GPU and is distributed across multiple devices using tensor parallelism, the binding bandwidth constraint shifts from HBM bandwidth to GPU-to-GPU interconnect bandwidth. NVLink 4.0 on the H100 SXM5 delivers 900 GB/s of total bidirectional bandwidth per GPU across all NVLink connections; PCIe 4.0 x16 delivers approximately 64 GB/s bidirectional (Source: NVIDIA, 2022). A four-GPU SXM configuration connected via NVLink is not bandwidth-equivalent to four PCIe H100s of the same aggregate HBM bandwidth, because inter-GPU communication during attention layers becomes a bottleneck that PCIe cannot clear at the same rate. Some cloud instances list H100 GPUs but do not expose NVLink or operate below full TDP due to thermal and power constraints in the host system, reducing effective bandwidth below the published specification. Verifying interconnect topology and thermal configuration, not just per-card VRAM and bandwidth specs, is necessary when evaluating multi-GPU inference infrastructure.
How to Evaluate GPU Cloud and Bare Metal Providers on Memory Bandwidth
Evaluating GPU providers on memory bandwidth requires requesting five specific technical specifications, not just VRAM count, and applying them to the actual batch sizes and model sizes in the serving workload.
The Five Specifications to Request Before Provisioning GPU Infrastructure
The five specifications that determine whether a GPU instance will meet bandwidth-bound inference requirements are: memory bandwidth in GB/s or TB/s (not VRAM GB), HBM generation (HBM2e, HBM3, or HBM3e, since generation affects both bandwidth and sustained error correction behavior), NVLink version and topology for any multi-GPU configuration, thermal design power and whether the instance operates at full TDP or a throttled subset of it, and the driver and interconnect software stack including CUDA version and NVLink fabric manager configuration. Providers who list only VRAM and FLOPS without disclosing memory bandwidth and interconnect topology are not supplying the specifications necessary to evaluate bandwidth-bound workloads. Requesting these five specifications as part of any GPU procurement evaluation will immediately differentiate providers who understand inference workloads from those who do not.
PCIe vs. SXM: A Worked Provider Comparison
For a team serving a 70B FP16 model at batch size 1 to 4, the choice between a PCIe H100 and an SXM H100 is not a marginal spec difference: it is approximately a 67 percent throughput difference driven entirely by bandwidth. Consider two providers: Provider A offers H100 PCIe instances at 2.0 TB/s per card with no NVLink; Provider B offers H100 SXM5 instances at 3.35 TB/s per card with NVLink 4.0. At batch size 1, the theoretical token generation ceiling for a 70B FP16 model scales with memory bandwidth divided by model weight size: Provider A delivers approximately 14 tokens per second per GPU; Provider B delivers approximately 24 tokens per second per GPU, before accounting for deep learning framework overhead. At batch sizes 2 through 4, the ratio holds. Provider A's pricing would need to be at least 40 percent lower per GPU to match Provider B's cost per token at equivalent latency SLAs, a threshold most PCIe instances do not clear. The comparison assumes both providers deliver the listed bandwidth consistently, which requires confirming TDP and interconnect configuration before provisioning.
FAQ: GPU Memory Bandwidth and LLM Inference
Is VRAM or bandwidth more important for LLM inference?
For latency-sensitive inference at low-to-moderate batch sizes, memory bandwidth is more important than VRAM capacity for determining tokens-per-second throughput. VRAM capacity determines whether a model fits in memory at all, but once loaded, token generation rate scales with bandwidth, not with available headroom. Both matter, but they answer different questions: VRAM answers "can I run this model," and bandwidth answers "how fast."
Why is LLM inference memory-bandwidth-bound?
LLM inference is memory-bandwidth-bound because the token generation loop requires streaming model weights from HBM to compute units on each decode step, and the volume of data moved per step far exceeds the arithmetic work performed at batch sizes below several hundred. This produces an arithmetic intensity well below the GPU's ridge point, leaving compute units waiting on memory rather than the reverse.
Is H100 PCIe or SXM better for inference?
H100 SXM5 is better for bandwidth-bound inference: it delivers 3.35 TB/s of HBM3 bandwidth versus 2.0 TB/s for the PCIe variant, a 67 percent advantage that translates directly to higher token throughput at low batch sizes. Both carry 80 GB of VRAM. If a cloud provider lists H100 without specifying the form factor, request the memory bandwidth spec before provisioning (Source: NVIDIA, 2022).
How much memory bandwidth does the H100 have?
The H100 SXM5 delivers 3.35 TB/s of memory bandwidth using HBM3, and the H100 PCIe delivers 2.0 TB/s using HBM2e. The two variants share the same GPU die and VRAM capacity but differ materially on bandwidth due to the interconnect architecture between the die and memory stack (Source: NVIDIA, 2022).
Does batch size affect whether a GPU is compute-bound or memory-bound during inference?
Yes. Arithmetic intensity during the decode phase scales approximately linearly with batch size: small batch sizes produce low arithmetic intensity and bandwidth-bound execution, while large batch sizes increase arithmetic intensity and eventually shift the workload toward compute-bound territory. For a 70B model on an H100 SXM5, that crossover does not occur until batch sizes reach several hundred sequences, above the range typical of latency-sensitive serving.
What is the difference between HBM and GDDR6 for inference workloads?
HBM achieves throughput through a wide stacked-die architecture with a very wide memory bus, enabling bandwidths of 2 TB/s and above on current data center GPUs. GDDR6 achieves lower bandwidths, typically 600 to 900 GB/s for current GPUs, through a narrower interface at higher clock speeds. For large-model bandwidth-bound inference, HBM-based GPUs deliver proportionally higher token throughput. GDDR6-based GPUs like the L40S and A10 remain competitive for smaller models and cost-sensitive deployments where the bandwidth ceiling is sufficient for the model being served.
How do I compare two GPU providers that list the same VRAM?
Request memory bandwidth in TB/s or GB/s, confirm HBM generation, verify whether multi-GPU configurations use NVLink or PCIe interconnect, and ask whether instances operate at full TDP. Two H100 instances with identical 80 GB VRAM can deliver 2.0 TB/s or 3.35 TB/s depending on form factor, a 67 percent throughput difference for bandwidth-bound workloads that will not appear anywhere in a VRAM-only comparison.
Finding GPU Infrastructure With the Bandwidth Specs That Match Your Workload
Inflect is a digital infrastructure marketplace where teams sourcing GPU cloud and bare metal compute can search, compare, and receive instant pricing from providers across 6,000+ facilities in 100+ countries, with no sales call required. For inference buyers evaluating providers on bandwidth rather than VRAM, that capability matters: Inflect's search returns side-by-side provider options with the technical specifications and pricing needed to run the PCIe vs. SXM analysis described in this post, without committing to a sales process before the comparison is complete. Bare metal GPU configurations across H100, H200, and MI300X tiers are available on Inflect, where bandwidth-per-dollar comparisons are most consequential for production inference workloads. Buyers who need guidance on structuring a bandwidth-focused GPU procurement evaluation can access Inflect's expert advisory at no charge.
Start Comparing GPU Providers on the Specs That Drive Inference Performance
Search GPU cloud and bare metal options across providers with instant pricing and no sales call required:
Compare H100 PCIe vs. SXM configurations, H200 HBM3e options, and MI300X availability side by side
Access free expert advisory to evaluate bandwidth, interconnect topology, and TDP specifications before provisioning
About the Author
Haley Rogers
Content & Social Media Specialist
Haley Rogers is the Content & Social Media Specialist at Inflect, bringing over two years of experience in social media, marketing, and content strategy — including time at a fast-paced tech company before joining the Inflect team. She specializes in translating complex digital infrastructure topics into clear, engaging content, with a particular focus on blog writing and brand storytelling across channels.
Contact:

