The GPU that powers a language model matters more than most developers realize. The choice of hardware affects training time, inference latency, cost per token, and ultimately what models are economically feasible to run. In 2026, that hardware conversation is dominated by one name: NVIDIA. And the two architectures shaping the present and near future are Blackwell — already deployed at massive scale — and Vera Rubin, arriving later this year with numbers that seem almost implausible on paper.

This guide cuts through the marketing to give AI engineers what they actually need to know: specs, real-world performance, cost implications, and what it means for how you build and deploy AI systems.

NVIDIA Blackwell: The Current Standard

The Blackwell architecture, introduced in 2024 and fully deployed through 2025–2026, represents NVIDIA's most significant architectural leap since the A100. The flagship chips — B200 and GB200 — are what cloud providers and AI labs are running right now.

💾
208B Transistors
Dual-die chiplet design with 208 billion transistors — the most complex chip ever mass-produced.
5th Gen Tensor Cores
Native FP4 (4-bit) support dramatically boosts inference throughput vs. Hopper's FP8.
🔗
NVLink 5
1.8 TB/s bidirectional bandwidth per GPU — NVL72 rack achieves 130 TB/s aggregate.
🧊
Up to 288GB HBM3e
Massive memory for running trillion-parameter models. GB300 Ultra configs reach 288 GB per GPU.

Blackwell Technical Specs: B200 vs GB200

Understanding the difference between B200 and GB200 matters for procurement decisions:

SpecB200 (single GPU)GB200 (Grace + Blackwell)
GPU DieBlackwell B200Blackwell B200 + Grace CPU
AI Training (FP8)20 petaflops20 petaflops (GPU portion)
AI Inference (FP4)40 petaflops40 petaflops (GPU portion)
HBM3e Memory192 GB192 GB GPU + 480 GB LPDDR5X
Memory Bandwidth8 TB/s8 TB/s GPU + 1 TB/s CPU
TDP1000W1200W (combined)
InterconnectNVLink 5NVLink-C2C (900 GB/s CPU-GPU)
Best ForLarge-scale training clustersInference at scale, hybrid workloads

The NVL72: When 72 GPUs Become One

The most powerful Blackwell deployment is not a single GPU — it is the NVL72, a rack-scale system containing 72 B200 GPUs interconnected via NVLink 5 with a combined 130 TB/s of aggregate bandwidth. This is not just a cluster of GPUs — the interconnect is fast enough that the entire NVL72 behaves as a single unified compute unit.

What this makes possible:

  • Running trillion-parameter models without model parallelism overhead
  • Training massive mixture-of-experts models with near-linear scaling
  • Serving multiple large models simultaneously with hot-swapping
  • Microsecond-latency communication between all 72 GPUs
💡

Who Is Actually Using NVL72?

Major cloud providers (AWS, Azure, GCP) and AI labs (OpenAI, Anthropic, Google DeepMind) are the primary NVL72 customers. The system requires specialized liquid cooling infrastructure and 800V DC power — this is data center-scale hardware, not something you run in a colocation facility.

FP4: Why 4-Bit Precision Changes the Economics

One of Blackwell's most consequential features is native FP4 (4-bit floating point) support in its 5th Generation Tensor Cores. To understand why this matters, you need to understand how precision affects AI workloads:

PrecisionBitsUse CaseThroughput vs FP32
FP3232Research, highest accuracy
BF1616Standard training~2×
FP88Inference (Hopper era standard)~4×
FP44Inference (Blackwell native)~8×

The practical result: running LLM inference on Blackwell with FP4 quantization can achieve roughly twice the throughput of running the same model on Hopper (H100) with FP8, at similar or better accuracy for most tasks. For organizations spending millions on inference compute, this is the difference between feasibility and infeasibility.

Python — FP4 Inference with TensorRT-LLM
import tensorrt_llm from tensorrt_llm.quantization import QuantMode # Enable FP4 quantization for Blackwell inference quant_mode = QuantMode.from_description( quantize_weights=True, quantize_activations=True, per_token=True, per_channel=True, use_fp4=True # Blackwell B200/GB200 native FP4 ) # Build engine with FP4 builder = tensorrt_llm.Builder() builder_config = builder.create_builder_config( precision="fp4", quant_mode=quant_mode, max_batch_size=64, max_input_len=4096, max_output_len=2048 ) # Result: ~2x throughput vs FP8 on H100 # ~8x throughput vs FP32 baseline

NVIDIA Vera Rubin: What's Coming Late 2026

While Blackwell is the present, Vera Rubin is what NVIDIA has already announced for H2 2026. The numbers are extraordinary even by NVIDIA's standards:

FeatureBlackwell B200Rubin GPU (Announced)Improvement
NVFP4 Inference~40 petaflops50 petaflops+25%
Inference ThroughputBaseline10× higher10×
Cost per TokenBaseline10× lower10×
MoE Training EfficiencyBaseline4× fewer GPUs needed
MemoryHBM3eHBM4Higher BW
InterconnectNVLink 5 (1.8 TB/s)NVLink 6 (3.6 TB/s)
CPU PairingGrace (72× Arm Neoverse)Vera (88× custom Olympus)+22%
⚠️

Treat These Numbers Carefully

The 10× inference improvement claim is based on NVIDIA's own benchmarks using Mixture-of-Experts model architectures. Real-world improvements for dense transformer models will be smaller. Wait for independent benchmarks before making procurement decisions based on these figures.

The Vera Rubin Platform: More Than Just a GPU

Vera Rubin is not just a new GPU — it is a complete platform redesign:

  • Rubin GPU: The compute engine, targeting 50 petaflops NVFP4 per chip
  • Vera CPU: 88 custom Olympus cores, purpose-built for AI orchestration workloads
  • NVLink 6: 3.6 TB/s per GPU — double Blackwell's already-impressive 1.8 TB/s
  • HBM4 memory: Higher bandwidth than HBM3e, enabling larger models in-memory
  • NVLink Switch 6: Rack-scale connectivity enabling hundreds of GPUs to operate as unified compute

Should You Wait for Vera Rubin?

This is the practical question for anyone making infrastructure decisions right now:

SituationRecommendation
Building production inference nowUse Blackwell. Don't wait — Rubin availability will be limited at launch.
Planning 2027 infrastructureDesign for Rubin compatibility but build with Blackwell.
Training frontier modelsBlackwell is the right choice for current-generation models.
Cost-sensitive inference at scaleConsider waiting — 10× cost per token improvement is significant.
Research and experimentationCloud access to Blackwell via AWS/Azure/GCP is immediately available.

Beyond NVIDIA: AMD MI300X and Google TPU v5

The AI hardware market is not a NVIDIA monopoly, though NVIDIA's dominance is real:

  • AMD MI300X: Competitive on memory capacity (192 GB HBM3 in a single chip), increasingly supported by PyTorch/ROCm. Best for memory-bound workloads. Still lags on software ecosystem maturity.
  • Google TPU v5e/v5p: Highly optimized for Google's own JAX/XLA stack. Cost-competitive on Google Cloud. Limited portability if you want to run code elsewhere.
  • Cerebras CS-3: Wafer-scale chip with massive SRAM — extraordinary for specific research workloads but niche use case.
  • Groq LPU: Ultra-low latency inference chip. Excellent for real-time inference applications where token latency matters more than throughput.
🚀

Practical Advice for AI Engineers

For most teams, the decision is not which GPU to buy — it is which cloud provider to use. AWS (H200/B200 via P5 instances), Azure (ND H200 v5), and GCP (A3 Ultra with H200) all offer Blackwell-generation hardware today. Start with spot/preemptible instances to reduce cost during experimentation before committing to reserved capacity.

What Blackwell Means for AI Inference Costs

The economics of AI have been shifting rapidly as hardware improves. Here is what Blackwell's FP4 throughput means in practical terms for LLM inference costs compared to the H100 era:

  • GPT-5.5 class models: Running inference on Blackwell B200 clusters costs roughly 40–50% less per token than equivalent H100 infrastructure at the same scale.
  • 70B parameter models: Can now be served on a single GB200 with comfortable headroom, eliminating tensor parallelism overhead.
  • Mixture-of-Experts models: Blackwell's NVLink bandwidth makes MoE inference dramatically more efficient — the biggest beneficiary of the architecture.
  • Batch inference: High-throughput batch jobs see the largest cost reductions — up to 60% cheaper per token vs. H100 in optimal configurations.

Conclusion: Hardware Is Competitive Advantage

In the early years of the LLM era, the competitive advantage in AI was largely about models — who had the best architecture, the most data, the smartest researchers. In 2026, as model architectures converge and training techniques become widely understood, hardware efficiency is increasingly the differentiator. Teams that understand how to extract maximum performance from Blackwell GPUs — through FP4 quantization, optimal batch sizes, NVLink topology-aware parallelism — have a real advantage over teams that treat hardware as a commodity. And when Vera Rubin arrives, the teams who have invested in understanding the hardware layer will be the first to harness its 10× inference efficiency gains.