NVIDIA GB300 Leads H200 by 20x in Concurrent Agents per Megawatt

Taylor Wilson

Published todayAbout 8 min read

Independent benchmarking firm Artificial Analysis released AA-AgentPerf, the first benchmark built for AI-agent inference. Nvidia's GB300 NVL72 sustains 61,400 concurrent agents per megawatt versus roughly 2,600 for the prior-generation H200 — a ~20× leap that signals a structural shift in how compute is measured: from tokens-per-second to agents-per-megawatt.

Why can't the old benchmarks keep up?

Traditional inference benchmarks measure a single-turn exchange — one question in, one answer out, like a 100-meter sprint.

AI agents work differently. A task is broken into dozens or hundreds of chained steps; model calls and tool calls alternate, and context snowballs with every round.

This means → judging agent compute by "tokens per second" is like rating a marathon runner on sprint times — wrong dimension entirely.

What does AA-AgentPerf actually measure?

AA-AgentPerf replays real coding-agent trajectories spanning 12+ programming languages, with sessions running up to 200 turns and context exceeding 100,000 tokens.

The headline metric is "concurrent agents per megawatt." In plain terms = how many working agents can one megawatt keep alive at once — mapping directly to a data center's power bill and serving density.

The test locks a service-level objective first (output speed, time-to-first-token), then ramps concurrency until the system breaks that floor. Output quality is monitored throughout — trading answer quality for concurrency is not allowed.

Where does GB300's 20× advantage come from?

It is not just a process-node upgrade. The GB300 NVL72 links 72 GPUs via NVLink into a single rack-scale unit, letting each expert module in an MoE model (Mixture-of-Experts — a design that splits one large model into specialist sub-models, activating only the ones needed) run across the full GPU pool instead of being bottlenecked by a single card's memory.

At the CUDA level, cross-expert communication and computation overlap, hiding coordination overhead inside useful work; TensorRT-LLM keeps efficiency stable as concurrency climbs by separating input processing from output generation.

This reflects a broader pattern: the Hopper-to-Blackwell generational jump delivers a step-function leap in concurrency, not an incremental gain. Rack-scale systems outperform single nodes in both raw throughput and energy efficiency.

What about competitors and the standard's future?

The leaderboard also includes AMD's MI355X, compared side-by-side at the single-GPU, single-node, and full-rack levels.

Artificial Analysis describes this as an "evolving frontier snapshot" — every vendor's system still has untapped optimization headroom, and scores will update as software stacks mature.

Whether AA-AgentPerf becomes an industry-accepted standard on par with MLPerf remains an open question, but the efficiency dimension it introduced — agents per megawatt — is already reshaping how the market evaluates compute systems.

Content is for reference only, not financial advice.