Article

NVFP4 Inference on Blackwell SM120 GPUs: vLLM, FlashInfer & What Worked

Bruno Machado ValerioView profilePublished June 3, 202612 min read

New GPU architectures are often marketed as easy upgrades. For inference teams, the reality is more complicated.

This post shares generalized field notes from bringing a large NVFP4-quantized language model online on Blackwell-class SM120 GPUs. The original work came from a real deployment, but the details here have been deliberately anonymized: no customer name, no exact model name, no account IDs, no internal image tags, no benchmark dates, and no production trace identifiers.

The short version:

SM120 worked well once the stack was assembled carefully, but it was not yet a boring default path.

The stable recipe required a recent vLLM build, FlashInfer b12x support, explicit SM120 configuration, ModelOpt NVFP4 target weights, FP8 KV cache, and careful separation between the target model and the speculative drafter.

The most important lesson was not a single package version or runtime flag. It was a boundary:

The target model’s NVFP4 quantization contract must not leak into a non-NVFP4 draft model.

Claim boundary

These are generalized field notes from an anonymized deployment published on June 3, 2026. They are not a universal SM120 recipe and should not be treated as a frozen package-version guide. The durable lessons are about the serving contract: GPU target, vLLM build, FlashInfer/CUTLASS support, ModelOpt NVFP4 checkpoint format, KV cache dtype, speculative drafter configuration, proxy/gateway behavior, tracing, and reproducible benchmarking.

Before copying any flag or image recipe, validate it against your exact GPU SKU, CUDA version, vLLM version, FlashInfer version, model checkpoint, and request path.

Safe and unsafe claims

The reusable claims are:

Large NVFP4 checkpoints can run on SM120 with a recent vLLM stack and FlashInfer b12x support.
SM120 should be configured explicitly; older architecture defaults are not enough.
ModelOpt NVFP4 target weights worked with FP8 KV cache in this serving setup.
Speculative decoding worked once the drafter was configured as a separate model identity with its own precision and layout.
The target model’s NVFP4 quantization config must not leak into a non-NVFP4 drafter.
The reproducible speculative decoding improvement was substantial, but lower than the earliest high-water marks.
Production-shaped load testing revealed stability characteristics that toy prompts would not have shown.
Benchmark tracing should be enabled before important runs, not reconstructed afterward.
Runtime health, proxy health, and gateway health must be validated separately.

The unsafe claim would be:

This stack delivers the early peak token rate as steady-state performance.

It briefly hit higher numbers. Those observations were real, but they were not reproducible enough to use as the headline.

Why this was not just another GPU port

A typical inference migration starts with a simple assumption: if the model runs on one recent NVIDIA GPU, it should run on the next one with only minor changes.

That assumption did not hold here.

The deployment combined several moving parts at once:

New GPU architecture
NVFP4 target weights
Large-context serving
FP8 KV cache
FlashInfer kernels
Speculative decoding
OpenAI-compatible request routing
Tool-calling and parser behavior

Each piece may work in isolation. The difficulty is getting the entire serving path to agree on kernels, model metadata, quantization layout, cache format, and request semantics.

The first practical lesson was:

Treat SM120 as a first-class target, not as a transparent replacement for older NVIDIA architectures.

Older defaults were not enough. The runtime and kernel stack needed to know explicitly that the target was compute 12.0f / 120f.

SM120 is not “all Blackwell.”

The important target here was SM120 / compute 12.0. Do not assume that every Blackwell GPU, every CUDA image, or every prebuilt kernel package targets the same architecture. Datacenter Blackwell parts use different compute capabilities than SM120-class workstation/desktop Blackwell. Before benchmarking, record the exact GPU SKU, compute capability, CUDA version, runtime image lineage, kernel package, and routing path.

Runtime choice: use the path that actually boots

We evaluated more than one serving path. Lower-level optimized runtimes are attractive for NVFP4 inference, and they may eventually become the best option as SM120 support matures.

For this deployment, though, the practical path was vLLM with a custom image.

The deciding question was not:

Which runtime should be fastest in theory?

It was:

Which runtime can load the quantized weights, select the right kernels, handle the real request shape, survive repeated probes, and remain debuggable when something fails?

For new hardware targets, that distinction matters. A theoretically faster runtime is not useful if it cannot reliably initialize the model or support the production API path.

The image recipe: make SM120 explicit

Stock release images were not sufficient. The working image started from a recent vLLM OpenAI-compatible image and added explicit FlashInfer configuration for SM120:

FROM vllm/vllm-openai:nightly

ENV FLASHINFER_CUDA_ARCH_LIST=12.0f \
    FLASHINFER_FORCE_SM=120f \
    FLASHINFER_DISABLE_VERSION_CHECK=1

Note on FLASHINFER_DISABLE_VERSION_CHECK=1: use it only when you have separately verified the wheel, cubin, and JIT cache path for your target GPU. It is not blanket advice. It silences a guardrail that exists for good reason, and is safe only when you already know the kernel set you are running against.

The image also needed recent CUTLASS and FlashInfer components, including nightly wheels, cubins, and JIT cache support for the relevant CUDA generation.

The exact package versions will change quickly, so the durable lesson is not “copy this image forever.”

The durable lesson is:

Do not assume a new GPU target is included in every release image, prebuilt wheel, kernel cache, or cubin package.

If the stack silently falls back to unsupported kernels, or if the right cubins are missing, the resulting failures can look unrelated to architecture support. You may see initialization errors, version checks, shape mismatches, or runtime failures that are really symptoms of an incomplete SM120 path.

There was also a container runtime wrinkle: the same image can behave differently depending on the node runtime and container user configuration. Validate the image under the same orchestration layer, GPU plugin path, and runtime user that production will use.

A local Docker success is not the same as a production deployment success.

NVFP4 came up before speculative decoding

The first stable lane was non-speculative NVFP4 inference:

Target weights: ModelOpt NVFP4 checkpoint
Quantization: ModelOpt
KV cache: FP8
GEMM backend: FlashInfer CUTLASS NVFP4 path

A representative runtime configuration looked like this:

--enable-chunked-prefill
--enable-prefix-caching
--quantization=modelopt_fp4
--kv-cache-dtype=fp8

Confirm the correct vLLM quantization flag for your installed version. Current vLLM documentation identifies modelopt_fp4 for ModelOpt NVFP4 checkpoints; older snippets and some nightly builds may accept modelopt as a shorthand. Validate against the exact vLLM version you are running before copying this config.

The deployment also used large-context settings and bounded batch/concurrency limits appropriate for the workload. Those values should be tuned per environment rather than copied directly.

This baseline was stable, but it did not deliver the full performance target. For isolated short generations, the non-speculative path landed in the low-30s tokens/sec range in this setup.

That made speculative decoding the next obvious lever.

Speculative decoding exposed the real bug

The most interesting failure appeared only after speculative decoding entered the picture.

The target model was NVFP4. The draft model was not.

That distinction is easy to miss.

In many speculative decoding setups, the target model and drafter are treated as closely related. They may share a tokenizer or architecture family. But they are not necessarily the same model artifact. The drafter may use a different precision, checkpoint layout, metadata contract, or weight shape.

The broken path effectively reused the target model’s NVFP4 quantization configuration for the drafter. That produced shape mismatch and assertion failures during initialization.

The fix was conceptually simple but important:

Configure the drafter as its own model identity with its own precision and layout.

The working shape was:

Target model: NVFP4 checkpoint
Draft model: separate assistant/draft checkpoint
KV cache: FP8
Draft window: small fixed number of speculative tokens
Backend: FlashInfer CUTLASS NVFP4 path

This is the most reusable technical lesson from the deployment.

Any large NVFP4 target model using speculative decoding can hit this class of issue if the drafter is not actually the same quantized architecture and layout.

Failure-mode map

The most useful way to read the rest of this post is as a set of failure modes you can recognize quickly, validate, and route to the right fix.

Symptom	Likely cause	Validation step	Fix
Init errors, version checks, shape mismatches on model load	Stock image does not target compute 12.0 / 120f	Inspect `FLASHINFER_CUDA_ARCH_LIST` and the cubin set in the image	Build an SM120-explicit image with the right FlashInfer config and verified wheels/cubins
Silent kernel fallback or degraded throughput	Wheels or cubins for SM120 are missing	Confirm presence of NVFP4 CUTLASS kernels for SM120 in the running image	Refresh wheels/cubins; do not rely on disabling version checks alone
Shape mismatch / assertion failure when speculative decoding initializes	NVFP4 quantization config leaked from target to non-NVFP4 drafter	Inspect the drafter’s precision, layout, and metadata independently from the target	Configure the drafter as a separate model identity with its own precision and layout
Direct runtime healthy, gateway returns errors	Route, endpoint picker, service registration, or model mapping missing	Call the model from runtime, internal proxy, and external gateway separately	Treat each layer as its own health surface; fix the missing routing entry
Autoscaler keeps capacity capped after a scale-up	Resource limits were patched to zero rather than removed	Re-check the GPU resource limit fields, not just nominal pod state	Remove the fields entirely instead of overwriting with an empty object
Cannot reconstruct a historical benchmark with high fidelity	Tracing was not enabled before the run	Verify both the tracing env var and the proxy success-callback config	Enable tracing before the next run; do not rely on spend logs as a replay source

Parser semantics are part of serving

Raw token generation is not enough for production readiness.

The serving path had to support the same request semantics as the application:

Chat templates
Reasoning parser behavior
Tool-call parsing
Long-context prompts
Short completions
Agent-style request bursts

A model can look healthy under direct completion tests and still fail under the production request shape.

Parser flags, request formatting, and tool-call behavior are part of the performance story because they determine whether the benchmark can run at all. The right test is not only “can the model return tokens?” It is:

Can the model serve the actual application request path correctly and repeatedly?

The performance story changed after reproduction

Early results looked spectacular.

After speculative decoding was enabled, a few isolated runs suggested a much larger improvement over the non-speculative baseline. That would have made a great headline.

Then we tried to reproduce it.

Repeated measurements across direct runtime calls, in-cluster proxy calls, and the external gateway path converged lower.

Measurement	Observed pattern	Confidence	How to use it
Non-speculative NVFP4 baseline	Low-30s tokens/sec for isolated short generations	Medium/high for this setup	Useful baseline; not a universal SM120 number
Early speculative decoding runs	Much larger apparent improvement	Low as a headline metric	Keep as internal note only
Reproduced speculative decoding result	Roughly 2–3x over the non-speculative baseline	Higher	Use as the honest story
Concurrent short-output serving	Low-thousands aggregate tokens/sec	Workload-specific	Useful production-shaped signal
Full proxy/gateway path	Needed separate validation	High importance	Required before capacity planning

The honest conclusion became:

Speculative decoding was stable and meaningfully faster, but the reproducible improvement was closer to 2–3x than the early high-water mark.

That is still a strong result. It is just not the number we initially wanted.

A short-lived high-water mark can be real without being representative.

Production-shaped load matters more than toy prompts

The most useful benchmark was not a toy prompt.

The representative workload had a production-like shape:

Long prompts
Short completions
Bursts of concurrent agent steps
Mixed request classes
Occasional very long-context turns
Realistic inter-request timing

Exact replay was not possible because historical prompt bodies and tool payloads had not been captured. Instead, the benchmark preserved workload shape: timing, token targets, and scenario class.

That kind of replay can answer useful operational questions:

Does the serving pod stay stable?
Does a queue build up?
How does long-context prefill affect decode throughput?
Does prefix caching help?
Does the gateway path introduce failures?
Are parser settings compatible with the application?

It cannot answer questions that require byte-identical historical prompts or exact tool payloads.

Under the tested load slice, the serving path remained stable and did not accumulate a queue. Long-context requests were still slow, but they did not break the runtime.

That is a useful production signal, even if it is not a final capacity plan.

Tracing must be enabled before the benchmark

One uncomfortable lesson was that some historical runs could not be reconstructed with high fidelity.

Spend logs and aggregate metrics were useful, but they were not enough. They could show timing, token counts, status codes, and rough throughput. They could not recover full prompt bodies, tool payloads, or external trace context.

For future runs, tracing needs to be enabled before the benchmark starts.

For OpenAI-compatible proxy stacks, that usually means verifying both pieces:

environment:
  TRACING_API_KEY: ...

proxy_settings:
  success_callback:
    - tracing_backend

The exact names vary by proxy and tracing provider, but the principle does not:

Environment variables alone may not enable traces. The proxy often needs explicit callback configuration.

Once tracing is wired correctly, future benchmark runs can be replayed and analyzed with much better fidelity.

Operational lessons from the serving path

The model runtime was only one part of the deployment. The rest of the serving path exposed several infrastructure lessons.

First, pin the hardware target during benchmarking.

It is easy to compare results across different GPU node sizes, memory configurations, or runtime pools and accidentally attribute the difference to quantization or kernels. Benchmark notes should record the hardware target, runtime image lineage, model revision, kernel package, and routing path.

Second, autoscaler limits can fail closed.

If GPU capacity is scaled down by patching resource limits to zero, restoring capacity may require removing those fields rather than overwriting them with an empty object. Otherwise, the autoscaler may continue to treat CPU, memory, or GPU capacity as capped.

Third, a healthy decode pod is not the same as a healthy external serving path.

Direct runtime calls can succeed while the gateway returns errors because a route, endpoint picker, service registration, or model mapping is missing. For production readiness, test all layers:

Direct runtime endpoint
Internal proxy or service mesh path
External gateway path

A model is not truly online until the application’s actual request path works.

Practical checklist for SM120 NVFP4 serving

For teams bringing similar models to SM120, this checklist is a useful starting point.

1. Validate the full support matrix

Do not validate only the GPU or only the model. Validate the combination:

GPU architecture
CUDA version
vLLM version
FlashInfer version
CUTLASS support
Quantization format
KV cache dtype
Speculative decoding mode
Parser behavior
Gateway/proxy path

2. Make SM120 explicit

Set architecture-specific FlashInfer configuration and verify that the image includes the right wheels, cubins, and JIT cache behavior.

FLASHINFER_CUDA_ARCH_LIST=12.0f
FLASHINFER_FORCE_SM=120f
FLASHINFER_DISABLE_VERSION_CHECK=1

3. Keep target and drafter configs separate

The target model and drafter may differ in:

Precision
Quantization metadata
Checkpoint layout
Architecture variant
Tokenizer assumptions
Weight shape

Treat them as separate contracts.

4. Benchmark through every serving path

Measure direct runtime performance, internal proxy performance, and external gateway performance separately. Do not assume one represents the others.

5. Report reproducible numbers

Keep the high-water mark in internal notes, but make the repeated measurement the headline.

6. Turn on tracing before the run

Without full traces, future replay becomes shape-based rather than request-exact.

Bottom line

SM120 was not a simple port, but it became a practical serving target once the stack was assembled carefully.

The working shape was:

Recent vLLM runtime
FlashInfer b12x / SM120 support
Explicit compute 12.0f / 120f configuration
ModelOpt NVFP4 target checkpoint
FP8 KV cache
Separate assistant/draft model for speculative decoding
Production-path validation through proxy and gateway

That stack produced a stable and meaningfully faster serving lane than the non-speculative NVFP4 baseline.

The final result was not the early headline number. It was better: a reproducible, debuggable path for large-model NVFP4 inference on SM120.

Continue reading

Claude Opus 4.8 Is a Benchmark Literacy Test

Insight

5/19/2026

8 min read

Governed AI Agent Sandbox on AWS: Architecture, MCP, and Controls

Insight

5/7/2026

9 min read

AWS MCP Server: Secure, Governed AWS Access for AI Agents

Insight

4/23/2026

7 min read

NVFP4 Inference on Blackwell SM120 GPUs: vLLM, FlashInfer & What Worked

Claim boundary

Safe and unsafe claims

Why this was not just another GPU port

Runtime choice: use the path that actually boots

The image recipe: make SM120 explicit

NVFP4 came up before speculative decoding

Speculative decoding exposed the real bug

Failure-mode map

Parser semantics are part of serving

The performance story changed after reproduction

Production-shaped load matters more than toy prompts

Tracing must be enabled before the benchmark

Operational lessons from the serving path

Practical checklist for SM120 NVFP4 serving

1. Validate the full support matrix

2. Make SM120 explicit

3. Keep target and drafter configs separate

4. Benchmark through every serving path

5. Report reproducible numbers

6. Turn on tracing before the run

Bottom line

Continue reading

Claude Opus 4.8 Is a Benchmark Literacy Test

Governed AI Agent Sandbox on AWS: Architecture, MCP, and Controls

AWS MCP Server: Secure, Governed AWS Access for AI Agents

OpenAI on Amazon Bedrock: GPT-5.5, Codex & AWS Adoption