Elevata

Article

NVFP4 Inference on Blackwell SM120 GPUs: vLLM, FlashInfer & What Worked

Bruno Machado Valerio
View profilePublished June 3, 202612 min read

New GPU architectures are often marketed as easy upgrades. For inference teams, the reality is more complicated.

This post shares generalized field notes from bringing a large NVFP4-quantized language model online on Blackwell-class SM120 GPUs. The original work came from a real deployment, but the details here have been deliberately anonymized: no customer name, no exact model name, no account IDs, no internal image tags, no benchmark dates, and no production trace identifiers.

The short version:

SM120 worked well once the stack was assembled carefully, but it was not yet a boring default path.

The stable recipe required a recent vLLM build, FlashInfer b12x support, explicit SM120 configuration, ModelOpt NVFP4 target weights, FP8 KV cache, and careful separation between the target model and the speculative drafter.

The most important lesson was not a single package version or runtime flag. It was a boundary:

The target model’s NVFP4 quantization contract must not leak into a non-NVFP4 draft model.

Claim boundary

These are generalized field notes from an anonymized deployment published on June 3, 2026. They are not a universal SM120 recipe and should not be treated as a frozen package-version guide. The durable lessons are about the serving contract: GPU target, vLLM build, FlashInfer/CUTLASS support, ModelOpt NVFP4 checkpoint format, KV cache dtype, speculative drafter configuration, proxy/gateway behavior, tracing, and reproducible benchmarking.

Before copying any flag or image recipe, validate it against your exact GPU SKU, CUDA version, vLLM version, FlashInfer version, model checkpoint, and request path.

Safe and unsafe claims

The reusable claims are:

  • Large NVFP4 checkpoints can run on SM120 with a recent vLLM stack and FlashInfer b12x support.
  • SM120 should be configured explicitly; older architecture defaults are not enough.
  • ModelOpt NVFP4 target weights worked with FP8 KV cache in this serving setup.
  • Speculative decoding worked once the drafter was configured as a separate model identity with its own precision and layout.
  • The target model’s NVFP4 quantization config must not leak into a non-NVFP4 drafter.
  • The reproducible speculative decoding improvement was substantial, but lower than the earliest high-water marks.
  • Production-shaped load testing revealed stability characteristics that toy prompts would not have shown.
  • Benchmark tracing should be enabled before important runs, not reconstructed afterward.
  • Runtime health, proxy health, and gateway health must be validated separately.

The unsafe claim would be:

This stack delivers the early peak token rate as steady-state performance.

It briefly hit higher numbers. Those observations were real, but they were not reproducible enough to use as the headline.

Why this was not just another GPU port

A typical inference migration starts with a simple assumption: if the model runs on one recent NVIDIA GPU, it should run on the next one with only minor changes.

That assumption did not hold here.

The deployment combined several moving parts at once:

  • New GPU architecture
  • NVFP4 target weights
  • Large-context serving
  • FP8 KV cache
  • FlashInfer kernels
  • Speculative decoding
  • OpenAI-compatible request routing
  • Tool-calling and parser behavior

Each piece may work in isolation. The difficulty is getting the entire serving path to agree on kernels, model metadata, quantization layout, cache format, and request semantics.

The first practical lesson was:

Treat SM120 as a first-class target, not as a transparent replacement for older NVIDIA architectures.

Older defaults were not enough. The runtime and kernel stack needed to know explicitly that the target was compute 12.0f / 120f.

SM120 is not “all Blackwell.”

The important target here was SM120 / compute 12.0. Do not assume that every Blackwell GPU, every CUDA image, or every prebuilt kernel package targets the same architecture. Datacenter Blackwell parts use different compute capabilities than SM120-class workstation/desktop Blackwell. Before benchmarking, record the exact GPU SKU, compute capability, CUDA version, runtime image lineage, kernel package, and routing path.

Runtime choice: use the path that actually boots

We evaluated more than one serving path. Lower-level optimized runtimes are attractive for NVFP4 inference, and they may eventually become the best option as SM120 support matures.

For this deployment, though, the practical path was vLLM with a custom image.

The deciding question was not:

Which runtime should be fastest in theory?

It was:

Which runtime can load the quantized weights, select the right kernels, handle the real request shape, survive repeated probes, and remain debuggable when something fails?

For new hardware targets, that distinction matters. A theoretically faster runtime is not useful if it cannot reliably initialize the model or support the production API path.

The image recipe: make SM120 explicit

Stock release images were not sufficient. The working image started from a recent vLLM OpenAI-compatible image and added explicit FlashInfer configuration for SM120:

FROM vllm/vllm-openai:nightly

ENV FLASHINFER_CUDA_ARCH_LIST=12.0f \
    FLASHINFER_FORCE_SM=120f \
    FLASHINFER_DISABLE_VERSION_CHECK=1

Note on FLASHINFER_DISABLE_VERSION_CHECK=1: use it only when you have separately verified the wheel, cubin, and JIT cache path for your target GPU. It is not blanket advice. It silences a guardrail that exists for good reason, and is safe only when you already know the kernel set you are running against.

The image also needed recent CUTLASS and FlashInfer components, including nightly wheels, cubins, and JIT cache support for the relevant CUDA generation.

The exact package versions will change quickly, so the durable lesson is not “copy this image forever.”

The durable lesson is:

Do not assume a new GPU target is included in every release image, prebuilt wheel, kernel cache, or cubin package.

If the stack silently falls back to unsupported kernels, or if the right cubins are missing, the resulting failures can look unrelated to architecture support. You may see initialization errors, version checks, shape mismatches, or runtime failures that are really symptoms of an incomplete SM120 path.

There was also a container runtime wrinkle: the same image can behave differently depending on the node runtime and container user configuration. Validate the image under the same orchestration layer, GPU plugin path, and runtime user that production will use.

A local Docker success is not the same as a production deployment success.

NVFP4 came up before speculative decoding

The first stable lane was non-speculative NVFP4 inference:

  • Target weights: ModelOpt NVFP4 checkpoint
  • Quantization: ModelOpt
  • KV cache: FP8
  • GEMM backend: FlashInfer CUTLASS NVFP4 path

A representative runtime configuration looked like this:

--enable-chunked-prefill
--enable-prefix-caching
--quantization=modelopt_fp4
--kv-cache-dtype=fp8

Confirm the correct vLLM quantization flag for your installed version. Current vLLM documentation identifies modelopt_fp4 for ModelOpt NVFP4 checkpoints; older snippets and some nightly builds may accept modelopt as a shorthand. Validate against the exact vLLM version you are running before copying this config.

The deployment also used large-context settings and bounded batch/concurrency limits appropriate for the workload. Those values should be tuned per environment rather than copied directly.

This baseline was stable, but it did not deliver the full performance target. For isolated short generations, the non-speculative path landed in the low-30s tokens/sec range in this setup.

That made speculative decoding the next obvious lever.

Speculative decoding exposed the real bug

The most interesting failure appeared only after speculative decoding entered the picture.

The target model was NVFP4. The draft model was not.

That distinction is easy to miss.

In many speculative decoding setups, the target model and drafter are treated as closely related. They may share a tokenizer or architecture family. But they are not necessarily the same model artifact. The drafter may use a different precision, checkpoint layout, metadata contract, or weight shape.

The broken path effectively reused the target model’s NVFP4 quantization configuration for the drafter. That produced shape mismatch and assertion failures during initialization.

The fix was conceptually simple but important:

Configure the drafter as its own model identity with its own precision and layout.

The working shape was:

  • Target model: NVFP4 checkpoint
  • Draft model: separate assistant/draft checkpoint
  • KV cache: FP8
  • Draft window: small fixed number of speculative tokens
  • Backend: FlashInfer CUTLASS NVFP4 path

This is the most reusable technical lesson from the deployment.

Any large NVFP4 target model using speculative decoding can hit this class of issue if the drafter is not actually the same quantized architecture and layout.

Failure-mode map

The most useful way to read the rest of this post is as a set of failure modes you can recognize quickly, validate, and route to the right fix.

SymptomLikely causeValidation stepFix
Init errors, version checks, shape mismatches on model loadStock image does not target compute 12.0 / 120fInspect FLASHINFER_CUDA_ARCH_LIST and the cubin set in the imageBuild an SM120-explicit image with the right FlashInfer config and verified wheels/cubins
Silent kernel fallback or degraded throughputWheels or cubins for SM120 are missingConfirm presence of NVFP4 CUTLASS kernels for SM120 in the running imageRefresh wheels/cubins; do not rely on disabling version checks alone
Shape mismatch / assertion failure when speculative decoding initializesNVFP4 quantization config leaked from target to non-NVFP4 drafterInspect the drafter’s precision, layout, and metadata independently from the targetConfigure the drafter as a separate model identity with its own precision and layout
Direct runtime healthy, gateway returns errorsRoute, endpoint picker, service registration, or model mapping missingCall the model from runtime, internal proxy, and external gateway separatelyTreat each layer as its own health surface; fix the missing routing entry
Autoscaler keeps capacity capped after a scale-upResource limits were patched to zero rather than removedRe-check the GPU resource limit fields, not just nominal pod stateRemove the fields entirely instead of overwriting with an empty object
Cannot reconstruct a historical benchmark with high fidelityTracing was not enabled before the runVerify both the tracing env var and the proxy success-callback configEnable tracing before the next run; do not rely on spend logs as a replay source

Parser semantics are part of serving

Raw token generation is not enough for production readiness.

The serving path had to support the same request semantics as the application:

  • Chat templates
  • Reasoning parser behavior
  • Tool-call parsing
  • Long-context prompts
  • Short completions
  • Agent-style request bursts

A model can look healthy under direct completion tests and still fail under the production request shape.

Parser flags, request formatting, and tool-call behavior are part of the performance story because they determine whether the benchmark can run at all. The right test is not only “can the model return tokens?” It is:

Can the model serve the actual application request path correctly and repeatedly?

The performance story changed after reproduction

Early results looked spectacular.

After speculative decoding was enabled, a few isolated runs suggested a much larger improvement over the non-speculative baseline. That would have made a great headline.

Then we tried to reproduce it.

Repeated measurements across direct runtime calls, in-cluster proxy calls, and the external gateway path converged lower.

MeasurementObserved patternConfidenceHow to use it
Non-speculative NVFP4 baselineLow-30s tokens/sec for isolated short generationsMedium/high for this setupUseful baseline; not a universal SM120 number
Early speculative decoding runsMuch larger apparent improvementLow as a headline metricKeep as internal note only
Reproduced speculative decoding resultRoughly 2–3x over the non-speculative baselineHigherUse as the honest story
Concurrent short-output servingLow-thousands aggregate tokens/secWorkload-specificUseful production-shaped signal
Full proxy/gateway pathNeeded separate validationHigh importanceRequired before capacity planning

The honest conclusion became:

Speculative decoding was stable and meaningfully faster, but the reproducible improvement was closer to 2–3x than the early high-water mark.

That is still a strong result. It is just not the number we initially wanted.

A short-lived high-water mark can be real without being representative.

Production-shaped load matters more than toy prompts

The most useful benchmark was not a toy prompt.

The representative workload had a production-like shape:

  • Long prompts
  • Short completions
  • Bursts of concurrent agent steps
  • Mixed request classes
  • Occasional very long-context turns
  • Realistic inter-request timing

Exact replay was not possible because historical prompt bodies and tool payloads had not been captured. Instead, the benchmark preserved workload shape: timing, token targets, and scenario class.

That kind of replay can answer useful operational questions:

  • Does the serving pod stay stable?
  • Does a queue build up?
  • How does long-context prefill affect decode throughput?
  • Does prefix caching help?
  • Does the gateway path introduce failures?
  • Are parser settings compatible with the application?

It cannot answer questions that require byte-identical historical prompts or exact tool payloads.

Under the tested load slice, the serving path remained stable and did not accumulate a queue. Long-context requests were still slow, but they did not break the runtime.

That is a useful production signal, even if it is not a final capacity plan.

Tracing must be enabled before the benchmark

One uncomfortable lesson was that some historical runs could not be reconstructed with high fidelity.

Spend logs and aggregate metrics were useful, but they were not enough. They could show timing, token counts, status codes, and rough throughput. They could not recover full prompt bodies, tool payloads, or external trace context.

For future runs, tracing needs to be enabled before the benchmark starts.

For OpenAI-compatible proxy stacks, that usually means verifying both pieces:

environment:
  TRACING_API_KEY: ...

proxy_settings:
  success_callback:
    - tracing_backend

The exact names vary by proxy and tracing provider, but the principle does not:

Environment variables alone may not enable traces. The proxy often needs explicit callback configuration.

Once tracing is wired correctly, future benchmark runs can be replayed and analyzed with much better fidelity.

Operational lessons from the serving path

The model runtime was only one part of the deployment. The rest of the serving path exposed several infrastructure lessons.

First, pin the hardware target during benchmarking.

It is easy to compare results across different GPU node sizes, memory configurations, or runtime pools and accidentally attribute the difference to quantization or kernels. Benchmark notes should record the hardware target, runtime image lineage, model revision, kernel package, and routing path.

Second, autoscaler limits can fail closed.

If GPU capacity is scaled down by patching resource limits to zero, restoring capacity may require removing those fields rather than overwriting them with an empty object. Otherwise, the autoscaler may continue to treat CPU, memory, or GPU capacity as capped.

Third, a healthy decode pod is not the same as a healthy external serving path.

Direct runtime calls can succeed while the gateway returns errors because a route, endpoint picker, service registration, or model mapping is missing. For production readiness, test all layers:

  • Direct runtime endpoint
  • Internal proxy or service mesh path
  • External gateway path

A model is not truly online until the application’s actual request path works.

Practical checklist for SM120 NVFP4 serving

For teams bringing similar models to SM120, this checklist is a useful starting point.

1. Validate the full support matrix

Do not validate only the GPU or only the model. Validate the combination:

  • GPU architecture
  • CUDA version
  • vLLM version
  • FlashInfer version
  • CUTLASS support
  • Quantization format
  • KV cache dtype
  • Speculative decoding mode
  • Parser behavior
  • Gateway/proxy path

2. Make SM120 explicit

Set architecture-specific FlashInfer configuration and verify that the image includes the right wheels, cubins, and JIT cache behavior.

FLASHINFER_CUDA_ARCH_LIST=12.0f
FLASHINFER_FORCE_SM=120f
FLASHINFER_DISABLE_VERSION_CHECK=1

3. Keep target and drafter configs separate

The target model and drafter may differ in:

  • Precision
  • Quantization metadata
  • Checkpoint layout
  • Architecture variant
  • Tokenizer assumptions
  • Weight shape

Treat them as separate contracts.

4. Benchmark through every serving path

Measure direct runtime performance, internal proxy performance, and external gateway performance separately. Do not assume one represents the others.

5. Report reproducible numbers

Keep the high-water mark in internal notes, but make the repeated measurement the headline.

6. Turn on tracing before the run

Without full traces, future replay becomes shape-based rather than request-exact.

Bottom line

SM120 was not a simple port, but it became a practical serving target once the stack was assembled carefully.

The working shape was:

  • Recent vLLM runtime
  • FlashInfer b12x / SM120 support
  • Explicit compute 12.0f / 120f configuration
  • ModelOpt NVFP4 target checkpoint
  • FP8 KV cache
  • Separate assistant/draft model for speculative decoding
  • Production-path validation through proxy and gateway

That stack produced a stable and meaningfully faster serving lane than the non-speculative NVFP4 baseline.

The final result was not the early headline number. It was better: a reproducible, debuggable path for large-model NVFP4 inference on SM120.

Related

Continue reading

Related reading on this topic.