[NV] Qwen3.5 B200 SGLang FP4 configs by kedarpotdar-nv · Pull Request #820 · SemiAnalysisAI/InferenceX

kedarpotdar-nv · 2026-02-27T19:22:05Z

Summary

Add FP4 benchmark configuration and launch script for Qwen3.5-397B-A17B on NVIDIA B200 GPUs using SGLang.

Changes

New Benchmark Config (`nvidia-master.yaml`)

Config key: qwen3.5-fp4-b200-sglang
Model: nvidia/Qwen3.5-397B-A17B-NVFP4
Image: lmsysorg/sglang:v0.5.9-cu129-amd64
Precision: FP4 (ModelOpt NVFP4)
Sequence length configurations:
- 1k1k — TP4/EP1 (conc 4–32), TP8/EP1 (conc 4–64), TP8/EP8 (conc 128)
- 1k8k — TP4/EP1 (conc 4–32), TP8/EP1 (conc 4–128)
- 8k1k — TP4/EP1 (conc 4–32), TP8/EP1 (conc 4–128)

New Launch Script (`benchmarks/single_node/qwen3.5_fp4_b200.sh`)

SGLang server configuration with:

--quantization modelopt_fp4 with --fp4-gemm-backend flashinfer_cutlass
--kv-cache-dtype fp8_e4m3
--attention-backend trtllm_mha / --moe-runner-backend flashinfer_trtllm
--enable-flashinfer-allreduce-fusion
--chunked-prefill-size 32768 / --max-prefill-tokens 32768
--disable-radix-cache
--mem-fraction-static 0.85

Perf Changelog

Added entry for the new qwen3.5-fp4-b200-sglang config.

kedarpotdar-nv · 2026-02-27T20:11:28Z

@claude please add a PR description

Klaud-Cold · 2026-02-27T20:11:53Z

Claude finished @kedarpotdar-nv's task in 40s —— View job

Adding PR Description

Gather context from changed files
Draft and update PR description

PR description has been added with a summary of all changes: new qwen3.5-fp4-b200-sglang benchmark config, launch script, and perf-changelog entry.

kedarpotdar-nv · 2026-02-27T20:14:30Z

@claude please review the PR

Klaud-Cold · 2026-02-27T20:14:50Z

Claude finished @kedarpotdar-nv's task in 2m 41s —— View job

Reviewing PR

Gather context from changed files and existing patterns
Review benchmark config (nvidia-master.yaml)
Review launch script (qwen3.5_fp4_b200.sh)
Review perf-changelog entry
Post review feedback

Review: `qwen3.5-fp4-b200-sglang`

Overall this is a clean addition. Config, script, and changelog all follow existing patterns. A few observations below — mostly minor, one worth discussing.

`.github/configs/nvidia-master.yaml` — Config

Looks good. Image (v0.5.9), model (nvidia/Qwen3.5-397B-A17B-NVFP4), and metadata fields are consistent with repo conventions. The search space is reasonable: TP4/EP1 for lower concurrencies, TP8/EP1 for wider sweeps, and a single TP8/EP8 point at conc=128 for 1k1k. No issues here.

`benchmarks/single_node/qwen3.5_fp4_b200.sh` — Launch Script

1. Missing --tokenizer-worker-num
The BF16 sibling script (qwen3.5_bf16_b200.sh) sets --tokenizer-worker-num 6. The new FP4 script omits this. At high concurrency (64–128), tokenization can become a bottleneck on a single worker. Was this intentionally dropped, or should it be carried over?

2. Hardcoded --scheduler-recv-interval 30
The BF16 script dynamically sets this (10 for CONC < 16, 30 for CONC ≥ 16). The FP4 script hardcodes 30. For low-concurrency runs (conc=4), a recv interval of 30 adds unnecessary scheduling latency — the scheduler polls less frequently, which can hurt latency at low load. Consider matching the BF16 pattern:

if [[ $CONC -ge 16 ]]; then
  SCHEDULER_RECV_INTERVAL=30
else
  SCHEDULER_RECV_INTERVAL=10
fi

Fix this →

3. No --served-model-name
The BF16 script passes --served-model-name "Qwen/Qwen3.5-397B-A17B". This script omits it. Not a blocker for benchmarking (the benchmark client uses the actual model path), but worth noting for consistency.

4. FP4-specific flags look correct

--quantization modelopt_fp4 + --fp4-gemm-backend flashinfer_cutlass — correct for NVFP4
--kv-cache-dtype fp8_e4m3 — standard for FP4 configs
--attention-backend trtllm_mha — correct for Qwen (MHA, not MLA which is DeepSeek-specific)
--moe-runner-backend flashinfer_trtllm — correct for MoE models
--mem-fraction-static 0.85 — reasonable bump from BF16's 0.82 given smaller FP4 weights

5. Environment exports inherited from BF16 script
NCCL_NVLS_ENABLE=1, SGL_ENABLE_JIT_DEEPGEMM=false, SGLANG_ENABLE_FLASHINFER_GEMM=true — all look appropriate.

`perf-changelog.yaml`

Clean entry, well-formatted, correct PR link. No issues.

Summary

Area	Verdict
Config YAML	✅ Good
Launch script	🟡 Minor — missing `--tokenizer-worker-num`, hardcoded scheduler interval
Perf changelog	✅ Good

The script is functional as-is. The two items flagged (tokenizer workers and dynamic scheduler interval) are nice-to-haves that would align with the BF16 sibling script and could improve low-concurrency latency. Up to you whether to address now or in a follow-up.

View job

kedarpotdar-nv · 2026-02-27T22:01:13Z

@functionstackx collaborating on this existing PR to get our FP8 and FP4 configs merged upstream in sgl-cookbook. sgl-project/sgl-cookbook#168

Thanks!

functionstackx · 2026-02-27T22:23:52Z

@functionstackx collaborating on this existing PR to get our FP8 and FP4 configs merged upstream in sgl-cookbook. sgl-project/sgl-cookbook#168

Thanks!

Thanks! excited to see continued focus on ensuring that the end users in the ml community can benefit from these optimized flag recipes for selecting the correct MoE kernel backend & attention kernel backend

functionstackx

lgtm

kedarpotdar-nv · 2026-02-27T23:26:19Z

/sweep run only evals

github-actions · 2026-02-27T23:26:28Z

@kedarpotdar-nv Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/22507799181?target=https://github.com
Command: run only evals
Pinned ref: c738532
Approval: not required (trusted collaborator).

add fp4 qwen35 configs

fcdccef

kedarpotdar-nv added the NVIDIA label Feb 27, 2026

kedarpotdar-nv requested a review from a team February 27, 2026 19:22

kedarpotdar-nv requested review from ankursingh-nv, csahithi, jgangani and yunzhoul-nv as code owners February 27, 2026 19:22

github-project-automation bot added this to InferenceMAX Board Feb 27, 2026

add pr #

4e43260

kedarpotdar-nv added the sweep-enabled label Feb 27, 2026

add tp4

c738532

kedarpotdar-nv requested a review from functionstackx February 27, 2026 20:11

kedarpotdar-nv requested a review from cquil11 February 27, 2026 20:11

kedarpotdar-nv self-assigned this Feb 27, 2026

kedarpotdar-nv mentioned this pull request Feb 27, 2026

Add Qwen3.5 FP8 and NVFP4 sgl-project/sgl-cookbook#168

Merged

functionstackx approved these changes Feb 27, 2026

View reviewed changes

kedarpotdar-nv and others added 3 commits February 27, 2026 15:47

limit search space

5c0a4e8

Merge branch 'main' into nv/qwen35-fp4

1c2a8c4

newline in perf-changelog.yaml

cf44331

cquil11 approved these changes Mar 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NV] Qwen3.5 B200 SGLang FP4 configs#820

[NV] Qwen3.5 B200 SGLang FP4 configs#820
kedarpotdar-nv wants to merge 6 commits intomainfrom
nv/qwen35-fp4

kedarpotdar-nv commented Feb 27, 2026 •

edited by Klaud-Cold

Loading

Uh oh!

kedarpotdar-nv commented Feb 27, 2026

Uh oh!

Klaud-Cold commented Feb 27, 2026 •

edited

Loading

Uh oh!

kedarpotdar-nv commented Feb 27, 2026

Uh oh!

Klaud-Cold commented Feb 27, 2026 •

edited

Loading

Uh oh!

kedarpotdar-nv commented Feb 27, 2026

Uh oh!

functionstackx commented Feb 27, 2026

Uh oh!

functionstackx left a comment

Uh oh!

kedarpotdar-nv commented Feb 27, 2026

Uh oh!

github-actions bot commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kedarpotdar-nv commented Feb 27, 2026 • edited by Klaud-Cold Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

New Benchmark Config (nvidia-master.yaml)

New Launch Script (benchmarks/single_node/qwen3.5_fp4_b200.sh)

Perf Changelog

Uh oh!

kedarpotdar-nv commented Feb 27, 2026

Uh oh!

Klaud-Cold commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Adding PR Description

Uh oh!

kedarpotdar-nv commented Feb 27, 2026

Uh oh!

Klaud-Cold commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewing PR

Review: qwen3.5-fp4-b200-sglang

.github/configs/nvidia-master.yaml — Config

benchmarks/single_node/qwen3.5_fp4_b200.sh — Launch Script

perf-changelog.yaml

Summary

Uh oh!

kedarpotdar-nv commented Feb 27, 2026

Uh oh!

functionstackx commented Feb 27, 2026

Uh oh!

functionstackx left a comment

Choose a reason for hiding this comment

Uh oh!

kedarpotdar-nv commented Feb 27, 2026

Uh oh!

github-actions bot commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kedarpotdar-nv commented Feb 27, 2026 •

edited by Klaud-Cold

Loading

New Benchmark Config (`nvidia-master.yaml`)

New Launch Script (`benchmarks/single_node/qwen3.5_fp4_b200.sh`)

Klaud-Cold commented Feb 27, 2026 •

edited

Loading

Klaud-Cold commented Feb 27, 2026 •

edited

Loading

Review: `qwen3.5-fp4-b200-sglang`

`.github/configs/nvidia-master.yaml` — Config

`benchmarks/single_node/qwen3.5_fp4_b200.sh` — Launch Script

`perf-changelog.yaml`