Skip to content

[NV] Qwen3.5 B200 SGLang FP4 configs#820

Open
kedarpotdar-nv wants to merge 6 commits intomainfrom
nv/qwen35-fp4
Open

[NV] Qwen3.5 B200 SGLang FP4 configs#820
kedarpotdar-nv wants to merge 6 commits intomainfrom
nv/qwen35-fp4

Conversation

@kedarpotdar-nv
Copy link
Collaborator

@kedarpotdar-nv kedarpotdar-nv commented Feb 27, 2026

Summary

Add FP4 benchmark configuration and launch script for Qwen3.5-397B-A17B on NVIDIA B200 GPUs using SGLang.

Changes

New Benchmark Config (nvidia-master.yaml)

  • Config key: qwen3.5-fp4-b200-sglang
  • Model: nvidia/Qwen3.5-397B-A17B-NVFP4
  • Image: lmsysorg/sglang:v0.5.9-cu129-amd64
  • Precision: FP4 (ModelOpt NVFP4)
  • Sequence length configurations:
    • 1k1k — TP4/EP1 (conc 4–32), TP8/EP1 (conc 4–64), TP8/EP8 (conc 128)
    • 1k8k — TP4/EP1 (conc 4–32), TP8/EP1 (conc 4–128)
    • 8k1k — TP4/EP1 (conc 4–32), TP8/EP1 (conc 4–128)

New Launch Script (benchmarks/single_node/qwen3.5_fp4_b200.sh)

SGLang server configuration with:

  • --quantization modelopt_fp4 with --fp4-gemm-backend flashinfer_cutlass
  • --kv-cache-dtype fp8_e4m3
  • --attention-backend trtllm_mha / --moe-runner-backend flashinfer_trtllm
  • --enable-flashinfer-allreduce-fusion
  • --chunked-prefill-size 32768 / --max-prefill-tokens 32768
  • --disable-radix-cache
  • --mem-fraction-static 0.85

Perf Changelog

  • Added entry for the new qwen3.5-fp4-b200-sglang config.

@kedarpotdar-nv
Copy link
Collaborator Author

@claude please add a PR description

@Klaud-Cold
Copy link
Collaborator

Klaud-Cold commented Feb 27, 2026

Claude finished @kedarpotdar-nv's task in 40s —— View job


Adding PR Description

  • Gather context from changed files
  • Draft and update PR description

PR description has been added with a summary of all changes: new qwen3.5-fp4-b200-sglang benchmark config, launch script, and perf-changelog entry.

@kedarpotdar-nv kedarpotdar-nv self-assigned this Feb 27, 2026
@kedarpotdar-nv
Copy link
Collaborator Author

@claude please review the PR

@Klaud-Cold
Copy link
Collaborator

Klaud-Cold commented Feb 27, 2026

Claude finished @kedarpotdar-nv's task in 2m 41s —— View job


Reviewing PR

  • Gather context from changed files and existing patterns
  • Review benchmark config (nvidia-master.yaml)
  • Review launch script (qwen3.5_fp4_b200.sh)
  • Review perf-changelog entry
  • Post review feedback

Review: qwen3.5-fp4-b200-sglang

Overall this is a clean addition. Config, script, and changelog all follow existing patterns. A few observations below — mostly minor, one worth discussing.


.github/configs/nvidia-master.yaml — Config

Looks good. Image (v0.5.9), model (nvidia/Qwen3.5-397B-A17B-NVFP4), and metadata fields are consistent with repo conventions. The search space is reasonable: TP4/EP1 for lower concurrencies, TP8/EP1 for wider sweeps, and a single TP8/EP8 point at conc=128 for 1k1k. No issues here.


benchmarks/single_node/qwen3.5_fp4_b200.sh — Launch Script

1. Missing --tokenizer-worker-num
The BF16 sibling script (qwen3.5_bf16_b200.sh) sets --tokenizer-worker-num 6. The new FP4 script omits this. At high concurrency (64–128), tokenization can become a bottleneck on a single worker. Was this intentionally dropped, or should it be carried over?

2. Hardcoded --scheduler-recv-interval 30
The BF16 script dynamically sets this (10 for CONC < 16, 30 for CONC ≥ 16). The FP4 script hardcodes 30. For low-concurrency runs (conc=4), a recv interval of 30 adds unnecessary scheduling latency — the scheduler polls less frequently, which can hurt latency at low load. Consider matching the BF16 pattern:

if [[ $CONC -ge 16 ]]; then
  SCHEDULER_RECV_INTERVAL=30
else
  SCHEDULER_RECV_INTERVAL=10
fi

Fix this →

3. No --served-model-name
The BF16 script passes --served-model-name "Qwen/Qwen3.5-397B-A17B". This script omits it. Not a blocker for benchmarking (the benchmark client uses the actual model path), but worth noting for consistency.

4. FP4-specific flags look correct

  • --quantization modelopt_fp4 + --fp4-gemm-backend flashinfer_cutlass — correct for NVFP4
  • --kv-cache-dtype fp8_e4m3 — standard for FP4 configs
  • --attention-backend trtllm_mha — correct for Qwen (MHA, not MLA which is DeepSeek-specific)
  • --moe-runner-backend flashinfer_trtllm — correct for MoE models
  • --mem-fraction-static 0.85 — reasonable bump from BF16's 0.82 given smaller FP4 weights

5. Environment exports inherited from BF16 script
NCCL_NVLS_ENABLE=1, SGL_ENABLE_JIT_DEEPGEMM=false, SGLANG_ENABLE_FLASHINFER_GEMM=true — all look appropriate.


perf-changelog.yaml

Clean entry, well-formatted, correct PR link. No issues.


Summary

Area Verdict
Config YAML ✅ Good
Launch script 🟡 Minor — missing --tokenizer-worker-num, hardcoded scheduler interval
Perf changelog ✅ Good

The script is functional as-is. The two items flagged (tokenizer workers and dynamic scheduler interval) are nice-to-haves that would align with the BF16 sibling script and could improve low-concurrency latency. Up to you whether to address now or in a follow-up.

View job

@kedarpotdar-nv
Copy link
Collaborator Author

@functionstackx collaborating on this existing PR to get our FP8 and FP4 configs merged upstream in sgl-cookbook. sgl-project/sgl-cookbook#168

Thanks!

@functionstackx
Copy link
Contributor

@functionstackx collaborating on this existing PR to get our FP8 and FP4 configs merged upstream in sgl-cookbook. sgl-project/sgl-cookbook#168

Thanks!

Thanks! excited to see continued focus on ensuring that the end users in the ml community can benefit from these optimized flag recipes for selecting the correct MoE kernel backend & attention kernel backend

Copy link
Contributor

@functionstackx functionstackx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@kedarpotdar-nv
Copy link
Collaborator Author

/sweep run only evals

@github-actions
Copy link
Contributor

@kedarpotdar-nv Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/22507799181?target=https://github.com
Command: run only evals
Pinned ref: c738532
Approval: not required (trusted collaborator).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants