Conversation
|
@claude please add a PR description |
|
Claude finished @kedarpotdar-nv's task in 40s —— View job Adding PR Description
PR description has been added with a summary of all changes: new |
|
@claude please review the PR |
|
Claude finished @kedarpotdar-nv's task in 2m 41s —— View job Reviewing PR
Review:
|
| Area | Verdict |
|---|---|
| Config YAML | ✅ Good |
| Launch script | 🟡 Minor — missing --tokenizer-worker-num, hardcoded scheduler interval |
| Perf changelog | ✅ Good |
The script is functional as-is. The two items flagged (tokenizer workers and dynamic scheduler interval) are nice-to-haves that would align with the BF16 sibling script and could improve low-concurrency latency. Up to you whether to address now or in a follow-up.
|
@functionstackx collaborating on this existing PR to get our FP8 and FP4 configs merged upstream in sgl-cookbook. sgl-project/sgl-cookbook#168 Thanks! |
Thanks! excited to see continued focus on ensuring that the end users in the ml community can benefit from these optimized flag recipes for selecting the correct MoE kernel backend & attention kernel backend |
|
/sweep run only evals |
|
@kedarpotdar-nv Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/22507799181?target=https://github.com |
Summary
Add FP4 benchmark configuration and launch script for Qwen3.5-397B-A17B on NVIDIA B200 GPUs using SGLang.
Changes
New Benchmark Config (
nvidia-master.yaml)qwen3.5-fp4-b200-sglangnvidia/Qwen3.5-397B-A17B-NVFP4lmsysorg/sglang:v0.5.9-cu129-amd641k1k— TP4/EP1 (conc 4–32), TP8/EP1 (conc 4–64), TP8/EP8 (conc 128)1k8k— TP4/EP1 (conc 4–32), TP8/EP1 (conc 4–128)8k1k— TP4/EP1 (conc 4–32), TP8/EP1 (conc 4–128)New Launch Script (
benchmarks/single_node/qwen3.5_fp4_b200.sh)SGLang server configuration with:
--quantization modelopt_fp4with--fp4-gemm-backend flashinfer_cutlass--kv-cache-dtype fp8_e4m3--attention-backend trtllm_mha/--moe-runner-backend flashinfer_trtllm--enable-flashinfer-allreduce-fusion--chunked-prefill-size 32768/--max-prefill-tokens 32768--disable-radix-cache--mem-fraction-static 0.85Perf Changelog
qwen3.5-fp4-b200-sglangconfig.