fix: Replace hard-coded precision thresholds with std-based bounds#1864
fix: Replace hard-coded precision thresholds with std-based bounds#1864TimDettmers wants to merge 3 commits intomainfrom
Conversation
Worker agents were running the full test suite (10+ min) which is wasteful when only a small area of code changed. Updated the completion workflow to instruct agents to run only relevant test files/functions. The full suite will be run separately later. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Precision tests were flaky because thresholds were set too close to the empirical mean error, leaving insufficient margin for GPU architecture differences. For example, test_4bit_quant for fp4/blocksize=256 used a threshold of 0.2908 + 0.001 = 0.2918, but Blackwell GPUs observed values around 0.2909 — only ~5 sigma from the mean, causing sporadic failures. Collected (mean, std) statistics from 200 samples per configuration on RTX 4090. Thresholds are now set at mean + 7*std, giving ~7 sigma of headroom for the measured GPU and enough margin to accommodate cross-architecture mean shifts (e.g., T4, Blackwell, XPU). Changes in test_functional.py: - test_4bit_quant: error_dict now stores (mean, std) tuples instead of bare means. Removed ad-hoc errtol/reltol special-casing for CPU fp32. - test_gemv_4bit: Replaced complex if/elif threshold tree (with GPU- specific carve-outs like T4 compute cap checks and XPU conditionals) with a clean per-dtype/dim-range (mean, std) table. Individual-sample std is used (not divided by sqrt(iters)) so thresholds naturally accommodate architecture-specific kernel behavior. Changes in test_parametrize.py: - test_replace_parameter_4bit: Same (mean, std) approach as test_4bit_quant. - test_moe_parameter_shape: Replaced flat 0.085/0.25 bounds with measured MoE-tensor-specific (mean, std). - test_different_blocksizes: Same (mean, std) approach as test_4bit_quant. - test_parametrization_forward_method: Replaced flat 0.08/0.25 bounds with small-tensor-specific (mean, std); small 64x64 tensors have ~16x higher relative std than 1024x1024 due to fewer quantization blocks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
- Replace ambiguous unicode multiplication sign with ASCII x - Apply ruff format to long assert lines - Fix test_linear4bit.py pre-existing format violation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TimDettmers
left a comment
There was a problem hiding this comment.
PR Review: #1864 — fix: Replace hard-coded precision thresholds with std-based bounds
Test-only change (plus a docs tweak to agents/coordinator_guide.md): replaces ad-hoc hard-coded precision thresholds in test_4bit_quant, test_gemv_4bit, and four test_parametrize.py tests with statistically grounded mean + 7*std bounds. Also removes GPU-specific carve-outs (T4 compute-cap check, XPU conditional) that required maintenance for each new architecture. The test_linear4bit.py change is a cosmetic reformat of one assert statement.
Verdict: Approve. The methodology is sound, the math checks out, and CI is fully green across all platforms.
Statistical methodology assessment
The approach stores (mean, std) tuples measured from 200 samples on RTX 4090, then sets thresholds at mean + N_SIGMA * std with N_SIGMA = 7.
- 7-sigma headroom gives a false-failure probability of ~1 in 390 billion per individual assertion on the reference GPU. Even if a different GPU shifts the mean by 2 sigma, there is still ~5-sigma headroom (1 in 3.5 million).
test_gemv_4bituses individual-sample std (not divided bysqrt(iters)), while the test averages over 100 iterations. This makes the threshold effectively ~70 sigma for the averaged metric, which is extremely generous and eliminates the need for per-GPU special cases. This is a deliberate and correct design choice, well-documented in the PR body and commit message.- No zero-std risk: All std values are strictly positive (smallest is 2e-9 for fp32 gemv), which is expected from empirical measurements over 200 stochastic samples.
- CPU fp32 coverage: The old code had special-case
errtol/reltolrelaxation for CPU fp32 at large blocksizes. The new7*stdthresholds are strictly more generous than the oldmean + 0.001/mean + 0.0028bounds in all affected configurations, so removing the CPU special case is safe.
Specific observations
- The commented-out debug print block in
test_gemv_4bit(11 lines) was removed. Good cleanup. - The
test_parametrize.pychanges consistently apply the same(mean, std)methodology to four tests that previously used varying ad-hoc margins (+0.001,+0.01,+0.02, flat0.085/0.25). The small-tensor test (test_parametrization_forward_method, 64x64) correctly uses higher std values (e.g., 0.001180 vs 0.000072 for 512x256 tensors) reflecting the ~16x higher relative variance from fewer quantization blocks. - The
coordinator_guide.mdchange adds guidance to run only relevant tests, not the full suite. Sensible process improvement.
No blocking issues.
- Security: Clear
- Downstream impact: None (test-only changes, no public API modifications)
- Tests: The PR is the test improvement. Methodology is well-founded.
- CI: All checks pass (L40S, T4, CPU x86/arm/macOS, Windows, lint)
- Cross-PR conflicts: File-level overlaps with several open PRs on
tests/test_functional.py(#1871, #1863, #1729) andtests/test_linear4bit.py(#1866, #1865, #1863, #1861, #1860, #1859, #1858). The overlaps ontest_linear4bit.pyare in different test functions (this PR only reformats one assert intest_quant_storage_shard_roundtrip), so merge conflicts are unlikely. Thetest_functional.pyoverlaps with #1871 (deprecation) and #1863 could require a rebase for whichever merges second.
Summary
mean + 7*stdboundstest_gemv_4bit— the std-based thresholds are generous enough to accommodate architecture differences naturallyWhat was wrong
The old thresholds in
test_4bit_quantstored mean error values averaged over 1k samples, then added a flat tolerance (+ 0.001). This tolerance was too small for large blocksizes where the per-sample standard deviation is higher. For fp4/blocksize=256, the tolerance gave only ~5.5 sigma of headroom — fine for the GPU it was measured on, but Blackwell's slightly different FP rounding shifts the mean by ~2 sigma, causing failures.Similarly,
test_gemv_4bithad a complex if/elif threshold tree with hardware-specific patches (e.g., relaxed relerr1 threshold for T4compute_cap == (7, 5)). Each new GPU architecture required adding another special case.The
test_parametrize.pytests inherited the same tight thresholds fromtest_functional.py, with ad-hoc margins (+ 0.01,+ 0.02) that varied per test.How it was fixed
Each error metric now stores
(mean, std)tuples measured from 200 samples. Thresholds are computed asmean + N_SIGMA * stdwithN_SIGMA = 7. This gives:For
test_gemv_4bit, individual-sample std (not divided bysqrt(iters)) is used deliberately, since the test averages over 100 iterations. This makes the threshold generous for the averaged value (~70 sigma), which naturally absorbs GPU-to-GPU kernel behavior differences without needing per-GPU carve-outs.Files changed
tests/test_functional.pytest_4bit_quant,test_gemv_4bittests/test_parametrize.pytest_replace_parameter_4bit,test_moe_parameter_shape,test_different_blocksizes,test_parametrization_forward_methodTest plan
test_4bit_quant— 96/96 passed (all quant_type × blocksize × dtype × device combos)test_gemv_4bit— 384/384 passed (all dim × dtype × storage_type × DQ × kind combos)test_parametrize.py— 158/158 passed🤖 Generated with Claude Code