Skip to content

fix: Replace hard-coded precision thresholds with std-based bounds#1864

Open
TimDettmers wants to merge 3 commits intomainfrom
fix/flaky-precision-tests
Open

fix: Replace hard-coded precision thresholds with std-based bounds#1864
TimDettmers wants to merge 3 commits intomainfrom
fix/flaky-precision-tests

Conversation

@TimDettmers
Copy link
Collaborator

Summary

  • Precision tests were flaky because thresholds were calibrated too close to the empirical mean error, leaving insufficient margin for GPU architecture differences (e.g., Blackwell's fp4/blocksize=256 rel_err of 0.2909 vs threshold 0.2918 — only ~5 sigma headroom)
  • Collected error statistics (mean, std) from 200 samples per configuration on RTX 4090, and replaced all hard-coded thresholds with mean + 7*std bounds
  • Removed GPU-specific carve-outs (T4 compute capability check, XPU conditional) from test_gemv_4bit — the std-based thresholds are generous enough to accommodate architecture differences naturally

What was wrong

The old thresholds in test_4bit_quant stored mean error values averaged over 1k samples, then added a flat tolerance (+ 0.001). This tolerance was too small for large blocksizes where the per-sample standard deviation is higher. For fp4/blocksize=256, the tolerance gave only ~5.5 sigma of headroom — fine for the GPU it was measured on, but Blackwell's slightly different FP rounding shifts the mean by ~2 sigma, causing failures.

Similarly, test_gemv_4bit had a complex if/elif threshold tree with hardware-specific patches (e.g., relaxed relerr1 threshold for T4 compute_cap == (7, 5)). Each new GPU architecture required adding another special case.

The test_parametrize.py tests inherited the same tight thresholds from test_functional.py, with ad-hoc margins (+ 0.01, + 0.02) that varied per test.

How it was fixed

Each error metric now stores (mean, std) tuples measured from 200 samples. Thresholds are computed as mean + N_SIGMA * std with N_SIGMA = 7. This gives:

  • ~7 sigma headroom on the measured GPU (1 in 390 billion chance of false failure)
  • ~5 sigma headroom on a GPU whose mean is 2 sigma higher (accommodates Blackwell, T4, etc.)
  • Clear, auditable methodology — the expected value and its variance are both visible in the code

For test_gemv_4bit, individual-sample std (not divided by sqrt(iters)) is used deliberately, since the test averages over 100 iterations. This makes the threshold generous for the averaged value (~70 sigma), which naturally absorbs GPU-to-GPU kernel behavior differences without needing per-GPU carve-outs.

Files changed

File Tests updated
tests/test_functional.py test_4bit_quant, test_gemv_4bit
tests/test_parametrize.py test_replace_parameter_4bit, test_moe_parameter_shape, test_different_blocksizes, test_parametrization_forward_method

Test plan

  • test_4bit_quant — 96/96 passed (all quant_type × blocksize × dtype × device combos)
  • test_gemv_4bit — 384/384 passed (all dim × dtype × storage_type × DQ × kind combos)
  • test_parametrize.py — 158/158 passed
  • Full CI suite on Blackwell (the primary target for this fix)

🤖 Generated with Claude Code

TimDettmers and others added 2 commits February 14, 2026 01:28
Worker agents were running the full test suite (10+ min) which is
wasteful when only a small area of code changed. Updated the completion
workflow to instruct agents to run only relevant test files/functions.
The full suite will be run separately later.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Precision tests were flaky because thresholds were set too close to the
empirical mean error, leaving insufficient margin for GPU architecture
differences. For example, test_4bit_quant for fp4/blocksize=256 used a
threshold of 0.2908 + 0.001 = 0.2918, but Blackwell GPUs observed values
around 0.2909 — only ~5 sigma from the mean, causing sporadic failures.

Collected (mean, std) statistics from 200 samples per configuration on
RTX 4090. Thresholds are now set at mean + 7*std, giving ~7 sigma of
headroom for the measured GPU and enough margin to accommodate
cross-architecture mean shifts (e.g., T4, Blackwell, XPU).

Changes in test_functional.py:
- test_4bit_quant: error_dict now stores (mean, std) tuples instead of
  bare means. Removed ad-hoc errtol/reltol special-casing for CPU fp32.
- test_gemv_4bit: Replaced complex if/elif threshold tree (with GPU-
  specific carve-outs like T4 compute cap checks and XPU conditionals)
  with a clean per-dtype/dim-range (mean, std) table. Individual-sample
  std is used (not divided by sqrt(iters)) so thresholds naturally
  accommodate architecture-specific kernel behavior.

Changes in test_parametrize.py:
- test_replace_parameter_4bit: Same (mean, std) approach as test_4bit_quant.
- test_moe_parameter_shape: Replaced flat 0.085/0.25 bounds with measured
  MoE-tensor-specific (mean, std).
- test_different_blocksizes: Same (mean, std) approach as test_4bit_quant.
- test_parametrization_forward_method: Replaced flat 0.08/0.25 bounds with
  small-tensor-specific (mean, std); small 64x64 tensors have ~16x higher
  relative std than 1024x1024 due to fewer quantization blocks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

- Replace ambiguous unicode multiplication sign with ASCII x
- Apply ruff format to long assert lines
- Fix test_linear4bit.py pre-existing format violation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Collaborator Author

@TimDettmers TimDettmers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: #1864 — fix: Replace hard-coded precision thresholds with std-based bounds

Test-only change (plus a docs tweak to agents/coordinator_guide.md): replaces ad-hoc hard-coded precision thresholds in test_4bit_quant, test_gemv_4bit, and four test_parametrize.py tests with statistically grounded mean + 7*std bounds. Also removes GPU-specific carve-outs (T4 compute-cap check, XPU conditional) that required maintenance for each new architecture. The test_linear4bit.py change is a cosmetic reformat of one assert statement.

Verdict: Approve. The methodology is sound, the math checks out, and CI is fully green across all platforms.

Statistical methodology assessment

The approach stores (mean, std) tuples measured from 200 samples on RTX 4090, then sets thresholds at mean + N_SIGMA * std with N_SIGMA = 7.

  • 7-sigma headroom gives a false-failure probability of ~1 in 390 billion per individual assertion on the reference GPU. Even if a different GPU shifts the mean by 2 sigma, there is still ~5-sigma headroom (1 in 3.5 million).
  • test_gemv_4bit uses individual-sample std (not divided by sqrt(iters)), while the test averages over 100 iterations. This makes the threshold effectively ~70 sigma for the averaged metric, which is extremely generous and eliminates the need for per-GPU special cases. This is a deliberate and correct design choice, well-documented in the PR body and commit message.
  • No zero-std risk: All std values are strictly positive (smallest is 2e-9 for fp32 gemv), which is expected from empirical measurements over 200 stochastic samples.
  • CPU fp32 coverage: The old code had special-case errtol/reltol relaxation for CPU fp32 at large blocksizes. The new 7*std thresholds are strictly more generous than the old mean + 0.001/mean + 0.0028 bounds in all affected configurations, so removing the CPU special case is safe.

Specific observations

  • The commented-out debug print block in test_gemv_4bit (11 lines) was removed. Good cleanup.
  • The test_parametrize.py changes consistently apply the same (mean, std) methodology to four tests that previously used varying ad-hoc margins (+0.001, +0.01, +0.02, flat 0.085/0.25). The small-tensor test (test_parametrization_forward_method, 64x64) correctly uses higher std values (e.g., 0.001180 vs 0.000072 for 512x256 tensors) reflecting the ~16x higher relative variance from fewer quantization blocks.
  • The coordinator_guide.md change adds guidance to run only relevant tests, not the full suite. Sensible process improvement.

No blocking issues.

  • Security: Clear
  • Downstream impact: None (test-only changes, no public API modifications)
  • Tests: The PR is the test improvement. Methodology is well-founded.
  • CI: All checks pass (L40S, T4, CPU x86/arm/macOS, Windows, lint)
  • Cross-PR conflicts: File-level overlaps with several open PRs on tests/test_functional.py (#1871, #1863, #1729) and tests/test_linear4bit.py (#1866, #1865, #1863, #1861, #1860, #1859, #1858). The overlaps on test_linear4bit.py are in different test functions (this PR only reformats one assert in test_quant_storage_shard_roundtrip), so merge conflicts are unlikely. The test_functional.py overlaps with #1871 (deprecation) and #1863 could require a rebase for whichever merges second.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments