Skip to content

Bug fix: disable weight quantizer rotation after weight fold during vLLM fakequant export#1143

Merged
kinjalpatel27 merged 3 commits intomainfrom
kinjal/vllm_export_rotation_fix
Apr 1, 2026
Merged

Bug fix: disable weight quantizer rotation after weight fold during vLLM fakequant export#1143
kinjalpatel27 merged 3 commits intomainfrom
kinjal/vllm_export_rotation_fix

Conversation

@kinjalpatel27
Copy link
Copy Markdown
Contributor

@kinjalpatel27 kinjalpatel27 commented Mar 30, 2026

What does this PR do?

Type of change: Bug fix

When export_hf_vllm_fq_checkpoint folds fake-quantized weights into the HF checkpoint, it applies the weight quantizer's full forward pass — including any Hadamard rotation — to produce fake_quant(rotate(W)). The weight quantizer is then disabled in the saved ModelOpt state so vLLM reload skips re-quantization.

However, if fold_weight is called after reload (e.g. in fakequant_worker.py), QuantModule.fold_weight checks fake_quant but not is_enabled, so it calls the disabled weight quantizer. With rotate=True, rotation runs before the
_disabled early-return in TensorQuantizer.forward, corrupting the already-folded weight with a second rotation.

This fix clears _rotate on weight quantizers alongside disable() before saving the ModelOpt state. Both are restored on the in-memory model after saving, so the export remains non-mutating.

Usage

Testing

cd <MODELOPT_PATH>/examples/vllm_serve

Add following config to modelopt/torch/quantization/config.py
NVFP4_ROTATE_CFG = {
    "quant_cfg": {
        "*weight_quantizer": {
            "num_bits": (2, 1),
            "block_sizes": {-1: 16, "type": "dynamic", "scale_bits": (4, 3)},
            "enable": True,
            "rotate": True,
        },
        "*input_quantizer": {
            "num_bits": (2, 1),
            "block_sizes": {-1: 16, "type": "dynamic", "scale_bits": (4, 3)},
            "enable": True,
            "rotate": True,
        },
        **_default_disabled_quantizer_cfg,
    },
    "algorithm": {
        "method": "max",
    },
}

python ../llm_ptq/hf_ptq.py             --pyt_ckpt_path Qwen/Qwen3-0.6B    --quant_cfg  NVFP4_ROTATE_CFG    --dataset cnn_dailymail             --export_path qwen3-rotate             --trust_remote_code             --batch_size 4             --calib_size 512 --vllm_fakequant_export

MODELOPT_STATE_PATH=qwen3-rotate/vllm_fq_modelopt_state.pth python vllm_serve_fakequant.py qwen3-rotate/ -tp 1         --served-model-name Qwen3-0.6B --host 0.0.0.0  --port 8001 --trust-remote-code         --disable-custom-all-reduce         --enforce-eager         --gpu-memory-utilization 0.8         --max-model-len 32768

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: N/A
  • Did you update Changelog?: N/A

Additional Information

Summary by CodeRabbit

  • Bug Fixes

    • Preserve and restore rotation settings for quantized weights during export so exported models keep correct quantization behavior.
  • Documentation

    • Updated changelog: moved vLLM fakequant reload entry to 0.44 (2026-05) with reference to the vLLM serve example.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 30, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 30, 2026

📝 Walkthrough

Walkthrough

Export snapshotting now preserves each weight quantizer's rotation state by capturing the original _rotate, replacing it with a disabled, type-preserving form while snapshotting, and restoring the original _rotate after export/reload.

Changes

Cohort / File(s) Summary
Rotation Configuration Preservation
modelopt/torch/export/plugins/vllm_fakequant_hf.py
Capture each enabled weight quantizer's original _rotate, replace it with a disabled form that preserves type (supports RotateConfig and legacy dict), snapshot state, then restore the original _rotate after completion.
Changelog
CHANGELOG.rst
Moved the "vLLM fakequant reload using ModelOpt state for HF models" entry from 0.43 to a new 0.44 entry under New Features.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately describes the main change: disabling weight quantizer rotation after weight fold during vLLM fakequant export, which directly matches the core bug fix in the code changes.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Security Anti-Patterns ✅ Passed PR modifies only vllm_fakequant_hf.py and CHANGELOG.rst with safe in-memory quantizer state management using PyTorch natives; no unsafe patterns detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch kinjal/vllm_export_rotation_fix

Comment @coderabbitai help to get the list of available commands and usage tips.

@kinjalpatel27 kinjalpatel27 marked this pull request as ready for review March 30, 2026 20:11
@kinjalpatel27 kinjalpatel27 requested a review from a team as a code owner March 30, 2026 20:11
@kinjalpatel27 kinjalpatel27 requested a review from cjluo-nv March 30, 2026 20:11
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 30, 2026

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-01 20:54 UTC

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
modelopt/torch/export/plugins/vllm_fakequant_hf.py (1)

107-122: Please add a rotate=True export/reload regression test.

This only fails when export, reload, and a later fold_weight() interact, so it is easy to regress silently without an end-to-end check.

As per coding guidelines, "Maintain minimum 70% code coverage on modelopt/* modules (enforced via coverage configuration)."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py` around lines 107 - 122,
Add a regression test that exercises export → reload → fold_weight to ensure the
quantizer rotation state is preserved: create a small model using QuantModule
with a TensorQuantizer that has rotate enabled, export it using the logic in
vllm_fakequant_hf.py (so the code path that disables quantizer.rotate and stores
orig_rotate in wqs_to_restore runs), reload the exported model, call
fold_weight() on the reloaded model, and assert that weights are identical to
the pre-export folded result and that the TensorQuantizer._rotate value is
restored to its original boolean; target the test at the export/reload flow that
manipulates QuantModule, TensorQuantizer, rotate/rotate_is_enabled, and
fold_weight so regressions are caught.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 118-122: The code disables quantizers and mutates
quantizer._rotate but does not guarantee restoration on error; wrap the section
that disables quantizers and the following processing (the region that builds
wqs_to_restore and does work up to where restores currently happen) in a
try/finally so restoration always runs: in the finally iterate wqs_to_restore
and call quantizer.enable() (or reverse the earlier disable call) and restore
quantizer._rotate to orig_rotate (using quantizer.rotate_is_enabled to decide if
it was changed), and swallow/log any exceptions during restore to avoid masking
the original exception; reference the quantizer object, wqs_to_restore list,
quantizer.disable/enable, quantizer._rotate and quantizer.rotate_is_enabled so
you restore state even if an exception occurs.

---

Nitpick comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 107-122: Add a regression test that exercises export → reload →
fold_weight to ensure the quantizer rotation state is preserved: create a small
model using QuantModule with a TensorQuantizer that has rotate enabled, export
it using the logic in vllm_fakequant_hf.py (so the code path that disables
quantizer.rotate and stores orig_rotate in wqs_to_restore runs), reload the
exported model, call fold_weight() on the reloaded model, and assert that
weights are identical to the pre-export folded result and that the
TensorQuantizer._rotate value is restored to its original boolean; target the
test at the export/reload flow that manipulates QuantModule, TensorQuantizer,
rotate/rotate_is_enabled, and fold_weight so regressions are caught.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8f9e9ddd-4ce2-43db-b2d3-5afccc6c397a

📥 Commits

Reviewing files that changed from the base of the PR and between f04e106 and 19563b0.

📒 Files selected for processing (1)
  • modelopt/torch/export/plugins/vllm_fakequant_hf.py

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 30, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.21%. Comparing base (c37c74f) to head (a4b8232).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1143      +/-   ##
==========================================
+ Coverage   70.20%   70.21%   +0.01%     
==========================================
  Files         230      230              
  Lines       26098    26098              
==========================================
+ Hits        18322    18325       +3     
+ Misses       7776     7773       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The bug is real — TensorQuantizer.forward() applies Hadamard rotation before the _disabled early-return, so when fold_weight is called on vLLM reload (which checks fake_quant but not is_enabled), the already-folded weight gets a second rotation. The fix correctly clears _rotate alongside disable() before saving the modelopt state, and restores both afterward to keep the export non-mutating.

One minor observation: _rotate can be a bool, dict, or RotateConfig — setting it to False works correctly for the serialized state (since rotate_is_enabled handles all three types), but preserving the original type (e.g. RotateConfig(enable=False)) would be slightly cleaner. Not a blocker.

Worth noting that the deeper root cause is QuantModule.fold_weight not guarding on is_enabled / _disabled before invoking the quantizer, but that's a broader change — this targeted fix is appropriate.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
modelopt/torch/export/plugins/vllm_fakequant_hf.py (1)

128-132: Logic is correct; consider optional try/finally for robustness.

The sequence properly saves orig_rotate before mutation and correctly uses rotate_is_enabled to avoid unnecessary changes. The fix addresses the double-rotation bug effectively.

As noted in past review, wrapping lines 128-165 in a try/finally would guarantee restoration even if an exception occurs mid-export (relevant when callers catch exceptions and continue). This is optional given your deployment context but would make the function safer as reusable library code.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py` around lines 128 - 132,
Wrap the block that disables quantizers and records originals (the code that
calls quantizer.disable(), saves orig_rotate from quantizer._rotate, checks
quantizer.rotate_is_enabled and sets quantizer._rotate =
disable_rotate(quantizer), and appends to wqs_to_restore) in a try/finally so
that the finally always iterates wqs_to_restore and restores each
quantizer._rotate and re-enables any state as needed; ensure you reference the
same symbols (quantizer, orig_rotate, disable_rotate, wqs_to_restore) when
restoring to guarantee restoration even if an exception occurs during export.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 128-132: Wrap the block that disables quantizers and records
originals (the code that calls quantizer.disable(), saves orig_rotate from
quantizer._rotate, checks quantizer.rotate_is_enabled and sets quantizer._rotate
= disable_rotate(quantizer), and appends to wqs_to_restore) in a try/finally so
that the finally always iterates wqs_to_restore and restores each
quantizer._rotate and re-enables any state as needed; ensure you reference the
same symbols (quantizer, orig_rotate, disable_rotate, wqs_to_restore) when
restoring to guarantee restoration even if an exception occurs during export.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6c0d7633-e5eb-4ca7-8003-e0d0f2fa9fe6

📥 Commits

Reviewing files that changed from the base of the PR and between 19563b0 and f56eecc.

📒 Files selected for processing (1)
  • modelopt/torch/export/plugins/vllm_fakequant_hf.py

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
@kinjalpatel27 kinjalpatel27 force-pushed the kinjal/vllm_export_rotation_fix branch from 83b8ee6 to a4b8232 Compare April 1, 2026 18:00
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
modelopt/torch/export/plugins/vllm_fakequant_hf.py (1)

129-169: ⚠️ Potential issue | 🟠 Major

Restore the temporary quantizer mutations in a finally block.

Line 128 starts mutating live quantizers, but the restore at Line 167 only runs on the happy path. If get_quantizer_state_dict(), mto.modelopt_state(), torch.save(), or save_pretrained() throws and the caller catches it, the in-memory model stays with weight quantizers disabled and _rotate cleared.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py` around lines 129 - 169,
The temporary mutations to quantizers (created via disable_rotate and tracked in
wqs_to_restore) must be undone even on exceptions: wrap the body that calls
get_quantizer_state_dict, mto.modelopt_state, torch.save, and
model.save_pretrained in a try/finally and move the loop that restores each wq
by calling wq.enable() and resetting wq._rotate to orig_rotate into the finally
block so all quantizers are restored regardless of errors.
🧹 Nitpick comments (1)
modelopt/torch/export/plugins/vllm_fakequant_hf.py (1)

32-38: Add a regression test for both RotateConfig and legacy dict _rotate values.

This path is now the compatibility boundary for exported checkpoints, but the PR still does not prove that both _rotate encodings are serialized with rotation disabled while the in-memory quantizer is restored to its original value afterward. A focused export test here would keep this from regressing silently.

As per coding guidelines "Maintain minimum 70% code coverage on modelopt/* modules (enforced via coverage configuration)".

Also applies to: 129-169

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py` around lines 32 - 38, Add
a focused regression test for disable_rotate in
modelopt/torch/export/plugins/vllm_fakequant_hf.py that covers both RotateConfig
and legacy dict encodings: create a TensorQuantizer with _rotate set to a
RotateConfig(enable=True) and another with _rotate set to a dict({'enable':
True, ...}), export them using the same export path exercised by the plugin (and
the code handling lines ~129-169), load the serialized checkpoint and assert the
stored _rotate value has enable=False in the checkpoint file (or serialized
dict) while asserting that the in-memory quantizer instances remain unchanged
(their original _rotate.enable is still True) after the export; add two tests
(e.g., test_export_rotateconfig_disabled_in_checkpoint and
test_export_legacy_rotate_dict_disabled_in_checkpoint) to enforce this behavior
and prevent regressions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 129-169: The temporary mutations to quantizers (created via
disable_rotate and tracked in wqs_to_restore) must be undone even on exceptions:
wrap the body that calls get_quantizer_state_dict, mto.modelopt_state,
torch.save, and model.save_pretrained in a try/finally and move the loop that
restores each wq by calling wq.enable() and resetting wq._rotate to orig_rotate
into the finally block so all quantizers are restored regardless of errors.

---

Nitpick comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 32-38: Add a focused regression test for disable_rotate in
modelopt/torch/export/plugins/vllm_fakequant_hf.py that covers both RotateConfig
and legacy dict encodings: create a TensorQuantizer with _rotate set to a
RotateConfig(enable=True) and another with _rotate set to a dict({'enable':
True, ...}), export them using the same export path exercised by the plugin (and
the code handling lines ~129-169), load the serialized checkpoint and assert the
stored _rotate value has enable=False in the checkpoint file (or serialized
dict) while asserting that the in-memory quantizer instances remain unchanged
(their original _rotate.enable is still True) after the export; add two tests
(e.g., test_export_rotateconfig_disabled_in_checkpoint and
test_export_legacy_rotate_dict_disabled_in_checkpoint) to enforce this behavior
and prevent regressions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bdb63335-b658-469b-b259-d6536747ef38

📥 Commits

Reviewing files that changed from the base of the PR and between f56eecc and a4b8232.

📒 Files selected for processing (2)
  • CHANGELOG.rst
  • modelopt/torch/export/plugins/vllm_fakequant_hf.py

@kinjalpatel27 kinjalpatel27 merged commit de55e8a into main Apr 1, 2026
59 of 61 checks passed
@kinjalpatel27 kinjalpatel27 deleted the kinjal/vllm_export_rotation_fix branch April 1, 2026 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants