Bug fix: disable weight quantizer rotation after weight fold during vLLM fakequant export#1143
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
📝 WalkthroughWalkthroughExport snapshotting now preserves each weight quantizer's rotation state by capturing the original Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
modelopt/torch/export/plugins/vllm_fakequant_hf.py (1)
107-122: Please add arotate=Trueexport/reload regression test.This only fails when export, reload, and a later
fold_weight()interact, so it is easy to regress silently without an end-to-end check.As per coding guidelines, "Maintain minimum 70% code coverage on modelopt/* modules (enforced via coverage configuration)."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py` around lines 107 - 122, Add a regression test that exercises export → reload → fold_weight to ensure the quantizer rotation state is preserved: create a small model using QuantModule with a TensorQuantizer that has rotate enabled, export it using the logic in vllm_fakequant_hf.py (so the code path that disables quantizer.rotate and stores orig_rotate in wqs_to_restore runs), reload the exported model, call fold_weight() on the reloaded model, and assert that weights are identical to the pre-export folded result and that the TensorQuantizer._rotate value is restored to its original boolean; target the test at the export/reload flow that manipulates QuantModule, TensorQuantizer, rotate/rotate_is_enabled, and fold_weight so regressions are caught.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 118-122: The code disables quantizers and mutates
quantizer._rotate but does not guarantee restoration on error; wrap the section
that disables quantizers and the following processing (the region that builds
wqs_to_restore and does work up to where restores currently happen) in a
try/finally so restoration always runs: in the finally iterate wqs_to_restore
and call quantizer.enable() (or reverse the earlier disable call) and restore
quantizer._rotate to orig_rotate (using quantizer.rotate_is_enabled to decide if
it was changed), and swallow/log any exceptions during restore to avoid masking
the original exception; reference the quantizer object, wqs_to_restore list,
quantizer.disable/enable, quantizer._rotate and quantizer.rotate_is_enabled so
you restore state even if an exception occurs.
---
Nitpick comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 107-122: Add a regression test that exercises export → reload →
fold_weight to ensure the quantizer rotation state is preserved: create a small
model using QuantModule with a TensorQuantizer that has rotate enabled, export
it using the logic in vllm_fakequant_hf.py (so the code path that disables
quantizer.rotate and stores orig_rotate in wqs_to_restore runs), reload the
exported model, call fold_weight() on the reloaded model, and assert that
weights are identical to the pre-export folded result and that the
TensorQuantizer._rotate value is restored to its original boolean; target the
test at the export/reload flow that manipulates QuantModule, TensorQuantizer,
rotate/rotate_is_enabled, and fold_weight so regressions are caught.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8f9e9ddd-4ce2-43db-b2d3-5afccc6c397a
📒 Files selected for processing (1)
modelopt/torch/export/plugins/vllm_fakequant_hf.py
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1143 +/- ##
==========================================
+ Coverage 70.20% 70.21% +0.01%
==========================================
Files 230 230
Lines 26098 26098
==========================================
+ Hits 18322 18325 +3
+ Misses 7776 7773 -3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
cjluo-nv
left a comment
There was a problem hiding this comment.
LGTM. The bug is real — TensorQuantizer.forward() applies Hadamard rotation before the _disabled early-return, so when fold_weight is called on vLLM reload (which checks fake_quant but not is_enabled), the already-folded weight gets a second rotation. The fix correctly clears _rotate alongside disable() before saving the modelopt state, and restores both afterward to keep the export non-mutating.
One minor observation: _rotate can be a bool, dict, or RotateConfig — setting it to False works correctly for the serialized state (since rotate_is_enabled handles all three types), but preserving the original type (e.g. RotateConfig(enable=False)) would be slightly cleaner. Not a blocker.
Worth noting that the deeper root cause is QuantModule.fold_weight not guarding on is_enabled / _disabled before invoking the quantizer, but that's a broader change — this targeted fix is appropriate.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
modelopt/torch/export/plugins/vllm_fakequant_hf.py (1)
128-132: Logic is correct; consider optionaltry/finallyfor robustness.The sequence properly saves
orig_rotatebefore mutation and correctly usesrotate_is_enabledto avoid unnecessary changes. The fix addresses the double-rotation bug effectively.As noted in past review, wrapping lines 128-165 in a
try/finallywould guarantee restoration even if an exception occurs mid-export (relevant when callers catch exceptions and continue). This is optional given your deployment context but would make the function safer as reusable library code.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py` around lines 128 - 132, Wrap the block that disables quantizers and records originals (the code that calls quantizer.disable(), saves orig_rotate from quantizer._rotate, checks quantizer.rotate_is_enabled and sets quantizer._rotate = disable_rotate(quantizer), and appends to wqs_to_restore) in a try/finally so that the finally always iterates wqs_to_restore and restores each quantizer._rotate and re-enables any state as needed; ensure you reference the same symbols (quantizer, orig_rotate, disable_rotate, wqs_to_restore) when restoring to guarantee restoration even if an exception occurs during export.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 128-132: Wrap the block that disables quantizers and records
originals (the code that calls quantizer.disable(), saves orig_rotate from
quantizer._rotate, checks quantizer.rotate_is_enabled and sets quantizer._rotate
= disable_rotate(quantizer), and appends to wqs_to_restore) in a try/finally so
that the finally always iterates wqs_to_restore and restores each
quantizer._rotate and re-enables any state as needed; ensure you reference the
same symbols (quantizer, orig_rotate, disable_rotate, wqs_to_restore) when
restoring to guarantee restoration even if an exception occurs during export.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6c0d7633-e5eb-4ca7-8003-e0d0f2fa9fe6
📒 Files selected for processing (1)
modelopt/torch/export/plugins/vllm_fakequant_hf.py
83b8ee6 to
a4b8232
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
modelopt/torch/export/plugins/vllm_fakequant_hf.py (1)
129-169:⚠️ Potential issue | 🟠 MajorRestore the temporary quantizer mutations in a
finallyblock.Line 128 starts mutating live quantizers, but the restore at Line 167 only runs on the happy path. If
get_quantizer_state_dict(),mto.modelopt_state(),torch.save(), orsave_pretrained()throws and the caller catches it, the in-memory model stays with weight quantizers disabled and_rotatecleared.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py` around lines 129 - 169, The temporary mutations to quantizers (created via disable_rotate and tracked in wqs_to_restore) must be undone even on exceptions: wrap the body that calls get_quantizer_state_dict, mto.modelopt_state, torch.save, and model.save_pretrained in a try/finally and move the loop that restores each wq by calling wq.enable() and resetting wq._rotate to orig_rotate into the finally block so all quantizers are restored regardless of errors.
🧹 Nitpick comments (1)
modelopt/torch/export/plugins/vllm_fakequant_hf.py (1)
32-38: Add a regression test for bothRotateConfigand legacy dict_rotatevalues.This path is now the compatibility boundary for exported checkpoints, but the PR still does not prove that both
_rotateencodings are serialized with rotation disabled while the in-memory quantizer is restored to its original value afterward. A focused export test here would keep this from regressing silently.As per coding guidelines "Maintain minimum 70% code coverage on modelopt/* modules (enforced via coverage configuration)".
Also applies to: 129-169
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py` around lines 32 - 38, Add a focused regression test for disable_rotate in modelopt/torch/export/plugins/vllm_fakequant_hf.py that covers both RotateConfig and legacy dict encodings: create a TensorQuantizer with _rotate set to a RotateConfig(enable=True) and another with _rotate set to a dict({'enable': True, ...}), export them using the same export path exercised by the plugin (and the code handling lines ~129-169), load the serialized checkpoint and assert the stored _rotate value has enable=False in the checkpoint file (or serialized dict) while asserting that the in-memory quantizer instances remain unchanged (their original _rotate.enable is still True) after the export; add two tests (e.g., test_export_rotateconfig_disabled_in_checkpoint and test_export_legacy_rotate_dict_disabled_in_checkpoint) to enforce this behavior and prevent regressions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 129-169: The temporary mutations to quantizers (created via
disable_rotate and tracked in wqs_to_restore) must be undone even on exceptions:
wrap the body that calls get_quantizer_state_dict, mto.modelopt_state,
torch.save, and model.save_pretrained in a try/finally and move the loop that
restores each wq by calling wq.enable() and resetting wq._rotate to orig_rotate
into the finally block so all quantizers are restored regardless of errors.
---
Nitpick comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 32-38: Add a focused regression test for disable_rotate in
modelopt/torch/export/plugins/vllm_fakequant_hf.py that covers both RotateConfig
and legacy dict encodings: create a TensorQuantizer with _rotate set to a
RotateConfig(enable=True) and another with _rotate set to a dict({'enable':
True, ...}), export them using the same export path exercised by the plugin (and
the code handling lines ~129-169), load the serialized checkpoint and assert the
stored _rotate value has enable=False in the checkpoint file (or serialized
dict) while asserting that the in-memory quantizer instances remain unchanged
(their original _rotate.enable is still True) after the export; add two tests
(e.g., test_export_rotateconfig_disabled_in_checkpoint and
test_export_legacy_rotate_dict_disabled_in_checkpoint) to enforce this behavior
and prevent regressions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: bdb63335-b658-469b-b259-d6536747ef38
📒 Files selected for processing (2)
CHANGELOG.rstmodelopt/torch/export/plugins/vllm_fakequant_hf.py
What does this PR do?
Type of change: Bug fix
When
export_hf_vllm_fq_checkpointfolds fake-quantized weights into the HF checkpoint, it applies the weight quantizer's full forward pass — including any Hadamard rotation — to producefake_quant(rotate(W)). The weight quantizer is then disabled in the saved ModelOpt state so vLLM reload skips re-quantization.However, if
fold_weightis called after reload (e.g. infakequant_worker.py),QuantModule.fold_weightchecksfake_quantbut notis_enabled, so it calls the disabled weight quantizer. Withrotate=True, rotation runs before the_disabledearly-return inTensorQuantizer.forward, corrupting the already-folded weight with a second rotation.This fix clears
_rotateon weight quantizers alongsidedisable()before saving the ModelOpt state. Both are restored on the in-memory model after saving, so the export remains non-mutating.Usage
Testing
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: N/AAdditional Information
Summary by CodeRabbit
Bug Fixes
Documentation