Skip to content

Fix vllm quantization for new vllm >= 0.17#1146

Open
mxinO wants to merge 2 commits intomainfrom
mxin/vllm-update
Open

Fix vllm quantization for new vllm >= 0.17#1146
mxinO wants to merge 2 commits intomainfrom
mxin/vllm-update

Conversation

@mxinO
Copy link
Copy Markdown
Contributor

@mxinO mxinO commented Mar 31, 2026

What does this PR do?

Type of change: Bug fix

Fix for vllm >= 0.17

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

  • Refactor
    • Improved vLLM quantization error handling with try/except guards to gracefully support missing or optional components
    • Enhanced KV-cache handling to accept both single and list/tuple formats and safely derive device/dtype
    • Added tolerance for models running without optional distributed parallel groups
    • Made attention-related quant module registration conditional, activating only when components are present

Signed-off-by: Meng Xin <mxin@nvidia.com>
@mxinO mxinO requested a review from a team as a code owner March 31, 2026 07:33
@mxinO mxinO requested a review from Edwardf0t1 March 31, 2026 07:33
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 31, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4924fcd4-60b3-460a-b001-56a67498f263

📥 Commits

Reviewing files that changed from the base of the PR and between 98a6a38 and 58daff1.

📒 Files selected for processing (1)
  • modelopt/torch/quantization/plugins/vllm.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • modelopt/torch/quantization/plugins/vllm.py

📝 Walkthrough

Walkthrough

Replace fragile vLLM presence checks with guarded try/except logic, conditionally define/register attention QuantModules only when symbols exist, make KV-cache handling accept lists/tuples or single tensors, and allow create_parallel_state() to proceed when EP groups are absent by catching related exceptions.

Changes

Cohort / File(s) Summary
vLLM plugin
modelopt/torch/quantization/plugins/vllm.py
Replaced direct find_spec checks with try/except-guarded detection, treat missing vLLM attention via AttributeError/ImportError, conditionally declare/register _QuantVLLMCrossAttention and _QuantVLLMEncoderOnlyAttention, updated _get_device_dtype() to handle KV-cache as list/tuple or single object, and made create_parallel_state() tolerate missing EP groups by catching AssertionError/RuntimeError.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main change: fixing vLLM quantization compatibility for vLLM >= 0.17, which is the core focus of all modifications.
Security Anti-Patterns ✅ Passed No instances of security anti-patterns (unsafe deserialization, eval/exec, hardcoded trust_remote_code, # nosec comments) found. Changes consist of defensive programming with try/except blocks and conditional registrations for vLLM compatibility.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch mxin/vllm-update

Comment @coderabbitai help to get the list of available commands and usage tips.

@mxinO mxinO requested a review from kinjalpatel27 March 31, 2026 07:34
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 31, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1146/

Built to branch gh-pages at 2026-03-31 07:45 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Signed-off-by: Meng Xin <mxin@nvidia.com>
@mxinO mxinO requested a review from realAsma March 31, 2026 07:41
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 31, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.21%. Comparing base (74a8694) to head (58daff1).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1146   +/-   ##
=======================================
  Coverage   70.21%   70.21%           
=======================================
  Files         230      230           
  Lines       26073    26073           
=======================================
  Hits        18308    18308           
  Misses       7765     7765           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mxinO mxinO enabled auto-merge (squash) March 31, 2026 08:10
Copy link
Copy Markdown
Contributor

@kinjalpatel27 kinjalpatel27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants