Remove internal lustre paths by kevalmorabia97 · Pull Request #1148 · NVIDIA/Model-Optimizer

kevalmorabia97 · 2026-03-31T16:36:16Z

Remove internal /lusre paths

Summary by CodeRabbit

Configuration Updates
- Set a concrete default container image to an NVIDIA PyTorch container for templates.
- Removed built-in default values in scripts; key environment/config variables must be provided externally.
- Replaced absolute example paths with placeholder relative paths and TODO notes requiring user customization.

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

coderabbitai · 2026-03-31T16:36:30Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9b4ce58f-61e8-4b96-9cd1-c19b9fac8d8b

📥 Commits

Reviewing files that changed from the base of the PR and between ebc6aa8 and c25c7be.

📒 Files selected for processing (3)

examples/llm_qad/configs/qwen3-30b-a3b-instruct-2507-moe_template.conf
examples/llm_qad/configs/qwen3-8b_template.conf
examples/llm_qad/sbatch_qad.sh

💤 Files with no reviewable changes (1)

examples/llm_qad/sbatch_qad.sh

✅ Files skipped from review due to trivial changes (1)

examples/llm_qad/configs/qwen3-8b_template.conf

🚧 Files skipped from review as they are similar to previous changes (1)

examples/llm_qad/configs/qwen3-30b-a3b-instruct-2507-moe_template.conf

📝 Walkthrough

Walkthrough

Set explicit default container image in two LLM QAD config templates, removed in-script default assignments from the sbatch script (forcing values to be provided externally), and replaced absolute paths with placeholder paths and TODOs in one YAML example.

Changes

Cohort / File(s)	Summary
Config templates — container image defaults `examples/llm_qad/configs/qwen3-30b-a3b-instruct-2507-moe_template.conf`, `examples/llm_qad/configs/qwen3-8b_template.conf`	Changed `export CONTAINER_IMAGE` from an empty string to `nvcr.io/nvidia/pytorch:26.01-py3`; updated inline comment to note it can be a container image or a `.sqsh` file.
Sbatch script — removed defaults `examples/llm_qad/sbatch_qad.sh`	Removed in-script default assignments for multiple path/container variables (e.g., `MLM_DIR`, `MODELOPT_DIR`, `MODELS_ROOT`, `QAD_CHECKPOINT_ROOT`, `DATACACHE_DIR`, `CONTAINER_IMAGE`, `CONTAINER_MOUNTS`, `CONTAINER_WORKDIR`); these must now be supplied via sourced config or environment.
YAML example — placeholder paths `examples/windows/torch_onnx/diffusers/qad_example/ltx2_qad.yaml`	Replaced absolute filesystem paths with `/path/to/...` placeholders and added inline `# TODO` comments for `model.model_path`, `model.load_checkpoint.text_encoder_path`, and `data.preprocessed_data_root`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: removing internal lustre paths from configuration files and scripts.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Security Anti-Patterns	✅ Passed	PR modifies only configuration files (.conf, .yaml, .sh) with no Python code changes containing security anti-patterns like torch.load(), eval(), exec(), or trust_remote_code.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch kmorabia/remove-lustre

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/llm_qad/configs/qwen3-8b_template.conf`:
- Line 58: Add a required-variable validation for LOG_DIR in sbatch_qad.sh
similar to the existing checks for other vars: detect if LOG_DIR is empty at the
start of the script and print an error then exit (matching the pattern used for
other variables), so mkdir -p "${LOG_DIR}" and the srun --output=${LOG_DIR}/...
usage cannot proceed with an empty path; reference the LOG_DIR variable, the
mkdir -p invocation, and the srun --output usage when placing the new check.

In `@examples/windows/torch_onnx/diffusers/qad_example/ltx2_qad.yaml`:
- Around line 5-6: The YAML key text_encoder_path is misindented as a sibling of
load_checkpoint rather than a child, breaking the expected
model.load_checkpoint.text_encoder_path schema; fix by nesting text_encoder_path
under load_checkpoint (adjust indentation) so that the configuration key becomes
model.load_checkpoint.text_encoder_path and consumers of
load_checkpoint/text_encoder_path find the value correctly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e867e57c-47a3-4ef0-9fcd-5b65630fdba7

📥 Commits

Reviewing files that changed from the base of the PR and between 74a8694 and ebc6aa8.

📒 Files selected for processing (4)

examples/llm_qad/configs/qwen3-30b-a3b-instruct-2507-moe_template.conf
examples/llm_qad/configs/qwen3-8b_template.conf
examples/llm_qad/sbatch_qad.sh
examples/windows/torch_onnx/diffusers/qad_example/ltx2_qad.yaml

💤 Files with no reviewable changes (1)

examples/llm_qad/sbatch_qad.sh

coderabbitai · 2026-03-31T16:39:45Z

examples/llm_qad/configs/qwen3-8b_template.conf

 export STUDENT_CONFIG_FILE="" # path to student model args script, e.g., ${MLM_DIR}/examples/post_training/modelopt/conf/Qwen/Qwen3-8B.sh
 export QAD_CHECKPOINT_ROOT="" # path to store QAD checkpoints
 export DATACACHE_DIR="" # path to data cache directory
+export LOG_DIR="" # path to log directory


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check if sbatch_qad.sh has any validation for LOG_DIR or other required variables echo "=== Checking for LOG_DIR validation in sbatch_qad.sh ===" rg -n 'LOG_DIR' examples/llm_qad/sbatch_qad.sh echo "" echo "=== Checking for any variable validation patterns (e.g., -z checks) ===" rg -n '\[\[.*-z.*\]\]|\[\s+-z' examples/llm_qad/sbatch_qad.sh echo "" echo "=== Checking for required variable error handling ===" rg -n 'required|REQUIRED|must be set|not set' examples/llm_qad/sbatch_qad.sh

Repository: NVIDIA/Model-Optimizer

Length of output: 988

Add validation for LOG_DIR in sbatch_qad.sh before use.

The script validates other required variables (lines 63-64, 102-103) but omits validation for LOG_DIR, which is marked as REQUIRED in the config. When empty, mkdir -p "" silently does nothing (line 74) and srun --output=${LOG_DIR}/%x_... (lines 146-147) becomes --output=/%x_..., attempting to write logs to the root filesystem.

Add a validation check at the start of the script, similar to the existing patterns:

[[ -z "${LOG_DIR:-}" ]] && echo "ERROR: LOG_DIR required" && exit 1

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/llm_qad/configs/qwen3-8b_template.conf` at line 58, Add a required-variable validation for LOG_DIR in sbatch_qad.sh similar to the existing checks for other vars: detect if LOG_DIR is empty at the start of the script and print an error then exit (matching the pattern used for other variables), so mkdir -p "${LOG_DIR}" and the srun --output=${LOG_DIR}/... usage cannot proceed with an empty path; reference the LOG_DIR variable, the mkdir -p invocation, and the srun --output usage when placing the new check.

coderabbitai · 2026-03-31T16:39:45Z

examples/windows/torch_onnx/diffusers/qad_example/ltx2_qad.yaml

  load_checkpoint:
-  text_encoder_path: "/lustre/fsw/portfolios/adlr/users/dhutchins/models/gemma"
+  text_encoder_path: "/path/to/gemma"  # TODO: Set your Gemma text encoder path


⚠️ Potential issue | 🔴 Critical

Fix YAML nesting under load_checkpoint to preserve schema.

text_encoder_path is currently aligned as a sibling of load_checkpoint instead of a child key. This changes the config structure and can break consumers expecting model.load_checkpoint.text_encoder_path.

Proposed fix

model: model_path: "/path/to/ltx2/checkpoint.safetensors" # TODO: Set your LTX-2 checkpoint path training_mode: "full" load_checkpoint: - text_encoder_path: "/path/to/gemma" # TODO: Set your Gemma text encoder path + text_encoder_path: "/path/to/gemma" # TODO: Set your Gemma text encoder path

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

load_checkpoint:

text_encoder_path: "/lustre/fsw/portfolios/adlr/users/dhutchins/models/gemma"

text_encoder_path: "/path/to/gemma" # TODO: Set your Gemma text encoder path

load_checkpoint:

text_encoder_path: "/path/to/gemma" # TODO: Set your Gemma text encoder path

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/windows/torch_onnx/diffusers/qad_example/ltx2_qad.yaml` around lines 5 - 6, The YAML key text_encoder_path is misindented as a sibling of load_checkpoint rather than a child, breaking the expected model.load_checkpoint.text_encoder_path schema; fix by nesting text_encoder_path under load_checkpoint (adjust indentation) so that the configuration key becomes model.load_checkpoint.text_encoder_path and consumers of load_checkpoint/text_encoder_path find the value correctly.

github-actions · 2026-03-31T16:40:25Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1148/
Built to branch `gh-pages` at 2026-04-01 08:04 UTC. Preview will be ready when the GitHub Pages deployment is complete.

codecov · 2026-03-31T16:49:59Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.18%. Comparing base (74a8694) to head (c25c7be).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1148      +/-   ##
==========================================
- Coverage   70.21%   70.18%   -0.04%     
==========================================
  Files         230      230              
  Lines       26073    26080       +7     
==========================================
- Hits        18308    18304       -4     
- Misses       7765     7776      +11

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

shengliangxu · 2026-04-01T02:15:34Z

examples/llm_qad/sbatch_qad.sh

    fi
 fi

-# === Default Paths (override in config) ===


removing seems too brutal. keeping them with some sane default value is good that people know what env vars are expected.

I just followed whatever previous approach was followed. The .conf file has many such env variables and all are defined from that file

cjluo-nv

Summary: Removes hardcoded internal lustre filesystem paths from example configs and scripts, replacing them with empty placeholders and TODO comments that users must fill in.

Issues Found:

[Correctness] sbatch_qad.sh — After removing the defaults block, critical variables like MLM_DIR, MODELOPT_DIR, CONTAINER_IMAGE, CONTAINER_MOUNTS, CONTAINER_WORKDIR, LOG_DIR, QAD_CHECKPOINT_ROOT, and DATACACHE_DIR have no validation. Unlike TP_SIZE and MBS (which use ${VAR:?ERROR: ...}), these will silently be empty if a user forgets to set them in their config. This will cause confusing failures — e.g., mkdir -p "" on line 86, or srun --container-image "" at line 160. Consider adding :? required checks for at least the critical ones (CONTAINER_IMAGE, CONTAINER_WORKDIR, MLM_DIR, LOG_DIR).
[Correctness] sbatch_qad.sh — The default LOG_DIR="${LOG_DIR:-${QAD_CHECKPOINT_ROOT}/logs_slurm}" (line 67) is also removed, but the config templates set LOG_DIR="" (empty string). Since :- treats empty string as unset, the old fallback would have worked. After this PR, if the user leaves LOG_DIR="" in their config, LOG_DIR will be empty and the mkdir and srun --output will fail. The templates should either instruct users to set it, or the script should add a fallback like LOG_DIR="${LOG_DIR:-${QAD_CHECKPOINT_ROOT}/logs_slurm}" back (without the lustre path — just the relative fallback).
[Readability] sbatch_qad.sh — The removal leaves a double blank line (lines 59-60 in the post-diff). Minor nit.

Suggestions:

Consider adding a validation block after sourcing the config that checks all required path variables are non-empty, similar to the existing pattern for STUDENT_CKPT/TEACHER_CKPT on lines 114-115.
The LOG_DIR fallback to ${QAD_CHECKPOINT_ROOT}/logs_slurm was actually a reasonable default that didn't contain any lustre-specific paths. It could be kept.

Overall Assessment: The intent is correct and the lustre paths should be removed. However, the lack of validation for now-required variables is a usability regression — users will get cryptic runtime errors instead of clear messages. Issue #2 (LOG_DIR losing its non-lustre fallback) is a minor bug.

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

Remove internal lustre paths

ebc6aa8

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

kevalmorabia97 requested review from ChenhanYu, cjluo-nv, meenchen, vishalpandya1990 and ynankani March 31, 2026 16:36

kevalmorabia97 requested review from a team as code owners March 31, 2026 16:36

coderabbitai bot reviewed Mar 31, 2026

View reviewed changes

meenchen approved these changes Mar 31, 2026

View reviewed changes

shengliangxu reviewed Apr 1, 2026

View reviewed changes

ynankani approved these changes Apr 1, 2026

View reviewed changes

cjluo-nv reviewed Apr 1, 2026

View reviewed changes

cjluo-nv approved these changes Apr 1, 2026

View reviewed changes

minor

c25c7be

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove internal lustre paths#1148

Remove internal lustre paths#1148
kevalmorabia97 wants to merge 2 commits intomainfrom
kmorabia/remove-lustre

kevalmorabia97 commented Mar 31, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 31, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 31, 2026

Uh oh!

coderabbitai bot Mar 31, 2026

Uh oh!

github-actions bot commented Mar 31, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-04-01 08:04 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

codecov bot commented Mar 31, 2026 •

edited

Loading

Uh oh!

shengliangxu Apr 1, 2026

Uh oh!

kevalmorabia97 Apr 1, 2026

Uh oh!

cjluo-nv left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

kevalmorabia97 commented Mar 31, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-04-01 08:04 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

codecov bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

shengliangxu Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

kevalmorabia97 Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

cjluo-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kevalmorabia97 commented Mar 31, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 31, 2026 •

edited

Loading

github-actions bot commented Mar 31, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-04-01 08:04 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

codecov bot commented Mar 31, 2026 •

edited

Loading