Skip to content

Remove internal lustre paths#1148

Open
kevalmorabia97 wants to merge 2 commits intomainfrom
kmorabia/remove-lustre
Open

Remove internal lustre paths#1148
kevalmorabia97 wants to merge 2 commits intomainfrom
kmorabia/remove-lustre

Conversation

@kevalmorabia97
Copy link
Copy Markdown
Collaborator

@kevalmorabia97 kevalmorabia97 commented Mar 31, 2026

Remove internal /lusre paths

Summary by CodeRabbit

  • Configuration Updates
    • Set a concrete default container image to an NVIDIA PyTorch container for templates.
    • Removed built-in default values in scripts; key environment/config variables must be provided externally.
    • Replaced absolute example paths with placeholder relative paths and TODO notes requiring user customization.

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 31, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9b4ce58f-61e8-4b96-9cd1-c19b9fac8d8b

📥 Commits

Reviewing files that changed from the base of the PR and between ebc6aa8 and c25c7be.

📒 Files selected for processing (3)
  • examples/llm_qad/configs/qwen3-30b-a3b-instruct-2507-moe_template.conf
  • examples/llm_qad/configs/qwen3-8b_template.conf
  • examples/llm_qad/sbatch_qad.sh
💤 Files with no reviewable changes (1)
  • examples/llm_qad/sbatch_qad.sh
✅ Files skipped from review due to trivial changes (1)
  • examples/llm_qad/configs/qwen3-8b_template.conf
🚧 Files skipped from review as they are similar to previous changes (1)
  • examples/llm_qad/configs/qwen3-30b-a3b-instruct-2507-moe_template.conf

📝 Walkthrough

Walkthrough

Set explicit default container image in two LLM QAD config templates, removed in-script default assignments from the sbatch script (forcing values to be provided externally), and replaced absolute paths with placeholder paths and TODOs in one YAML example.

Changes

Cohort / File(s) Summary
Config templates — container image defaults
examples/llm_qad/configs/qwen3-30b-a3b-instruct-2507-moe_template.conf, examples/llm_qad/configs/qwen3-8b_template.conf
Changed export CONTAINER_IMAGE from an empty string to nvcr.io/nvidia/pytorch:26.01-py3; updated inline comment to note it can be a container image or a .sqsh file.
Sbatch script — removed defaults
examples/llm_qad/sbatch_qad.sh
Removed in-script default assignments for multiple path/container variables (e.g., MLM_DIR, MODELOPT_DIR, MODELS_ROOT, QAD_CHECKPOINT_ROOT, DATACACHE_DIR, CONTAINER_IMAGE, CONTAINER_MOUNTS, CONTAINER_WORKDIR); these must now be supplied via sourced config or environment.
YAML example — placeholder paths
examples/windows/torch_onnx/diffusers/qad_example/ltx2_qad.yaml
Replaced absolute filesystem paths with /path/to/... placeholders and added inline # TODO comments for model.model_path, model.load_checkpoint.text_encoder_path, and data.preprocessed_data_root.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: removing internal lustre paths from configuration files and scripts.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Security Anti-Patterns ✅ Passed PR modifies only configuration files (.conf, .yaml, .sh) with no Python code changes containing security anti-patterns like torch.load(), eval(), exec(), or trust_remote_code.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch kmorabia/remove-lustre

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/llm_qad/configs/qwen3-8b_template.conf`:
- Line 58: Add a required-variable validation for LOG_DIR in sbatch_qad.sh
similar to the existing checks for other vars: detect if LOG_DIR is empty at the
start of the script and print an error then exit (matching the pattern used for
other variables), so mkdir -p "${LOG_DIR}" and the srun --output=${LOG_DIR}/...
usage cannot proceed with an empty path; reference the LOG_DIR variable, the
mkdir -p invocation, and the srun --output usage when placing the new check.

In `@examples/windows/torch_onnx/diffusers/qad_example/ltx2_qad.yaml`:
- Around line 5-6: The YAML key text_encoder_path is misindented as a sibling of
load_checkpoint rather than a child, breaking the expected
model.load_checkpoint.text_encoder_path schema; fix by nesting text_encoder_path
under load_checkpoint (adjust indentation) so that the configuration key becomes
model.load_checkpoint.text_encoder_path and consumers of
load_checkpoint/text_encoder_path find the value correctly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e867e57c-47a3-4ef0-9fcd-5b65630fdba7

📥 Commits

Reviewing files that changed from the base of the PR and between 74a8694 and ebc6aa8.

📒 Files selected for processing (4)
  • examples/llm_qad/configs/qwen3-30b-a3b-instruct-2507-moe_template.conf
  • examples/llm_qad/configs/qwen3-8b_template.conf
  • examples/llm_qad/sbatch_qad.sh
  • examples/windows/torch_onnx/diffusers/qad_example/ltx2_qad.yaml
💤 Files with no reviewable changes (1)
  • examples/llm_qad/sbatch_qad.sh

export STUDENT_CONFIG_FILE="" # path to student model args script, e.g., ${MLM_DIR}/examples/post_training/modelopt/conf/Qwen/Qwen3-8B.sh
export QAD_CHECKPOINT_ROOT="" # path to store QAD checkpoints
export DATACACHE_DIR="" # path to data cache directory
export LOG_DIR="" # path to log directory
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if sbatch_qad.sh has any validation for LOG_DIR or other required variables

echo "=== Checking for LOG_DIR validation in sbatch_qad.sh ==="
rg -n 'LOG_DIR' examples/llm_qad/sbatch_qad.sh

echo ""
echo "=== Checking for any variable validation patterns (e.g., -z checks) ==="
rg -n '\[\[.*-z.*\]\]|\[\s+-z' examples/llm_qad/sbatch_qad.sh

echo ""
echo "=== Checking for required variable error handling ==="
rg -n 'required|REQUIRED|must be set|not set' examples/llm_qad/sbatch_qad.sh

Repository: NVIDIA/Model-Optimizer

Length of output: 988


Add validation for LOG_DIR in sbatch_qad.sh before use.

The script validates other required variables (lines 63-64, 102-103) but omits validation for LOG_DIR, which is marked as REQUIRED in the config. When empty, mkdir -p "" silently does nothing (line 74) and srun --output=${LOG_DIR}/%x_... (lines 146-147) becomes --output=/%x_..., attempting to write logs to the root filesystem.

Add a validation check at the start of the script, similar to the existing patterns:

[[ -z "${LOG_DIR:-}" ]] && echo "ERROR: LOG_DIR required" && exit 1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_qad/configs/qwen3-8b_template.conf` at line 58, Add a
required-variable validation for LOG_DIR in sbatch_qad.sh similar to the
existing checks for other vars: detect if LOG_DIR is empty at the start of the
script and print an error then exit (matching the pattern used for other
variables), so mkdir -p "${LOG_DIR}" and the srun --output=${LOG_DIR}/... usage
cannot proceed with an empty path; reference the LOG_DIR variable, the mkdir -p
invocation, and the srun --output usage when placing the new check.

Comment on lines 5 to +6
load_checkpoint:
text_encoder_path: "/lustre/fsw/portfolios/adlr/users/dhutchins/models/gemma"
text_encoder_path: "/path/to/gemma" # TODO: Set your Gemma text encoder path
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix YAML nesting under load_checkpoint to preserve schema.

text_encoder_path is currently aligned as a sibling of load_checkpoint instead of a child key. This changes the config structure and can break consumers expecting model.load_checkpoint.text_encoder_path.

Proposed fix
 model:
   model_path: "/path/to/ltx2/checkpoint.safetensors"  # TODO: Set your LTX-2 checkpoint path
   training_mode: "full"
   load_checkpoint:
-  text_encoder_path: "/path/to/gemma"  # TODO: Set your Gemma text encoder path
+    text_encoder_path: "/path/to/gemma"  # TODO: Set your Gemma text encoder path
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
load_checkpoint:
text_encoder_path: "/lustre/fsw/portfolios/adlr/users/dhutchins/models/gemma"
text_encoder_path: "/path/to/gemma" # TODO: Set your Gemma text encoder path
load_checkpoint:
text_encoder_path: "/path/to/gemma" # TODO: Set your Gemma text encoder path
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/windows/torch_onnx/diffusers/qad_example/ltx2_qad.yaml` around lines
5 - 6, The YAML key text_encoder_path is misindented as a sibling of
load_checkpoint rather than a child, breaking the expected
model.load_checkpoint.text_encoder_path schema; fix by nesting text_encoder_path
under load_checkpoint (adjust indentation) so that the configuration key becomes
model.load_checkpoint.text_encoder_path and consumers of
load_checkpoint/text_encoder_path find the value correctly.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 31, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1148/

Built to branch gh-pages at 2026-04-01 08:04 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 31, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.18%. Comparing base (74a8694) to head (c25c7be).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1148      +/-   ##
==========================================
- Coverage   70.21%   70.18%   -0.04%     
==========================================
  Files         230      230              
  Lines       26073    26080       +7     
==========================================
- Hits        18308    18304       -4     
- Misses       7765     7776      +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

fi
fi

# === Default Paths (override in config) ===
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing seems too brutal. keeping them with some sane default value is good that people know what env vars are expected.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just followed whatever previous approach was followed. The .conf file has many such env variables and all are defined from that file

Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary: Removes hardcoded internal lustre filesystem paths from example configs and scripts, replacing them with empty placeholders and TODO comments that users must fill in.

Issues Found:

  1. [Correctness] sbatch_qad.sh — After removing the defaults block, critical variables like MLM_DIR, MODELOPT_DIR, CONTAINER_IMAGE, CONTAINER_MOUNTS, CONTAINER_WORKDIR, LOG_DIR, QAD_CHECKPOINT_ROOT, and DATACACHE_DIR have no validation. Unlike TP_SIZE and MBS (which use ${VAR:?ERROR: ...}), these will silently be empty if a user forgets to set them in their config. This will cause confusing failures — e.g., mkdir -p "" on line 86, or srun --container-image "" at line 160. Consider adding :? required checks for at least the critical ones (CONTAINER_IMAGE, CONTAINER_WORKDIR, MLM_DIR, LOG_DIR).

  2. [Correctness] sbatch_qad.sh — The default LOG_DIR="${LOG_DIR:-${QAD_CHECKPOINT_ROOT}/logs_slurm}" (line 67) is also removed, but the config templates set LOG_DIR="" (empty string). Since :- treats empty string as unset, the old fallback would have worked. After this PR, if the user leaves LOG_DIR="" in their config, LOG_DIR will be empty and the mkdir and srun --output will fail. The templates should either instruct users to set it, or the script should add a fallback like LOG_DIR="${LOG_DIR:-${QAD_CHECKPOINT_ROOT}/logs_slurm}" back (without the lustre path — just the relative fallback).

  3. [Readability] sbatch_qad.sh — The removal leaves a double blank line (lines 59-60 in the post-diff). Minor nit.

Suggestions:

  • Consider adding a validation block after sourcing the config that checks all required path variables are non-empty, similar to the existing pattern for STUDENT_CKPT/TEACHER_CKPT on lines 114-115.
  • The LOG_DIR fallback to ${QAD_CHECKPOINT_ROOT}/logs_slurm was actually a reasonable default that didn't contain any lustre-specific paths. It could be kept.

Overall Assessment: The intent is correct and the lustre paths should be removed. However, the lack of validation for now-required variables is a usability regression — users will get cryptic runtime errors instead of clear messages. Issue #2 (LOG_DIR losing its non-lustre fallback) is a minor bug.

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants