Conversation
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
💤 Files with no reviewable changes (1)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughSet explicit default container image in two LLM QAD config templates, removed in-script default assignments from the sbatch script (forcing values to be provided externally), and replaced absolute paths with placeholder paths and TODOs in one YAML example. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/llm_qad/configs/qwen3-8b_template.conf`:
- Line 58: Add a required-variable validation for LOG_DIR in sbatch_qad.sh
similar to the existing checks for other vars: detect if LOG_DIR is empty at the
start of the script and print an error then exit (matching the pattern used for
other variables), so mkdir -p "${LOG_DIR}" and the srun --output=${LOG_DIR}/...
usage cannot proceed with an empty path; reference the LOG_DIR variable, the
mkdir -p invocation, and the srun --output usage when placing the new check.
In `@examples/windows/torch_onnx/diffusers/qad_example/ltx2_qad.yaml`:
- Around line 5-6: The YAML key text_encoder_path is misindented as a sibling of
load_checkpoint rather than a child, breaking the expected
model.load_checkpoint.text_encoder_path schema; fix by nesting text_encoder_path
under load_checkpoint (adjust indentation) so that the configuration key becomes
model.load_checkpoint.text_encoder_path and consumers of
load_checkpoint/text_encoder_path find the value correctly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e867e57c-47a3-4ef0-9fcd-5b65630fdba7
📒 Files selected for processing (4)
examples/llm_qad/configs/qwen3-30b-a3b-instruct-2507-moe_template.confexamples/llm_qad/configs/qwen3-8b_template.confexamples/llm_qad/sbatch_qad.shexamples/windows/torch_onnx/diffusers/qad_example/ltx2_qad.yaml
💤 Files with no reviewable changes (1)
- examples/llm_qad/sbatch_qad.sh
| export STUDENT_CONFIG_FILE="" # path to student model args script, e.g., ${MLM_DIR}/examples/post_training/modelopt/conf/Qwen/Qwen3-8B.sh | ||
| export QAD_CHECKPOINT_ROOT="" # path to store QAD checkpoints | ||
| export DATACACHE_DIR="" # path to data cache directory | ||
| export LOG_DIR="" # path to log directory |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if sbatch_qad.sh has any validation for LOG_DIR or other required variables
echo "=== Checking for LOG_DIR validation in sbatch_qad.sh ==="
rg -n 'LOG_DIR' examples/llm_qad/sbatch_qad.sh
echo ""
echo "=== Checking for any variable validation patterns (e.g., -z checks) ==="
rg -n '\[\[.*-z.*\]\]|\[\s+-z' examples/llm_qad/sbatch_qad.sh
echo ""
echo "=== Checking for required variable error handling ==="
rg -n 'required|REQUIRED|must be set|not set' examples/llm_qad/sbatch_qad.shRepository: NVIDIA/Model-Optimizer
Length of output: 988
Add validation for LOG_DIR in sbatch_qad.sh before use.
The script validates other required variables (lines 63-64, 102-103) but omits validation for LOG_DIR, which is marked as REQUIRED in the config. When empty, mkdir -p "" silently does nothing (line 74) and srun --output=${LOG_DIR}/%x_... (lines 146-147) becomes --output=/%x_..., attempting to write logs to the root filesystem.
Add a validation check at the start of the script, similar to the existing patterns:
[[ -z "${LOG_DIR:-}" ]] && echo "ERROR: LOG_DIR required" && exit 1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/llm_qad/configs/qwen3-8b_template.conf` at line 58, Add a
required-variable validation for LOG_DIR in sbatch_qad.sh similar to the
existing checks for other vars: detect if LOG_DIR is empty at the start of the
script and print an error then exit (matching the pattern used for other
variables), so mkdir -p "${LOG_DIR}" and the srun --output=${LOG_DIR}/... usage
cannot proceed with an empty path; reference the LOG_DIR variable, the mkdir -p
invocation, and the srun --output usage when placing the new check.
| load_checkpoint: | ||
| text_encoder_path: "/lustre/fsw/portfolios/adlr/users/dhutchins/models/gemma" | ||
| text_encoder_path: "/path/to/gemma" # TODO: Set your Gemma text encoder path |
There was a problem hiding this comment.
Fix YAML nesting under load_checkpoint to preserve schema.
text_encoder_path is currently aligned as a sibling of load_checkpoint instead of a child key. This changes the config structure and can break consumers expecting model.load_checkpoint.text_encoder_path.
Proposed fix
model:
model_path: "/path/to/ltx2/checkpoint.safetensors" # TODO: Set your LTX-2 checkpoint path
training_mode: "full"
load_checkpoint:
- text_encoder_path: "/path/to/gemma" # TODO: Set your Gemma text encoder path
+ text_encoder_path: "/path/to/gemma" # TODO: Set your Gemma text encoder path📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| load_checkpoint: | |
| text_encoder_path: "/lustre/fsw/portfolios/adlr/users/dhutchins/models/gemma" | |
| text_encoder_path: "/path/to/gemma" # TODO: Set your Gemma text encoder path | |
| load_checkpoint: | |
| text_encoder_path: "/path/to/gemma" # TODO: Set your Gemma text encoder path |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/windows/torch_onnx/diffusers/qad_example/ltx2_qad.yaml` around lines
5 - 6, The YAML key text_encoder_path is misindented as a sibling of
load_checkpoint rather than a child, breaking the expected
model.load_checkpoint.text_encoder_path schema; fix by nesting text_encoder_path
under load_checkpoint (adjust indentation) so that the configuration key becomes
model.load_checkpoint.text_encoder_path and consumers of
load_checkpoint/text_encoder_path find the value correctly.
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1148 +/- ##
==========================================
- Coverage 70.21% 70.18% -0.04%
==========================================
Files 230 230
Lines 26073 26080 +7
==========================================
- Hits 18308 18304 -4
- Misses 7765 7776 +11 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| fi | ||
| fi | ||
|
|
||
| # === Default Paths (override in config) === |
There was a problem hiding this comment.
removing seems too brutal. keeping them with some sane default value is good that people know what env vars are expected.
There was a problem hiding this comment.
I just followed whatever previous approach was followed. The .conf file has many such env variables and all are defined from that file
cjluo-nv
left a comment
There was a problem hiding this comment.
Summary: Removes hardcoded internal lustre filesystem paths from example configs and scripts, replacing them with empty placeholders and TODO comments that users must fill in.
Issues Found:
-
[Correctness]
sbatch_qad.sh— After removing the defaults block, critical variables likeMLM_DIR,MODELOPT_DIR,CONTAINER_IMAGE,CONTAINER_MOUNTS,CONTAINER_WORKDIR,LOG_DIR,QAD_CHECKPOINT_ROOT, andDATACACHE_DIRhave no validation. UnlikeTP_SIZEandMBS(which use${VAR:?ERROR: ...}), these will silently be empty if a user forgets to set them in their config. This will cause confusing failures — e.g.,mkdir -p ""on line 86, orsrun --container-image ""at line 160. Consider adding:?required checks for at least the critical ones (CONTAINER_IMAGE,CONTAINER_WORKDIR,MLM_DIR,LOG_DIR). -
[Correctness]
sbatch_qad.sh— The defaultLOG_DIR="${LOG_DIR:-${QAD_CHECKPOINT_ROOT}/logs_slurm}"(line 67) is also removed, but the config templates setLOG_DIR=""(empty string). Since:-treats empty string as unset, the old fallback would have worked. After this PR, if the user leavesLOG_DIR=""in their config,LOG_DIRwill be empty and themkdirandsrun --outputwill fail. The templates should either instruct users to set it, or the script should add a fallback likeLOG_DIR="${LOG_DIR:-${QAD_CHECKPOINT_ROOT}/logs_slurm}"back (without the lustre path — just the relative fallback). -
[Readability]
sbatch_qad.sh— The removal leaves a double blank line (lines 59-60 in the post-diff). Minor nit.
Suggestions:
- Consider adding a validation block after sourcing the config that checks all required path variables are non-empty, similar to the existing pattern for
STUDENT_CKPT/TEACHER_CKPTon lines 114-115. - The
LOG_DIRfallback to${QAD_CHECKPOINT_ROOT}/logs_slurmwas actually a reasonable default that didn't contain any lustre-specific paths. It could be kept.
Overall Assessment: The intent is correct and the lustre paths should be removed. However, the lack of validation for now-required variables is a usability regression — users will get cryptic runtime errors instead of clear messages. Issue #2 (LOG_DIR losing its non-lustre fallback) is a minor bug.
Remove internal /lusre paths
Summary by CodeRabbit