Skip to content

eval: add optional setup script support for eval sessions#1673

Merged
dgageot merged 1 commit intodocker:mainfrom
dgageot:evals-setup
Feb 10, 2026
Merged

eval: add optional setup script support for eval sessions#1673
dgageot merged 1 commit intodocker:mainfrom
dgageot:evals-setup

Conversation

@dgageot
Copy link
Member

@dgageot dgageot commented Feb 10, 2026

Add a 'setup' property to EvalCriteria that allows defining a sh script to run inside the eval container before cagent exec. The script is written to a temp file, bind-mounted into the container, and executed as part of the entrypoint chain. This is useful for preparing the container environment (e.g., installing packages, initializing repos) before the agent runs.

Assisted-By: cagent

Add a 'setup' property to EvalCriteria that allows defining a sh script
to run inside the eval container before cagent exec. The script is
written to a temp file, bind-mounted into the container, and executed
as part of the entrypoint chain. This is useful for preparing the
container environment (e.g., installing packages, initializing repos)
before the agent runs.

Assisted-By: cagent
@dgageot dgageot requested a review from a team as a code owner February 10, 2026 07:26
Copy link

@docker-agent docker-agent bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Reviewed the setup script support implementation. Found 2 medium-severity issues related to error handling and cleanup. Both are straightforward to fix and improve the robustness of the temporary file handling.

The core functionality is well-designed - the setup script is properly isolated in a temp file, mounted read-only into the container, and executed before the agent. The defer-based cleanup is correctly placed to execute after the container completes.

if setup != "" {
// Run setup script, then cagent exec with the original arguments.
args = append(args, "sh", "-c", "sh /setup.sh && exec /cagent exec --yolo --json \"$@\"", "--", "/configs/"+agentFile)
} else {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error handling issue for setup script failures

When a container exits with an error, the code cannot distinguish between setup script failure and cagent execution failure. Both result in the same error path (lines 446-453), making debugging difficult. The error message "container failed" doesn't indicate whether the failure occurred in the setup script or the cagent exec command.

Consider:

  1. Adding a marker/log line at the end of the setup script to indicate successful completion
  2. Checking for this marker to provide clearer error messages
  3. Or wrapping the setup script execution to capture its specific exit code

Example:

sh /setup.sh && echo '__SETUP_COMPLETE__' && exec /cagent exec --yolo --json "$@"

Then parse the output to determine if setup completed before reporting errors.


args = append(args,
"-v", setupFile+":/setup.sh:ro",
"--entrypoint", "/run.sh",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing error handling for file removal

The defer os.Remove(setupFile) silently ignores any errors during temporary file cleanup. While os.Remove() failures are uncommon on Unix-like systems, this is poor error handling practice. If the file cannot be removed due to permissions or file system issues, temporary files could accumulate on disk over repeated evaluations.

Consider logging cleanup errors:

defer func() {
    if err := os.Remove(setupFile); err != nil && !os.IsNotExist(err) {
        slog.Debug("failed to remove setup file", "path", setupFile, "error", err)
    }
}()

This makes cleanup failures visible for debugging without failing the evaluation.

@dgageot dgageot merged commit d7f1f6e into docker:main Feb 10, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants