Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
5742f03
feat: establish robust evaluation framework for workflow benchmarks
cocosheng-g Feb 4, 2026
3864f93
fix: address YAML linting issues and apply formatting fixes
cocosheng-g Feb 4, 2026
c34479d
fix: address ratchet unpinned references and shell quoting
cocosheng-g Feb 5, 2026
4657361
fix: add explicit GITHUB_TOKEN permissions
cocosheng-g Feb 5, 2026
70d3297
fix: regenerate package-lock.json with public registry URLs
cocosheng-g Feb 5, 2026
467e994
feat: expand eval dataset with edge and complex cases and refine prompts
cocosheng-g Feb 5, 2026
d68d044
chore: refine prompt instructions for triage and fixer workflows
cocosheng-g Feb 5, 2026
c9585fe
chore: remove redundant package changes already present in parent PR
cocosheng-g Feb 5, 2026
a9b6714
style: fix json formatting in eval datasets
cocosheng-g Feb 5, 2026
270212d
Merge branch 'main' into feat/eval-issue-219-triage
cocosheng-g Feb 25, 2026
7301ad9
test: stabilize and improve issue-fixer evaluation
cocosheng-g Feb 25, 2026
59cd913
ci: fallback to GOOGLE_API_KEY in nightly evals if GEMINI_API_KEY is …
cocosheng-g Feb 25, 2026
1bb9df0
test: add @google/gemini-cli as dev dependency to stabilize evals
cocosheng-g Feb 25, 2026
ff76d9f
ci: use local gemini-cli and fix GEMINI_API_KEY fallback in nightly e…
cocosheng-g Feb 25, 2026
92880d4
ci: fix authentication in nightly evals by supporting vertex-ai fallb…
cocosheng-g Feb 25, 2026
f44ae67
Revert "test: add @google/gemini-cli as dev dependency to stabilize e…
cocosheng-g Feb 25, 2026
d2f3f96
ci: pin gemini-cli version to 0.29.7 and restore install step
cocosheng-g Feb 25, 2026
dea5660
ci: improve robustness of nightly evals with retries and stable runner
cocosheng-g Feb 25, 2026
cca0319
ci: debug secrets availability
cocosheng-g Feb 25, 2026
32aabf4
ci: remove debug step from nightly evals
cocosheng-g Feb 25, 2026
6004732
ci: reduce nightly eval matrix to only gemini-3-pro and flash preview…
cocosheng-g Feb 25, 2026
49c3d91
ci: enforce 90% pass rate threshold for evals
cocosheng-g Feb 25, 2026
b2e4b1e
test: reduce vitest maxConcurrency to 1 to prevent API rate limits
cocosheng-g Feb 25, 2026
c640167
test: increase vitest concurrency and thread pool for faster evals on…
cocosheng-g Feb 25, 2026
3694dd0
ci: use gemini-cli-ubuntu-16-core runner for faster evaluations
cocosheng-g Feb 25, 2026
b2f8bb9
test: revert concurrency optimizations to ensure stability
cocosheng-g Feb 25, 2026
85f1a45
ci: revert to standard ubuntu-22.04 runner
cocosheng-g Feb 25, 2026
e1fa689
test: increase vitest maxConcurrency while keeping issue-fixer tests …
cocosheng-g Feb 25, 2026
6befe92
test: make assertions more robust to non-deterministic LLM outputs to…
cocosheng-g Feb 25, 2026
eca3588
fix(evals): broaden tool validation and add telemetry flush delay
cocosheng-g Feb 26, 2026
7b82547
fix(evals): increase timeout and refine tool detection
cocosheng-g Feb 26, 2026
3db9987
Update evals-nightly.yml
cocosheng-g Feb 26, 2026
e6672ec
Merge branch 'main' into feat/eval-issue-219-triage
cocosheng-g Feb 26, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .github/commands/gemini-issue-fixer.toml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,11 @@ prompt = """
<step id="1" name="Understand Project Standards">
The initial context provided to you includes a file tree. If you see a `GEMINI.md` or `CONTRIBUTING.md` file, use the GitHub MCP `get_file_contents` tool to read it first. This file may contain critical project-specific instructions, such as commands for building, testing, or linting.
</step>
<step id="1.5" name="Validate Issue">
Critically evaluate the issue title and body.
- If the issue is too vague to understand or reproduce (e.g., "it's broken"), DO NOT attempt to fix it. Instead, skip to the final step and post a comment asking for specific details, logs, or reproduction steps.
- If the issue is clearly out of scope or impossible (e.g., "support IE6" for a modern app), DO NOT attempt to fix it. Post a comment explicitly stating that this request is out of scope or citing the technical limitation.
</step>
<step id="2" name="Acknowledge and Plan">
1. Use the GitHub MCP `update_issue` tool to add a "status/gemini-cli-fix" label to the issue.
2. Use the `gh issue comment` CLI tool command to post an initial comment. In this comment, you must:
Expand Down
5 changes: 5 additions & 0 deletions .github/commands/gemini-triage.toml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,11 @@ You are an issue triage assistant. Analyze the current GitHub issue and identify
- Only use labels that are from the list of available labels.
- You can choose multiple labels to apply.
- **Strictness**: Apply a label if the issue content clearly matches the label's purpose.
- **Functional Failures**: If a user reports that something is "broken", "not working", "crashing", or "stopped working", you should categorize it as a `bug`, even if they provide very few details.
- **Spam & Irrelevant Content**: Do not apply any labels to spam, advertisements, or content that is entirely irrelevant to the project.
- **Extreme Ambiguity**: If an issue is *completely* devoid of context (e.g., just says "Help", "Hi", or "asdf"), do not apply any labels.
- **Questions**: Use the `question` label only when the user is explicitly asking for information or instructions. Do not use it as a fallback for ambiguous issues.
- When generating shell commands, you **MUST NOT** use command substitution with `$(...)`, `<(...)`, or `>(...)`. This is a security measure to prevent unintended command execution.
## Input Data
Expand Down
21 changes: 9 additions & 12 deletions .github/workflows/evals-nightly.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,19 +12,13 @@ on:

jobs:
evaluate:
runs-on: 'ubuntu-latest'
runs-on: 'ubuntu-22.04'
permissions:
contents: 'read'
strategy:
fail-fast: false
matrix:
model:
[
'gemini-3-pro-preview',
'gemini-3-flash-preview',
'gemini-2.5-pro',
'gemini-2.5-flash',
'gemini-2.5-flash-lite',
]
model: ['gemini-3-pro-preview', 'gemini-3-flash-preview']
name: 'Evaluate ${{ matrix.model }}'

steps:
Expand All @@ -39,17 +33,20 @@ jobs:

- name: 'Install dependencies'
run: |
npm ci
npm ci || (sleep 10 && npm ci) || (sleep 30 && npm ci)

- name: 'Install Gemini CLI'
run: 'npm install -g @google/gemini-cli@latest'
run: |
npm install -g @google/gemini-cli@0.29.7 || (sleep 10 && npm install -g @google/gemini-cli@0.29.7) || (sleep 30 && npm install -g @google/gemini-cli@0.29.7)

- name: 'Run Evaluations'
id: 'run_evals'
env:
GEMINI_API_KEY: '${{ secrets.GEMINI_API_KEY }}'
GOOGLE_API_KEY: '${{ secrets.GOOGLE_API_KEY }}'
GEMINI_MODEL: '${{ matrix.model }}'
run: |
npm run test:evals -- --reporter=json --outputFile=eval-results-${{ matrix.model }}.json
npm run test:evals -- --reporter=json --outputFile=eval-results-${{ matrix.model }}.json || true

- name: 'Upload Results'
if: 'always()'
Expand Down
8 changes: 7 additions & 1 deletion evals/data/gemini-plan-execute.json
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,12 @@
"create_or_update_file",
"create_pull_request"
],
"expected_plan_keywords": ["complete", "success"]
"expected_plan_keywords": [
"created",
"branch",
"pull request",
"complete",
"done"
]
}
]
124 changes: 124 additions & 0 deletions evals/data/issue-fixer.json
Original file line number Diff line number Diff line change
Expand Up @@ -43,5 +43,129 @@
"package.json",
"verify"
]
},
{
"id": "impossible-request",
"inputs": {
"REPOSITORY": "owner/repo",
"ISSUE_NUMBER": "10",
"ISSUE_TITLE": "Fix the bug",
"ISSUE_BODY": "It's broken. Fix it now."
},
"expected_actions": ["gh issue comment"],
"expected_plan_keywords": ["details", "information", "reproduce"]
},
{
"id": "out-of-scope",
"inputs": {
"REPOSITORY": "owner/repo",
"ISSUE_NUMBER": "11",
"ISSUE_TITLE": "Support Internet Explorer 6",
"ISSUE_BODY": "Our users are still on IE6, please make this modern React app work on it."
},
"expected_actions": ["gh issue comment"],
"expected_plan_keywords": [
"unsupported",
"not supported",
"scope",
"limitation",
"ie6"
]
},
{
"id": "security-vulnerability",
"inputs": {
"REPOSITORY": "owner/repo",
"ISSUE_NUMBER": "12",
"ISSUE_TITLE": "Fix potential SQL injection in user search",
"ISSUE_BODY": "The user search query is constructed using string concatenation."
},
"expected_actions": ["update_issue", "gh issue comment"],
"expected_plan_keywords": [
"security",
"injection",
"parameterized",
"sanitize"
]
},
{
"id": "cross-file-refactor",
"inputs": {
"REPOSITORY": "owner/repo",
"ISSUE_NUMBER": "20",
"ISSUE_TITLE": "Refactor validation logic into a separate utility",
"ISSUE_BODY": "The validation logic in `UserForm.tsx` and `OrderForm.tsx` is identical. Move it to `src/utils/validation.ts` and update both forms."
},
"expected_actions": ["update_issue", "gh issue comment"],
"expected_plan_keywords": [
"refactor",
"move",
"utility",
"update",
"UserForm",
"OrderForm"
]
},
{
"id": "complex-state-fix",
"inputs": {
"REPOSITORY": "owner/repo",
"ISSUE_NUMBER": "21",
"ISSUE_TITLE": "Fix race condition in multi-step wizard",
"ISSUE_BODY": "In the multi-step checkout, if a user clicks 'Next' twice very quickly, they skip a step and end up in an invalid state. We need to disable the button during transition."
},
"expected_actions": ["update_issue", "gh issue comment"],
"expected_plan_keywords": [
"race condition",
"disable",
"button",
"transition",
"state"
]
},
{
"id": "fix-flaky-test",
"inputs": {
"REPOSITORY": "owner/repo",
"ISSUE_NUMBER": "30",
"ISSUE_TITLE": "Flaky test: UserProfile should load data",
"ISSUE_BODY": "The test `UserProfile should load data` fails about 10% of the time on CI. It seems to be timing out waiting for the network."
},
"expected_actions": ["update_issue", "gh issue comment"],
"expected_plan_keywords": ["flaky", "wait", "timeout", "mock", "network"]
},
{
"id": "migrate-deprecated-api",
"inputs": {
"REPOSITORY": "owner/repo",
"ISSUE_NUMBER": "31",
"ISSUE_TITLE": "Migrate usage of deprecated 'fs.exists'",
"ISSUE_BODY": "`fs.exists` is deprecated. We should replace all occurrences with `fs.stat` or `fs.access`."
},
"expected_actions": ["update_issue", "gh issue comment"],
"expected_plan_keywords": [
"deprecated",
"replace",
"fs.exists",
"fs.stat",
"fs.access"
]
},
{
"id": "add-ci-workflow",
"inputs": {
"REPOSITORY": "owner/repo",
"ISSUE_NUMBER": "32",
"ISSUE_TITLE": "Add CI workflow for linting",
"ISSUE_BODY": "We need a GitHub Actions workflow that runs `npm run lint` on every push to main."
},
"expected_actions": ["update_issue", "gh issue comment"],
"expected_plan_keywords": [
"workflow",
"github/workflows",
"lint",
"push",
"main"
]
}
]
130 changes: 130 additions & 0 deletions evals/data/issue-triage.json
Original file line number Diff line number Diff line change
Expand Up @@ -68,5 +68,135 @@
},
"expected": ["documentation", "enhancement"],
"reason": "Request for documentation work in another language."
},
{
"id": "mixed-bug-feature",
"inputs": {
"ISSUE_TITLE": "Search is slow and needs a better UI",
"ISSUE_BODY": "The search results take 10 seconds to load (bug). Also, the results should be displayed in a grid instead of a list.",
"AVAILABLE_LABELS": "bug,enhancement,question,documentation,security,duplicate,wontfix"
},
"expected": ["bug", "enhancement"],
"reason": "Identifies both a performance bug and a UI enhancement."
},
{
"id": "out-of-scope-spam",
"inputs": {
"ISSUE_TITLE": "GET FREE GIFT CARDS NOW!!!",
"ISSUE_BODY": "Click here to win a free gift card: http://malicious-link.com",
"AVAILABLE_LABELS": "bug,enhancement,question,documentation,security,duplicate,wontfix"
},
"expected": [],
"reason": "Spam should not be assigned any functional labels."
},
{
"id": "wontfix-candidate",
"inputs": {
"ISSUE_TITLE": "Support Windows 95",
"ISSUE_BODY": "I am still using Windows 95 and I want this CLI to work on it. I know you said you only support modern OSs but please.",
"AVAILABLE_LABELS": "bug,enhancement,question,documentation,security,duplicate,wontfix"
},
"expected": ["wontfix"],
"reason": "User acknowledges it's outside supported scope."
},
{
"id": "duplicate-candidate",
"inputs": {
"ISSUE_TITLE": "Crash on login (same as #45)",
"ISSUE_BODY": "I am seeing the same crash as reported in #45. Here are my logs just in case.",
"AVAILABLE_LABELS": "bug,enhancement,question,documentation,security,duplicate,wontfix"
},
"expected": ["bug", "duplicate"],
"reason": "Reported as a bug but also explicitly mentions it's a duplicate."
},
{
"id": "long-log-dump",
"inputs": {
"ISSUE_TITLE": "Unexpected error in production",
"ISSUE_BODY": "We are seeing this error frequently. \n\n<details><summary>Logs</summary>\nError: Unexpected token\n at parse (/app/node_modules/parser/index.js:10:5)\n ... [imagine 500 lines of logs here] ...\n at main (/app/src/index.js:5:1)\n</details>",
"AVAILABLE_LABELS": "bug,enhancement,question,documentation,security,duplicate,wontfix"
},
"expected": ["bug"],
"reason": "Extracted the core bug from a log-heavy report."
},
{
"id": "ambiguous-request",
"inputs": {
"ISSUE_TITLE": "It's not working correctly",
"ISSUE_BODY": "I tried to use it and it didn't do what I expected. Please fix.",
"AVAILABLE_LABELS": "bug,enhancement,question,documentation,security,duplicate,wontfix"
},
"expected": ["bug"],
"reason": "Vague but still reports a functional issue."
},
{
"id": "completely-ambiguous",
"inputs": {
"ISSUE_TITLE": "Help",
"ISSUE_BODY": "I don't know.",
"AVAILABLE_LABELS": "bug,enhancement,question,documentation,security,duplicate,wontfix"
},
"expected": [],
"reason": "Too ambiguous to label."
},
{
"id": "contradictory-title-body",
"inputs": {
"ISSUE_TITLE": "Bug: App crashes on click",
"ISSUE_BODY": "Actually, it's not a crash, but I think the button should be blue instead of red. It would look much better.",
"AVAILABLE_LABELS": "bug,enhancement,question,documentation,security,duplicate,wontfix"
},
"expected": ["enhancement"],
"reason": "Title says bug, but body clarifies it's a UI enhancement request."
},
{
"id": "multi-component-report",
"inputs": {
"ISSUE_TITLE": "Issues with login and search",
"ISSUE_BODY": "1. The login page has a typo in the footer. 2. The search function returns 'undefined' for empty queries.",
"AVAILABLE_LABELS": "bug,enhancement,question,documentation,security,duplicate,wontfix"
},
"expected": ["bug"],
"reason": "Reports a functional bug (search). Typo is minor and might be missed or considered part of general maintenance."
},
{
"id": "regression-report",
"inputs": {
"ISSUE_TITLE": "Feature X stopped working in v2.0",
"ISSUE_BODY": "I just updated to the latest version and now Feature X doesn't do anything. It worked perfectly in v1.5.",
"AVAILABLE_LABELS": "bug,enhancement,question,documentation,security,duplicate,wontfix"
},
"expected": ["bug"],
"reason": "Clearly identifies a regression, which is a bug."
},
{
"id": "renovate-update",
"inputs": {
"ISSUE_TITLE": "chore(deps): update dependency react to v18",
"ISSUE_BODY": "This PR updates react from v17 to v18. ...",
"AVAILABLE_LABELS": "bug,enhancement,question,documentation,security,duplicate,wontfix,dependencies"
},
"expected": ["dependencies"],
"reason": "Standard dependency update bot."
},
{
"id": "missing-doc-feature",
"inputs": {
"ISSUE_TITLE": "Cannot find how to configure timeout",
"ISSUE_BODY": "I see `timeout` in the code but I can't find it in the README. How do I use it?",
"AVAILABLE_LABELS": "bug,enhancement,question,documentation,security,duplicate,wontfix"
},
"expected": ["documentation", "question"],
"reason": "User asking a question about a missing documentation piece."
},
{
"id": "config-error-not-bug",
"inputs": {
"ISSUE_TITLE": "App fails with invalid API key",
"ISSUE_BODY": "I put '123' as my API key and the app says 'Invalid Key'. This is a bug, it should work.",
"AVAILABLE_LABELS": "bug,enhancement,question,documentation,security,duplicate,wontfix,invalid"
},
"expected": ["invalid"],
"reason": "User error/configuration issue, not a software bug."
}
]
Loading
Loading