Skip to content

fix: speed up batch queue processing by disabling cooloff and fixing retry race#3079

Open
ericallam wants to merge 1 commit intomainfrom
fix/batch-queue-processing
Open

fix: speed up batch queue processing by disabling cooloff and fixing retry race#3079
ericallam wants to merge 1 commit intomainfrom
fix/batch-queue-processing

Conversation

@ericallam
Copy link
Member

@ericallam ericallam commented Feb 17, 2026

Fix slow fair queue processing by removing spurious cooloff on concurrency blocks and fixing a race condition where retry attempt counts were not atomically updated during message re-queue.

Removed cooloff entirely from the batch queue

@changeset-bot
Copy link

changeset-bot bot commented Feb 17, 2026

🦋 Changeset detected

Latest commit: ba3a29e

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 28 packages
Name Type
@trigger.dev/redis-worker Patch
@internal/run-engine Patch
@internal/schedule-engine Patch
@trigger.dev/build Patch
@trigger.dev/core Patch
@trigger.dev/python Patch
@trigger.dev/react-hooks Patch
@trigger.dev/rsc Patch
@trigger.dev/schema-to-json Patch
@trigger.dev/sdk Patch
@trigger.dev/database Patch
@trigger.dev/otlp-importer Patch
trigger.dev Patch
d3-chat Patch
references-d3-openai-agents Patch
@internal/cache Patch
@internal/clickhouse Patch
@internal/redis Patch
@internal/replication Patch
@internal/testcontainers Patch
@internal/tracing Patch
@internal/tsql Patch
@internal/zod-worker Patch
references-nextjs-realtime Patch
references-realtime-hooks-test Patch
references-realtime-streams Patch
@internal/sdk-compat-tests Patch
references-telemetry Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@ericallam ericallam force-pushed the fix/batch-queue-processing branch from 9ea0ae2 to d275304 Compare February 17, 2026 15:49
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 17, 2026

No actionable comments were generated in the recent review. 🎉

📜 Recent review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d275304 and ba3a29e.

📒 Files selected for processing (6)
  • .changeset/fix-batch-queue-processing.md
  • apps/webapp/app/env.server.ts
  • internal-packages/run-engine/src/batch-queue/index.ts
  • packages/redis-worker/src/fair-queue/index.ts
  • packages/redis-worker/src/fair-queue/tests/fairQueue.test.ts
  • packages/redis-worker/src/fair-queue/visibility.ts
✅ Files skipped from review due to trivial changes (1)
  • .changeset/fix-batch-queue-processing.md
🚧 Files skipped from review as they are similar to previous changes (2)
  • packages/redis-worker/src/fair-queue/tests/fairQueue.test.ts
  • packages/redis-worker/src/fair-queue/index.ts
🧰 Additional context used
📓 Path-based instructions (8)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

**/*.{ts,tsx}: Always import tasks from @trigger.dev/sdk, never use @trigger.dev/sdk/v3 or deprecated client.defineJob pattern
Every Trigger.dev task must be exported and have a unique id property with no timeouts in the run function

Files:

  • internal-packages/run-engine/src/batch-queue/index.ts
  • packages/redis-worker/src/fair-queue/visibility.ts
  • apps/webapp/app/env.server.ts
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

Import from @trigger.dev/core using subpaths only, never import from root

Files:

  • internal-packages/run-engine/src/batch-queue/index.ts
  • packages/redis-worker/src/fair-queue/visibility.ts
  • apps/webapp/app/env.server.ts
**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

  • internal-packages/run-engine/src/batch-queue/index.ts
  • packages/redis-worker/src/fair-queue/visibility.ts
  • apps/webapp/app/env.server.ts
**/*.{js,ts,jsx,tsx,json,md,yaml,yml}

📄 CodeRabbit inference engine (AGENTS.md)

Format code using Prettier before committing

Files:

  • internal-packages/run-engine/src/batch-queue/index.ts
  • packages/redis-worker/src/fair-queue/visibility.ts
  • apps/webapp/app/env.server.ts
{packages,integrations}/**/*

📄 CodeRabbit inference engine (CLAUDE.md)

Add a changeset when modifying any public package in packages/* or integrations/* using pnpm run changeset:add

Files:

  • packages/redis-worker/src/fair-queue/visibility.ts
{packages/core,apps/webapp}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use zod for validation in packages/core and apps/webapp

Files:

  • apps/webapp/app/env.server.ts
apps/webapp/app/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)

Access all environment variables through the env export of env.server.ts instead of directly accessing process.env in the Trigger.dev webapp

Files:

  • apps/webapp/app/env.server.ts
apps/webapp/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)

apps/webapp/**/*.{ts,tsx}: When importing from @trigger.dev/core in the webapp, use subpath exports from the package.json instead of importing from the root path
Follow the Remix 2.1.0 and Express server conventions when updating the main trigger.dev webapp

Access environment variables via env export from apps/webapp/app/env.server.ts, never use process.env directly

Files:

  • apps/webapp/app/env.server.ts
🧠 Learnings (3)
📓 Common learnings
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 2870
File: apps/webapp/app/services/redisConcurrencyLimiter.server.ts:56-66
Timestamp: 2026-01-12T17:18:09.451Z
Learning: In `apps/webapp/app/services/redisConcurrencyLimiter.server.ts`, the query concurrency limiter will not be deployed with Redis Cluster mode, so multi-key operations (keyKey and globalKey in different hash slots) are acceptable and will function correctly in standalone Redis mode.
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Control concurrency using the `queue` property with `concurrencyLimit` option

Applied to files:

  • internal-packages/run-engine/src/batch-queue/index.ts
📚 Learning: 2025-11-14T16:03:06.917Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 2681
File: apps/webapp/app/services/platform.v3.server.ts:258-302
Timestamp: 2025-11-14T16:03:06.917Z
Learning: In `apps/webapp/app/services/platform.v3.server.ts`, the `getDefaultEnvironmentConcurrencyLimit` function intentionally throws an error (rather than falling back to org.maximumConcurrencyLimit) when the billing client returns undefined plan limits. This fail-fast behavior prevents users from receiving more concurrency than their plan entitles them to. The org.maximumConcurrencyLimit fallback is only for self-hosted deployments where no billing client exists.

Applied to files:

  • apps/webapp/app/env.server.ts
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
  • GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
  • GitHub Check: typecheck / typecheck
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
🔇 Additional comments (6)
apps/webapp/app/env.server.ts (1)

552-552: LGTM — default bumped from 1 to 5 to improve batch throughput.

Aligns with the PR objective. Existing deployments that already set this env var are unaffected.

internal-packages/run-engine/src/batch-queue/index.ts (1)

152-154: LGTM — cooloff disabled to prevent spurious stalls on concurrency saturation.

This directly addresses the PR objective. When all concurrency slots are occupied, the old cooloff would unnecessarily delay re-attempts to dequeue, slowing batch processing.

packages/redis-worker/src/fair-queue/visibility.ts (4)

281-312: LGTM — updatedData parameter cleanly extends the release path for atomic retry updates.

The optional parameter defaults to "" in TypeScript, which the Lua script correctly treats as "no update." This eliminates the previous race where retry attempt counts were updated via a separate HSET after the release, making the retry data update atomic with the re-enqueue.


686-699: Lua updatedData handling is correct.

The nil/empty-string guard (updatedData and updatedData ~= "") properly distinguishes "caller provided new data" from "use existing in-flight payload." Since Lua treats nil ARGV entries as absent and Redis passes missing trailing ARGV as not present, the and check is necessary and correct.


439-440: Correct: reclaimTimedOut passes empty string to preserve original payload on timeout reclaim.

This ensures timed-out messages are re-enqueued with their original data, not accidentally blanked.


820-831: All existing call sites of releaseMessage have been properly updated with the new updatedData argument. Direct calls to this.redis.releaseMessage() at lines 301 and 430 pass the argument correctly (as updatedData ?? "" and "" respectively), and the wrapper method visibilityManager.release() accepts it as an optional parameter with appropriate defaults. No compilation issues will occur.


Walkthrough

Updates in the redis-worker and related packages: removed per-queue cool-off increment when availableCapacity === 0; changed retry path to pass updated message JSON into the release/visibility Lua path instead of separate HSET; extended VisibilityManager.release and the releaseMessage Redis/Lua call to accept optional updatedData and use it to replace the in-flight payload; added a test ensuring concurrency blocks do not trigger cooloff; disabled cooloff in a FairQueue config; and bumped BATCH_CONCURRENCY_LIMIT_DEFAULT from 1 to 5 in env.server.ts. No exported/public API removals.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is incomplete and missing several required sections from the template: no issue reference, missing checklist completion, no testing steps, and no changelog section. Add issue reference (Closes #), complete the checklist items, describe testing steps taken, and provide a formal changelog section following the template.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: disabling cooloff and fixing retry race conditions during batch queue processing.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/batch-queue-processing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

@ericallam ericallam force-pushed the fix/batch-queue-processing branch from d275304 to f38f2f9 Compare February 17, 2026 17:06
@ericallam ericallam force-pushed the fix/batch-queue-processing branch from f38f2f9 to ba3a29e Compare February 17, 2026 17:08
@ericallam ericallam changed the title fix: speed up batch queue processing by removing stalls and fixing retry race fix: speed up batch queue processing by disabling cooloff and fixing retry race Feb 17, 2026
@ericallam ericallam marked this pull request as ready for review February 17, 2026 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants