Skip to content

Add retry utilities and per-task retry support#874

Merged
threepointone merged 4 commits intomainfrom
retries
Feb 15, 2026
Merged

Add retry utilities and per-task retry support#874
threepointone merged 4 commits intomainfrom
retries

Conversation

@threepointone
Copy link
Copy Markdown
Contributor

@threepointone threepointone commented Feb 9, 2026

Summary

Adds structured retry support to the Agents SDK — a consistent, configurable system that works across schedules, queues, MCP connections, and user code.

  • this.retry(fn, options?) — retry any async operation with exponential backoff and jitter
  • queue(), schedule(), scheduleEvery() accept per-task { retry?: RetryOptions }
  • addMcpServer() accepts { retry?: RetryOptions } for connection retries
  • Class-level defaults via static options = { retry: { ... } }
  • Internal retries for workflow operations with DO-aware error detection
  • Bonus: getQueue(), getQueues(), getSchedule(), dequeue(), dequeueAll(), dequeueAllByCallback() made synchronous (they were async but only did sync SQL work)

Design

Full design doc in design/retries.md. Key decisions:

  • Full jitter backoffAWS "Full Jitter" strategy. Best p99 latency, simplest implementation.
  • Internal-first primitivestryN, jitterBackoff, isErrorRetryable, validateRetryOptions live in src/retries.ts and are not re-exported. Only RetryOptions (type) and this.retry() (method) are public API.
  • Retry options stored in SQLite — per-task retry options are persisted as JSON in a retry_options TEXT column so they survive DO hibernation.
  • shouldRetry(err, nextAttempt) only on this.retry() — functions can't be serialized to SQLite, so queue()/schedule() don't support it. The predicate receives both the error and the next attempt number for attempt-aware retry decisions.
  • Eager validation against resolved defaultsvalidateRetryOptions(options, defaults) runs at enqueue/schedule/retry time and resolves partial options against class-level defaults before cross-field checks. { baseDelayMs: 5000 } against default maxDelayMs: 3000 throws immediately, not hours later when the task executes. Also enforces integer maxAttempts and guards against NaN/Infinity.
  • Retry observabilityqueue:retry and schedule:retry events emitted before each retry attempt with callback, id, attempt number, and maxAttempts.
  • Cached _resolvedOptions — static options are computed once and reused, not rebuilt on every queue/schedule/retry call.

Defaults

Setting Default Rationale
maxAttempts 3 Enough for transient blips, not so many that a broken service blocks the agent
baseDelayMs 100ms Fast first retry for quick recovery
maxDelayMs 3000ms Cap at 3s to avoid blocking the DO event loop too long

Workflow operations use 200ms base / 3s max. MCP connections use 500ms base / 5s max.

What's included

Area Files What changed
Core primitives src/retries.ts RetryOptions, tryN, jitterBackoff, isErrorRetryable, validateRetryOptions
Agent class src/index.ts this.retry(), retry on queue()/schedule(), class-level defaults, observability events, cached options, DRY helpers, sync getters
MCP client src/mcp/client.ts Retry options on addMcpServer(), persisted in server_options
Observability src/observability/agent.ts queue:retry and schedule:retry event types
Unit tests src/tests/retries.test.ts 32 tests — primitives, validation, NaN/Infinity guards, integer enforcement, validation with defaults
Integration tests src/tests/retry-integration.test.ts 18 tests — this.retry(), shouldRetry with attempt number, queue/schedule retry, eager validation, class-level defaults
Test agents src/tests/agents/retry.ts TestRetryAgent, TestRetryDefaultsAgent
Design doc design/retries.md Architecture, decisions, tradeoffs
User docs docs/retries.md Quick start, API reference, patterns, limitations
Doc updates docs/scheduling.md, docs/queue.md, docs/mcp-client.md Updated types and signatures
Playground demo examples/playground/src/demos/core/ Interactive retry demo with 3 scenarios
Changesets .changeset/retry-utilities.md, .changeset/sync-getters.md Minor (retries) + patch (sync getters)

Known limitations

  • No dead-letter queue — failed tasks are dequeued after exhausting retries. Logged and routed through onError().
  • No circuit breaker — each task exhausts its retry budget independently.
  • Queue retries are head-of-line blocking — one failing item's retries delay all subsequent items. Use this.retry() inside the callback for independent retry.
  • Retry delays block the DOtryN uses setTimeout between attempts. For short delays (100ms–3s) this is fine. For longer recovery, use schedule() instead.

Notes for reviewers

  1. Start with design/retries.md — it covers architecture, every key decision, and tradeoffs. The code will make more sense after reading it.

  2. src/retries.ts is the core — ~160 lines. tryN is the only retry loop; everything else composes on top of it. If the primitives look right, the rest follows.

  3. _flushQueue() and the alarm handler in src/index.ts are the most impactful integration points. They use shared parseRetryOptions() and resolveRetryConfig() helpers to read retry_options from the DB row, merge with class-level defaults, and pass to tryN. Payload parsing is hoisted outside the retry loop to avoid repeated deserialization.

  4. Validation is the tightest partvalidateRetryOptions accepts optional defaults so that cross-field checks work against resolved values (e.g. explicit baseDelayMs: 5000 against default maxDelayMs: 3000). tryN also validates its own inputs with Number.isFinite() guards. Both paths use consistent error messages.

  5. Test agents catch errors internally — callable methods that test error paths return { error: string } instead of throwing. This avoids unhandled promise rejections in the workerd runtime (thrown errors in @callable methods cross the RPC boundary and appear as uncaught rejections even when vitest catches them). This let us remove dangerouslyIgnoreUnhandledErrors from the vitest config.

  6. The playground demo is functional — run cd examples/playground && npm run dev and navigate to Core > Retry. Three interactive scenarios: flaky operation, shouldRetry filter, and queue with retry.

  7. Backward compatible — all new parameters are optional, the retry_options TEXT column is added via migration, and the sync getter change is safe (await on a non-Promise resolves immediately).

Test plan

  • npm run build — all packages build
  • npm run test — 172 tests pass (23 files), zero unhandled rejections
  • npm run typecheck — 42/42 projects pass
  • npm run check:exports — all 4 packages valid
  • Playground demo runs (cd examples/playground && npm run dev → Core > Retry)

Introduce a retry system across the Agents SDK: add core primitives (jitterBackoff, tryN, isErrorRetryable) and expose this.retry() plus a RetryOptions type. Persist per-task retry options for queue() and schedule()/scheduleEvery() (new retry_options DB columns) and allow MCP server connection retry config. Validate retry options eagerly and provide class-level defaults via static options; internal Durable Object-aware retries were added for workflow operations and MCP reconnection logic. Includes extensive docs and examples (playground UI + demo agent), and unit/integration tests for retries.
@changeset-bot
Copy link
Copy Markdown

changeset-bot bot commented Feb 9, 2026

🦋 Changeset detected

Latest commit: b6db4f2

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 4 packages
Name Type
agents Major
@cloudflare/ai-chat Major
@cloudflare/codemode Major
hono-agents Major

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new bot commented Feb 9, 2026

Open in StackBlitz

npm i https://pkg.pr.new/cloudflare/agents@874

commit: b6db4f2

threepointone and others added 2 commits February 15, 2026 16:07
Co-authored-by: Cursor <cursoragent@cursor.com>
Introduce an attempt-aware shouldRetry predicate and stronger eager validation for retry primitives. Key changes: rename internal isRetryable → shouldRetry (accepting (err, nextAttempt)), update tryN to validate finite/integer inputs and delay bounds, and surface clearer error messages. Add parseRetryOptions and resolveRetryConfig helpers, cache resolved agent options, and emit observability events (queue:retry, schedule:retry) for retry attempts >1. Update DB parsing/serialization paths to use the helpers and wire retries through resolveRetryConfig in queue/schedule handlers. Update docs and tests to reflect new signatures, validation behavior, and added integration cases.
@threepointone threepointone changed the title RFC: Retry utilities for the Agents SDK Add retry utilities and per-task retry support Feb 15, 2026
Replace incorrect quadruple backticks and remove an extra blank line in docs/scheduling.md to properly close the code block around the Self-Destructing Agents example and ensure correct rendering before the Timezone-Aware Scheduling section.
@threepointone threepointone merged commit a6ec9b0 into main Feb 15, 2026
4 checks passed
@threepointone threepointone deleted the retries branch February 15, 2026 17:09
@github-actions github-actions bot mentioned this pull request Feb 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant