Add retry utilities and per-task retry support#874
Merged
threepointone merged 4 commits intomainfrom Feb 15, 2026
Merged
Conversation
Introduce a retry system across the Agents SDK: add core primitives (jitterBackoff, tryN, isErrorRetryable) and expose this.retry() plus a RetryOptions type. Persist per-task retry options for queue() and schedule()/scheduleEvery() (new retry_options DB columns) and allow MCP server connection retry config. Validate retry options eagerly and provide class-level defaults via static options; internal Durable Object-aware retries were added for workflow operations and MCP reconnection logic. Includes extensive docs and examples (playground UI + demo agent), and unit/integration tests for retries.
🦋 Changeset detectedLatest commit: b6db4f2 The changes in this PR will be included in the next version bump. This PR includes changesets to release 4 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
commit: |
Co-authored-by: Cursor <cursoragent@cursor.com>
Introduce an attempt-aware shouldRetry predicate and stronger eager validation for retry primitives. Key changes: rename internal isRetryable → shouldRetry (accepting (err, nextAttempt)), update tryN to validate finite/integer inputs and delay bounds, and surface clearer error messages. Add parseRetryOptions and resolveRetryConfig helpers, cache resolved agent options, and emit observability events (queue:retry, schedule:retry) for retry attempts >1. Update DB parsing/serialization paths to use the helpers and wire retries through resolveRetryConfig in queue/schedule handlers. Update docs and tests to reflect new signatures, validation behavior, and added integration cases.
Replace incorrect quadruple backticks and remove an extra blank line in docs/scheduling.md to properly close the code block around the Self-Destructing Agents example and ensure correct rendering before the Timezone-Aware Scheduling section.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds structured retry support to the Agents SDK — a consistent, configurable system that works across schedules, queues, MCP connections, and user code.
this.retry(fn, options?)— retry any async operation with exponential backoff and jitterqueue(),schedule(),scheduleEvery()accept per-task{ retry?: RetryOptions }addMcpServer()accepts{ retry?: RetryOptions }for connection retriesstatic options = { retry: { ... } }getQueue(),getQueues(),getSchedule(),dequeue(),dequeueAll(),dequeueAllByCallback()made synchronous (they wereasyncbut only did sync SQL work)Design
Full design doc in
design/retries.md. Key decisions:tryN,jitterBackoff,isErrorRetryable,validateRetryOptionslive insrc/retries.tsand are not re-exported. OnlyRetryOptions(type) andthis.retry()(method) are public API.retry_options TEXTcolumn so they survive DO hibernation.shouldRetry(err, nextAttempt)only onthis.retry()— functions can't be serialized to SQLite, soqueue()/schedule()don't support it. The predicate receives both the error and the next attempt number for attempt-aware retry decisions.validateRetryOptions(options, defaults)runs at enqueue/schedule/retry time and resolves partial options against class-level defaults before cross-field checks.{ baseDelayMs: 5000 }against defaultmaxDelayMs: 3000throws immediately, not hours later when the task executes. Also enforces integermaxAttemptsand guards againstNaN/Infinity.queue:retryandschedule:retryevents emitted before each retry attempt with callback, id, attempt number, and maxAttempts._resolvedOptions— static options are computed once and reused, not rebuilt on every queue/schedule/retry call.Defaults
maxAttemptsbaseDelayMsmaxDelayMsWorkflow operations use 200ms base / 3s max. MCP connections use 500ms base / 5s max.
What's included
src/retries.tsRetryOptions,tryN,jitterBackoff,isErrorRetryable,validateRetryOptionssrc/index.tsthis.retry(), retry onqueue()/schedule(), class-level defaults, observability events, cached options, DRY helpers, sync getterssrc/mcp/client.tsaddMcpServer(), persisted inserver_optionssrc/observability/agent.tsqueue:retryandschedule:retryevent typessrc/tests/retries.test.tssrc/tests/retry-integration.test.tsthis.retry(),shouldRetrywith attempt number, queue/schedule retry, eager validation, class-level defaultssrc/tests/agents/retry.tsTestRetryAgent,TestRetryDefaultsAgentdesign/retries.mddocs/retries.mddocs/scheduling.md,docs/queue.md,docs/mcp-client.mdexamples/playground/src/demos/core/.changeset/retry-utilities.md,.changeset/sync-getters.mdKnown limitations
onError().this.retry()inside the callback for independent retry.tryNusessetTimeoutbetween attempts. For short delays (100ms–3s) this is fine. For longer recovery, useschedule()instead.Notes for reviewers
Start with
design/retries.md— it covers architecture, every key decision, and tradeoffs. The code will make more sense after reading it.src/retries.tsis the core — ~160 lines.tryNis the only retry loop; everything else composes on top of it. If the primitives look right, the rest follows._flushQueue()and the alarm handler insrc/index.tsare the most impactful integration points. They use sharedparseRetryOptions()andresolveRetryConfig()helpers to readretry_optionsfrom the DB row, merge with class-level defaults, and pass totryN. Payload parsing is hoisted outside the retry loop to avoid repeated deserialization.Validation is the tightest part —
validateRetryOptionsaccepts optionaldefaultsso that cross-field checks work against resolved values (e.g. explicitbaseDelayMs: 5000against defaultmaxDelayMs: 3000).tryNalso validates its own inputs withNumber.isFinite()guards. Both paths use consistent error messages.Test agents catch errors internally — callable methods that test error paths return
{ error: string }instead of throwing. This avoids unhandled promise rejections in the workerd runtime (thrown errors in@callablemethods cross the RPC boundary and appear as uncaught rejections even when vitest catches them). This let us removedangerouslyIgnoreUnhandledErrorsfrom the vitest config.The playground demo is functional — run
cd examples/playground && npm run devand navigate to Core > Retry. Three interactive scenarios: flaky operation, shouldRetry filter, and queue with retry.Backward compatible — all new parameters are optional, the
retry_options TEXTcolumn is added via migration, and the sync getter change is safe (awaiton a non-Promise resolves immediately).Test plan
npm run build— all packages buildnpm run test— 172 tests pass (23 files), zero unhandled rejectionsnpm run typecheck— 42/42 projects passnpm run check:exports— all 4 packages validcd examples/playground && npm run dev→ Core > Retry)