Ignore stale readable callbacks after replica sync handoff by sarthakaggarwal97 · Pull Request #3348 · valkey-io/valkey

sarthakaggarwal97 · 2026-03-10T16:14:30Z

I noticed this crash in yesterday's daily run: https://github.com/valkey-io/valkey/actions/runs/22880939835/job/66383301850#step:10:7530

30553:S 10 Mar 2026 01:02:11.216 - Error accepting cluster node connection: error:0A000126:SSL routines::unexpected eof while reading
30553:S 10 Mar 2026 01:02:11.233 # syncWithPrimary(): state machine error, state should be RECEIVE_PSYNC but is 13
30553:S 10 Mar 2026 01:02:11.233 - Accepting cluster node connection from 127.0.0.1:37896
30553:S 10 Mar 2026 01:02:11.241 - Error accepting cluster node connection: error:0A000126:SSL routines::unexpected eof while reading
30553:S 10 Mar 2026 01:02:11.241 * Replica main thread creating Bio thread to save RDB to disk


=== VALKEY BUG REPORT START: Cut & paste starting from here ===
30553:S 10 Mar 2026 01:02:11.241 # valkey 255.255.255 crashed by signal: 11, si_code: 1
30553:S 10 Mar 2026 01:02:11.241 # Accessing address: (nil)
30553:S 10 Mar 2026 01:02:11.241 # Crashed running the instruction at: 0x55579c

------ STACK TRACE ------
EIP:
/__w/valkey/valkey/src/valkey-server 127.0.0.1:21217 [cluster](receiveRDBinBioThreadSingleChannel+0x2c)[0x55579c]

During sync, the main connection starts with syncWithPrimary() as its read handler. After the PSYNC reply is processed, the code swaps that handler to the RDB transfer path and sets replication to REPL_STATE_TRANSFER.

The race could be that a readable event for the old handler can already be queued before the handler swap is completed. When that stale callback later fires, syncWithPrimary() runs one more time even though the replica is already in the
transfer phase.

The patch fixes two things -

Ignore the late syncWithPrimary() callbacks once replication is already inREPL_STATE_TRANSFER or REPL_STATE_CONNECTED
Skip the BIO-thread RDB handoff if the transfer connection is already gone

Passing CI Run 200 times - https://github.com/sarthakaggarwal97/valkey/actions/runs/22888409606/job/66406230369

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

codecov · 2026-03-10T16:45:46Z

Codecov Report

❌ Patch coverage is 33.33333% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.11%. Comparing base (1e2ca5e) to head (56f5f2b).
⚠️ Report is 26 commits behind head on unstable.

Files with missing lines	Patch %	Lines
src/replication.c	33.33%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #3348      +/-   ##
============================================
+ Coverage     74.98%   75.11%   +0.12%     
============================================
  Files           129      129              
  Lines         71633    71636       +3     
============================================
+ Hits          53715    53810      +95     
+ Misses        17918    17826      -92

Files with missing lines	Coverage Δ
src/replication.c	`86.13% <33.33%> (+0.01%)`	⬆️

... and 22 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

sarthakaggarwal97 · 2026-03-10T19:41:02Z

I was chatting with @nitaicaro offline for this PR, and ideally syncWithPrimary shouldn't have been fired twice technically. @alon-arenberg would you like to take a look at this? @nitaicaro suggested that you might have some context on this one.

nitaicaro · 2026-03-10T19:52:08Z

@alon-arenberg - some more context - looking at the TLS + ae interaction, it looks like tlsProcessPendingData iterates the pending list and calls tlsHandleEvent(conn, AE_READABLE) directly, outside the ae fired-events loop. if during that handler execution, the read handler changes, is it possible that the next event which was already queued for that same fd might still cause the old handler to execute?

sarthakaggarwal97 and others added 2 commits March 10, 2026 09:05

Fix a race in replica full sync

6f0db21

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

Merge branch 'valkey-io:unstable' into ignore-stale-replica-transfer

56f5f2b

github-actions bot assigned sarthakaggarwal97 Mar 10, 2026

sarthakaggarwal97 requested review from enjoy-binbin and nitaicaro March 10, 2026 16:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore stale readable callbacks after replica sync handoff#3348

Ignore stale readable callbacks after replica sync handoff#3348
sarthakaggarwal97 wants to merge 2 commits intovalkey-io:unstablefrom
sarthakaggarwal97:ignore-stale-replica-transfer

sarthakaggarwal97 commented Mar 10, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 10, 2026 •

edited

Loading

Uh oh!

sarthakaggarwal97 commented Mar 10, 2026

Uh oh!

nitaicaro commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sarthakaggarwal97 commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sarthakaggarwal97 commented Mar 10, 2026

Uh oh!

nitaicaro commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sarthakaggarwal97 commented Mar 10, 2026 •

edited

Loading

codecov bot commented Mar 10, 2026 •

edited

Loading