Skip to content

Ignore stale readable callbacks after replica sync handoff#3348

Open
sarthakaggarwal97 wants to merge 2 commits intovalkey-io:unstablefrom
sarthakaggarwal97:ignore-stale-replica-transfer
Open

Ignore stale readable callbacks after replica sync handoff#3348
sarthakaggarwal97 wants to merge 2 commits intovalkey-io:unstablefrom
sarthakaggarwal97:ignore-stale-replica-transfer

Conversation

@sarthakaggarwal97
Copy link
Contributor

@sarthakaggarwal97 sarthakaggarwal97 commented Mar 10, 2026

I noticed this crash in yesterday's daily run: https://github.com/valkey-io/valkey/actions/runs/22880939835/job/66383301850#step:10:7530

30553:S 10 Mar 2026 01:02:11.216 - Error accepting cluster node connection: error:0A000126:SSL routines::unexpected eof while reading
30553:S 10 Mar 2026 01:02:11.233 # syncWithPrimary(): state machine error, state should be RECEIVE_PSYNC but is 13
30553:S 10 Mar 2026 01:02:11.233 - Accepting cluster node connection from 127.0.0.1:37896
30553:S 10 Mar 2026 01:02:11.241 - Error accepting cluster node connection: error:0A000126:SSL routines::unexpected eof while reading
30553:S 10 Mar 2026 01:02:11.241 * Replica main thread creating Bio thread to save RDB to disk


=== VALKEY BUG REPORT START: Cut & paste starting from here ===
30553:S 10 Mar 2026 01:02:11.241 # valkey 255.255.255 crashed by signal: 11, si_code: 1
30553:S 10 Mar 2026 01:02:11.241 # Accessing address: (nil)
30553:S 10 Mar 2026 01:02:11.241 # Crashed running the instruction at: 0x55579c

------ STACK TRACE ------
EIP:
/__w/valkey/valkey/src/valkey-server 127.0.0.1:21217 [cluster](receiveRDBinBioThreadSingleChannel+0x2c)[0x55579c]

During sync, the main connection starts with syncWithPrimary() as its read handler. After the PSYNC reply is processed, the code swaps that handler to the RDB transfer path and sets replication to REPL_STATE_TRANSFER.

The race could be that a readable event for the old handler can already be queued before the handler swap is completed. When that stale callback later fires, syncWithPrimary() runs one more time even though the replica is already in the
transfer phase.

The patch fixes two things -

  1. Ignore the late syncWithPrimary() callbacks once replication is already inREPL_STATE_TRANSFER or REPL_STATE_CONNECTED
  2. Skip the BIO-thread RDB handoff if the transfer connection is already gone

Passing CI Run 200 times - https://github.com/sarthakaggarwal97/valkey/actions/runs/22888409606/job/66406230369

@codecov
Copy link

codecov bot commented Mar 10, 2026

Codecov Report

❌ Patch coverage is 33.33333% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.11%. Comparing base (1e2ca5e) to head (56f5f2b).
⚠️ Report is 26 commits behind head on unstable.

Files with missing lines Patch % Lines
src/replication.c 33.33% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #3348      +/-   ##
============================================
+ Coverage     74.98%   75.11%   +0.12%     
============================================
  Files           129      129              
  Lines         71633    71636       +3     
============================================
+ Hits          53715    53810      +95     
+ Misses        17918    17826      -92     
Files with missing lines Coverage Δ
src/replication.c 86.13% <33.33%> (+0.01%) ⬆️

... and 22 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@sarthakaggarwal97
Copy link
Contributor Author

I was chatting with @nitaicaro offline for this PR, and ideally syncWithPrimary shouldn't have been fired twice technically. @alon-arenberg would you like to take a look at this? @nitaicaro suggested that you might have some context on this one.

@nitaicaro
Copy link
Contributor

@alon-arenberg - some more context - looking at the TLS + ae interaction, it looks like tlsProcessPendingData iterates the pending list and calls tlsHandleEvent(conn, AE_READABLE) directly, outside the ae fired-events loop. if during that handler execution, the read handler changes, is it possible that the next event which was already queued for that same fd might still cause the old handler to execute?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants