SERVER-121686 Fix config server crash on config.mongos delete#1635
Open
amit777 wants to merge 18 commits intomongodb:v7.0from
Open
SERVER-121686 Fix config server crash on config.mongos delete#1635amit777 wants to merge 18 commits intomongodb:v7.0from
amit777 wants to merge 18 commits intomongodb:v7.0from
Conversation
GitOrigin-RevId: 3b47fcedd563b91860793a46fa25aa579af16adc
GitOrigin-RevId: 84ade39605f660946ddce1a489f638cbb55ec5db
GitOrigin-RevId: cfc346c59d2cc3dbc903507fc642e22e3097f362
…6060) GitOrigin-RevId: da5d24e95be0a4ee771a64250acefb7f6f274940
… (#46064) GitOrigin-RevId: 571cbfdf0e715e55e894fd2b01b7dd1720e89151
… messages (#44559) (#46087) GitOrigin-RevId: 0e4cd91f1d5631be49c32020ddc8225a421603ae
…ion find on sharded collection [v7.0.29-hotfix] (#46095) GitOrigin-RevId: 6c9ac5d9044aecaa83c25611fb400902e0522424
…quired at startup (#46084) GitOrigin-RevId: 59eeb10163d765b4fc96c8854526733e44fd9af5
… by the system (#44791) (#46107) GitOrigin-RevId: 702ca7465a53d58fdc3084c26913c9b0090d3dd4
…ation where projection is only _id exclusion (#44589) (#46115) GitOrigin-RevId: 624a0e2bcda69d1e0a2d5952b2279348530ce2b8
GitOrigin-RevId: 7d78592b5120b34fe3dc1042233ba68e03124679
GitOrigin-RevId: d2a57fa864dd90b0769344b061c8b5ca9c6a2dd0
…tfix] (#46224) GitOrigin-RevId: 5a5a037b9d402856e492b984c811a3b08065bd1e
…259) GitOrigin-RevId: d2be1d8bf48d6ef6dc90543fd45ce5b6c4ade070
…etwork buffers (#46253) GitOrigin-RevId: b2287245e1227d0ac70dc459ac776beaa0a24b6e
…rationSessionInfo` (#46286) GitOrigin-RevId: 415cc13e900a82a2e00e4f4417dc7159a883e975
GitOrigin-RevId: 67480f41dfa5802ce14af5c95bd0e9826d3b2131
Remove fatal invariant(erased) in QueryAnalysisCoordinator::onSamplerDelete that crashes the config server when a delete targets a sampler not in the coordinator's in-memory map. This happens when the sampler's ping time was too old to be loaded during onStartup, or during oplog replay/recovery. The crash causes an unrecoverable crash loop: all config server replicas abort when applying the same oplog entry, and startup recovery replays the same delete, retriggering the invariant failure. Fix: - Remove invariant(erased) assertion — a missing sampler is not a fatal condition; erase() on a missing key is a safe no-op. - Add try/catch around all config.mongos commit handlers in the op observer to prevent non-critical QueryAnalysisCoordinator errors from aborting the server during oplog application. - Add test for deleting an untracked sampler. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a P1 blocker where deleting documents from the
config.mongoscollection crashes the config server with an invariant failure inQueryAnalysisCoordinator::onSamplerDelete, causing an unrecoverable crash loop across all config server replicas.Root Cause
In
onSamplerDelete(), the code assertsinvariant(erased)after calling_samplers.erase(). This crashes when the deleted sampler wasn't loaded into the coordinator's in-memory map duringonStartup()— which only loads samplers with recent ping times. During oplog replay/recovery, the same delete is replayed, retriggering the crash on every restart.Fix
invariant(erased)inQueryAnalysisCoordinator::onSamplerDelete— a sampler not being in the map is not a fatal condition;erase()on a missing key is a safe no-opconfig.mongoscommit handlers inQueryAnalysisOpObserverto prevent non-criticalQueryAnalysisCoordinatorerrors from aborting the server during oplog applicationImpact
config.mongoscollection is informational only — deletes should never crash the serverTest plan
RemoveUntrackedSamplerOnDeleteDoesNotCrashverifies that deleting a sampler not in the coordinator's map succeeds without crashingRemoveSamplersOnDeletetest continues to pass (happy path)config.mongos, confirm no crash🤖 Generated with Claude Code