Skip to content

SERVER-121686 Fix config server crash on config.mongos delete#1635

Open
amit777 wants to merge 18 commits intomongodb:v7.0from
amit777:fix/SERVER-121686-config-mongos-delete-crash
Open

SERVER-121686 Fix config server crash on config.mongos delete#1635
amit777 wants to merge 18 commits intomongodb:v7.0from
amit777:fix/SERVER-121686-config-mongos-delete-crash

Conversation

@amit777
Copy link

@amit777 amit777 commented Mar 15, 2026

Summary

Fixes a P1 blocker where deleting documents from the config.mongos collection crashes the config server with an invariant failure in QueryAnalysisCoordinator::onSamplerDelete, causing an unrecoverable crash loop across all config server replicas.

Root Cause

In onSamplerDelete(), the code asserts invariant(erased) after calling _samplers.erase(). This crashes when the deleted sampler wasn't loaded into the coordinator's in-memory map during onStartup() — which only loads samplers with recent ping times. During oplog replay/recovery, the same delete is replayed, retriggering the crash on every restart.

Fix

  • Remove invariant(erased) in QueryAnalysisCoordinator::onSamplerDelete — a sampler not being in the map is not a fatal condition; erase() on a missing key is a safe no-op
  • Add try/catch around all config.mongos commit handlers in QueryAnalysisOpObserver to prevent non-critical QueryAnalysisCoordinator errors from aborting the server during oplog application
  • Add test for deleting an untracked sampler to prevent regression

Impact

  • Affected versions: 7.0.25, 7.0.30 (confirmed)
  • The crash causes complete config server outage requiring manual oplog surgery to recover
  • The config.mongos collection is informational only — deletes should never crash the server

Test plan

  • New unit test RemoveUntrackedSamplerOnDeleteDoesNotCrash verifies that deleting a sampler not in the coordinator's map succeeds without crashing
  • Existing RemoveSamplersOnDelete test continues to pass (happy path)
  • Manual verification: deploy sharded cluster, delete from config.mongos, confirm no crash

🤖 Generated with Claude Code

zackwintermdb and others added 18 commits December 22, 2025 19:50
GitOrigin-RevId: 3b47fcedd563b91860793a46fa25aa579af16adc
GitOrigin-RevId: 84ade39605f660946ddce1a489f638cbb55ec5db
GitOrigin-RevId: cfc346c59d2cc3dbc903507fc642e22e3097f362
…6060)

GitOrigin-RevId: da5d24e95be0a4ee771a64250acefb7f6f274940
… (#46064)

GitOrigin-RevId: 571cbfdf0e715e55e894fd2b01b7dd1720e89151
… messages (#44559) (#46087)

GitOrigin-RevId: 0e4cd91f1d5631be49c32020ddc8225a421603ae
…ion find on sharded collection [v7.0.29-hotfix] (#46095)

GitOrigin-RevId: 6c9ac5d9044aecaa83c25611fb400902e0522424
…quired at startup (#46084)

GitOrigin-RevId: 59eeb10163d765b4fc96c8854526733e44fd9af5
… by the system (#44791) (#46107)

GitOrigin-RevId: 702ca7465a53d58fdc3084c26913c9b0090d3dd4
…ation where projection is only _id exclusion (#44589) (#46115)

GitOrigin-RevId: 624a0e2bcda69d1e0a2d5952b2279348530ce2b8
GitOrigin-RevId: 7d78592b5120b34fe3dc1042233ba68e03124679
GitOrigin-RevId: d2a57fa864dd90b0769344b061c8b5ca9c6a2dd0
…tfix] (#46224)

GitOrigin-RevId: 5a5a037b9d402856e492b984c811a3b08065bd1e
…259)

GitOrigin-RevId: d2be1d8bf48d6ef6dc90543fd45ce5b6c4ade070
…etwork buffers (#46253)

GitOrigin-RevId: b2287245e1227d0ac70dc459ac776beaa0a24b6e
…rationSessionInfo` (#46286)

GitOrigin-RevId: 415cc13e900a82a2e00e4f4417dc7159a883e975
GitOrigin-RevId: 67480f41dfa5802ce14af5c95bd0e9826d3b2131
Remove fatal invariant(erased) in QueryAnalysisCoordinator::onSamplerDelete
that crashes the config server when a delete targets a sampler not in the
coordinator's in-memory map. This happens when the sampler's ping time was
too old to be loaded during onStartup, or during oplog replay/recovery.

The crash causes an unrecoverable crash loop: all config server replicas
abort when applying the same oplog entry, and startup recovery replays the
same delete, retriggering the invariant failure.

Fix:
- Remove invariant(erased) assertion — a missing sampler is not a fatal
  condition; erase() on a missing key is a safe no-op.
- Add try/catch around all config.mongos commit handlers in the op observer
  to prevent non-critical QueryAnalysisCoordinator errors from aborting the
  server during oplog application.
- Add test for deleting an untracked sampler.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.