Skip to content

metrics: guard MetricsCollector with a mutex to prevent transient scrape gaps #160

@mjudeikis-bot

Description

@mjudeikis-bot

Background

Discovered during review of #159 (Fix metrics).

The MetricsCollector.Collect() method calls Reset() on gauge vectors before re-populating them. This creates a brief window during a Prometheus scrape where condition/count metrics temporarily disappear — potentially causing transient alert fires for operators with high scrape frequency or tight alerting rules.

As noted by @xrstf in the review, the correct fix is to guard the collector with a sync.RWMutex:

  • Hold a write lock during the Reset() + repopulate sequence inside updateObjectCounts and updateCacheServerCounts
  • Hold a read lock in Collect() while metrics are being scraped

This ensures that Prometheus scrapes see a consistent, atomic snapshot and never observe a partially-reset state.

Proposed Fix

type MetricsCollector struct {
    mu sync.RWMutex
    // ... existing fields
}

func (c *MetricsCollector) updateObjectCounts(...) {
    c.mu.Lock()
    defer c.mu.Unlock()
    // Reset + repopulate
}

func (c *MetricsCollector) Collect(ch chan<- prometheus.Metric) {
    c.mu.RLock()
    defer c.mu.RUnlock()
    // collect
}

Scope

This is intentionally out of scope for #159 which focuses on the Set(1) → accumulate correctness fix. Tracked separately here as a good first issue.

/kind bug
/good-first-issue

Metadata

Metadata

Assignees

No one assigned

    Labels

    good first issueDenotes an issue ready for a new contributor, according to the "help wanted" guidelines.kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions