Skip to content

Metrics endpoint is not available on listeners that don't have active runners #3784

@velkovb

Description

@velkovb

Checks

Controller Version

0.9.3

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Go to your prometheus endpoint.
2. Got to targets and get the target for your listener pods.
3. All listeners that don't have running jobs appear as down.

Describe the bug

We have 24 runners types and deploy 24 different scalesets. Only a few of them have warp (minRunners) enabled. Only those show up in Prometheus as live targets. The ones that don't have runners appear as down. Not sure if this could be related to - https://github.com/actions/actions-runner-controller/pull/3445/files

This leads to an issue with the gha_registered_runners metrics as it doesn't properly go to 0. For example if we have 20 warm runners and we scale them down for the night to 0 the gha_registered_runners stays at 20 as the metrics endpoint never starts to get the 0 value.

P.S. We observed another issue caused by this. When a Scale set configured with 0 min runners receives a job request it spins up new runner pods but doesn't activate the metrics endpoint thus leading to missing metrics.

Describe the expected behavior

All listener should be reporting metrics even if there are no active jobs on them.

Additional Context

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: role
          operator: In
          values:
          - spot
fullnameOverride: gha-runner-scale-set-controller
metrics:
  controllerManagerAddr: :8080
  listenerAddr: :8081
  listenerEndpoint: /metrics
priorityClassName: system-cluster-critical
replicaCount: 1
resources:
  limits:
    memory: 512Mi
  requests:
    cpu: 200m
    memory: 512Mi
tolerations:
- effect: NoSchedule
  key: pot
  operator: Equal
  value: "true"

Controller Logs

https://gist.github.com/velkovb/eccd211ad8776ce4f63d55c61b954879

Runner Pod Logs

no runners involved here

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggha-runner-scale-setRelated to the gha-runner-scale-set modeneeds triageRequires review from the maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions