Skip to content

Removing Node Pool from Cluster Breaks Ingress Conroller #649

@dacox

Description

@dacox

Hi everyone,

we're currently on 1.11.6-gke.11. We've had some issues with gce-ingress in the past, and after a recent outage are trying to dig into the root cause.

We currently have a handful of Node Pools and use an Ingress to map traffic to two services in our cluster. This has created the Cloud Load Balancer with two Backend Services with one health check each. Each Backend Service points to the same Instance Group that is also created by the controller (am I getting this right?).

As I understand, because our Service's have externalTrafficPolicy=Cluster (default), both Backend Services show all nodes in the Cluster as healthy, not just the ones that pertain to our Services.

We recently removed a Node Pool from the Cluster. To do this, we cordoned each node in the pool and then drained them. A few hours later we removed the node pool.

Immediately upon deleting the pool, we experienced a large amount of downtime. It did not seem to be resolved when the last node went offline. At one point, we saw that both Backend Services reported 0/0 as healthy.

After about half an hour, we were able to stop the fire by kubectl delete ingress-myingress and then re-creating it.

We now understand that we would get some 502s when the nodes were being deleted due to our externalTrafficPolicy. What we are having a hard time grappling with is why the outage did not recover until we re-created the ingress.

Cheers!

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.lifecycle/rottenDenotes an issue or PR that has aged beyond stale and will be auto-closed.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions