Removing Node Pool from Cluster Breaks Ingress Conroller

Hi everyone,

we're currently on `1.11.6-gke.11`. We've had some issues with `gce-ingress` in the past, and after a recent outage are trying to dig into the root cause.

We currently have a handful of Node Pools and use an Ingress to map traffic to two services in our cluster. This has created the Cloud Load Balancer with two Backend Services with one health check each. Each Backend Service points to the same Instance Group that is also created by the controller (am I getting this right?).

As I understand, because our Service's have `externalTrafficPolicy=Cluster (default)`, both Backend Services show all nodes in the Cluster as healthy, not just the ones that pertain to our Services.

We recently removed a Node Pool from the Cluster. To do this, we cordoned each node in the pool and then drained them. A few hours later we removed the node pool.

Immediately upon deleting the pool, we experienced a large amount of downtime. It did not seem to be resolved when the last node went offline. At one point, we saw that both Backend Services reported `0/0` as healthy.

After about half an hour, we were able to stop the fire by `kubectl delete ingress-myingress` and then re-creating it.

We now understand that we would get some 502s when the nodes were being deleted due to our `externalTrafficPolicy`. What we are having a hard time grappling with is why the outage did not recover until we re-created the ingress. 

Cheers!



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removing Node Pool from Cluster Breaks Ingress Conroller #649

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Removing Node Pool from Cluster Breaks Ingress Conroller #649

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions