EKS Cluster Upgrades Without Drama

Priyankar Prasad
December 23, 2025
6 mins
Infrastructure
aws eks kubernetes terraform

Introduction

EKS upgrades are rarely “just bump the version”. Most outages happen in the gaps: version skew, managed add-ons drifting, node rollouts that evict too aggressively, and workloads that were never tested against the next Kubernetes minor.

This post lays out a repeatable approach to upgrading EKS with minimal risk, using:

Managed add-ons
A deliberate version skew policy
Safe nodegroup upgrades
PDBs + disruption controls

It’s opinionated, but it’s designed to be practical.

1. Know what changes with the EKS upgrade

An EKS “cluster version upgrade” impacts at least four layers:

Control plane — EKS-managed Kubernetes API server, etcd, etc.
Worker nodes — AMI, kubelet, container runtime, CNI interactions
Core add-ons — VPC CNI, CoreDNS, kube-proxy, CSI drivers
Your workloads — APIs, admission controllers, CRDs, controllers, Pod security

Most incidents are triggered by #3 and #4, not the control plane itself.

2. Establish a version skew policy (and enforce it)

Kubernetes supports some skew between components, but you should treat skew as a temporary state.

A safe, simple policy

Upgrade one minor version at a time (e.g. 1.29 → 1.30 → 1.31)
Keep kubelet(nodes) within one minor of the control plane
Upgrade add-ons in a controlled order (next section)
Maintain a “supported matrix” for cluster add-ons and critical controllers

Practical rule

Don’t let nodegroups lag for weeks after the control plane upgrade. Skew is where you get weird networking and DNS behavior that’s painful to diagnose.

3. Add-on strategy: managed add-ons are worth it

EKS managed add-ons reduce toil, but if you leave them on “whatever latest”, you introduce uncontrolled change.

Important Add-ons

Amazon VPC CNI — networking
CoreDNS — cluster DNS
kube-proxy — service networking
EBS CSI driver — storage (and EFS CSI if you use it)

Recommended approach

Pin add-on versions (in Terraform) and update intentionally.
Prefer managed add-ons for the AWS-owned components above unless you have a strong reason not to.

Upgrade order (common stable sequence)

VPC CNI
CoreDNS
Kube-proxy
CSI drivers (EBS/EFS)

Why this order works: CNI and DNS issues cause the biggest blast radius. kube-proxy compatibility matters once you start moving kubelets.

Tip: Do add-on upgrades in non-prod first and keep notes. These are repeatable changes.

4. Pre-flight checks: catch most failures before the upgrade

Before upgrading anything, run checks that expose “future breakage”:

A) Kubernetes API deprecations

Identify removed APIs your workloads still use. This is especially important for older clusters or legacy manifests/Helm charts.

B) CRDs and controllers

CRDs often outlive their controllers. Ensure:

Controllers are compatible with the target Kubernetes version
Any admission webhooks and policy engines are tested

C) Pod Disruption Budgets and eviction behavior

During node upgrades, your workloads will be evicted. If PDBs are too strict, your rollout stalls. If they’re too loose, availability suffers.

Validate:

PDBs exist for critical apps
PDB minAvailable is realistic given replica counts
You have enough capacity to reschedule pods during drain

D) Cluster autoscaler / Karpenter

If you use Karpenter or Cluster Autoscaler:

Ensure it supports the target cluster version
Validate node provisioning logic (taints, labels, instance types)

Tools that help: EKS Upgrade Insights and kubent

EKS Upgrade Insights (AWS Console or CLI) — Shows upgrade readiness for your cluster: incompatible add-ons, deprecated APIs in use, and version compatibility for managed add-ons. Run it when you pick a target version so you see control-plane and add-on issues in one place.
kubent (Kubernetes No Trouble) — Scans the cluster for resources still using deprecated or removed Kubernetes APIs. Run kubent against your cluster and fix any reported resources before upgrading. It directly addresses check (A) and is quick to add to a pre-upgrade pipeline.

5. Make node upgrades boring: blue/green nodegroups

Control plane upgrades are usually quick and low impact. Node upgrades are where you can hurt yourself.

Preferred: blue/green nodegroups

Instead of “in-place” upgrades, create a new nodegroup with:

New AMI
Desired kubelet version
Same labels/taints (or intentionally different for migration)

Then migrate:

Cordon/drain old nodes gradually
Observe workload health
Scale down the old group when stable

This gives you clean rollback: if something fails, scale the old group back up and stop draining.

Draining discipline that avoids outages

Drain in small batches (1 node at a time for critical clusters)
Respect PDBs
Set sane timeouts
Ensure daemonsets are handled correctly
Watch for “stuck terminating” pods (often due to finalizers or PDB deadlocks)

6. PDBs and disruption budgets: how to not self-DoS during drains

A stable node rollout requires surge capacity. You need enough spare resources to reschedule pods:

Add temporary capacity (scale nodegroup up) or
Use a surge nodegroup or
Allow Karpenter to provision extra nodes during drain

Practical guidance for PDBs

For critical services, aim for ≥ 2 replicas and minAvailable: 1 (or a percentage)
For single-replica workloads: a PDB doesn’t create availability add replicas or accept downtime

Also check:

Topology spread constraints and anti-affinity rules (these can block rescheduling if your cluster is small)
Pod priority classes (ensure important workloads get scheduled first)

7. Observability during upgrades: the signals that matter

During upgrades, dashboards should answer:

Are pods scheduling successfully?
Are we seeing DNS failures?
Are we seeing elevated 5xx or latency?
Are nodes healthy and draining cleanly?

High-signal things to watch

CoreDNS errors / latency
VPC CNI errors, IP exhaustion signals
Pending pods count (scheduling pressure)
Node not-ready events
Ingress 5xx / p95 latency
Application-specific error rate and throughput

Tip: Have a dedicated upgrade dashboard ready. Don’t rely on hunting logs mid-upgrade.

8. Rollback strategy: know what’s reversible

Not all parts are equally reversible:

Component	Reversibility
Nodegroup blue/green	Highly reversible (scale old up, new down)
Workload manifest changes	Reversible via Git revert
Add-on version changes	Usually reversible (but test downgrade paths)
Control plane version	Generally not something you casually roll back

This is why the plan emphasizes: pre-flight checks, add-on testing in non-prod, and blue/green nodes for easy rollback leverage.

A sample “upgrade runbook” you can reuse

Here’s a condensed checklist.

1. Pre-flight
- [ ] Confirm target version support for: Argo CD, ingress controller, cert-manager, CSI, autoscaling components
- [ ] Scan for deprecated APIs
- [ ] Validate PDBs and replica counts
- [ ] Ensure capacity headroom (or surge plan)
- [ ] Confirm backup/restore posture for critical stateful systems

2. Execute
- [ ] Upgrade managed add-ons (pinned versions)
- [ ] Upgrade platform controllers/CRDs via GitOps
- [ ] Upgrade EKS control plane
- [ ] Create new nodegroup(s)
- [ ] Drain old nodes gradually with monitoring gates
- [ ] Validate SLOs and core signals throughout

3. Close
- [ ] Decommission old nodegroups
- [ ] Update documentation: versions, issues, fixes
- [ ] Add regression checks for next upgrade cycle

Closing: upgrades become routine when you treat them like a product

The clusters that upgrade smoothly aren’t “lucky”. They have:

A version skew policy
Pinned add-on versions
A repeatable node migration pattern