EKS Cluster Upgrades Without Drama
- Priyankar Prasad
- December 23, 2025
- 6 mins
- Infrastructure
- aws eks kubernetes terraform
Introduction
EKS upgrades are rarely “just bump the version”. Most outages happen in the gaps: version skew, managed add-ons drifting, node rollouts that evict too aggressively, and workloads that were never tested against the next Kubernetes minor.
This post lays out a repeatable approach to upgrading EKS with minimal risk, using:
- Managed add-ons
- A deliberate version skew policy
- Safe nodegroup upgrades
- PDBs + disruption controls
It’s opinionated, but it’s designed to be practical.
1. Know what changes with the EKS upgrade
An EKS “cluster version upgrade” impacts at least four layers:
- Control plane — EKS-managed Kubernetes API server, etcd, etc.
- Worker nodes — AMI, kubelet, container runtime, CNI interactions
- Core add-ons — VPC CNI, CoreDNS, kube-proxy, CSI drivers
- Your workloads — APIs, admission controllers, CRDs, controllers, Pod security
Most incidents are triggered by #3 and #4, not the control plane itself.
2. Establish a version skew policy (and enforce it)
Kubernetes supports some skew between components, but you should treat skew as a temporary state.
A safe, simple policy
- Upgrade one minor version at a time (e.g. 1.29 → 1.30 → 1.31)
- Keep kubelet(nodes) within one minor of the control plane
- Upgrade add-ons in a controlled order (next section)
- Maintain a “supported matrix” for cluster add-ons and critical controllers
Practical rule
Don’t let nodegroups lag for weeks after the control plane upgrade. Skew is where you get weird networking and DNS behavior that’s painful to diagnose.
3. Add-on strategy: managed add-ons are worth it
EKS managed add-ons reduce toil, but if you leave them on “whatever latest”, you introduce uncontrolled change.
Important Add-ons
- Amazon VPC CNI — networking
- CoreDNS — cluster DNS
- kube-proxy — service networking
- EBS CSI driver — storage (and EFS CSI if you use it)
Recommended approach
- Pin add-on versions (in Terraform) and update intentionally.
- Prefer managed add-ons for the AWS-owned components above unless you have a strong reason not to.
Upgrade order (common stable sequence)
- VPC CNI
- CoreDNS
- Kube-proxy
- CSI drivers (EBS/EFS)
Why this order works: CNI and DNS issues cause the biggest blast radius. kube-proxy compatibility matters once you start moving kubelets.
Tip: Do add-on upgrades in non-prod first and keep notes. These are repeatable changes.
4. Pre-flight checks: catch most failures before the upgrade
Before upgrading anything, run checks that expose “future breakage”:
A) Kubernetes API deprecations
Identify removed APIs your workloads still use. This is especially important for older clusters or legacy manifests/Helm charts.
B) CRDs and controllers
CRDs often outlive their controllers. Ensure:
- Controllers are compatible with the target Kubernetes version
- Any admission webhooks and policy engines are tested
C) Pod Disruption Budgets and eviction behavior
During node upgrades, your workloads will be evicted. If PDBs are too strict, your rollout stalls. If they’re too loose, availability suffers.
Validate:
- PDBs exist for critical apps
- PDB
minAvailableis realistic given replica counts - You have enough capacity to reschedule pods during drain
D) Cluster autoscaler / Karpenter
If you use Karpenter or Cluster Autoscaler:
- Ensure it supports the target cluster version
- Validate node provisioning logic (taints, labels, instance types)
Tools that help: EKS Upgrade Insights and kubent
- EKS Upgrade Insights (AWS Console or CLI) — Shows upgrade readiness for your cluster: incompatible add-ons, deprecated APIs in use, and version compatibility for managed add-ons. Run it when you pick a target version so you see control-plane and add-on issues in one place.
- kubent (Kubernetes No Trouble) — Scans the cluster for resources still using deprecated or removed Kubernetes APIs. Run
kubentagainst your cluster and fix any reported resources before upgrading. It directly addresses check (A) and is quick to add to a pre-upgrade pipeline.
5. Make node upgrades boring: blue/green nodegroups
Control plane upgrades are usually quick and low impact. Node upgrades are where you can hurt yourself.
Preferred: blue/green nodegroups
Instead of “in-place” upgrades, create a new nodegroup with:
- New AMI
- Desired kubelet version
- Same labels/taints (or intentionally different for migration)
Then migrate:
- Cordon/drain old nodes gradually
- Observe workload health
- Scale down the old group when stable
This gives you clean rollback: if something fails, scale the old group back up and stop draining.
Draining discipline that avoids outages
- Drain in small batches (1 node at a time for critical clusters)
- Respect PDBs
- Set sane timeouts
- Ensure daemonsets are handled correctly
- Watch for “stuck terminating” pods (often due to finalizers or PDB deadlocks)
6. PDBs and disruption budgets: how to not self-DoS during drains
A stable node rollout requires surge capacity. You need enough spare resources to reschedule pods:
- Add temporary capacity (scale nodegroup up) or
- Use a surge nodegroup or
- Allow Karpenter to provision extra nodes during drain
Practical guidance for PDBs
- For critical services, aim for ≥ 2 replicas and
minAvailable: 1(or a percentage) - For single-replica workloads: a PDB doesn’t create availability add replicas or accept downtime
Also check:
- Topology spread constraints and anti-affinity rules (these can block rescheduling if your cluster is small)
- Pod priority classes (ensure important workloads get scheduled first)
7. Observability during upgrades: the signals that matter
During upgrades, dashboards should answer:
- Are pods scheduling successfully?
- Are we seeing DNS failures?
- Are we seeing elevated 5xx or latency?
- Are nodes healthy and draining cleanly?
High-signal things to watch
- CoreDNS errors / latency
- VPC CNI errors, IP exhaustion signals
- Pending pods count (scheduling pressure)
- Node not-ready events
- Ingress 5xx / p95 latency
- Application-specific error rate and throughput
Tip: Have a dedicated upgrade dashboard ready. Don’t rely on hunting logs mid-upgrade.
8. Rollback strategy: know what’s reversible
Not all parts are equally reversible:
| Component | Reversibility |
|---|---|
| Nodegroup blue/green | Highly reversible (scale old up, new down) |
| Workload manifest changes | Reversible via Git revert |
| Add-on version changes | Usually reversible (but test downgrade paths) |
| Control plane version | Generally not something you casually roll back |
This is why the plan emphasizes: pre-flight checks, add-on testing in non-prod, and blue/green nodes for easy rollback leverage.
A sample “upgrade runbook” you can reuse
Here’s a condensed checklist.
1. Pre-flight
- [ ] Confirm target version support for: Argo CD, ingress controller, cert-manager, CSI, autoscaling components
- [ ] Scan for deprecated APIs
- [ ] Validate PDBs and replica counts
- [ ] Ensure capacity headroom (or surge plan)
- [ ] Confirm backup/restore posture for critical stateful systems
2. Execute
- [ ] Upgrade managed add-ons (pinned versions)
- [ ] Upgrade platform controllers/CRDs via GitOps
- [ ] Upgrade EKS control plane
- [ ] Create new nodegroup(s)
- [ ] Drain old nodes gradually with monitoring gates
- [ ] Validate SLOs and core signals throughout
3. Close
- [ ] Decommission old nodegroups
- [ ] Update documentation: versions, issues, fixes
- [ ] Add regression checks for next upgrade cycle
Closing: upgrades become routine when you treat them like a product
The clusters that upgrade smoothly aren’t “lucky”. They have:
- A version skew policy
- Pinned add-on versions
- A repeatable node migration pattern