EKS cluster version upgrade flow

EKS Cluster Upgrades Without Drama

Introduction

EKS upgrades are rarely “just bump the version”. Most outages happen in the gaps: version skew, managed add-ons drifting, node rollouts that evict too aggressively, and workloads that were never tested against the next Kubernetes minor.

This post lays out a repeatable approach to upgrading EKS with minimal risk, using:

  • Managed add-ons
  • A deliberate version skew policy
  • Safe nodegroup upgrades
  • PDBs + disruption controls

It’s opinionated, but it’s designed to be practical.

1. Know what changes with the EKS upgrade

An EKS “cluster version upgrade” impacts at least four layers:

  1. Control plane — EKS-managed Kubernetes API server, etcd, etc.
  2. Worker nodes — AMI, kubelet, container runtime, CNI interactions
  3. Core add-ons — VPC CNI, CoreDNS, kube-proxy, CSI drivers
  4. Your workloads — APIs, admission controllers, CRDs, controllers, Pod security

Most incidents are triggered by #3 and #4, not the control plane itself.

2. Establish a version skew policy (and enforce it)

Kubernetes supports some skew between components, but you should treat skew as a temporary state.

A safe, simple policy

  • Upgrade one minor version at a time (e.g. 1.29 → 1.30 → 1.31)
  • Keep kubelet(nodes) within one minor of the control plane
  • Upgrade add-ons in a controlled order (next section)
  • Maintain a “supported matrix” for cluster add-ons and critical controllers

Practical rule

Don’t let nodegroups lag for weeks after the control plane upgrade. Skew is where you get weird networking and DNS behavior that’s painful to diagnose.

3. Add-on strategy: managed add-ons are worth it

EKS managed add-ons reduce toil, but if you leave them on “whatever latest”, you introduce uncontrolled change.

Important Add-ons

  • Amazon VPC CNI — networking
  • CoreDNS — cluster DNS
  • kube-proxy — service networking
  • EBS CSI driver — storage (and EFS CSI if you use it)
  • Pin add-on versions (in Terraform) and update intentionally.
  • Prefer managed add-ons for the AWS-owned components above unless you have a strong reason not to.

Upgrade order (common stable sequence)

  1. VPC CNI
  2. CoreDNS
  3. Kube-proxy
  4. CSI drivers (EBS/EFS)

Why this order works: CNI and DNS issues cause the biggest blast radius. kube-proxy compatibility matters once you start moving kubelets.

Tip: Do add-on upgrades in non-prod first and keep notes. These are repeatable changes.

4. Pre-flight checks: catch most failures before the upgrade

Before upgrading anything, run checks that expose “future breakage”:

A) Kubernetes API deprecations

Identify removed APIs your workloads still use. This is especially important for older clusters or legacy manifests/Helm charts.

B) CRDs and controllers

CRDs often outlive their controllers. Ensure:

  • Controllers are compatible with the target Kubernetes version
  • Any admission webhooks and policy engines are tested

C) Pod Disruption Budgets and eviction behavior

During node upgrades, your workloads will be evicted. If PDBs are too strict, your rollout stalls. If they’re too loose, availability suffers.

Validate:

  • PDBs exist for critical apps
  • PDB minAvailable is realistic given replica counts
  • You have enough capacity to reschedule pods during drain

D) Cluster autoscaler / Karpenter

If you use Karpenter or Cluster Autoscaler:

  • Ensure it supports the target cluster version
  • Validate node provisioning logic (taints, labels, instance types)

Tools that help: EKS Upgrade Insights and kubent

  • EKS Upgrade Insights (AWS Console or CLI) — Shows upgrade readiness for your cluster: incompatible add-ons, deprecated APIs in use, and version compatibility for managed add-ons. Run it when you pick a target version so you see control-plane and add-on issues in one place.
  • kubent (Kubernetes No Trouble) — Scans the cluster for resources still using deprecated or removed Kubernetes APIs. Run kubent against your cluster and fix any reported resources before upgrading. It directly addresses check (A) and is quick to add to a pre-upgrade pipeline.

5. Make node upgrades boring: blue/green nodegroups

Control plane upgrades are usually quick and low impact. Node upgrades are where you can hurt yourself.

Preferred: blue/green nodegroups

Instead of “in-place” upgrades, create a new nodegroup with:

  • New AMI
  • Desired kubelet version
  • Same labels/taints (or intentionally different for migration)

Then migrate:

  1. Cordon/drain old nodes gradually
  2. Observe workload health
  3. Scale down the old group when stable

This gives you clean rollback: if something fails, scale the old group back up and stop draining.

Draining discipline that avoids outages

  • Drain in small batches (1 node at a time for critical clusters)
  • Respect PDBs
  • Set sane timeouts
  • Ensure daemonsets are handled correctly
  • Watch for “stuck terminating” pods (often due to finalizers or PDB deadlocks)

6. PDBs and disruption budgets: how to not self-DoS during drains

A stable node rollout requires surge capacity. You need enough spare resources to reschedule pods:

  • Add temporary capacity (scale nodegroup up) or
  • Use a surge nodegroup or
  • Allow Karpenter to provision extra nodes during drain

Practical guidance for PDBs

  • For critical services, aim for ≥ 2 replicas and minAvailable: 1 (or a percentage)
  • For single-replica workloads: a PDB doesn’t create availability add replicas or accept downtime

Also check:

  • Topology spread constraints and anti-affinity rules (these can block rescheduling if your cluster is small)
  • Pod priority classes (ensure important workloads get scheduled first)

7. Observability during upgrades: the signals that matter

During upgrades, dashboards should answer:

  • Are pods scheduling successfully?
  • Are we seeing DNS failures?
  • Are we seeing elevated 5xx or latency?
  • Are nodes healthy and draining cleanly?

High-signal things to watch

  • CoreDNS errors / latency
  • VPC CNI errors, IP exhaustion signals
  • Pending pods count (scheduling pressure)
  • Node not-ready events
  • Ingress 5xx / p95 latency
  • Application-specific error rate and throughput

Tip: Have a dedicated upgrade dashboard ready. Don’t rely on hunting logs mid-upgrade.

8. Rollback strategy: know what’s reversible

Not all parts are equally reversible:

ComponentReversibility
Nodegroup blue/greenHighly reversible (scale old up, new down)
Workload manifest changesReversible via Git revert
Add-on version changesUsually reversible (but test downgrade paths)
Control plane versionGenerally not something you casually roll back

This is why the plan emphasizes: pre-flight checks, add-on testing in non-prod, and blue/green nodes for easy rollback leverage.

A sample “upgrade runbook” you can reuse

Here’s a condensed checklist.

1. Pre-flight
- [ ] Confirm target version support for: Argo CD, ingress controller, cert-manager, CSI, autoscaling components
- [ ] Scan for deprecated APIs
- [ ] Validate PDBs and replica counts
- [ ] Ensure capacity headroom (or surge plan)
- [ ] Confirm backup/restore posture for critical stateful systems

2. Execute
- [ ] Upgrade managed add-ons (pinned versions)
- [ ] Upgrade platform controllers/CRDs via GitOps
- [ ] Upgrade EKS control plane
- [ ] Create new nodegroup(s)
- [ ] Drain old nodes gradually with monitoring gates
- [ ] Validate SLOs and core signals throughout

3. Close
- [ ] Decommission old nodegroups
- [ ] Update documentation: versions, issues, fixes
- [ ] Add regression checks for next upgrade cycle

Closing: upgrades become routine when you treat them like a product

The clusters that upgrade smoothly aren’t “lucky”. They have:

  • A version skew policy
  • Pinned add-on versions
  • A repeatable node migration pattern

Share :

Related Posts

Running GitHub Actions on AWS CodeBuild
Blog Post
Blog Post

Running GitHub Actions on AWS CodeBuild

A complete guide to setting up self-hosted GitHub Actions runners on AWS CodeBuild using Terraform

November 8, 2025
7 mins