Overview
AKS upgrades can be risky if done manually — a wrong sequence, missing a node pool, or skipping workload validation can cause real downtime. This post walks through the GitHub Actions pipeline I built at Awan Infotech to handle AKS version upgrades safely and repeatably.
The Problem
When you manage multiple AKS clusters across environments, keeping Kubernetes versions current is non-trivial. Microsoft drops support for old minor versions, and manual upgrades are error-prone — especially when you have stateful workloads, Persistent Volumes, and custom ingress configurations.
The goals for this automation were:
- Upgrade the control plane first, then node pools in order
- Validate all Pods are healthy before proceeding
- Send Teams notifications at every stage
- Support rollback if validation fails
Pipeline Structure
The workflow is split into three jobs: pre-check, upgrade, and validate.
name: AKS Zero-Downtime Upgrade
on:
workflow_dispatch:
inputs:
cluster_name:
required: true
resource_group:
required: true
target_version:
required: true
jobs:
pre-check:
runs-on: ubuntu-latest
steps:
- name: Azure Login
uses: azure/login@v1
with:
creds: ${{ secrets.AZURE_CREDENTIALS }}
- name: Check current version
run: |
az aks show \
--name ${{ inputs.cluster_name }} \
--resource-group ${{ inputs.resource_group }} \
--query kubernetesVersion -o tsv
Control Plane Upgrade
The control plane upgrade runs first. Azure handles this gracefully — the API server is temporarily unavailable during the upgrade window, but running workloads are unaffected.
- name: Upgrade control plane
run: |
az aks upgrade \
--name ${{ inputs.cluster_name }} \
--resource-group ${{ inputs.resource_group }} \
--kubernetes-version ${{ inputs.target_version }} \
--control-plane-only \
--yes
Node Pool Upgrade with Surge
Node pools are upgraded next using a max-surge setting of 1. This ensures one extra node is provisioned before an old node is drained, keeping capacity stable throughout.
- name: Upgrade node pools
run: |
POOLS=$(az aks nodepool list \
--cluster-name ${{ inputs.cluster_name }} \
--resource-group ${{ inputs.resource_group }} \
--query "[].name" -o tsv)
for pool in $POOLS; do
az aks nodepool upgrade \
--cluster-name ${{ inputs.cluster_name }} \
--resource-group ${{ inputs.resource_group }} \
--name $pool \
--kubernetes-version ${{ inputs.target_version }} \
--max-surge 1 \
--no-wait
done
Workload Validation
After the upgrade, the pipeline waits for all Deployments and StatefulSets to report healthy before marking the run as successful.
validate:
needs: upgrade
runs-on: ubuntu-latest
steps:
- name: Set AKS context
uses: azure/aks-set-context@v3
with:
cluster-name: ${{ inputs.cluster_name }}
resource-group: ${{ inputs.resource_group }}
- name: Check all pods running
run: |
kubectl wait --for=condition=available \
deployment --all \
--namespace default \
--timeout=300s
Results
After deploying this pipeline across our AKS clusters, we reduced upgrade-related incidents to zero over three upgrade cycles. The automated validation step caught one case where a Deployment failed to roll out after a node pool upgrade — the pipeline halted and alerted the team before users were affected.
The key insight: treat AKS upgrades the same as application deployments — automate, validate, and have a clear rollback path.
Next Steps
- Add Prometheus-based health checks (not just
kubectl wait) - Integrate with Azure Policy to block deploys if version is out of support
- Add multi-cluster fan-out for upgrading staging before production automatically