Zero-Downtime AKS Upgrades with GitHub Actions

Overview

AKS upgrades can be risky if done manually — a wrong sequence, missing a node pool, or skipping workload validation can cause real downtime. This post walks through the GitHub Actions pipeline I built at Awan Infotech to handle AKS version upgrades safely and repeatably.

The Problem

When you manage multiple AKS clusters across environments, keeping Kubernetes versions current is non-trivial. Microsoft drops support for old minor versions, and manual upgrades are error-prone — especially when you have stateful workloads, Persistent Volumes, and custom ingress configurations.

The goals for this automation were:

Upgrade the control plane first, then node pools in order
Validate all Pods are healthy before proceeding
Send Teams notifications at every stage
Support rollback if validation fails

Pipeline Structure

The workflow is split into three jobs: pre-check, upgrade, and validate.

name: AKS Zero-Downtime Upgrade

on:
  workflow_dispatch:
    inputs:
      cluster_name:
        required: true
      resource_group:
        required: true
      target_version:
        required: true

jobs:
  pre-check:
    runs-on: ubuntu-latest
    steps:
      - name: Azure Login
        uses: azure/login@v1
        with:
          creds: ${{ secrets.AZURE_CREDENTIALS }}

      - name: Check current version
        run: |
          az aks show \
            --name ${{ inputs.cluster_name }} \
            --resource-group ${{ inputs.resource_group }} \
            --query kubernetesVersion -o tsv

Control Plane Upgrade

The control plane upgrade runs first. Azure handles this gracefully — the API server is temporarily unavailable during the upgrade window, but running workloads are unaffected.

      - name: Upgrade control plane
        run: |
          az aks upgrade \
            --name ${{ inputs.cluster_name }} \
            --resource-group ${{ inputs.resource_group }} \
            --kubernetes-version ${{ inputs.target_version }} \
            --control-plane-only \
            --yes

Node Pool Upgrade with Surge

Node pools are upgraded next using a max-surge setting of 1. This ensures one extra node is provisioned before an old node is drained, keeping capacity stable throughout.

      - name: Upgrade node pools
        run: |
          POOLS=$(az aks nodepool list \
            --cluster-name ${{ inputs.cluster_name }} \
            --resource-group ${{ inputs.resource_group }} \
            --query "[].name" -o tsv)

          for pool in $POOLS; do
            az aks nodepool upgrade \
              --cluster-name ${{ inputs.cluster_name }} \
              --resource-group ${{ inputs.resource_group }} \
              --name $pool \
              --kubernetes-version ${{ inputs.target_version }} \
              --max-surge 1 \
              --no-wait
          done

Workload Validation

After the upgrade, the pipeline waits for all Deployments and StatefulSets to report healthy before marking the run as successful.

  validate:
    needs: upgrade
    runs-on: ubuntu-latest
    steps:
      - name: Set AKS context
        uses: azure/aks-set-context@v3
        with:
          cluster-name: ${{ inputs.cluster_name }}
          resource-group: ${{ inputs.resource_group }}

      - name: Check all pods running
        run: |
          kubectl wait --for=condition=available \
            deployment --all \
            --namespace default \
            --timeout=300s

Results

After deploying this pipeline across our AKS clusters, we reduced upgrade-related incidents to zero over three upgrade cycles. The automated validation step caught one case where a Deployment failed to roll out after a node pool upgrade — the pipeline halted and alerted the team before users were affected.

The key insight: treat AKS upgrades the same as application deployments — automate, validate, and have a clear rollback path.

Next Steps

Add Prometheus-based health checks (not just kubectl wait)
Integrate with Azure Policy to block deploys if version is out of support
Add multi-cluster fan-out for upgrading staging before production automatically