Overview

AKS upgrades can be risky if done manually — a wrong sequence, missing a node pool, or skipping workload validation can cause real downtime. This post walks through the GitHub Actions pipeline I built at Awan Infotech to handle AKS version upgrades safely and repeatably.

The Problem

When you manage multiple AKS clusters across environments, keeping Kubernetes versions current is non-trivial. Microsoft drops support for old minor versions, and manual upgrades are error-prone — especially when you have stateful workloads, Persistent Volumes, and custom ingress configurations.

The goals for this automation were:

Pipeline Structure

The workflow is split into three jobs: pre-check, upgrade, and validate.

name: AKS Zero-Downtime Upgrade

on:
  workflow_dispatch:
    inputs:
      cluster_name:
        required: true
      resource_group:
        required: true
      target_version:
        required: true

jobs:
  pre-check:
    runs-on: ubuntu-latest
    steps:
      - name: Azure Login
        uses: azure/login@v1
        with:
          creds: ${{ secrets.AZURE_CREDENTIALS }}

      - name: Check current version
        run: |
          az aks show \
            --name ${{ inputs.cluster_name }} \
            --resource-group ${{ inputs.resource_group }} \
            --query kubernetesVersion -o tsv

Control Plane Upgrade

The control plane upgrade runs first. Azure handles this gracefully — the API server is temporarily unavailable during the upgrade window, but running workloads are unaffected.

      - name: Upgrade control plane
        run: |
          az aks upgrade \
            --name ${{ inputs.cluster_name }} \
            --resource-group ${{ inputs.resource_group }} \
            --kubernetes-version ${{ inputs.target_version }} \
            --control-plane-only \
            --yes

Node Pool Upgrade with Surge

Node pools are upgraded next using a max-surge setting of 1. This ensures one extra node is provisioned before an old node is drained, keeping capacity stable throughout.

      - name: Upgrade node pools
        run: |
          POOLS=$(az aks nodepool list \
            --cluster-name ${{ inputs.cluster_name }} \
            --resource-group ${{ inputs.resource_group }} \
            --query "[].name" -o tsv)

          for pool in $POOLS; do
            az aks nodepool upgrade \
              --cluster-name ${{ inputs.cluster_name }} \
              --resource-group ${{ inputs.resource_group }} \
              --name $pool \
              --kubernetes-version ${{ inputs.target_version }} \
              --max-surge 1 \
              --no-wait
          done

Workload Validation

After the upgrade, the pipeline waits for all Deployments and StatefulSets to report healthy before marking the run as successful.

  validate:
    needs: upgrade
    runs-on: ubuntu-latest
    steps:
      - name: Set AKS context
        uses: azure/aks-set-context@v3
        with:
          cluster-name: ${{ inputs.cluster_name }}
          resource-group: ${{ inputs.resource_group }}

      - name: Check all pods running
        run: |
          kubectl wait --for=condition=available \
            deployment --all \
            --namespace default \
            --timeout=300s

Results

After deploying this pipeline across our AKS clusters, we reduced upgrade-related incidents to zero over three upgrade cycles. The automated validation step caught one case where a Deployment failed to roll out after a node pool upgrade — the pipeline halted and alerted the team before users were affected.

The key insight: treat AKS upgrades the same as application deployments — automate, validate, and have a clear rollback path.

Next Steps