AKS can't pull from ACR registry when using custom vnet + subnet

122 Views Asked by At

I have an AKS cluster, and it needs to pull from an ACR created contextually. Pretty standard stuff. It works using solutions like this when not specifying any subnet for the default_nodepool, but I want all my nodepools to reside in a custom made vnet - subnet pair (possibly even different subnets per different nodepools).

When I change this, I get a 401 Unauthorized whenever I try to pull an image from the cluster.

I'll paste a minimal main.tf for reference.

resource "azurerm_virtual_network" "vnet" {
  name                = "${var.project}-vnet"
  location            = data.azurerm_resource_group.rg.location
  resource_group_name = data.azurerm_resource_group.rg.name
  address_space       = ["10.160.0.0/20"]
}

resource "azurerm_subnet" "cluster_subnet" {
  name                 = "${var.project}-cluster-subnet"
  resource_group_name  = data.azurerm_resource_group.rg.name
  virtual_network_name = azurerm_virtual_network.vnet.name
  address_prefixes     = ["10.160.0.0/22"]
}

resource "azurerm_kubernetes_cluster" "aks" {
  name                = "${var.project}-cluster"
  location            = data.azurerm_resource_group.rg.location
  resource_group_name = data.azurerm_resource_group.rg.name

  dns_prefix = var.project

  default_node_pool {
    name                        = "default"
    node_count                  = 1
    vm_size                     = "Standard_D2_v2"
    zones                       = ["3"]
    temporary_name_for_rotation = "fallback"
    vnet_subnet_id              = azurerm_subnet.cluster_subnet.id
    # pod_subnet_id               = azurerm_subnet.cluster_subnet.id
  }

  network_profile {
    network_plugin = "azure"
    network_policy = "calico"
  }

  azure_active_directory_role_based_access_control {
    managed = true
  }

  # using managed identity to assign roles just to this whole cluster (e.g. image pull permissions)
  identity {
    type = "SystemAssigned"
  }
}

resource "azurerm_container_registry" "acr" {
  name                = "${var.project}9090registry"
  location            = data.azurerm_resource_group.rg.location
  resource_group_name = data.azurerm_resource_group.rg.name
  sku                 = "Basic"
  admin_enabled       = false
}

# permission to pull images from the registry
resource "azurerm_role_assignment" "kubweb_to_acr" {
  principal_id                     = azurerm_kubernetes_cluster.aks.kubelet_identity[0].object_id
  scope                            = azurerm_container_registry.acr.id
  role_definition_name             = "AcrPull"
  skip_service_principal_aad_check = true
}

This does not work. I've tried setting manually the subnet network security group, making sure it had Outbound access to Anywhere, but to no avail (not pasting here since I'd be too long). If possible I'd like to solve this without having to create another SP with username and password and having to then put the secret in every namespace.

Has anyone solved this?

2

There are 2 best solutions below

0
Roman On

The issue you are facing is likely due to the fact that your AKS cluster is unable to authenticate with your ACR registry. When using a custom VNet and subnet, you need to ensure that your AKS cluster’s identity has the proper authorization to pull images from the ACR registry.

To resolve this issue, you can try the following solutions:

  1. Make sure that the AcrPull role assignment is created for your AKS cluster’s identity at the container registry level. You can check whether the AcrPull role assignment is created by running the following command:

       az role assignment list --assignee <your AKS cluster's identity principal ID> --scope <your ACR registry ID> --query "[?roleDefinitionName=='AcrPull']"
    

    If the role assignment is not created, you can create it by configuring Container Registry integration for the AKS cluster. For more information, see List Azure role assignments using the Azure portal.

  2. Make sure that the secret of the service principal that’s associated with your AKS cluster isn’t expired. To check the expiration date of your service principal, run the following commands:

       az account show
       az ad sp show --id <your service principal's client ID> --query "passwordCredentials[0].endDate" 
    

    If the secret is expired, update the credentials for the AKS cluster.

  3. Make sure that the container registry role assignment refers to the correct service principal. To check the service principal that’s used by the AKS cluster, run the following command:

       az aks show --resource-group <your resource group name> --name <your AKS cluster name> --query "servicePrincipalProfile.clientId" 
    

    To check the service principal that’s referenced by the container registry role assignment, run the following command:

       az role assignment list --assignee <your service principal's client ID> --scope <your ACR registry ID> --query "[?roleDefinitionName=='AcrPull']"
    

    Compare the two service principals. If they don’t match, integrate the AKS cluster with the container registry again.

  4. Make sure that the kubelet identity is assigned to the AKS VMSS. To find the kubelet identity of your AKS cluster, run the following command:

       az aks show --resource-group <your resource group name> --name <your AKS cluster name> --query "identity.kubeletidentity.objectId"
    

    Then, you can list the identities of the AKS VMSS by opening the VMSS from the node resource group and selecting Identity > User assigned in the Azure portal or by running the following command:

       az vmss identity show --resource-group <your resource group name> --name <your VMSS name> 
    

    If the kubelet identity of your AKS cluster isn’t assigned to the AKS VMSS, assign it back.

Check below documents for better clarity:

2
Venkat V On

AKS can't pull from ACR registry when using custom vnet + subnet, When I change this, I get a 401 Unauthorized whenever I try to pull an image from the cluster.

The cause of the 401 Unauthorized error in Terraform is that an AKS cluster requires an identity, which can be either a managed identity or a service principal with the proper authorization used to pull an image from a container registry. Otherwise, you may encounter the following "401 Unauthorized" error.

Make sure you have the necessary permissions to assign the acrpull role to the identity.

Here is the updated terraform script to create AKS and using different subnets in nodepool.

To assign different subnets to the node pool, you need to create another subnet, as using the same subnet for both the pod network and the virtual network is not allowed.

    provider "azurerm" {
      features {}
    }
    data "azurerm_resource_group" "rg" {
      name = "venkat"
    }
    resource "azurerm_virtual_network" "vnet" {
      name                = "vnet-aks"
      location            = data.azurerm_resource_group.rg.location
      resource_group_name = data.azurerm_resource_group.rg.name
      address_space       = ["10.160.0.0/20"]
    }
    
    resource "azurerm_subnet" "cluster_subnet" {
      name                 = "venkatsubnet"
      resource_group_name  = data.azurerm_resource_group.rg.name
      virtual_network_name = azurerm_virtual_network.vnet.name
      address_prefixes     = ["10.160.0.0/22"]
    }
    
    resource "azurerm_subnet" "cluster_subnet1" {
      name                 = "venkatsubnet1"
      resource_group_name  = data.azurerm_resource_group.rg.name
      virtual_network_name = azurerm_virtual_network.vnet.name
      address_prefixes     = ["10.160.4.0/22"]  
    }
    
    resource "azurerm_kubernetes_cluster" "aks" {
      name                = "aks-cluster-demo"
      location            = data.azurerm_resource_group.rg.location
      resource_group_name = data.azurerm_resource_group.rg.name
    
      dns_prefix = "venkat"
    
      default_node_pool {
        name                        = "default"
        node_count                  = 1
        vm_size                     = "Standard_D2_v2"
        zones                       = ["3"]
        temporary_name_for_rotation = "fallback"
        vnet_subnet_id              = azurerm_subnet.cluster_subnet.id
        pod_subnet_id               = azurerm_subnet.cluster_subnet1.id
      }
    
      network_profile {
        network_plugin = "azure"
        network_policy = "calico"
      }
    
      azure_active_directory_role_based_access_control {
        managed = true
      }
    
      # using managed identity to assign roles just to this whole cluster (e.g. image pull permissions)
      identity {
        type = "SystemAssigned"
      }
    }
    
    resource "azurerm_container_registry" "acr" {
      name                = "venkatregistry"
      location            = data.azurerm_resource_group.rg.location
      resource_group_name = data.azurerm_resource_group.rg.name
      sku                 = "Basic"
      admin_enabled       = false
    }
    
    # permission to pull images from the registry
    resource "azurerm_role_assignment" "kubweb_to_acr" {
      principal_id                     = azurerm_kubernetes_cluster.aks.kubelet_identity[0].object_id
      scope                            = azurerm_container_registry.acr.id
      role_definition_name             = "AcrPull"
      skip_service_principal_aad_check = true
    }

Output:

enter image description here

Reference: Cause 1: 401 Unauthorized error