Antrea Agents in CrashLoopBackOff - CIDRNotAvailable

When using the Antrea CNI, it takes care of ip address management for each of the pods that are deployed onto the worker nodes. It achieves this with an OVS bridge named br-int on each of the nodes in the Tanzu kuberentes clusters. The OVS bridge also has a tunnel port that will create an overlay tunnel to other nodes in the Tanzu kubernetes clusters to enable inter-pod comms.

Each worker node in the Tanzu kubernetes cluster is assigned its own unique /24 subnet. /24 is the default cidr. This can be changed by adding –node-cidr-mask-size to the /etc/kubernetes/manifests/kube-controller-manager.yaml on the control plane node. For more information review this document

The /24 subnet is derived from the podCIDR field that is defined in the yaml file you use to deploy the Tanzu Kubernetes cluster.

For eg.

apiVersion: run.tanzu.vmware.com/v1alpha1 
kind: TanzuKubernetesCluster 
metadata:
  name: antrea-tkc
  namespace: gs-dev
spec:
  distribution:
          version: v1.20.2
  topology:
    controlPlane:
      count: 3
      class: best-effort-small
      storageClass: k8s-storage-profile 
    workers:
      count: 10
      class: best-effort-small
      storageClass: k8s-storage-profile
  settings:
    network:
      cni:
        name: antrea 
      services:
        cidrBlocks: ["192.168.0.0/22"]
      pods:
        cidrBlocks: ["193.1.0.0/24"]

This document does a great job in explaing how the Antrea CNI works

I recently ran into an error where some of the antrea agents failed with CrashLoopBackOff

root@debian:~# k get pods -A
NAMESPACE                      NAME                                                             READY   STATUS              RESTARTS   AGE
kube-system                    antrea-agent-42ft4                                               1/2     CrashLoopBackOff    8          19m
kube-system                    antrea-agent-48kfg                                               1/2     CrashLoopBackOff    9          98s
kube-system                    antrea-agent-n5ftp                                               2/2     Running             0          21m
kube-system                    antrea-agent-rr576                                               1/2     CrashLoopBackOff    3          2m2s
kube-system                    antrea-controller-6d498b5b54-zbsjr                               1/1     Running             0          21m
kube-system                    antrea-resource-init-5774f96d79-cqxn8                            1/1     Running             0          21m

Lets look at the antrea-agent logs:

root@debian:~# kubectl get secret antrea-tkc-kubeconfig -o jsonpath='{.data.value}' | base64 -d > antrea-tkc-kubeconfig

root@debian:~# alias kf="kubectl --kubeconfig=antrea-tkc-kubeconfig"

Review logs

root@debian:~# kf logs antrea-agent-42ft4 -c antrea-agent -n kube-system
...
I1028 15:21:18.334228       1 ovs_client.go:67] Connecting to OVSDB at address /var/run/openvswitch/db.sock
I1028 15:21:18.334503       1 agent.go:205] Setting up node network
I1028 15:21:18.347790       1 agent.go:603] Setting Node MTU=1450
E1028 15:21:18.347935       1 agent.go:637] Spec.PodCIDR is empty for Node antrea-tkc-fail-workers-jgm2s-5459797484-nmdwh. Please make sure --allocate-node-cidrs is enabled for kube-controller-manager and --cluster-cidr specifies a sufficient CIDR range
F1028 15:21:18.352062       1 main.go:58] Error running agent: error initializing agent: CIDR string is empty for node antrea-tkc-workers-jgm2s-5459797484-nmdwh
goroutine 1 [running]:
k8s.io/klog.stacks(0xc0005eb400, 0xc0003cc480, 0xa6, 0x215)
        /tmp/gopath/pkg/mod/k8s.io/klog@v1.0.0/klog.go:875 +0xb9
k8s.io/klog.(*loggingT).output(0x2d678e0, 0xc000000003, 0xc000356230, 0x2c9c935, 0x7, 0x3a, 0x0)
        /tmp/gopath/pkg/mod/k8s.io/klog@v1.0.0/klog.go:826 +0x35f
k8s.io/klog.(*loggingT).printf(0x2d678e0, 0x3, 0x1e4c8d2, 0x17, 0xc000409d58, 0x1, 0x1)
        /tmp/gopath/pkg/mod/k8s.io/klog@v1.0.0/klog.go:707 +0x153
k8s.io/klog.Fatalf(...)
        /tmp/gopath/pkg/mod/k8s.io/klog@v1.0.0/klog.go:1276
main.newAgentCommand.func1(0xc000051400, 0xc0000fe280, 0x0, 0x8)
        /usr/src/github.com/vmware-tanzu/antrea/cmd/antrea-agent/main.go:58 +0x215
github.com/spf13/cobra.(*Command).execute(0xc000051400, 0xc00004e0a0, 0x8, 0x8, 0xc000051400, 0xc00004e0a0)
        /tmp/gopath/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830 +0x2c2
github.com/spf13/cobra.(*Command).ExecuteC(0xc000051400, 0x0, 0x0, 0x0)
        /tmp/gopath/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914 +0x30b
github.com/spf13/cobra.(*Command).Execute(...)
        /tmp/gopath/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
main.main()
        /usr/src/github.com/vmware-tanzu/antrea/cmd/antrea-agent/main.go:37 +0x52

Looks like the node is not getting an ip address error initializing agent: CIDR string is empty for node antrea-tkc-workers-jgm2s-5459797484-nmdwh

Lets look at the kube-controller logs as well

root@debian:~# kf logs kube-controller-manager-antrea-tkc-fail-control-plane-fg99l -n kube-system
...
I1028 15:24:35.102335       1 event.go:291] "Event occurred" object="antrea-tkc-fail-workers-jgm2s-5459797484-nmdwh" kind="Node" apiVersion="v1" type="Normal" reason="CIDRNotAvailable" message="Node antrea-tkc-fail-workers-jgm2s-5459797484-nmdwh status is now: CIDRNotAvailable"
E1028 15:24:55.113314       1 controller_utils.go:260] Error while processing Node Add/Delete: failed to allocate cidr from cluster cidr at idx:0: CIDR allocation failed; there are no remaining CIDRs left to allocate in the accepted range
I1028 15:24:55.113792       1 event.go:291] "Event occurred" object="antrea-tkc-fail-control-plane-fg99l" kind="Node" apiVersion="v1" type="Normal" reason="CIDRNotAvailable" message="Node antrea-tkc-fail-control-plane-fg99l status is now: CIDRNotAvailable"
E1028 15:26:34.766300       1 controller_utils.go:260] Error while processing Node Add/Delete: failed to allocate cidr from cluster cidr at idx:0: CIDR allocation failed; there are no remaining CIDRs left to allocate in the accepted range
I1028 15:26:34.766720       1 event.go:291] "Event occurred" object="antrea-tkc-fail-workers-jgm2s-67566c48b4-hng6w" kind="Node" apiVersion="v1" type="Normal" reason="CIDRNotAvailable" message="Node antrea-tkc-fail-workers-jgm2s-67566c48b4-hng6w status is now: CIDRNotAvailable"

So why is the kube-controller complaining that there are no CIDRs remaining?

Reviewing the yaml file again:

apiVersion: run.tanzu.vmware.com/v1alpha1 
kind: TanzuKubernetesCluster 
metadata:
  name: antrea-tkc
  namespace: gs-dev
spec:
  distribution:
          version: v1.20.2
  topology:
    controlPlane:
      count: 3
      class: best-effort-small
      storageClass: k8s-storage-profile 
    workers:
      count: 10
      class: best-effort-small
      storageClass: k8s-storage-profile
  settings:
    network:
      cni:
        name: antrea 
      services:
        cidrBlocks: ["192.168.0.0/22"]
      pods:
        cidrBlocks: ["193.1.0.0/24"]

The podCIDR here is 192.168.0.0/22
The usable ip range for a /22 subnet is 192.168.0.1-192.168.3.254

As Antrea will derive a /24 subnet from the podCIDR of /22 - we can have a max of 4 /24 subnets as below:
192.168.0.0/24
192.168.1.0/24
192.168.2.0/24
192.168.3.0/24

The number of worker nodes I’ve defined above is 10. This caused the issue as the controller did not have enough /24 subnets to hand out to the worker nodes.

To resolve the issue I just changed the podCIDR to 192.168.0.0/20