When using the Antrea CNI, it takes care of ip address management for each of the pods that are deployed onto the worker nodes. It achieves this with an OVS bridge named br-int on each of the nodes in the Tanzu kuberentes clusters. The OVS bridge also has a tunnel port that will create an overlay tunnel to other nodes in the Tanzu kubernetes clusters to enable inter-pod comms.
Each worker node in the Tanzu kubernetes cluster is assigned its own unique /24 subnet. /24 is the default cidr. This can be changed by adding –node-cidr-mask-size to the /etc/kubernetes/manifests/kube-controller-manager.yaml on the control plane node. For more information review this document
The /24 subnet is derived from the podCIDR field that is defined in the yaml file you use to deploy the Tanzu Kubernetes cluster.
For eg.
apiVersion: run.tanzu.vmware.com/v1alpha1
kind: TanzuKubernetesCluster
metadata:
name: antrea-tkc
namespace: gs-dev
spec:
distribution:
version: v1.20.2
topology:
controlPlane:
count: 3
class: best-effort-small
storageClass: k8s-storage-profile
workers:
count: 10
class: best-effort-small
storageClass: k8s-storage-profile
settings:
network:
cni:
name: antrea
services:
cidrBlocks: ["192.168.0.0/22"]
pods:
cidrBlocks: ["193.1.0.0/24"]
This document does a great job in explaing how the Antrea CNI works
I recently ran into an error where some of the antrea agents failed with CrashLoopBackOff
root@debian:~# k get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system antrea-agent-42ft4 1/2 CrashLoopBackOff 8 19m
kube-system antrea-agent-48kfg 1/2 CrashLoopBackOff 9 98s
kube-system antrea-agent-n5ftp 2/2 Running 0 21m
kube-system antrea-agent-rr576 1/2 CrashLoopBackOff 3 2m2s
kube-system antrea-controller-6d498b5b54-zbsjr 1/1 Running 0 21m
kube-system antrea-resource-init-5774f96d79-cqxn8 1/1 Running 0 21m
Lets look at the antrea-agent logs:
Login to the TKC cluster
root@debian:~# kubectl get secret antrea-tkc-kubeconfig -o jsonpath='{.data.value}' | base64 -d > antrea-tkc-kubeconfig
root@debian:~# alias kf="kubectl --kubeconfig=antrea-tkc-kubeconfig"
Review logs
root@debian:~# kf logs antrea-agent-42ft4 -c antrea-agent -n kube-system
...
I1028 15:21:18.334228 1 ovs_client.go:67] Connecting to OVSDB at address /var/run/openvswitch/db.sock
I1028 15:21:18.334503 1 agent.go:205] Setting up node network
I1028 15:21:18.347790 1 agent.go:603] Setting Node MTU=1450
E1028 15:21:18.347935 1 agent.go:637] Spec.PodCIDR is empty for Node antrea-tkc-fail-workers-jgm2s-5459797484-nmdwh. Please make sure --allocate-node-cidrs is enabled for kube-controller-manager and --cluster-cidr specifies a sufficient CIDR range
F1028 15:21:18.352062 1 main.go:58] Error running agent: error initializing agent: CIDR string is empty for node antrea-tkc-workers-jgm2s-5459797484-nmdwh
goroutine 1 [running]:
k8s.io/klog.stacks(0xc0005eb400, 0xc0003cc480, 0xa6, 0x215)
/tmp/gopath/pkg/mod/k8s.io/klog@v1.0.0/klog.go:875 +0xb9
k8s.io/klog.(*loggingT).output(0x2d678e0, 0xc000000003, 0xc000356230, 0x2c9c935, 0x7, 0x3a, 0x0)
/tmp/gopath/pkg/mod/k8s.io/klog@v1.0.0/klog.go:826 +0x35f
k8s.io/klog.(*loggingT).printf(0x2d678e0, 0x3, 0x1e4c8d2, 0x17, 0xc000409d58, 0x1, 0x1)
/tmp/gopath/pkg/mod/k8s.io/klog@v1.0.0/klog.go:707 +0x153
k8s.io/klog.Fatalf(...)
/tmp/gopath/pkg/mod/k8s.io/klog@v1.0.0/klog.go:1276
main.newAgentCommand.func1(0xc000051400, 0xc0000fe280, 0x0, 0x8)
/usr/src/github.com/vmware-tanzu/antrea/cmd/antrea-agent/main.go:58 +0x215
github.com/spf13/cobra.(*Command).execute(0xc000051400, 0xc00004e0a0, 0x8, 0x8, 0xc000051400, 0xc00004e0a0)
/tmp/gopath/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830 +0x2c2
github.com/spf13/cobra.(*Command).ExecuteC(0xc000051400, 0x0, 0x0, 0x0)
/tmp/gopath/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914 +0x30b
github.com/spf13/cobra.(*Command).Execute(...)
/tmp/gopath/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
main.main()
/usr/src/github.com/vmware-tanzu/antrea/cmd/antrea-agent/main.go:37 +0x52
Looks like the node is not getting an ip address error initializing agent: CIDR string is empty for node antrea-tkc-workers-jgm2s-5459797484-nmdwh
Lets look at the kube-controller logs as well
root@debian:~# kf logs kube-controller-manager-antrea-tkc-fail-control-plane-fg99l -n kube-system
...
I1028 15:24:35.102335 1 event.go:291] "Event occurred" object="antrea-tkc-fail-workers-jgm2s-5459797484-nmdwh" kind="Node" apiVersion="v1" type="Normal" reason="CIDRNotAvailable" message="Node antrea-tkc-fail-workers-jgm2s-5459797484-nmdwh status is now: CIDRNotAvailable"
E1028 15:24:55.113314 1 controller_utils.go:260] Error while processing Node Add/Delete: failed to allocate cidr from cluster cidr at idx:0: CIDR allocation failed; there are no remaining CIDRs left to allocate in the accepted range
I1028 15:24:55.113792 1 event.go:291] "Event occurred" object="antrea-tkc-fail-control-plane-fg99l" kind="Node" apiVersion="v1" type="Normal" reason="CIDRNotAvailable" message="Node antrea-tkc-fail-control-plane-fg99l status is now: CIDRNotAvailable"
E1028 15:26:34.766300 1 controller_utils.go:260] Error while processing Node Add/Delete: failed to allocate cidr from cluster cidr at idx:0: CIDR allocation failed; there are no remaining CIDRs left to allocate in the accepted range
I1028 15:26:34.766720 1 event.go:291] "Event occurred" object="antrea-tkc-fail-workers-jgm2s-67566c48b4-hng6w" kind="Node" apiVersion="v1" type="Normal" reason="CIDRNotAvailable" message="Node antrea-tkc-fail-workers-jgm2s-67566c48b4-hng6w status is now: CIDRNotAvailable"
So why is the kube-controller complaining that there are no CIDRs remaining?
Reviewing the yaml file again:
apiVersion: run.tanzu.vmware.com/v1alpha1
kind: TanzuKubernetesCluster
metadata:
name: antrea-tkc
namespace: gs-dev
spec:
distribution:
version: v1.20.2
topology:
controlPlane:
count: 3
class: best-effort-small
storageClass: k8s-storage-profile
workers:
count: 10
class: best-effort-small
storageClass: k8s-storage-profile
settings:
network:
cni:
name: antrea
services:
cidrBlocks: ["192.168.0.0/22"]
pods:
cidrBlocks: ["193.1.0.0/24"]
The podCIDR here is 192.168.0.0/22
The usable ip range for a /22 subnet is 192.168.0.1-192.168.3.254
As Antrea will derive a /24 subnet from the podCIDR of /22 - we can have a max of 4 /24 subnets as below:192.168.0.0/24
192.168.1.0/24
192.168.2.0/24
192.168.3.0/24
The number of worker nodes I’ve defined above is 10. This caused the issue as the controller did not have enough /24 subnets to hand out to the worker nodes.
To resolve the issue I just changed the podCIDR to 192.168.0.0/20