If you run into any issue where the config status is stuck in “configuring” state, one of the first things to check is the wcpsvc logs on the vCenter appliance here:
/var/log/vmware/wcp/wcpsvc.log
Interestingly I ran into an issue where the logs were complaining about authorization. You probably will see the following events in a loop:
2021-05-30T11:48:11.077Z error wcp [kubelifecycle/spherelet.go:923] [opID=domain-c8-host-28] **Failed to get Kubernetes cluster node list: Unauthorized**
2021-05-30T11:48:11.078Z error wcp [kubelifecycle/node\_controller.go:1059] [opID=domain-c8-host-28] Intent nodeReadyIntent, step scanNodeForReadyState for cluster domain-c8 node host-28 returned error Unauthorized
2021-05-30T11:48:11.078Z debug wcp [kubelifecycle/node\_controller.go:911] [opID=domain-c8-host-28] For node host-28, setting configStatus from ERROR to ERROR
2021-05-30T11:48:11.078Z debug wcp [kubelifecycle/node\_controller.go:912] [opID=domain-c8-host-28] For node host-28, setting configStatusMessages from ([]namespace\_management.ClustersMessage)[{Severity:(namespace\_management.ClustersMessageSeverityEnum)ERROR Details:(\*std.LocalizableMessage){Id:(string)vcenter.wcp.systemerror DefaultMessage:(string)A general system error occurred. Args Params:(map[string]std.LocalizationParam)<nil> Localized:(\*string)<nil>}}] to ([]namespace\_management.ClustersMessage)[{Severity:(namespace\_management.ClustersMessageSeverityEnum)ERROR Details:(\*std.LocalizableMessage){Id:(string)vcenter.wcp.systemerror DefaultMessage:(string)**A general system error occurred. Args:([]string)[Unauthorized]** Params:(map[string]std.LocalizationParam)<nil>}}]
2021-05-30T11:48:11.078Z error wcp [kubelifecycle/node\_controller.go:422] [opID=domain-c8-host-28] Failed to realize node {nodeID:host-28 clusterID:domain-c8} state. **Err Unauthorized**. Will retry.
Turns out the authorization errors were due to the kube-apiserver not being able to resolve the vCenter’s fqdn for SSO. As SSO fails, we will see “Unauthorized” events in the logs.
There are two DNS entries that you configure when enabling workload management.
- Management Network DNS Server
- Workload network DNS server.
I had the same DNS server for both management and workload networks in my lab environment. I had some configuration issues with my T0 edge that did not allow the workload network to reach the DNS server on the management network.
After resolving the configuration issues on the edge, i was able to enable workload management.
How do you know if the workload network is unable to reach the DNS Server?
- Login to NSX-T.
- Go to"Plan and Troubleshoot" -> “Traceflow”
- Under Source, use the following settings:
Type: Port/Interface
Attachment: Virtual Interface (VIF)
Port: “vnetif- vif-cim-XXXX” - Under Destination, use the following settings:
Type: IP-Mac
Layer: Layer 3
IP Address: (Type in the DNS Servers IP address) - In overarching panel between source and destination, use the following settings
IP Address: IPv4
Traffic Type: Unicast
Protocol Type: ICMP - Click Trace