目录
1. 故障处理过程
今天接到同事反馈发现有一套k8s apiserver集群出现如下报错:
Failed to create new replica set "recommend-alg-service-74c6bc97cd": Get https://10.13.96.12:6443/api/v1/namespaces/saas-ec-tomcat-pl/resourcequotas: x509: certificate has expired or is not yet valid
随后去api server节点上查询api server日志,发现也有大量报错:
I0510 17:43:56.889617 790992 reflector.go:211] Listing and watching *v1.MutatingWebhookConfiguration from k8s.io/client-go/informers/factory.go:135
I0510 17:43:56.892900 790992 log.go:172] http: TLS handshake error from 10.13.96.11:36528: remote error: tls: bad certificate
E0510 17:43:56.892945 790992 reflector.go:178] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.MutatingWebhookConfiguration: Get https://10.13.96.11:6443/apis/admissionregistration.k8s.io/v1/mutatingwebhookconfigurations?resourceVersion=331916875: x509: certificate has expired or is not yet valid
I0510 17:43:57.087301 790992 reflector.go:211] Listing and watching *v1.ClusterRole from k8s.io/client-go/informers/factory.go:135
I0510 17:43:57.090560 790992 log.go:172] http: TLS handshake error from 10.13.96.11:36530: remote error: tls: bad certificate
E0510 17:43:57.090625 790992 reflector.go:178] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.ClusterRole: Get https://10.13.96.11:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?resourceVersion=397554978: x509: certificate has expired or is not yet valid
让人奇怪的是,并不是所有请求都报证书错误,通过日志发现,大量的请求到api server都是200的,出现500的为少数。
同时我们通过上面的报错发现,报错的访问来源是10.13.96.11:36528
,这个IP是api server自己。
最后我们将错误定位到以下范围:
- api server自己访问自己报证书过期的错误,而其它组件访问都是正常的。
同时我们检查了服务器上的所有k8s 集群组件使用的证书,并没有过期。
在无计可施时,怀着试试看的想法,我们将api server服务重启了,发现重启后恢复正常。果然是重启能解决99%的问题。