目录

1. 故障处理过程

今天接到同事反馈发现有一套k8s apiserver集群出现如下报错:

Failed to create new replica set "recommend-alg-service-74c6bc97cd": Get https://10.13.96.12:6443/api/v1/namespaces/saas-ec-tomcat-pl/resourcequotas: x509: certificate has expired or is not yet valid

随后去api server节点上查询api server日志,发现也有大量报错:

I0510 17:43:56.889617  790992 reflector.go:211] Listing and watching *v1.MutatingWebhookConfiguration from k8s.io/client-go/informers/factory.go:135
I0510 17:43:56.892900  790992 log.go:172] http: TLS handshake error from 10.13.96.11:36528: remote error: tls: bad certificate
E0510 17:43:56.892945  790992 reflector.go:178] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.MutatingWebhookConfiguration: Get https://10.13.96.11:6443/apis/admissionregistration.k8s.io/v1/mutatingwebhookconfigurations?resourceVersion=331916875: x509: certificate has expired or is not yet valid
I0510 17:43:57.087301  790992 reflector.go:211] Listing and watching *v1.ClusterRole from k8s.io/client-go/informers/factory.go:135
I0510 17:43:57.090560  790992 log.go:172] http: TLS handshake error from 10.13.96.11:36530: remote error: tls: bad certificate
E0510 17:43:57.090625  790992 reflector.go:178] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.ClusterRole: Get https://10.13.96.11:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?resourceVersion=397554978: x509: certificate has expired or is not yet valid

让人奇怪的是,并不是所有请求都报证书错误,通过日志发现,大量的请求到api server都是200的,出现500的为少数。

同时我们通过上面的报错发现,报错的访问来源是10.13.96.11:36528,这个IP是api server自己。

最后我们将错误定位到以下范围:

  • api server自己访问自己报证书过期的错误,而其它组件访问都是正常的。

同时我们检查了服务器上的所有k8s 集群组件使用的证书,并没有过期。

在无计可施时,怀着试试看的想法,我们将api server服务重启了,发现重启后恢复正常。果然是重启能解决99%的问题。

阅读全文

kubernetes源码分支:1.18

先说结论,kube-apiserver启动时:

  • –admission-control参数带的插件将是apiserver启动的插件,不包括默认插件
  • –admission-control和–enable-admission-plugins –disable-admission-plugins不能同时使用
  • –enable-admission-plugins参数不需要按加载顺序填写
  • 不使用–admission-control参数时,api server会同时启动默认插件
  • –enable-admission-plugins参数启用时,api server会同时启动默认插件,除非使用了–disable-admission-plugins显示的关闭某个插件
  • –enable-admission-plugins和–disable-admission-plugins如果同时填写了某一个插件,这个插件将会被加载

阅读全文

作者的图片

阿辉

容器技术及容器集群等分布式系统研究

容器平台负责人

上海