CoreDNS篇8-健康检查

本文最后更新于:July 28, 2022 pm

本文主要讲解介绍CoreDNS内置的两个健康检查插件healthready的使用方式和适用场景。

1、health插件

health插件默认情况下会在8080端口/health路径下提供健康状态查询服务,当CoreDNS服务正常的时候,会返回200http状态码并附带一个OK的内容。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
[root@coredns-10-31-53-1 conf]# curl -v http://10.31.53.1:8080/health
* About to connect() to 10.31.53.1 port 8080 (#0)
* Trying 10.31.53.1...
* Connected to 10.31.53.1 (10.31.53.1) port 8080 (#0)
> GET /health HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 10.31.53.1:8080
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Thu, 28 Jul 2022 03:52:56 GMT
< Content-Length: 2
< Content-Type: text/plain; charset=utf-8
<
* Connection #0 to host 10.31.53.1 left intact
OK

比较特别的是health插件还附带了一个lameduck功能,lameduck的效果就是在coredns进程关闭之前延迟对应的时间。假设我们设置了lameduck 10s,那么coredns在接收到退出进程命令的时候会延迟10s的时间再结束进程。

1
2
3
health [ADDRESS] {
lameduck DURATION
}

需要特别注意的是,假设我们在多个配置块中都使用了lameduck功能,那么时间会叠加。举个例子,假设我们在10个配置块中都设置了lameduck 10s,那么coredns在接收到退出进程命令的时候会延迟10*10=100s的时间再结束进程。

此外还有一个小问题,在开启health插件之后会导致health插件对应的端口会有较多的TIME_WAIT连接,目前怀疑是插件本身会请求自身端口进行检查导致产生TIME_WAIT连接。

1
2
[root@coredns-10-31-53-1 conf]# netstat -nt | grep 8080 | grep -c TIME_WAIT
61

2、ready插件

ready插件health插件有些类似,默认情况下定义在8181端口的/ready路径下返回CoreDNS服务器的状态,正常情况下也是返回200http状态码并附带一个OK的内容。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
[root@coredns-10-31-53-1 conf]# curl -v http://10.31.53.1:8181/ready
* About to connect() to 10.31.53.1 port 8181 (#0)
* Trying 10.31.53.1...
* Connected to 10.31.53.1 (10.31.53.1) port 8181 (#0)
> GET /ready HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 10.31.53.1:8181
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Thu, 28 Jul 2022 03:53:25 GMT
< Content-Length: 2
< Content-Type: text/plain; charset=utf-8
<
* Connection #0 to host 10.31.53.1 left intact
OK

当CoreDNS服务中的某个组件的相关配置出现异常的时候,则会返回503http状态码并附带一个出现问题的组件名称。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
[root@coredns-10-31-53-1 conf]# curl -vv http://10.31.53.1:8181/ready
* About to connect() to 10.31.53.1 port 8181 (#0)
* Trying 10.31.53.1...
* Connected to 10.31.53.1 (10.31.53.1) port 8181 (#0)
> GET /ready HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 10.31.53.1:8181
> Accept: */*
>
< HTTP/1.1 503 Service Unavailable
< Date: Thu, 28 Jul 2022 03:51:44 GMT
< Content-Length: 10
< Content-Type: text/plain; charset=utf-8
<
* Connection #0 to host 10.31.53.1 left intact
kubernetes

而此时访问health组件的接口返回的响应码还是200以及OK

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
[root@coredns-10-31-53-1 conf]# curl -v http://10.31.53.1:8080/health
* About to connect() to 10.31.53.1 port 8080 (#0)
* Trying 10.31.53.1...
* Connected to 10.31.53.1 (10.31.53.1) port 8080 (#0)
> GET /health HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 10.31.53.1:8080
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Thu, 28 Jul 2022 03:59:45 GMT
< Content-Length: 2
< Content-Type: text/plain; charset=utf-8
<
* Connection #0 to host 10.31.53.1 left intact
OK

从systemd的服务状态中我们不难看出,此时的coredns是处于运行状态,但是kubernetes插件工作异常。这也就较好地说明了health插件在工作时主要关注coredns本身的运行状态,而ready插件会同时关注组件的工作状态是否正常。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
[root@coredns-10-31-53-1 conf]# systemctl status coredns
● coredns.service - CoreDNS
Loaded: loaded (/usr/lib/systemd/system/coredns.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2022-07-28 11:52:50 CST; 8min ago
Docs: https://coredns.io/manual/toc/
Main PID: 14478 (coredns)
Tasks: 13
Memory: 23.8M
CGroup: /system.slice/coredns.service
└─14478 /home/coredns/sbin/coredns -dns.port=53 -conf /home/coredns/conf/corefile

Jul 28 11:52:50 coredns-10-31-53-1.tinychen.io coredns[14478]: [INFO] plugin/reload: Running configuration MD5 = e3edb2bb003af1e51a1b82bfaebba8f4
Jul 28 11:52:50 coredns-10-31-53-1.tinychen.io coredns[14478]: CoreDNS-1.8.6
Jul 28 11:52:50 coredns-10-31-53-1.tinychen.io coredns[14478]: linux/amd64, go1.17.1, 13a9191
Jul 28 11:52:50 coredns-10-31-53-1.tinychen.io coredns[14478]: [INFO] 127.0.0.1:53443 - 17600 "HINFO IN 6988510158354025264.1665891352749413348.cali-cluster.tclocal. udp 78 false 512" NXDOMAIN qr,aa,rd 192 0.000385901s
Jul 28 11:57:05 coredns-10-31-53-1.tinychen.io coredns[14478]: [INFO] Reloading
Jul 28 11:57:10 coredns-10-31-53-1.tinychen.io coredns[14478]: [WARNING] plugin/kubernetes: starting server with unsynced Kubernetes API
Jul 28 11:57:10 coredns-10-31-53-1.tinychen.io coredns[14478]: [INFO] 127.0.0.1:41957 - 46173 "HINFO IN 3749714491109172199.3469953470964448055.cali-cluster.tclocal. udp 78 false 512" SERVFAIL qr,aa,rd 192 0.00012492s
Jul 28 11:57:10 coredns-10-31-53-1.tinychen.io coredns[14478]: [INFO] plugin/reload: Running configuration MD5 = 2365432f92773a3434ec9ab810392378
Jul 28 11:57:10 coredns-10-31-53-1.tinychen.io coredns[14478]: [INFO] Reloading complete
Jul 28 11:59:49 coredns-10-31-53-1.tinychen.io coredns[14478]: [INFO] plugin/ready: Still waiting on: "kubernetes"

3、小结

从上面的对比我们不难发现就单纯的就检测程序本身状态而言,两者都是能够满足需求的。而在默认的k8s中部署的coredns,我们查看其配置文件可以发现两者的用途并不一致,health插件主要用于livenessProbe,用于检测该pod是否正常运行,是否需要销毁重建等;而ready插件主要用于readinessProbe,用于检测coredns的状态是否可以ready并对外提供服务。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
livenessProbe:
failureThreshold: 5
httpGet:
path: /health
port: 8080
scheme: HTTP
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5

readinessProbe:
failureThreshold: 3
httpGet:
path: /ready
port: 8181
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1

更多关于Liveness和Readiness的配置可以参考kubernetes的官方配置文档

The kubelet uses liveness probes to know when to restart a container. For example, liveness probes could catch a deadlock, where an application is running, but unable to make progress. Restarting a container in such a state can help to make the application more available despite bugs.

The kubelet uses readiness probes to know when a container is ready to start accepting traffic. A Pod is considered ready when all of its containers are ready. One use of this signal is to control which Pods are used as backends for Services. When a Pod is not ready, it is removed from Service load balancers.

The kubelet uses startup probes to know when a container application has started. If such a probe is configured, it disables liveness and readiness checks until it succeeds, making sure those probes don’t interfere with the application startup. This can be used to adopt liveness checks on slow starting containers, avoiding them getting killed by the kubelet before they are up and running.