原理

通常 Pod 体面终止的过程为:kubelet 先发送一个带有体面超时限期的 TERM(又名 SIGTERM,根据参数terminationGracePeriodSeconds来决定) 信号到每个容器中的主进程,将请求发送到容器运行时来尝试停止 Pod 中的容器。 停止容器的这些请求由容器运行时以异步方式处理。 这些请求的处理顺序无法被保证。许多容器运行时遵循容器镜像内定义的 STOPSIGNAL 值, 如果不同,则发送容器镜像中配置的 STOPSIGNAL,而不是 TERM 信号。 一旦超出了体面终止限期,容器运行时会向所有剩余进程发送 KILL 信号,之后 Pod 就会被从 API 服务器上移除。 如果 kubelet 或者容器运行时的管理服务在等待进程终止期间被重启, 集群会从头开始重试,赋予 Pod 完整的体面终止限期。

通常终止流程按照下面约束进行:

如果 Pod 中的容器之一定义了 preStop 回调 且 Pod 规约中的 terminationGracePeriodSeconds 未设为 0, kubelet 开始在容器内运行该回调逻辑。默认的 terminationGracePeriodSeconds 设置为 30 秒.

  • 如果 Pod 未定义 preStop 回调,根据默认的 terminationGracePeriodSeconds 设置为 30 秒。进行 kill -9(无论terminationGracePeriodSeconds 有没有配置)

  • 如果 preStop 回调在体面期结束后仍在运行,kubelet 将请求短暂的、一次性的体面期延长 2 秒。即 30 + 2 s 后删除Pod。

  • 如果 preStop 回调配置的值大于 > terminationGracePeriodSeconds , 仍按照 terminationGracePeriodSeconds 去执行。

  • kubelet 向每个容器的 pid = 1的进程发送 SIGTERM。

  • 发送后 Pod 被设置为 Terminating,并关闭Pod流量调度(service 是根据 Running Pod进行调度)。

  • 待Pod自动完成,或者 到达 terminationGracePeriodSeconds + 2 时,将强制退出。

演示

使用的 Dockerfile

dockerfile
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
FROM golang:1.22.12-bookworm AS builder
MAINTAINER seiya
WORKDIR /app
COPY ./main.go /app
COPY 1.sh /app/
ENV GOPROXY https://goproxy.cn,direct
RUN \
  apt update && apt-get install -y libx11-dev dumb-init 
RUN \
    #sed -i 's/dl-cdn.alpinelinux.org/mirrors.ustc.edu.cn/g' /etc/apk/repositories && \
    CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -ldflags "-s -w" -o signal main.go && \
    chmod +x signal

#FROM gcr.io/distroless/base-debian12 AS runner
FROM amd64/debian:12-slim AS runner
WORKDIR /app
#COPY --from=builder /usr/bin/dumb-init /usr/bin/dumb-init
COPY --from=builder /app/1.sh /app/
COPY --from=builder /app/signal /usr/sbin/
RUN \
  apt update && apt-get install -y dumb-init busybox
VOLUME ["/app" ]
ENTRYPOINT ["dumb-init","--single-child", "--"]
CMD ["/app/1.sh"]

使用的部署清单

yaml
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: test
  name: test-signal
  namespace: debug
spec:
  selector:
    matchLabels:
      app: test-signal
  template:
    metadata:
      labels:
        app: test-signal
    spec:
      containers:
        - image: cylonchau/signal-test:6.2
          imagePullPolicy: Always
          name: test-signal
          lifecycle:
            preStop:
              exec:
                command:
                  - /home/busybox
                  - sleep
                  - '40'
      restartPolicy: Always

使用的启动脚本

bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
#!/usr/bin/busybox sh

# 这是捕获到 kill -15 时,向 child_pid 发送 kill -15,然后等待 child_pid 进程退出
trap 'kill -TERM $child_pid; wait $child_pid' TERM

# 启动真正的进程
/usr/sbin/signal &
child_pid=$!

# 等待子进程退出
wait $child_pid

当发起redeploy,现象立马重启,dumb-init吧信号传递给脚本,脚本没有把信号传递给子进程,父进程退出后容器就退出了。

会被立即杀死

bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
process 5 : 16
process 2 : 14
process 4 : 16
process 1 : 13
process 6 : 18
process 3 : 15
process 8 : 20
process 7 : 19
process 9 : 21
process 5 : 17
process 0 : 12
kill -15 进程退出
等待进程完成

正常不适用脚本的推出追踪

bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
process 1 : 1
2025-02-11T03:09:39.724905088Z process 9 : 9
2025-02-11T03:09:39.724909372Z process 2 : 2
2025-02-11T03:09:39.724913167Z waiting signal...
2025-02-11T03:09:39.724916314Z process 6 : 6
2025-02-11T03:09:39.724920035Z process 7 : 7
process 0 : 0

...

2025-02-11T03:09:56.736808435Z process 1 : 18
2025-02-11T03:09:56.736822881Z process 5 : 22
2025-02-11T03:09:56.736826547Z process 2 : 19
kill -15 进程退出
等待进程完成
process 5 : 23
2025-02-11T03:09:57.737872349Z process 3 : 21
2025-02-11T03:09:57.737880369Z process 0 : 18
2025-02-11T03:09:57.737885600Z process 9 : 27
process 8 : 26
2025-02-11T03:09:57.737896163Z process 7 : 25
2025-02-11T03:09:57.737901236Z process 4 : 22
2025-02-11T03:09:57.737907145Z process 6 : 24
2025-02-11T03:09:57.737912198Z process 1 : 19
2025-02-11T03:09:57.737916878Z process 2 : 20
process 2 : 21
2025-02-11T03:09:58.738805260Z process 1 : 20
2025-02-11T03:09:58.738812505Z process 5 : 24
2025-02-11T03:09:58.738824443Z process 3 : 22
2025-02-11T03:09:58.738829906Z process 0 : 19
2025-02-11T03:09:58.738834665Z process 9 : 28
process 8 : 27
2025-02-11T03:09:58.738845432Z process 7 : 26
2025-02-11T03:09:58.738850652Z process 4 : 23
2025-02-11T03:09:58.738872801Z process 6 : 25
process 4 : 24
2025-02-11T03:09:59.739946678Z process 1 : 21
2025-02-11T03:09:59.739954313Z process 5 : 25
2025-02-11T03:09:59.739959821Z process 3 : 23
process 0 : 20
2025-02-11T03:09:59.739971261Z process 9 : 29
2025-02-11T03:09:59.739977126Z process 8 : 28
2025-02-11T03:09:59.739982936Z process 7 : 27
2025-02-11T03:09:59.739988609Z process 6 : 26
2025-02-11T03:09:59.739994269Z process 2 : 22
process 7 : 28
process 4 : 25
2025-02-11T03:10:00.740901309Z process 1 : 22
2025-02-11T03:10:00.740904915Z process 5 : 26
2025-02-11T03:10:00.740908607Z process 3 : 24
2025-02-11T03:10:00.740912412Z process 0 : 21
...

如果命令是在脚本中使用,需要增加 trap

bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
#!/usr/bin/busybox sh

# 这是捕获到 kill -15 时,向 child_pid 发送 kill -15,然后等待 child_pid 进程退出
trap 'kill -TERM $child_pid; wait $child_pid' TERM

# 启动真正的进程
/usr/sbin/signal &
child_pid=$!

# 等待子进程退出
wait $child_pid

然后测试

bash
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
2025-02-11T09:02:23.499272204Z process 5 : 444
kill -15 进程退出
等待进程完成
process 2 : 442
2025-02-11T09:02:24.500288953Z process 6 : 446
2025-02-11T09:02:24.500296520Z process 9 : 449
2025-02-11T09:02:24.500303369Z process 1 : 441
2025-02-11T09:02:24.500309809Z process 0 : 440
2025-02-11T09:02:24.500316118Z process 8 : 448
process 4 : 444
2025-02-11T09:02:24.500329508Z process 5 : 445
2025-02-11T09:02:24.500336311Z process 3 : 443
2025-02-11T09:02:24.500342658Z process 7 : 447
process 5 : 446
2025-02-11T09:02:25.501432488Z process 3 : 444
process 4 : 445
2025-02-11T09:02:25.501462805Z process 7 : 448
2025-02-11T09:02:25.501469120Z process 1 : 442
2025-02-11T09:02:25.501474984Z process 6 : 447
2025-02-11T09:02:25.501481006Z process 9 : 450
process 0 : 441
2025-02-11T09:02:25.501493018Z process 2 : 443
2025-02-11T09:02:25.501498026Z process 8 : 449
process 9 : 451
2025-02-11T09:02:26.502288615Z process 5 : 447
2025-02-11T09:02:26.502296794Z process 3 : 445
process 4 : 446
2025-02-11T09:02:26.502309015Z process 1 : 443
2025-02-11T09:02:26.502315239Z process 6 : 448
2025-02-11T09:02:26.502321715Z process 8 : 450
2025-02-11T09:02:26.502327547Z process 2 : 444
2025-02-11T09:02:26.502333990Z process 0 : 442
2025-02-11T09:02:26.502340285Z process 7 : 449
process 2 : 445
2025-02-11T09:02:27.503276894Z process 9 : 452
2025-02-11T09:02:27.503284666Z process 5 : 448
2025-02-11T09:02:27.503290863Z process 3 : 446
process 4 : 447
2025-02-11T09:02:27.503303407Z process 1 : 444
process 6 : 449
2025-02-11T09:02:27.503315512Z process 7 : 450
2025-02-11T09:02:27.503322049Z process 8 : 451
process 0 : 443
process 6 : 450
2025-02-11T09:02:28.504232236Z process 2 : 446
process 9 : 453
2025-02-11T09:02:28.504248187Z process 5 : 449
2025-02-11T09:02:28.504255023Z process 3 : 447
process 4 : 448
2025-02-11T09:02:28.504267497Z process 1 : 445
2025-02-11T09:02:28.504274473Z process 0 : 444
process 7 : 451
2025-02-11T09:02:28.504286872Z process 8 : 452
process 1 : 446