本文发布于Cylon的收藏册,转载请著名原文链接~

处理记录

Ceph版本:octopus

首先遇到問題是,业务端无法挂在 cephfs 查看内核日志发现是 bad authorize reply ,以为是 ceph keyring被替换了

1
2
3
4
5
6
7
8
2019-01-30 17:26:58 localhost kernel: libceph: mds0 10.80.20.100:6801 bad authorize reply
2019-01-30 17:26:58 localhost kernel: libceph: mds0 10.80.20.100:6801 bad authorize reply
2019-01-30 17:26:58 localhost kernel: libceph: mds0 10.80.20.100:6801 bad authorize reply
2019-01-30 17:26:58 localhost kernel: libceph: mds0 10.80.20.100:6801 bad authorize reply
2019-01-30 17:26:58 localhost kernel: libceph: mds0 10.80.20.100:6801 bad authorize reply
2019-01-30 17:26:58 localhost kernel: libceph: mds0 10.80.20.100:6801 bad authorize reply
2019-01-30 17:26:58 localhost kernel: libceph: mds0 10.80.20.100:6801 bad authorize reply
2019-01-30 17:26:58 localhost kernel: libceph: mds0 10.80.20.100:6801 bad authorize reply

在排查完 keyring 后,手动尝试挂载 cephfs 提示 Input/output error ,此时看出是集群问题了

1
2
$ mount -t ceph 10.80.20.100:6789:/tmp /tmp/ceph -o secret=AQCoW0dgQk4qGhAAwayKv70OSyyWB3XpZ1JLYQ==,name=cephuser
mount error 5 = Input/output error

因为一开始看到日志是 bad authorize reply 以为是认证错误,重新登录了一下发现是相同的提示,这时查看 ceph status 发现集群异常,除了下面报错外,还有一个 osd down 的状态。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
$ ceph health detail
HEALTH_WARN 1 host fail cephadm check; 1 MDSs report slow metadata IOs; 1 MDSs report slow requests; clock skew detected on mon.localhost; Degraded data redundancy: 39560/118680 objects degraded (33.333%), 201 pgs degraded, 255 pgs undersized; 4 daemons have recently crashed
[WRN] CEPHADM_HOST_CHECK_FAIL: 1 hosts fail cephadm check
    host localhost failed check: ['podman|docker (/bin/docker) is present', 'systemctl is present', 'lvcreate is present', "No time sync service is running; checked for ['chrony.service', 'chronyd.service', 'systemd-timesyncd.service', 'ntpd.service', 'ntp.service']", 'ERROR: No time synchronizetion is active']
[WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
    mds.cephfs.localhost.mhlzaj(mds.0): 20 slow metadata IOs are blocked > 30 secs, oldest blocked for 4830 secs
[WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
    mds.cephfs.localhost.mhlzaj(mds.0): 14 slow metadata IOs are blocked > 30 secs
[WRN] MON_CLOCK_SKEW: clock skew detected on mon.localhost, mon.localhost1
    mon.localhost clock skew 29357.8s > max 0.05s (latency 0.0132089s)
    mon.localhost1 clock skew 29357.8s > max 0.05s (latency 0.0117421s)
[WRN] PG_DEGRADED: Degraded data redundancy: 39501/118680 objects degraded (33.333%), 189 pgs degraded, 241 pgs undersized
    pg 1.0 is stuck undersized for 22m, current state active+undersized+degraded, last acting [1]
    pg 2.0 is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 2.1 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 2.2 is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 2.3 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 2.4 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 2.5 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 2.6 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 2.7 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 2.8 is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 2.c is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 2.d is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 2.e is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 2.f is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 2.10 is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 2.11 is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 2.12 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 2.13 is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 2.14 is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 2.15 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 2.16 is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 2.17 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 2.18 is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 2.19 is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 2.1a is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 2.1b is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 3.0 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 3.1 is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 3.2 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 3.3 is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 3.4 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 3.5 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 3.6 is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 3.7 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 3.9 is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 3.c is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 3.d is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 3.e is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 3.f is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 3.10 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 3.11 is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 3.12 is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 3.13 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 3.14 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 3.15 is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 3.16 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 3.17 is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 3.18 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 3.19 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 3.1a is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
[WRN] RECENT_CRASH: 4 daemons have recently crashed
    osd4 crhash on host xxxxxx at 20xx-0x-xxT04xx:xx:xx.xxxxxxz
    client.xxx.xxxx.hostname.xxxx crashed on host hostname at 20xx-0x-xxT04xx:xx:xx.xxxxxxz
  • CEPHADM_HOST_CHECK_FAIL:一台或多台主机未通过基本 cephadm 主机检查,该检查验证 (1) 主机可访问并且可以在其中执行 cephadm,以及 (2) 主机满足基本先决条件,例如工作容器运行时(podman 或 docker)和工作时间同步。如果此测试失败,cephadm 将无法管理该主机上的服务。

  • MDS_SLOW_METADATA_IO

  • MDS_SLOW_REQUEST:N条慢请求被阻塞

  • MON_CLOCK_SKEW:运行 ceph-mon 的主机上的时钟未很好同步。如果集群检测到时钟偏差大于 mon_clock_drift_allowed,则会引发此运行状况检查。

  • PG_DEGRADED:一个或多个PG的健康状态受到了损害。一种常见的情况是,某个OSD发生故障或离线,导致PG进入降级状态。在这种情况下,数据副本的可用性会受到影响,并且Ceph集群的性能也可能下降。

首先重启 chronyd 修复了时间同步的问题,因为机房内机器经常出现 chronyd 的服务导致异常,其次重启 osd 服务,让 ceph 做再平衡完成后剩下下面报错。

并且现象有两个:

  • cephfs no such file or director
  • ceph orch 命令还是没有反应
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
$ ceph health detail
HEALTH_WARN 1 host fail cephadm check; 1 MDSs report slow metadata IOs; 1 MDSs report slow requests; clock skew detected on mon.localhost; Degraded data 
[WRN] FS_DEGRADED: 1 filesystem is degraded
    fs cephfs is degraded
[WRN] MDS_SLOW_METADATA_IO: 1 MDSs slow metadata IOs
     mds.cephfs.localhost.mhlzaj(mds.0): 20 slow metadata IOs are blocked > 30 secs, oldest blocked for 930 secs
[WRN] RECENT_CRASH: 4 daemons have recently crashed
    osd4 crhash on host xxxxxx at 20xx-0x-xxT04xx:xx:xx.xxxxxxz
    client.xxx.xxxx.hostname.xxxx crashed on host hostname at 20xx-0x-xxT04xx:xx:xx.xxxxxxz
[WRN] SLOW_OPS: 2 slow ops, oldset one blocked for 474 sec, mon.localhost has slow ops

通过 search 了一下,查询到 orch 是 MGR 模块

The orchestrator is a MGR module, have you checked if the containers are up and running [1]

此时操作登录对应ceph node,docker restart ceph-mgr的模块,并且重启 mds 模块

1
2
3
4
$ ceph health detail
[WRN] RECENT_CRASH: 4 daemons have recently crashed
    osd4 crhash on host xxxxxx at 20xx-0x-xxT04xx:xx:xx.xxxxxxz
    client.xxx.xxxx.hostname.xxxx crashed on host hostname at 20xx-0x-xxT04xx:xx:xx.xxxxxxz

此时集群恢复正常,cephfs 恢复

总结

由于长期没有在处理 ceph 方向问题,对排查有以下生疏:

  • 无法挂载时没有及时查看 ceph 集群信息,而是盯着客户端方向日志查询了半天。
  • 对 ceph 故障代码没有了解过,如果有了解,可以很明确的定位问题,而不用耽误2小时。

Reference

[1] ceph orch status hangs forever

[2] HEALTH CHECKS

[3] CEPHFS HEALTH MESSAGES

本文发布于Cylon的收藏册,转载请著名原文链接~

链接:记录一次ceph集群故障处理记录

版权:本作品采用「署名-非商业性使用-相同方式共享 4.0 国际」 许可协议进行许可。