记录一次ceph集群故障处理记录

本文发布于Cylon的收藏册，转载请著名原文链接~

处理记录

Ceph版本：octopus

首先遇到問題是，业务端无法挂在 cephfs 查看内核日志发现是 bad authorize reply ，以为是 ceph keyring被替换了

2019-01-30 17:26:58 localhost kernel: libceph: mds0 10.80.20.100:6801 bad authorize reply
2019-01-30 17:26:58 localhost kernel: libceph: mds0 10.80.20.100:6801 bad authorize reply
2019-01-30 17:26:58 localhost kernel: libceph: mds0 10.80.20.100:6801 bad authorize reply
2019-01-30 17:26:58 localhost kernel: libceph: mds0 10.80.20.100:6801 bad authorize reply
2019-01-30 17:26:58 localhost kernel: libceph: mds0 10.80.20.100:6801 bad authorize reply
2019-01-30 17:26:58 localhost kernel: libceph: mds0 10.80.20.100:6801 bad authorize reply
2019-01-30 17:26:58 localhost kernel: libceph: mds0 10.80.20.100:6801 bad authorize reply
2019-01-30 17:26:58 localhost kernel: libceph: mds0 10.80.20.100:6801 bad authorize reply

在排查完 keyring 后，手动尝试挂载 cephfs 提示 Input/output error ，此时看出是集群问题了

$ mount -t ceph 10.80.20.100:6789:/tmp /tmp/ceph -o secret=AQCoW0dgQk4qGhAAwayKv70OSyyWB3XpZ1JLYQ==,name=cephuser
mount error 5 = Input/output error

因为一开始看到日志是 bad authorize reply 以为是认证错误，重新登录了一下发现是相同的提示，这时查看 ceph status 发现集群异常，除了下面报错外，还有一个 osd down 的状态。

$ ceph health detail
HEALTH_WARN 1 host fail cephadm check; 1 MDSs report slow metadata IOs; 1 MDSs report slow requests; clock skew detected on mon.localhost; Degraded data redundancy: 39560/118680 objects degraded (33.333%), 201 pgs degraded, 255 pgs undersized; 4 daemons have recently crashed
[WRN] CEPHADM_HOST_CHECK_FAIL: 1 hosts fail cephadm check
    host localhost failed check: ['podman|docker (/bin/docker) is present', 'systemctl is present', 'lvcreate is present', "No time sync service is running; checked for ['chrony.service', 'chronyd.service', 'systemd-timesyncd.service', 'ntpd.service', 'ntp.service']", 'ERROR: No time synchronizetion is active']
[WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
    mds.cephfs.localhost.mhlzaj(mds.0): 20 slow metadata IOs are blocked > 30 secs, oldest blocked for 4830 secs
[WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
    mds.cephfs.localhost.mhlzaj(mds.0): 14 slow metadata IOs are blocked > 30 secs
[WRN] MON_CLOCK_SKEW: clock skew detected on mon.localhost, mon.localhost1
    mon.localhost clock skew 29357.8s > max 0.05s (latency 0.0132089s)
    mon.localhost1 clock skew 29357.8s > max 0.05s (latency 0.0117421s)
[WRN] PG_DEGRADED: Degraded data redundancy: 39501/118680 objects degraded (33.333%), 189 pgs degraded, 241 pgs undersized
    pg 1.0 is stuck undersized for 22m, current state active+undersized+degraded, last acting [1]
    pg 2.0 is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 2.1 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 2.2 is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 2.3 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 2.4 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 2.5 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 2.6 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 2.7 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 2.8 is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 2.c is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 2.d is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 2.e is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 2.f is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 2.10 is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 2.11 is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 2.12 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 2.13 is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 2.14 is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 2.15 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 2.16 is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 2.17 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 2.18 is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 2.19 is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 2.1a is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 2.1b is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 3.0 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 3.1 is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 3.2 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 3.3 is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 3.4 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 3.5 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 3.6 is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 3.7 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 3.9 is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 3.c is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 3.d is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 3.e is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 3.f is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 3.10 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 3.11 is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 3.12 is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 3.13 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 3.14 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 3.15 is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 3.16 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 3.17 is stuck undersized for 4d, current state active+undersized+degraded, last acting [0]
    pg 3.18 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 3.19 is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
    pg 3.1a is stuck undersized for 4d, current state active+undersized+degraded, last acting [1]
[WRN] RECENT_CRASH: 4 daemons have recently crashed
    osd4 crhash on host xxxxxx at 20xx-0x-xxT04xx:xx:xx.xxxxxxz
    client.xxx.xxxx.hostname.xxxx crashed on host hostname at 20xx-0x-xxT04xx:xx:xx.xxxxxxz

CEPHADM_HOST_CHECK_FAIL：一台或多台主机未通过基本 cephadm 主机检查，该检查验证 (1) 主机可访问并且可以在其中执行 cephadm，以及 (2) 主机满足基本先决条件，例如工作容器运行时（podman 或 docker）和工作时间同步。如果此测试失败，cephadm 将无法管理该主机上的服务。
MDS_SLOW_METADATA_IO
MDS_SLOW_REQUEST：N条慢请求被阻塞
MON_CLOCK_SKEW：运行 ceph-mon 的主机上的时钟未很好同步。如果集群检测到时钟偏差大于 mon_clock_drift_allowed，则会引发此运行状况检查。
PG_DEGRADED：一个或多个PG的健康状态受到了损害。一种常见的情况是，某个OSD发生故障或离线，导致PG进入降级状态。在这种情况下，数据副本的可用性会受到影响，并且Ceph集群的性能也可能下降。

首先重启 chronyd 修复了时间同步的问题，因为机房内机器经常出现 chronyd 的服务导致异常，其次重启 osd 服务，让 ceph 做再平衡完成后剩下下面报错。

并且现象有两个：

cephfs no such file or director
ceph orch 命令还是没有反应

$ ceph health detail
HEALTH_WARN 1 host fail cephadm check; 1 MDSs report slow metadata IOs; 1 MDSs report slow requests; clock skew detected on mon.localhost; Degraded data 
[WRN] FS_DEGRADED: 1 filesystem is degraded
    fs cephfs is degraded
[WRN] MDS_SLOW_METADATA_IO: 1 MDSs slow metadata IOs
     mds.cephfs.localhost.mhlzaj(mds.0): 20 slow metadata IOs are blocked > 30 secs, oldest blocked for 930 secs
[WRN] RECENT_CRASH: 4 daemons have recently crashed
    osd4 crhash on host xxxxxx at 20xx-0x-xxT04xx:xx:xx.xxxxxxz
    client.xxx.xxxx.hostname.xxxx crashed on host hostname at 20xx-0x-xxT04xx:xx:xx.xxxxxxz
[WRN] SLOW_OPS: 2 slow ops, oldset one blocked for 474 sec, mon.localhost has slow ops

通过 search 了一下，查询到 orch 是 MGR 模块

The orchestrator is a MGR module, have you checked if the containers are up and running ^[1]

此时操作登录对应ceph node，docker restart ceph-mgr的模块，并且重启 mds 模块

$ ceph health detail
[WRN] RECENT_CRASH: 4 daemons have recently crashed
    osd4 crhash on host xxxxxx at 20xx-0x-xxT04xx:xx:xx.xxxxxxz
    client.xxx.xxxx.hostname.xxxx crashed on host hostname at 20xx-0x-xxT04xx:xx:xx.xxxxxxz

此时集群恢复正常，cephfs 恢复