Rancher因Etcd告警空间耗尽无法启动的解决方案

症状

Rancher无法正常启动，通过查看Rancher日志可以看到集群一直报错：

Waiting on etcd startup: status 503

可以明显的看出是etcd出了问题阻塞了集群的启动，需要进入到rancher容器里，查看etcd的问题

etcdctl check datascale

{"level":"warn","ts":"2022-12-26T06:56:07.062Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-315441f2-5ef2-4474-91b4-249484ee17de/127.0.0.1:2379","attempt":0,"error":"rpc error: code = ResourceExhausted desc = etcdserver: mvcc: database space exceeded"}
 10000 / 10000 Boooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00% 1s
FAIL: too many errors
FAIL: ERROR(etcdserver: mvcc: database space exceeded) -> 10000

etcdctl check perf

FAIL: too many errors
FAIL: ERROR(etcdserver: mvcc: database space exceeded) -> 8994
FAIL: Throughput too low: 1 writes/s
PASS: Slowest request took 0.000000s
PASS: Stddev is NaNs
FAIL

通过上面两个检查命令，得到是错误太多
打印节点状态

etcdctl endpoint status --write-out table

通过节点信息，可以看到是因为错误太多导致警告空间被占满，etcd无法写入

压缩并整理多余空间

通过查找官方文档确定解决方案，通过执行命令压缩etcd空间并且整理空间碎片即可

#使用API3
export ETCDCTL_API=3
# 查看告警信息
etcdctl --endpoints=http://127.0.0.1:2379 alarm list
# 告警信息
memberID:10276657743932975437 alarm:NOSPACE

# 获取当前版本
rev=$(etcdctl --endpoints=http://127.0.0.1:2379 endpoint status --write-out="json" | egrep -o '"revision":[0-9]*' | egrep -o '[0-9].*')
# 压缩掉所有旧版本
etcdctl --endpoints=http://127.0.0.1:2379 compact $rev
# 整理多余的空间
etcdctl --endpoints=http://127.0.0.1:2379 defrag
# 取消告警信息
etcdctl --endpoints=http://127.0.0.1:2379 alarm disarm

数据压缩完成之后数据大小如下：