Rancher因Etcd告警空间耗尽无法启动的解决方案


症状

Rancher无法正常启动,通过查看Rancher日志可以看到集群一直报错:

Waiting on etcd startup: status 503

可以明显的看出是etcd出了问题阻塞了集群的启动,需要进入到rancher容器里,查看etcd的问题

etcdctl check datascale
{"level":"warn","ts":"2022-12-26T06:56:07.062Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-315441f2-5ef2-4474-91b4-249484ee17de/127.0.0.1:2379","attempt":0,"error":"rpc error: code = ResourceExhausted desc = etcdserver: mvcc: database space exceeded"}
 10000 / 10000 Boooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00% 1s
FAIL: too many errors
FAIL: ERROR(etcdserver: mvcc: database space exceeded) -> 10000
etcdctl check perf
FAIL: too many errors
FAIL: ERROR(etcdserver: mvcc: database space exceeded) -> 8994
FAIL: Throughput too low: 1 writes/s
PASS: Slowest request took 0.000000s
PASS: Stddev is NaNs
FAIL

通过上面两个检查命令,得到是错误太多
打印节点状态

etcdctl endpoint status --write-out table

通过节点信息,可以看到是因为错误太多导致警告空间被占满,etcd无法写入

压缩并整理多余空间

通过查找官方文档确定解决方案,通过执行命令压缩etcd空间并且整理空间碎片即可

#使用API3
export ETCDCTL_API=3
# 查看告警信息
etcdctl --endpoints=http://127.0.0.1:2379 alarm list
# 告警信息
memberID:10276657743932975437 alarm:NOSPACE

# 获取当前版本
rev=$(etcdctl --endpoints=http://127.0.0.1:2379 endpoint status --write-out="json" | egrep -o '"revision":[0-9]*' | egrep -o '[0-9].*')
# 压缩掉所有旧版本
etcdctl --endpoints=http://127.0.0.1:2379 compact $rev
# 整理多余的空间
etcdctl --endpoints=http://127.0.0.1:2379 defrag
# 取消告警信息
etcdctl --endpoints=http://127.0.0.1:2379 alarm disarm

数据压缩完成之后数据大小如下:

说明:压缩ETCD空间也可以减少etcd程序的内存占用量,提高etcd性能,在没有问题的时候提前进行压缩也是明智的选择

文档参考

声明:初心|版权所有,违者必究|如未注明,均为原创|本网站采用BY-NC-SA协议进行授权

转载:转载请注明原文链接 - Rancher因Etcd告警空间耗尽无法启动的解决方案


愿你勿忘初心,并从一而终