【ETCD】ETCD 集群恢复操作
·
服务器断电重启后,etcd 集群因成员 ID 解析失败导致节点状态异常。本文档提供两种恢复方案,帮助快速重建集群。
- 生产环境还是建议日常做好备份,减少数据丢失也恢复得更快
- etcd 备份恢复参考博客:【Kubernetes】K8s 之 ETCD - 恢复备份_etcd恢复-CSDN博客
1、问题现象
1.1、集群状态异常
kubectl exec -n infras api-etcd-0 -- etcdctl member list
# 输出结果:
cddcbff3a9557fa, unstarted, , http://api-etcd-1.api-etcd-headless.infras.svc.cluster.local:2380, , false
195ef927ecf53854, started, api-etcd-0, http://api-etcd-0.api-etcd-headless.infras.svc.cluster.local:2380, http://api-etcd-0.api-etcd-headless.infras.svc.cluster.local:2379,http://api-etcd.infras.svc.cluster.local:2379, false
270884f43442e3f1, started, api-etcd-2, http://api-etcd-2.api-etcd-headless.infras.svc.cluster.local:2380, http://api-etcd-2.api-etcd-headless.infras.svc.cluster.local:2379,http://api-etcd.infras.svc.cluster.local:2379, false
# 异常表现:api-etcd-1 节点信息不完整,缺少成员名称和客户端 URL,状态为 unstarted
1.2、Pod 日志错误
kubectl logs -n infras api-etcd-1
# 关键错误信息:
# etcd 09:01:54.15 INFO ==> Updating member in existing cluster
# Error: bad member ID arg (strconv.ParseUint: parsing "": invalid syntax), expecting ID in Hex
2、根本原因
- 故障节点(api-etcd-1)尝试加入集群时,成员 ID 解析失败
3、解决方案
3.1、方案一
- 节点重新加入
# 步骤 1:移除异常成员
kubectl exec -n infras api-etcd-0 -- etcdctl member remove <异常节点 ID>
# 步骤 2:重新添加成员
kubectl exec -n infras api-etcd-0 -- etcdctl member add api-etcd-1 \
--peer-urls=http://api-etcd-1.api-etcd-headless.infras.svc.cluster.local:2380
# 返回示例 ----------------
# Added member named api-etcd-1 with ID xxxxxx to cluster
# ETCD_NAME="api-etcd-1"
# ETCD_INITIAL_CLUSTER="api-etcd-1=http://api-etcd-1.api-etcd-headless.infras.svc.cluster.local:2380,api-etcd-0=http://api-etcd-0.api-etcd-headless.infras.svc.cluster.local:2380,api-etcd-2=http://api-etcd-2.api-etcd-headless.infras.svc.cluster.local:2380"
# ETCD_INITIAL_CLUSTER_STATE="existing"
# ------------------------
# 步骤 3:重启 Pod
kubectl delete pod -n infras api-etcd-1 --force --grace-period=0
# 步骤 4:验证集群状态
kubectl exec -n infras api-etcd-0 -- etcdctl member list -w table
3.2、方案二
- 单节点恢复
# 备份数据
## 方式一:使用 etcdctl 备份
kubectl exec -n infras api-etcd-0 -- etcdctl snapshot save /bitnami/etcd/snapshot.db
## 方式二:直接备份持久卷数据目录
kubectl exec -n infras api-etcd-0 -- tar -czf /tmp/etcd-backup.tar.gz /bitnami/etcd/data
# 修改 StatefulSet 为单节点模式
# ---------------
env:
- name: ETCD_INITIAL_CLUSTER_STATE
value: "existing"
- name: ETCD_INITIAL_CLUSTER
value: "" # 置空,由集群自动识别
- name: ETCD_FORCE_NEW_CLUSTER
value: "true" # 强制以单节点模式启动
- name: ETCD_DISASTER_RECOVERY
value: "yes" # 启用灾难恢复模式
# -----------------
# 启动单节点 etcd
# 缩容到 0
kubectl scale statefulset -n infras api-etcd --replicas=0
# 扩容到 1
kubectl scale statefulset -n infras api-etcd --replicas=1
# 验证只有一个成员
kubectl exec -n infras api-etcd-0 -- etcdctl member list
# 清空其他节点数据
# 再次缩容到 0
kubectl scale sts api-etcd -n infras --replicas=0
# 手动清空 api-etcd-1 和 api-etcd-2 的持久卷数据
# 恢复多节点集群
# 恢复 StatefulSet 配置
# -----------------
env:
- name: ETCD_INITIAL_CLUSTER_STATE
value: "existing"
- name: ETCD_INITIAL_CLUSTER
value: "" # 置空
- name: ETCD_FORCE_NEW_CLUSTER
value: "false" # 关闭强制新集群
- name: ETCD_DISASTER_RECOVERY
value: "no" # 关闭灾难恢复模式
# -----------------
# 启动 api-etcd-0
kubectl scale sts api-etcd -n infras --replicas=1
# 等待 etcd-0 正常运行后,加入 etcd-1
kubectl exec -n infras api-etcd-0 -- \
etcdctl member add api-etcd-1 \
--peer-urls=http://api-etcd-1.api-etcd-headless.infras.svc.cluster.local:2380
# 验证成员列表
kubectl exec -n infras api-etcd-0 -- etcdctl member list
# 扩容到 2 个副本
kubectl scale sts api-etcd -n infras --replicas=2
# 等待 etcd-1 正常后,加入 etcd-2
kubectl exec -n infras api-etcd-0 -- \
etcdctl member add api-etcd-2 \
--peer-urls=http://api-etcd-2.api-etcd-headless.infras.svc.cluster.local:2380
# 扩容到 3 个副本
kubectl scale sts api-etcd -n infras --replicas=3
# PS:必须按照「先添加成员 → 再扩容副本」的顺序执行
4、常用命令
# 查看集群状态
## 查看成员列表(表格格式)
kubectl exec -n infras api-etcd-0 -- etcdctl member list -w table
## 查看集群端点状态(含 Leader 信息)
kubectl exec -n infras api-etcd-0 -- etcdctl endpoint status --cluster -w table
## 检查节点健康状态
kubectl exec -it api-etcd-0 -n infras -- etcdctl endpoint health
# 成员管理
## 移除成员
kubectl exec -n infras api-etcd-0 -- etcdctl member remove <成员ID>
## 添加成员
kubectl exec -n infras api-etcd-0 -- etcdctl member add <成员名称> \
--peer-urls=<peer-url>
## 转移 Leader
kubectl exec -n infras api-etcd-0 -- etcdctl move-leader <目标节点ID>
# 数据管理
## 创建快照备份
kubectl exec -n infras api-etcd-0 -- etcdctl snapshot save /bitnami/etcd/snapshot.db
## 从快照恢复
kubectl exec -n infras api-etcd-0 -- etcdctl snapshot restore /bitnami/etcd/snapshot.db
openEuler 是由开放原子开源基金会孵化的全场景开源操作系统项目,面向数字基础设施四大核心场景(服务器、云计算、边缘计算、嵌入式),全面支持 ARM、x86、RISC-V、loongArch、PowerPC、SW-64 等多样性计算架构
更多推荐


所有评论(0)