服务器断电重启后,etcd 集群因成员 ID 解析失败导致节点状态异常。本文档提供两种恢复方案,帮助快速重建集群。

1、问题现象

1.1、集群状态异常

kubectl exec -n infras api-etcd-0 -- etcdctl member list

# 输出结果:
cddcbff3a9557fa, unstarted, , http://api-etcd-1.api-etcd-headless.infras.svc.cluster.local:2380, , false
195ef927ecf53854, started, api-etcd-0, http://api-etcd-0.api-etcd-headless.infras.svc.cluster.local:2380, http://api-etcd-0.api-etcd-headless.infras.svc.cluster.local:2379,http://api-etcd.infras.svc.cluster.local:2379, false
270884f43442e3f1, started, api-etcd-2, http://api-etcd-2.api-etcd-headless.infras.svc.cluster.local:2380, http://api-etcd-2.api-etcd-headless.infras.svc.cluster.local:2379,http://api-etcd.infras.svc.cluster.local:2379, false

# 异常表现:api-etcd-1 节点信息不完整,缺少成员名称和客户端 URL,状态为 unstarted

1.2、Pod 日志错误

kubectl logs -n infras api-etcd-1

# 关键错误信息:
# etcd 09:01:54.15 INFO  ==> Updating member in existing cluster
# Error: bad member ID arg (strconv.ParseUint: parsing "": invalid syntax), expecting ID in Hex

2、根本原因

  • 故障节点(api-etcd-1)尝试加入集群时,成员 ID 解析失败

3、解决方案

3.1、方案一

  • 节点重新加入
# 步骤 1:移除异常成员
kubectl exec -n infras api-etcd-0 -- etcdctl member remove <异常节点 ID>

# 步骤 2:重新添加成员
kubectl exec -n infras api-etcd-0 -- etcdctl member add api-etcd-1 \
  --peer-urls=http://api-etcd-1.api-etcd-headless.infras.svc.cluster.local:2380
# 返回示例 ----------------
# Added member named api-etcd-1 with ID xxxxxx to cluster

# ETCD_NAME="api-etcd-1"
# ETCD_INITIAL_CLUSTER="api-etcd-1=http://api-etcd-1.api-etcd-headless.infras.svc.cluster.local:2380,api-etcd-0=http://api-etcd-0.api-etcd-headless.infras.svc.cluster.local:2380,api-etcd-2=http://api-etcd-2.api-etcd-headless.infras.svc.cluster.local:2380"
# ETCD_INITIAL_CLUSTER_STATE="existing"
# ------------------------

# 步骤 3:重启 Pod
kubectl delete pod -n infras api-etcd-1 --force --grace-period=0

# 步骤 4:验证集群状态
kubectl exec -n infras api-etcd-0 -- etcdctl member list -w table

3.2、方案二

  • 单节点恢复
# 备份数据
## 方式一:使用 etcdctl 备份
kubectl exec -n infras api-etcd-0 -- etcdctl snapshot save /bitnami/etcd/snapshot.db
## 方式二:直接备份持久卷数据目录
kubectl exec -n infras api-etcd-0 -- tar -czf /tmp/etcd-backup.tar.gz /bitnami/etcd/data

# 修改 StatefulSet 为单节点模式
# ---------------
env:
  - name: ETCD_INITIAL_CLUSTER_STATE
    value: "existing"
  - name: ETCD_INITIAL_CLUSTER
    value: ""               # 置空,由集群自动识别
  - name: ETCD_FORCE_NEW_CLUSTER
    value: "true"           # 强制以单节点模式启动
  - name: ETCD_DISASTER_RECOVERY
    value: "yes"            # 启用灾难恢复模式
# -----------------

# 启动单节点 etcd
# 缩容到 0
kubectl scale statefulset -n infras api-etcd --replicas=0
# 扩容到 1
kubectl scale statefulset -n infras api-etcd --replicas=1
# 验证只有一个成员
kubectl exec -n infras api-etcd-0 -- etcdctl member list


# 清空其他节点数据
# 再次缩容到 0
kubectl scale sts api-etcd -n infras --replicas=0
# 手动清空 api-etcd-1 和 api-etcd-2 的持久卷数据

# 恢复多节点集群
# 恢复 StatefulSet 配置
# -----------------
env:
  - name: ETCD_INITIAL_CLUSTER_STATE
    value: "existing"
  - name: ETCD_INITIAL_CLUSTER
    value: ""               # 置空
  - name: ETCD_FORCE_NEW_CLUSTER
    value: "false"          # 关闭强制新集群
  - name: ETCD_DISASTER_RECOVERY
    value: "no"             # 关闭灾难恢复模式
# -----------------

# 启动 api-etcd-0
kubectl scale sts api-etcd -n infras --replicas=1
# 等待 etcd-0 正常运行后,加入 etcd-1
kubectl exec -n infras api-etcd-0 -- \
  etcdctl member add api-etcd-1 \
  --peer-urls=http://api-etcd-1.api-etcd-headless.infras.svc.cluster.local:2380
# 验证成员列表
kubectl exec -n infras api-etcd-0 -- etcdctl member list
# 扩容到 2 个副本
kubectl scale sts api-etcd -n infras --replicas=2

# 等待 etcd-1 正常后,加入 etcd-2
kubectl exec -n infras api-etcd-0 -- \
  etcdctl member add api-etcd-2 \
  --peer-urls=http://api-etcd-2.api-etcd-headless.infras.svc.cluster.local:2380
# 扩容到 3 个副本
kubectl scale sts api-etcd -n infras --replicas=3

# PS:必须按照「先添加成员 → 再扩容副本」的顺序执行

4、常用命令

# 查看集群状态

## 查看成员列表(表格格式)
kubectl exec -n infras api-etcd-0 -- etcdctl member list -w table
## 查看集群端点状态(含 Leader 信息)
kubectl exec -n infras api-etcd-0 -- etcdctl endpoint status --cluster -w table
## 检查节点健康状态
kubectl exec -it api-etcd-0 -n infras -- etcdctl endpoint health


# 成员管理

## 移除成员
kubectl exec -n infras api-etcd-0 -- etcdctl member remove <成员ID>
## 添加成员
kubectl exec -n infras api-etcd-0 -- etcdctl member add <成员名称> \
  --peer-urls=<peer-url>
## 转移 Leader
kubectl exec -n infras api-etcd-0 -- etcdctl move-leader <目标节点ID>


# 数据管理

## 创建快照备份
kubectl exec -n infras api-etcd-0 -- etcdctl snapshot save /bitnami/etcd/snapshot.db
## 从快照恢复
kubectl exec -n infras api-etcd-0 -- etcdctl snapshot restore /bitnami/etcd/snapshot.db
Logo

openEuler 是由开放原子开源基金会孵化的全场景开源操作系统项目,面向数字基础设施四大核心场景(服务器、云计算、边缘计算、嵌入式),全面支持 ARM、x86、RISC-V、loongArch、PowerPC、SW-64 等多样性计算架构

更多推荐