PG流复制集群-CLup一键创建

一、简介和工作原理

Keepalived 是一种利用 VRRP 协议实现服务器集群高可用的软件,通过实时监测节点状态并在故障发生时自动进行流量切换,确保服务始终稳定运行,避免因单点故障导致业务中断。

1. VRRP协议

想象一个场景,有两台路由器像“双胞胎值班员”一样提供相同服务,一台是“主值班员”,另一台是“候补值班员”。它们共用一个“虚拟电话号码”(VIP),客户只知道拨打这个号码。 正常工作时:“主值班员”负责接听电话(持有VIP),同时每隔一段时间就通过内部广播(地址224.0.0.18)向“候补值班员”喊一句“我还活着”。 故障发生时:如果“候补值班员”长时间没听到“主值班员”的喊声,就会认为对方“掉线了”。 自动切换:此时,“候补值班员”会根据预设的优先级顺序自动晋升为“新主值班员”,立刻接管虚拟电话号码,让客户的所有呼叫无缝转移到自己身上。整个过程中,客户完全不知情,服务就像从未中断过一样,这就是Keepalived通过VRRP协议实现的高可用“隐形交接术”。

2. Keepalived的三大模块

Keepalived主要由三个模块构成:

  1. Core模块:作为核心,负责主进程的启动、维护及全局配置文件的加载和解析。
  2. Check模块:负责健康检查,支持对负载均衡器后端的真实服务器进行TCP检查、HTTP_GET检查等,确保只有健康的服务器才接收流量。
  3. VRRP模块:这是实现VRRP协议的功能模块,负责处理主备节点间的通信和状态切换。

二、安装Keepalived

1、前期准备

安装前我们需要创建数据库Keepalived,并且创建表探测表sr_delay,后续Keepalived探测,刷新sr_delay表的last_alive字段为当前探测时间。这张表用来判断主备延迟情况,数据库故障切换时会查询用到这张表。

create user keepalived password 'keepalived' CONNECTION LIMIT 4 ;
create database keepalived owner keepalived;
\c keepalived keepalived
create table sr_delay(id int4, last_alive timestamp(0) without time zone);

#表sr_delay只允许写入一条记录,并且不允许删除此表数据,通过触发器实现。创建触发器函数,如下所示:

CREATE FUNCTION cannt_delete ()
RETURNS trigger
LANGUAGE plpgsql AS $$
BEGIN
RAISE EXCEPTION 'Table sr_delay can not delete !';
 END;
$$;

#创建触发器:

CREATE TRIGGER trigger_sr_delay_del
BEFORE DELETE ON sr_delay
FOR EACH ROW EXECUTE PROCEDURE cannt_delete() ;
 
CREATE TRIGGER trigger_sr_delay_tru BEFORE TRUNCATE ON sr_delay FOR STATEMENT EXECUTE PROCEDURE cannt_delete() ;

#插入数据:

INSERT INTO sr_delay VALUES(1,now()) ;

后续Keepalived会每隔指定时间探测PostgreSQL数据库存活,并且以Keepalived用户登录Keepalived数据库刷新这张表, 配置主备库pg_hba.conf,增加如下内容:

echo "
host    keepalived    keepalived        10.197.167.121/20      md5
host    keepalived    keepalived        10.197.167.122/20      md5" >> pg_hba.conf 

pg_hba.conf配置文件一般在PGDATA里面

2、安装keepalived

方法一,yum安装(快速上手)

yum install keepalived -y

会在/etc/keepalived/生成一个keepalived.conf配置文件

安装后,可以使用systemctl命令来管理服务(现在还没编写配置文件启动不起来):

systemctl start keepalived.service    # 启动
systemctl enable keepalived.service   # 设置开机自启
systemctl status keepalived.service   # 查看状态

方法二,源码编译安装

安装依赖包

yum install -y gcc openssl-devel libnl3-devel libnfnetlink-devel net-snmp-devel curl make

下载安装编译

# 进入常用源码目录,下载(请替换为最新稳定版链接)
cd /usr/local/src/
curl -O http://keepalived.org/software/keepalived-2.2.4.tar.gz
# 或使用 wget https://www.keepalived.org/software/keepalived-2.2.4.tar.gz

# 解压并进入目录
tar xvf keepalived-2.2.4.tar.gz
cd keepalived-2.2.4

# 配置、编译并安装
./configure --prefix=/usr/local/keepalived
make && make install

配置系统服务

# 将启动脚本复制到系统目录
cp /usr/local/src/keepalived-2.2.4/keepalived/keepalived.service /usr/lib/systemd/system/
# 重新加载systemd配置
systemctl daemon-reload

三、配置Keepalived实现PG主备高可用

  • 主节点(Master):物理IP为 192.168.18.11
  • 备节点(Backup):物理IP为 192.168.18.12
  • 虚拟IP(VIP)192.168.18.100

主节点配置

keepalived配置文件/etc/keepalived/keepalived.conf

! Configuration File for keepalived
global_defs {
        router_id pg_xc1 #主备机必须不一样
        script_user postgres #如果是涉及到pg数据库的相关操作,则需要选择pg用户
        enable_script_security
}
vrrp_script check_pg_alived {
        script "/etc/keepalived/check_pg.sh" #检查pg数据库状态的脚本,主要靠这个判断是否进行主备切换
        interval 5
        weight 30 #很重要,判断主备切换的权重
        fall 3 # require 3 failures for KO
}
vrrp_instance VI_1 {
        state BACKUP #BACKUP 必须大写
        nopreempt #no preempt为非抢占模式
        interface ens10 #vip使用的网络接口
        virtual_router_id 10 #主备必须一样
        priority 100 #设置节点的优先级,跟weight值组合使用,决定选择哪个为主
        advert_int 1
        authentication {
        auth_type PASS
        auth_pass keepalived
        }
        track_script {
                check_pg_alived
        }
        virtual_ipaddress {
                192.168.18.100 #vip
        }
        notify_master "/etc/keepalived/failover.sh" #当节点成为master时,执行的脚本
        notify_fault "/etc/keepalived/fault.sh" #节点出现故障,执行的脚本
}
~

其中interface的值通过ip a可以查看网络接口是什么

脚本编写

1、脚本要放在/etc/keepalived目录下,否则无法切换,可能是基于安全的原因。

2、脚本的权限很重要,权限太多执行不成功,一般为755

check_pg.sh配置文件/etc/keepalived/check_pg.sh

#!/bin/bash
export PGDATABASE=keepalived
export PGPORT=5432
export PGUSER=keepalived
export PGHOME=/usr/pgsql-16
export PATH=$PGHOME/bin:$PATH:$HOME/bin
export PGMIP=192.168.18.11
LOGFILE=/etc/keepalived/log/pg_status.log
SQL1='SELECT pg_is_in_recovery from pg_is_in_recovery();'
SQL2='update sr_delay set last_alive = now() where id =1;'
SQL3='SELECT 1;'
db_role=`echo $SQL1 | $PGHOME/bin/psql -h $PGMIP -p 5432 -d keepalived -U keepalived -At -w`
if [ $db_role == 't' ];
then
echo -e `date +"%F %T"` "Attention1:the current database is standby DB!" >> $LOGFILE
exit 0
fi
# 备库不检查存活,主库更新状态
echo $SQL3 | $PGHOME/bin/psql -h $PGMIP -p $PGPORT -d $PGDATABASE -U $PGUSER -At -w
if [ $? -eq 0 ] ;
then
echo $SQL2 | $PGHOME/bin/psql -h $PGMIP -p $PGPORT -d $PGDATABASE -U $PGUSER -At -w
echo -e `date +"%F %T"` "Success: update the master sr_delay successed!" >> $LOGFILE
exit 0
else
echo -e `date +"%F %T"` "Error:Is the server is running?" >> $LOGFILE
exit 1
fi

切换脚本failover.sh,主库宕机后,keepalived调用执行切换脚本/etc/keepalived/failover.sh

#!/bin/bash
export PGPORT=5432
export PGUSER=keepalived
export PG_OS_USER=keepalived
export PGDATA=/data/pgdata
export PGDBNAME=keepalived
export LANG=zh_CN.UTF-8
export PGPATH=/usr/pgsql-16/bin
export PATH=$PATH:$PGPATH
export PGMIP=192.168.18.11
LOGFILE=/etc/keepalived/log/failover.log
# master-to-slave delay
sr_allowed_delay_time=10
SQL1='select pg_is_in_recovery from pg_is_in_recovery();'
SQL2="select last_alive as delay_time from sr_delay where now()-last_alive < interval '60';"
db_role=`echo $SQL1 | psql -At -h $PGMIP -p $PGPORT -U $PGUSER -d $PGDBNAME -w`
db_sr_delaytime=`echo $SQL2 | psql -h $PGMIP -p $PGPORT -d $PGDBNAME -U $PGUSER -At -w`
SWITCH_COMMAND='pg_ctl promote -D $PGDATA'
# slave switchover to master if delay large than specical second
if [ $db_role == 'f' ];then
echo -e `date +"%F %T"` "Attention:The current postgreSQL DB is master database,cannot switched!" >> $LOGFILE
exit 0
fi
if [ $db_sr_delaytime -gt 0 ];then
echo -e `date +"%F %T"` "Attention:The current master database is health,the standby DB cannot switched!" >> $LOGFILE
exit 0
fi
echo $db_sr_delaytime
if [ -z $db_sr_delaytime ];then
echo -e `date +"%F %T"` "Attention:The current database is statndby,ready to switch master database!" >> $LOGFILE
pg_ctl promote -D $PGDATA
sed -i 's/primary_conninfo/#primary_conninfo/' $PGDATA/postgresql.auto.conf
pg_ctl restart -D $PGDATA
elif [ $? eq 0 ];then
echo -e `date +"%F %T"` "success:The current standby database successed to switched the primary PG database !" >> $LOGFILE
exit 0
else
echo -e `date +"%F %T"` "Error: the standby database failed to switch the primary PG database !,pelease checked it!" >> $LOGFILE
exit 1
fi

脚本fault.sh,keepalived 进入错误状态时执行的脚本。

#!/bin/bash
LOGGFILE=/etc/keepalived/log/pg_db_fault.log
PGPORT=5432
PGMIP=192.168.18.11
echo -e `date +"%F %T"` "Error:Because of the priamry DB happend some unknown problem,So turn off the PostgreSQL Database!" >> $LOGFILE
export PGPID="`netstat -anp|grep $PGPORT |awk '{printf $7}'|cut -d/ -f1`"
service keepalived stop
kill -9 $PGPID
if [ $? eq 0 ];then
echo -e `date +"%F %T"` "Error:Because of the priamry DB happend some unknown problem,So turn off the PostgreSQL Database!" >> $LOGFILE
service keepalived stop
exit 1
fi

备节点配置

备库配置:

1、把master的keepalived.conf文件和sh脚本文件复制到备机。

2、修改keepalived.conf的router_id为pg_xc2,priority为90即可。

四、keepakived启动及状态查看

1、MASTER启动keepalived 服务器

# service keepalived start

2、MASTER查看keepalived 状态

# systemctl status keepalived

[root@clup-host-2422 keepalived]# systemctl status keepalived.service 
● keepalived.service - LVS and VRRP High Availability Monitor
   Loaded: loaded (/usr/lib/systemd/system/keepalived.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2026-05-26 14:47:43 CST; 22min ago
  Process: 64028 ExecStart=/usr/sbin/keepalived $KEEPALIVED_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 64029 (keepalived)
    Tasks: 2 (limit: 11145)
   Memory: 1.6M
   CGroup: /system.slice/keepalived.service
           ├─64029 /usr/sbin/keepalived -D
           └─64030 /usr/sbin/keepalived -D

3、查看后台日志

# tail -f /var/log/messages

4、查看虚拟ip(ifconfig看不到虚拟IP):

# ip a

[postgres@clup-host-2423 pgdata]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:c5:a7:7a brd ff:ff:ff:ff:ff:ff
    altname enp0s10
    inet 192.168.18.11/20 brd 10.197.175.255 scope global dynamic noprefixroute ens10
       valid_lft 3333sec preferred_lft 3333sec
    inet 192.168.18.100/32 scope global ens10
       valid_lft forever preferred_lft forever

五、主备切换

进入主库把主库停了

pg_ctl stop -D $PGDATA

查看keepalived日志tail -f /var/log/messages

May 26 15:28:49 clup-host-2422 Keepalived[71251]: Starting VRRP child process, pid=71252
May 26 15:28:49 clup-host-2422 systemd[1]: Started LVS and VRRP High Availability Monitor.
May 26 15:28:49 clup-host-2422 Keepalived_vrrp[71252]: Registering Kernel netlink reflector
May 26 15:28:49 clup-host-2422 Keepalived_vrrp[71252]: Registering Kernel netlink command channel
May 26 15:28:49 clup-host-2422 Keepalived_vrrp[71252]: Opening file '/etc/keepalived/keepalived.conf'.
May 26 15:28:49 clup-host-2422 Keepalived_vrrp[71252]: (/etc/keepalived/keepalived.conf: Line 22) Truncating auth_pass to 8 characters
May 26 15:28:49 clup-host-2422 Keepalived_vrrp[71252]: Assigned address 192.168.18.11 for interface ens10
May 26 15:28:49 clup-host-2422 Keepalived_vrrp[71252]: Registering gratuitous ARP shared channel
May 26 15:28:49 clup-host-2422 Keepalived_vrrp[71252]: (VI_1) removing VIPs.
May 26 15:28:49 clup-host-2422 Keepalived_vrrp[71252]: (VI_1) Entering BACKUP STATE (init)
May 26 15:28:49 clup-host-2422 Keepalived_vrrp[71252]: VRRP sockpool: [ifindex(  2), family(IPv4), proto(112), fd(11,12)]
May 26 15:28:49 clup-host-2422 Keepalived_vrrp[71252]: VRRP_Script(check_pg_alived) succeeded
May 26 15:28:49 clup-host-2422 Keepalived_vrrp[71252]: (VI_1) Changing effective priority from 70 to 100
May 26 15:29:44 clup-host-2422 su[71403]: (to postgres) root on pts/1
May 26 15:29:59 clup-host-2422 Keepalived_vrrp[71252]: Script `check_pg_alived` now returning 1
May 26 15:30:05 clup-host-2422 systemd[1]: Starting system activity accounting tool...
May 26 15:30:05 clup-host-2422 systemd[1]: sysstat-collect.service: Succeeded.
May 26 15:30:05 clup-host-2422 systemd[1]: Started system activity accounting tool.
May 26 15:30:09 clup-host-2422 Keepalived_vrrp[71252]: VRRP_Script(check_pg_alived) failed (exited with status 1)
May 26 15:30:09 clup-host-2422 Keepalived_vrrp[71252]: (VI_1) Changing effective priority from 100 to 70

VRRP_Script(check_pg_alived) failed (exited with status 1)

可以看到检测到pg未存活,然后优先权-权重变低小于备库的优先权,VIP漂移到备库,然后我们在主备库查看一下vip在哪

# 主库
[postgres@clup-host-2422 ~]$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:c5:a7:79 brd ff:ff:ff:ff:ff:ff
    altname enp0s10
    inet 192.168.18.11/20 brd 10.197.175.255 scope global dynamic noprefixroute ens10
       valid_lft 1776sec preferred_lft 1776sec

# 备库
[root@clup-host-2423 keepalived]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:c5:a7:7a brd ff:ff:ff:ff:ff:ff
    altname enp0s10
    inet 192.168.18.12/20 brd 10.197.175.255 scope global dynamic noprefixroute ens10
       valid_lft 1939sec preferred_lft 1939sec
    inet 192.168.18.100/32 scope global ens10
       valid_lft forever preferred_lft forever

可以看到此时第三方主机通过vip连接到的是新的主库数据库(说明vip和备库切换到主库成功)

备库切换到主库,需要额外操作完成所有切换(或者在切换脚本实现)可以在pg配置文件修改实现

也可以在切换脚本中直接实现切换,更新中ing....

问题

问题一、

因为我设置的初始优先权是100,权重是30,在我关闭keepalived的时候给我加上权重算为130的优先权(100+20【(VI_1) Changing effective priority from 100 to 130】),启动后看到是130-30变成100的优先权,还是比备库高,当时我的备库的优先权是90,导致漂移失败了,后面把优先权变成70就成功了

问题二、执行脚本出错

这个脚本使用psql -At -h localhost -p $PGPORT -U $PGUSER -d $PGDBNAME -w

因为这个-w禁止密码提醒,在设.pgpass的时候放在的目录是postgres家目录下,启动keepalived的时候没有成功用到.pgpass,发现是要用到root下,所以把文件复制到/root

.pgpass文件

localhost:5432:keepalived:keepalived:keepalived
192.168.18.11:5432:keepalived:keepalived:keepalived
192.168.18.12:5432:keepalived:keepalived:keepalived

Keepalived主备竞选规则

Keepalived的主备角色是可以改变的,其改变策略是:

1、如果weight值设置为正整数,当主库的vrrp_script脚本执行成功,其权值为priority+weight,而备库的权值为priority+weight的值。那么此时主库的和大于备库的和。

2、如果weight值设置为正整数,当主库的vrrp_script脚本执行失败,其权值为priority设置的值;而备库的权值为priority+weight,如果其和大于主库的priority,则备库会变成主库。

注意weight设置的值必须要大于主备priority之间的差。比如:主库priority设为100,备库priority设为80,weight则要设置大于100-80=20的差。

通过观察/var/log/messages可以看到权值的变化。

后续更新自动切换脚本

CLup简介https://www.csudata.com/clup/manual

Logo

openEuler 是由开放原子开源基金会孵化的全场景开源操作系统项目,面向数字基础设施四大核心场景(服务器、云计算、边缘计算、嵌入式),全面支持 ARM、x86、RISC-V、loongArch、PowerPC、SW-64 等多样性计算架构

更多推荐