虽然PostgreSQL没有明确指出 primary 和 standby 节点的角色可以相互切换。(即standby promote成primary ,primary demote成standby .不需要重建集群)
但是经过实际测试,是可以实现的。
以下测试数据库版本9.0.2 . 低版本可能有一定差异。(比如有一个版本改进了standby shutdown后自动获取checkpoint的功能,最初的9.0是没有的,没有这个功能重启后需要用到一些已经APPLY的WAL,带来比较麻烦的问题。有兴趣的朋友可以参考我以前写的BLOG)
首先来看几个和XLOG相关的东西,要确保安全的SWITCHOVER,必须确保一下提及的数据在primary和standby切换前是一致的。
1. primary 和 standby 关闭后使用pg_controldata输出的checkpoint location (-m immediate关闭的话这个值不可信,因此必须确保都是-m fast 或者 smart模式关闭的,fast和smart关闭都是干净的关闭模式,起来时不需要recovery)
具体的操作步骤:
1. on primary
pg_ctl stop -m fast $PGDATA
确认返回成功后
查看pg_controldata输出
Latest checkpoint location: 1B/6C000020
Latest checkpoint's REDO location: 1B/6C000020
2. on standby
pg_ctl stop -m fast $PGDATA
确认返回成功后
查看pg_controldata输出
Latest checkpoint location: 1B/6C000020
Latest checkpoint's REDO location: 1B/6C000020
3. 确保standby和primary 的pg_controldata checkpoint的输出结果一致。就可以切换standby 和 primary了。
4. 启动standby,并且touch一个在recovery.conf中的trigger文件,使standby promote.为primary库
5. 修改原primary库的配置文件 并新建recovery.conf,需要注意的是新增一条
recovery_target_timeline = 'latest' ,否则会报类似如下错误timeline 6 of the primary does not match recovery target timeline 5。(其他配置略,清参考我以前写的BLOG。)
6. 启动原primary , 此时切换完成,primary 变standby ,standby 变primary ,而且可以再次切换。
切换完成后查看pg_controldata的输出,TimeLineID 将变大。
TimeLineID递增后,在新的PRIMARY 库会出现类似如下进程
postgres: archiver process last was 00000005.history
而TimeLineID可能就等于5.
与此同时,新的standby TimeLineID还是老的,因为还没有更新过来。
要更新过来的话,要么等CHECKPOINT,要么关闭主节点,然后再关闭副节点。起来之后就同步了。
下面来分析一下,一个archive和stream都打开的主副数据库集群.
1. 主节点在干净的关闭数据库时干了些啥?
对于一个STREAM复制的环境,主节点上面有一个WAL_SENDER进程,干净关闭数据库时,postmaster对这个进程的处理方法和普通的SERVER PROCESS不一样,数据库关闭时首先发信号给server process关闭,然后写shutdown checkpoint ,(有兴趣的朋友可以看一看CHECKPOINT的源码,CHECKPOINT分为好几种),然后postmaster告知wal_sender shutdown_checkpoint,并且把截止这个CHECKPOINT的XLOG发送给standby上面的wal_receiver进程。
从而主节点干净的关闭数据库时,standby节点可以接收到shutdown_checkpoint,达到一致的目的。
原文引用:
Therefore postmaster treats walsenders like the pgarch process,
and instructs them to terminate at PM_SHUTDOWN_2 phase, after all regular
backends have died and bgwriter has written the shutdown checkpoint.
2. 副节点在干净的关闭数据库时干了些啥?
standby节点干净的关闭数据库,直接关闭,目前的版本不尝试且不等待正在从archive ,pg_xlog,或wal_sender恢复完成。但是已经接收到的CHECKPOINT将被写入持久化存储,因此关闭后pg_controldata与主节点一致.
干净的关闭后的primary 和standby 库的pg_controldata输入示例 :
primary :
postgres@digoal-172-16-3-39-> pg_controldata
pg_control version number: 903
Catalog version number: 201008051
Database system identifier: 5556352067747738614
Database cluster state: shut down
pg_control last modified: Wed 11 May 2011 08:52:38 PM CST
Latest checkpoint location: 1B/70000020
Prior checkpoint location: 1B/6C000078
Latest checkpoint's REDO location: 1B/70000020
Latest checkpoint's TimeLineID: 5
Latest checkpoint's NextXID: 0/4369
Latest checkpoint's NextOID: 2074455
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 655
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 0
Time of latest checkpoint: Wed 11 May 2011 08:52:38 PM CST
Minimum recovery ending location: 0/0
Backup start location: 0/0
Current wal_level setting: hot_standby
Current max_connections setting: 2000
Current max_prepared_xacts setting: 50
Current max_locks_per_xact setting: 64
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 1048576
WAL block size: 8192
Bytes per WAL segment: 67108864
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
standby :
postgres@digoal-172-16-3-33-> pg_controldata
pg_control version number: 903
Catalog version number: 201008051
Database system identifier: 5556352067747738614
Database cluster state: shut down in recovery
pg_control last modified: Wed 11 May 2011 08:52:52 PM CST
Latest checkpoint location: 1B/70000020
Prior checkpoint location: 1B/6C000020
Latest checkpoint's REDO location: 1B/70000020
Latest checkpoint's TimeLineID: 5
Latest checkpoint's NextXID: 0/4369
Latest checkpoint's NextOID: 2074455
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 655
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 0
Time of latest checkpoint: Wed 11 May 2011 08:52:38 PM CST
Minimum recovery ending location: 1B/6C000020
Backup start location: 0/0
Current wal_level setting: hot_standby
Current max_connections setting: 2000
Current max_prepared_xacts setting: 50
Current max_locks_per_xact setting: 64
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 1048576
WAL block size: 8192
Bytes per WAL segment: 67108864
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
对于一个因异常而FAILOVER的库,因为failover时不能确保primary 和 standby 的checkpoint一致,所以,在primary数据库服务器修复后,不能直接转换为standby,而需要一个checkpoint小于standby切换为primary时的standby库的checkpoint的基础备份,并且需要足够的archive log来做apply,
因此failover后主备集群的重新建立的速度或简易程度,大部分决定于平时对standby和archive log的备份。
# 补充
# 2011-06-09 今天对一个生产系统做了切换,版本是9.0.3
# 切换后,起主节点的时候一直处于等待状态
postgres: startup process waiting for 00000001000001200000001C
# 日志里面报错如下:
cp: cannot stat `/opt/pg_arch/00000002.history': No such file or directory
# 00000002.history这个文件来自于新的primary节点pg_xlog目录,拷贝到/opt/pg_arch/00000002.history,另外
拷贝00000001000001200000001C到/opt/pg_arch/.
# 拷贝完后重启老的primary节点.
2011-06-09 14:31:57.095 CST,,,1357,,4df068dd.54d,1,,2011-06-09 14:31:57 CST,,0,LOG,00000,"database system was shut down in recovery at 2011-06-09 14:31:53 CST",,,,,,,,,""
2011-06-09 14:31:57.099 CST,,,1357,,4df068dd.54d,2,,2011-06-09 14:31:57 CST,,0,LOG,00000,"restored log file ""00000002.history"" from archive",,,,,,,,,""
2011-06-09 14:31:57.105 CST,,,1357,,4df068dd.54d,3,,2011-06-09 14:31:57 CST,,0,LOG,00000,"restored log file ""00000002.history"" from archive",,,,,,,,,""
2011-06-09 14:31:57.105 CST,,,1357,,4df068dd.54d,4,,2011-06-09 14:31:57 CST,,0,LOG,00000,"entering standby mode",,,,,,,,,""
2011-06-09 14:31:57.275 CST,,,1357,,4df068dd.54d,5,,2011-06-09 14:31:57 CST,,0,LOG,00000,"restored log file ""00000001000001200000001C"" from archive",,,,,,,,,""
2011-06-09 14:31:58.120 CST,,,1357,,4df068dd.54d,6,,2011-06-09 14:31:57 CST,1/0,0,LOG,00000,"consistent recovery state reached at 120/70000078",,,,,,,,,""
2011-06-09 14:31:58.120 CST,,,1357,,4df068dd.54d,7,,2011-06-09 14:31:57 CST,1/0,0,LOG,00000,"invalid record length at 120/70000078",,,,,,,,,""
2011-06-09 14:31:58.121 CST,,,1355,,4df068dc.54b,1,,2011-06-09 14:31:56 CST,,0,LOG,00000,"database system is ready to accept read only connections",,,,,,,,,""
2011-06-09 14:31:58.126 CST,,,1365,,4df068de.555,1,,2011-06-09 14:31:58 CST,,0,LOG,00000,"streaming replication successfully connected to primary",,,,,,,,,""
2011-06-09 14:31:59.320 CST,,,1357,,4df068dd.54d,8,,2011-06-09 14:31:57 CST,1/0,0,LOG,00000,"redo starts at 120/70000078",,,,,,,,,""
成功的切换为standby节点.