原文转自:http://blog.csdn.net/tianlesoftware/article/details/6256542
在之前的两篇文章里都对oracle Data Guard的Failover 进行了说明,但是没有个系统的说明,所以在这篇把DG的Failover 做个系统的说明。
物理Data Guard 下Failover 时Redo 的处理问题
http://blog.csdn.net/tianlesoftware/archive/2010/11/05/5989638.aspx
Oracle Data Guard Linux 平台 Physical Standby 搭建实例
http://blog.csdn.net/tianlesoftware/archive/2010/04/30/5547565.aspx
Failover 是失败切换。 这种情况下切换对redo 的处理,就显的很重要。如果处理好,就不会有数据丢失。 否则就会有数据丢失。
在Oracle 11g里,Data Guard 切换多了一个新的功能:flush redo。
Flush 能把没有发送的redo 从主库传送到standby库。 只要主库能启动到mount 状态,那么Flush 就可以把没有发送的归档和current online redo 发送到备库。
Flush语法:
SQL> ALTER SYSTEM FLUSH REDO TO target_db_name;
这里的target_db_name 是我们在主库的db_unique_name 名称。 也就是在tnsnames.ora 文件配置的。 Flush 会将未发送的redo 从主库传到备库,并且等待redo在standby 库上apply 之后返回成功。 所以只要Flush成功,那么Failover 就没有主句丢失。
如果说我们的Primary 已经不能启动到mount 状态,那么就只能按照之前的方法来。 Oracle 10g 下就是这么操作的。
一. 正常的Failover
1.1 检查Gap
sql> select thread#, low_sequence#, high_sequence# from v$archive_gap;
如果有,将对应的归档文件copy到备库,在注册它
sql>alter database register physical logfile 'filespec1';
注意: 如果有Gap存在,并且没有解决。 那么是不能正常的进行一个Failover。 只能进行一个强制的Failover。 这种情况下会有数据丢失。
sql> alter database activate physical standby database;
1.2 解决gap问题后,进行切换
1.2.1 取消Apply
SQL> recover managed standby database cancel;
1.2.2 结束Apply
(1)在oracle 10gR2 或之后的版本:如果在备用库上有备用库日志文件
SQL> alter database recover managed standby database finish; -- [force|wait|nowait]
在执行这个命令的时候,如果主库和备库之间的网络中断了。 那么备库的RFS进程就会等待网络的连接,直到TCP超时。 因此在这种情况下,我们就需要加上Foce 关键字。
(2)在oracle 10gR2之前的版本:没有备库日志文件
SQL> alter database recover managed standby database finish skip standby logfile;
注意:如果执行了这条命令,就不能在进行recover standby database;
1.2.3 将备库切换成主库
SQL> alter database commit to switchover to primary;
SQL> shutdown immediate;
SQL> startup
二. 强行切换(激活)
2.1 使用条件
当我们正常切换的时候,提示我们需要介质恢复的时候,就需要使用强行激活standby 库。 如:
SQL> ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY;
ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY
*
ERROR at line 1:
ORA-16139: media recovery required
2.2 强行激活下的 redo 问题
在这里需要说明一点,就是我们在主库commit 之后,然后shutdown abort, 这时候,主库的online redo 会自动的写入备库的最后一个归档文件里(大小会发生变化)。我们在恢复的时候需要对备库的最后一个归档文件进行重新的注册。
sql>alter database register physical logfile 'filespec1';
如果说,主库OS是整个宕机了。 这个时候,online redo 是不会发送到备库。所以我们需要手工的将主库的所有online redo copy到备库。 然后进行recover。
步骤如下:
SQL> alter database recover managed standby database cancel;
Database altered.
SQL> recover standby database until cancel;
ORA-00279: change 509016 generated at 11/05/2010 11:40:27 needed for thread 1
ORA-00289: suggestion : /u01/archive/1_17_734225750.dbf
ORA-00280: change 509016 for thread 1 is in sequence #17
-- 默认情况下会提示需要归档17, 实际上这个序列为17的归档还没有生成,我们忽略它,使用我们刚才copy过来的redo 日志来恢复。
Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
/u01/app/oracle/oradata/orcl/redo01.log -- 注意, 这个位置是我手动写的
Log applied.
Media recovery complete.
这里一次就搞定了。 实际上有三个redo,如果不确定使用哪个redo的,只能一个一个试。
当我们使用了recover standby database until cancel之后,只能使用强制激活备库,如果使用正常模式,会提示我们需要:
ORA-16139: media recovery required
2.3 强制激活备库:
sql> alter database recover managed standby database cancel;
sql> recover standby database until cancel;
sql>alter database activate standby database;
sql>shutdown immediate;
sql>startup
三. Switchover
3.1 主库操作:
(1)查看状态:
sql>select switchover_status from v$database;
(2)切换
sql> alter database commit to switchover to physical standby with session shutdown;
SQL> shutdown immediate;
SQL> startup;
SQL> alter database mount standby database;
SQL> recover managed standby database disconnect;
3.2 备库操作:
SQL> alter database commit to switchover to primary with session shutdown;
SQL> shutdown immediate
SQL> startup
四. 对Failover 过程的研究
4.1 Failover 日志
Thu Mar 17 15:01:47 2011
alter database activate standby database
Thu Mar 17 15:01:47 2011
ALTER DATABASE ACTIVATE [PHYSICAL] STANDBY DATABASE (dave)
--我们切换的时候,命令写全命令,db 自动补全了
RESETLOGS after complete recovery through change 1255060
Resetting resetlogs activation ID 808909668 (0x3036fb64)
-- resetlogs了. 这就以为着产生一个新的incarnation。 online redo 会被清空
Online log /u01/app/oracle/oradata/dave/redo01.log: Thread 1 Group 1 was previously cleared
Online log /u01/app/oracle/oradata/dave/redo02.log: Thread 1 Group 2 was previously cleared
Online log /u01/app/oracle/oradata/dave/redo03.log: Thread 1 Group 3 was previously cleared
Standby became primary SCN: 1255058
Thu Mar 17 15:01:48 2011
Setting recovery target incarnation to 3
--修改incarnation版本
Thu Mar 17 15:01:48 2011
Converting standby mount to primary mount.
--将standby 转成 primary
Thu Mar 17 15:01:48 2011
ACTIVATE STANDBY: Complete - Database mounted as primary (dave)
Completed: alter database activate standby database
--完成active
Thu Mar 17 15:01:59 2011
Shutting down instance: further logons disabled
--关闭实例
Thu Mar 17 15:01:59 2011
Stopping background process CJQ0
Thu Mar 17 15:01:59 2011
Stopping background process MMNL
Thu Mar 17 15:01:59 2011
Stopping background process MMON
Thu Mar 17 15:01:59 2011
Shutting down instance (immediate)
License high water mark = 7
Thu Mar 17 15:01:59 2011
Stopping Job queue slave processes, flags = 7
Thu Mar 17 15:01:59 2011
Job queue slave processes stopped
All dispatchers and shared servers shutdown
Thu Mar 17 15:02:35 2011
ARC1: Archival disabled due to shutdown: 1089
Shutting down archive processes
Archiving is disabled
Thu Mar 17 15:02:45 2011
ARCH shutting down
ARC0: Archival stopped
Thu Mar 17 15:02:50 2011
ARCH shutting down
ARC1: Archival stopped
Thu Mar 17 15:07:04 2011
SHUTDOWN: Active processes prevent shutdown operation
Thu Mar 17 15:07:50 2011
ALTER DATABASE CLOSE NORMAL
Thu Mar 17 15:07:50 2011
ORA-1109 signalled during: ALTER DATABASE CLOSE NORMAL...
Thu Mar 17 15:07:50 2011
ALTER DATABASE DISMOUNT
Completed: ALTER DATABASE DISMOUNT
ARCH: Archival disabled due to shutdown: 1089
Shutting down archive processes
Archiving is disabled
Archive process shutdown avoided: 0 active
ARCH: Archival disabled due to shutdown: 1089
Shutting down archive processes
Archiving is disabled
Archive process shutdown avoided: 0 active
Thu Mar 17 15:08:13 2011
Starting ORACLE instance (normal)
--开始重新启动实例
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-free SCN scheme 2
Autotune of undo retention is turned on.
IMODE=BR
ILAT =18
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 prior to event group initialization
Starting up ORACLE RDBMS Version: 10.2.0.4.0.
System parameters with non-default values:
processes = 150
__shared_pool_size = 113246208
__large_pool_size = 4194304
__java_pool_size = 25165824
__streams_pool_size = 0
nls_territory = AMERICA
sga_target = 247463936
control_files = /u01/app/oracle/oradata/dave/control01.ctl, /u01/app/oracle/oradata/dave/control02.ctl, /u01/app/oracle/oradata/dave/control03.ctl
db_block_size = 8192
__db_cache_size = 100663296
compatible = 10.2.0.1.0
log_archive_config = dg_config=(dave_pd,dave_st)
log_archive_dest_1 = location=/u01/archivelog valid_for=(all_logfiles,all_roles) db_unique_name=dave_st
log_archive_dest_2 = service=dave_pd reopen=120 lgwr async valid_for=(online_logfiles,primary_role) db_unique_name=dave_pd
log_archive_dest_state_1 = ENABLE
log_archive_dest_state_2 = ENABLE
standby_archive_dest = /u01/archivelog
fal_client = dave_st
fal_server = dave_pd
db_file_multiblock_read_count= 16
standby_file_management = AUTO
undo_management = AUTO
undo_tablespace = UNDOTBS1
remote_login_passwordfile= EXCLUSIVE
db_domain =
dispatchers = (PROTOCOL=TCP) (SERVICE=daveXDB)
job_queue_processes = 10
background_dump_dest = /u01/app/oracle/admin/dave/bdump
user_dump_dest = /u01/app/oracle/admin/dave/udump
core_dump_dest = /u01/app/oracle/admin/dave/cdump
audit_file_dest = /u01/app/oracle/admin/dave/adump
db_name = dave
db_unique_name = dave_st
open_cursors = 300
pga_aggregate_target = 81788928
PMON started with pid=2, OS id=5909
PSP0 started with pid=3, OS id=5911
MMAN started with pid=4, OS id=5913
DBW0 started with pid=5, OS id=5915
LGWR started with pid=6, OS id=5917
CKPT started with pid=7, OS id=5919
SMON started with pid=8, OS id=5921
RECO started with pid=9, OS id=5923
CJQ0 started with pid=10, OS id=5925
MMON started with pid=11, OS id=5927
Thu Mar 17 15:08:14 2011
starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'...
MMNL started with pid=12, OS id=5929
Thu Mar 17 15:08:14 2011
starting up 1 shared server(s) ...
Thu Mar 17 15:08:15 2011
ALTER DATABASE MOUNT
Thu Mar 17 15:08:19 2011
Setting recovery target incarnation to 3
Thu Mar 17 15:08:19 2011
Successful mount of redo thread 1, with mount id 808884895
Thu Mar 17 15:08:19 2011
Database mounted in Exclusive Mode
Completed: ALTER DATABASE MOUNT
Thu Mar 17 15:08:19 2011
ALTER DATABASE OPEN
Thu Mar 17 15:08:19 2011
Assigning activation ID 808884895 (0x30369a9f)
LGWR: STARTING ARCH PROCESSES
ARC0 started with pid=16, OS id=5937
Thu Mar 17 15:08:19 2011
ARC0: Archival started
ARC1: Archival started
LGWR: STARTING ARCH PROCESSES COMPLETE
ARC1 started with pid=17, OS id=5939
LNS1 started with pid=18, OS id=5941
Thu Mar 17 15:08:22 2011
Thread 1 advanced to log sequence 2 (thread open)
Thu Mar 17 15:08:23 2011
ARC0: STARTING ARCH PROCESSES
Thu Mar 17 15:08:23 2011
ARC1: Becoming the 'no FAL' ARCH
ARC1: Becoming the 'no SRL' ARCH
Thu Mar 17 15:08:23 2011
Thread 1 opened at log sequence 2
Current log# 2 seq# 2 mem# 0: /u01/app/oracle/oradata/dave/redo02.log
Successful open of redo thread 1
Thu Mar 17 15:08:23 2011
******************************************************************
LGWR: Setting 'active' archival for destination LOG_ARCHIVE_DEST_2
******************************************************************
Thu Mar 17 15:08:23 2011
ARC1: LGWR is actively archiving destination LOG_ARCHIVE_DEST_2
Thu Mar 17 15:08:23 2011
ARC2: Archival started
ARC0: STARTING ARCH PROCESSES COMPLETE
ARC0: Becoming the heartbeat ARCH
ARC2 started with pid=19, OS id=5943
Thu Mar 17 15:08:23 2011
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Thu Mar 17 15:08:23 2011
SMON: enabling cache recovery
Thu Mar 17 15:08:24 2011
Successfully onlined Undo Tablespace 1.
Dictionary check beginning
Dictionary check complete
Thu Mar 17 15:08:24 2011
SMON: enabling tx recovery
Thu Mar 17 15:08:24 2011
Database Characterset is ZHS16GBK
Opening with internal Resource Manager plan
where NUMA PG = 1, CPUs = 1
replication_dependency_tracking turned off (no async multimaster replication found)
Starting background process QMNC
QMNC started with pid=20, OS id=5945
Thu Mar 17 15:08:26 2011
LOGSTDBY: Validating controlfile with logical metadata
Thu Mar 17 15:08:26 2011
LOGSTDBY: Validation complete
Completed: ALTER DATABASE OPEN
4.2 对Failover 的补充说明
在4.1 中看了Failover 的整个过程,DB 会进行一次resetlogs。 这个是个很有意思的过程。
(1) resetlogs 会产生一个新的incarnation。 这个会影响我们的RMAN 恢复。 我们查看一下:
RMAN> list incarnation;
List of Database Incarnations
DB Key Inc Key DB Name DB ID STATUS Reset SCN Reset Time
------- ------- -------- ---------------- --- ---------- ----------
1 1 DAVE 808637274 PARENT 1 30-JUN-05
2 2 DAVE 808637274 PARENT 446075 14-MAR-11
3 3 DAVE 808637274 CURRENT 1255061 17-MAR-11
这个时候,我们只能恢复incarnation 为3之内的信息,如果要恢复到其他版本的信息,要保证对应备份集存在的同时,在使用reset database incarnation to 3或者其他的版本。 之后在恢复。
(2)看下归档日志
先看备库:
SQL> select max(sequence#) from v$archived_log;
MAX(SEQUENCE#)
--------------
6
SQL> select sequence#,applied from v$archived_log;
SEQUENCE# APP
---------- ---
3 YES
2 YES
5 YES
4 YES
6 YES
这个是重新开始的,没有什么问题。
我们看下主库:
SQL> select max(sequence#) from v$archived_log;
MAX(SEQUENCE#)
--------------
88
SQL> select sequence#,applied from v$archived_log;
SEQUENCE# APP
---------- ---
4 YES
3 YES
5 YES
6 YES
7 YES
8 YES
......
82 YES
83 YES
84 YES
85 YES
86 YES
87 YES
88 YES
2 NO
2 YES --注意这部分,有重新开始了。
3 NO
3 YES
4 NO
5 NO
5 YES
4 YES
6 NO
6 YES
因为resetlogs 会重置sequence#。 将其设置为1. 所以这里又重新开始了。 但是SCN 不会重置。 我们查看一下:
SQL> select sequence#,first_change#,next_change# from v$log_history;
SEQUENCE# FIRST_CHANGE# NEXT_CHANGE#
---------- ------------- ------------
1 446075 451208
2 451208 483347
3 483347 485272
4 485272 485277
5 485277 486119
.....
83 1227667 1229252
84 1229252 1252272
85 1252272 1252277
86 1252277 1252293
87 1252293 1252294
88 1252294 1253301
1 1255061 1255062
--sequence# 重新开始了,但是SCN 还是继续增加的。
2 1255062 1257639
3 1257639 1257644
4 1257644 1265602
5 1265602 1265607
6 1265607 1265913
94 rows selected.
SQL>
所以,这种情况下,查看同步情况还是有点不直观。 但是V$LOG_HISTORY 和V$ARCHIVED_LOG显示的log 历史信息是从控制文件中取得的,所以说,如果要删除以前的记录,只有重建控制文件了。
(3)重建控制文件
SQL> shutdown immediate
SQL> startup nomount;
SQL>create controlfile reuse database dave noresetlogs archivelog
LOGFILE
GROUP 1 '/u01/app/oracle/oradata/dave/redo01.log',
GROUP 2 '/u01/app/oracle/oradata/dave/redo02.log',
GROUP 3 '/u01/app/oracle/oradata/dave/redo03.log'
DATAFILE
'/u01/app/oracle/oradata/dave/sysaux01.dbf',
'/u01/app/oracle/oradata/dave/system01.dbf',
'/u01/app/oracle/oradata/dave/undotbs01.dbf',
'/u01/app/oracle/oradata/dave/users01.dbf'
CHARACTER SET ZHS16GBK;
--注意,使用的是noresetlogs,如果使用resetlogs,DG 就需要重新搭建了。
SQL> ALTER TABLESPACE TEMP ADD TEMPFILE '/u01/app/oracle/oradata/dave/temp01.dbf' size 100M;
Tablespace altered.
--添加临时表空间,在重建控制文件的时候,不能添加TEMP表空间,只能在控制文件重建好之后,在添加temp 表空间。
更多信息参考:
Oracle 控制文件
http://blog.csdn.net/tianlesoftware/archive/2009/12/09/4974440.aspx
(4)在次验证归档信息
主库:
SQL> select max(sequence#) from v$archived_log;
MAX(SEQUENCE#)
--------------
9
SQL> alter system switch logfile;
System altered.
SQL> select max(sequence#) from v$archived_log;
MAX(SEQUENCE#)
--------------
10
SQL> select sequence#,applied from v$archived_log;
SEQUENCE# APP
---------- ---
8 NO
7 NO
9 NO
9 YES
10 YES
10 NO
6 rows selected.
SQL> select sequence#,first_change#,next_change# from v$log_history;
SEQUENCE# FIRST_CHANGE# NEXT_CHANGE#
---------- ------------- ------------
9 1267828 1268212
10 1268212 1274690
从这个结果来看,重建控制文件之后,之前的所有的有关归档的信息都会被删除。
备库:
SQL> select max(sequence#) from v$archived_log;
MAX(SEQUENCE#)
--------------
10
SQL> select sequence#,applied from v$archived_log;
SEQUENCE# APP
---------- ---
3 YES
2 YES
5 YES
4 YES
6 YES
7 YES
8 YES
9 YES
10 YES
9 rows selected.
说明:
我这里是测试环境,所以重建控制文件测试一下,如果是生产环境,小心操作。