下周要为新员工介绍Oracle数据库,为了让课程更接地气,准备了虚拟机环境,用于实验和练习,在此过程中出现了两个ORA-600的错误,偶然中又有必然,记录于此。
操作过程:
1. 我在MAC上创建完成虚拟机环境,未关闭虚拟机操作系统。
2. 用移动硬盘,拷贝了次环境。
3. DELL笔记本中打开VMWare,引用移动硬盘中的环境。首先提示了这个选项,我选择的是“我已移动该虚拟机”,
两者区别:
“我已移动该虚拟机”:网卡MAC地址会保持不变,但若复制本地,同时开机在一个vmnet可能造成冲突。
”我已复制该虚拟机“:网卡MAC地址会变化,即虚拟机网卡物理地址,是新生成。
4. 打开虚拟机后,需要启动Oracle数据库,
SQL> startup
ORACLE instance started.
...
Database mounted.
ORA-00600: internal error code, arguments: [kcratr_scan_lastbwr], [], [], [],[], [], [], [], [], [], [], []
执行startup命令的时候,提示了ORA-00600错误,参数名称是kcratr_scan_lastbwr。
放到以前,此时第一个念头,必定是重新拷贝一份,是不是就能解决了。的确,有可能就解决了,但这就是“知其然,不知其所以然”,不是一名严谨的技术人员,解决问题的方法,于是乎需要探究一下。
ORA-00600是Oracle中非常著名的一个错误号,同时可能是一个会让你非常头疼的一个错误号,类似于Java语言中抛出的异常,
The ORA-600 error is the generic internal error number for Oracle program exceptions. It indicates that a process has encountered a low-level, unexpected condition.
他的通用提示信息如下,
ORA 600 "internal error code, arguments: [%s], [%s],[%s], [%s], [%s]"
其中,第一个参数是一个内部信息数字或字符串类型。这个参数,以及数据库版本号,非常重要,用这些信息,可以明确问题根本原因,以及对系统的潜在影响。其他参数,则用来提供进一步的信息(例如内部变量的值等)。非常详细的堆栈信息,则会记录于ORA-600的trace文件中。在Java语言中,抛出的异常,通常也会有一些提示信息、堆栈信息。
这篇文章《ORA-600/ORA-7445/ORA-700 Error Look-up Tool (文档 ID 153788.1)》提供了针对这几种Oracle错误的快捷查找工具,
回到上面描述的问题,查看alert日志,记录的错误信息如下所示,
ALTER DATABASE OPEN
Beginning crash recovery of 1 threads
Started redo scan
Hex dump of (file 3, block 144) in trace file /u01/app/oracle/diag/rdbms/bisal/BISAL/trace/BISAL_ora_3392.trc
Reading datafile '/u01/app/oracle/oradata/BISAL/undotbs01.dbf' for corruption at rdba: 0x00c00090 (file 3, block 144)
Reread (file 3, block 144) found same corrupt data (logically corrupt)
Write verification failed for File 3 Block 144 (rdba 0xc00090)
Errors in file /u01/app/oracle/diag/rdbms/bisal/BISAL/trace/BISAL_ora_3392.trc (incident=2554):
ORA-00600: internal error code, arguments: [kcratr_scan_lastbwr], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/bisal/BISAL/incident/incdir_2554/BISAL_ora_3392_i2554.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Aborting crash recovery due to error 600
Errors in file /u01/app/oracle/diag/rdbms/bisal/BISAL/trace/BISAL_ora_3392.trc:
ORA-00600: internal error code, arguments: [kcratr_scan_lastbwr], [], [], [], [], [], [], [], [], [], [], []
Errors in file /u01/app/oracle/diag/rdbms/bisal/BISAL/trace/BISAL_ora_3392.trc:
ORA-00600: internal error code, arguments: [kcratr_scan_lastbwr], [], [], [], [], [], [], [], [], [], [], []
ORA-600 signalled during: ALTER DATABASE OPEN...
Dumping diagnostic data in directory=[cdmp_20170901062027], requested by (instance=1, osid=3392), summary=[incident=2554].
Fri Sep 01 06:21:24 2017
Sweep [inc][2554]: completed
Sweep [inc2][2554]: completed
提示open数据库阶段,读取undotbs01.dbf回滚表空间数据文件,3号文件144号块出现了逻辑坏块。
MOS这篇文章《Bug 9584943 - Crash / recovery failure due to lost write even if mirror has a good image (文档 ID 9584943.8)》介绍了这种ORA-00600错误,
指出即使镜像拷贝正常,但由于缺失一部分写入,导致实例恢复失败。读取错误的文件头,能毁坏一个正常的镜像拷贝。这种情况下,可能会抛出这两种错误,ORA-600 [kcratr_scan_lostwrt]或者ORA-600 [kcratr_scan_lastbwr]。
回想一下,从MAC中拷贝出的虚拟机,是未关闭状态,但从DELL打开移动硬盘的镜像,则是关闭状态,需要重新开机,换句话说,这就是异常断电的场景,此时log buffer中的redo信息未必来得及触发写出条件,即持久化至在线重做日志,当重新开启数据库的时候,由于不是正常关闭数据库,因此需要执行实例恢复,我们知道,实例恢复包括两个阶段,一是利用日志文件中的redo信息,进行交易的前滚操作,恢复到异常断电时刻的状态,此时数据库包含已提交和未提交两种类型的改变向量,二是利用UNDO表空间中的数据进行交易回滚操作,阶段一中除了恢复出了已提交的交易,还恢复出了未提交的事务,此时就会执行rollback操作,回滚所有未提交的事务,此刻数据库中仅包含已提交的事务,状态一致,才能继续open数据库。
上述kcratr_scan_lastbwr则是因为执行实例恢复,第二个阶段,即读取回滚数据的时候,发现了逻辑坏块,因此无法执行,导致open失败。
《ORA-600 [kcratr_scan_lastbwr] (文档 ID 1267231.1)》这篇文章则介绍了这种ORA-00600错误的解决方案,即需要做数据库的逻辑恢复,
数据库启动到mount状态,执行recover database操作,然后open数据库。
按此操作执行,
SQL> startup mount;
ORACLE instance started.
...
SQL> recover database;
Media recovery complete.
...
SQL> alter database open;
alter database open
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [kcratr_nab_less_than_odr], [1],
[7], [160384], [160414], [], [], [], [], [], [], []
open的时候,又报错了,还是ORA-00600,但这次参数不同,是kcratr_nab_less_than_odr。
查看alert日志,
alter database open
Beginning crash recovery of 1 threads
Started redo scan
Completed redo scan
read 47 KB redo, 0 data blocks need recovery
Errors in file /u01/app/oracle/diag/rdbms/bisal/BISAL/trace/BISAL_ora_3464.trc (incident=3755):
ORA-00600: internal error code, arguments: [kcratr_nab_less_than_odr], [1], [7], [160384], [160414], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/bisal/BISAL/incident/incdir_3755/BISAL_ora_3464_i3755.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Aborting crash recovery due to error 600
Errors in file /u01/app/oracle/diag/rdbms/bisal/BISAL/trace/BISAL_ora_3464.trc:
ORA-00600: internal error code, arguments: [kcratr_nab_less_than_odr], [1], [7], [160384], [160414], [], [], [], [], [], [], []
Errors in file /u01/app/oracle/diag/rdbms/bisal/BISAL/trace/BISAL_ora_3464.trc:
ORA-00600: internal error code, arguments: [kcratr_nab_less_than_odr], [1], [7], [160384], [160414], [], [], [], [], [], [], []
ORA-600 signalled during: alter database open...
Fri Sep 01 06:26:24 2017
Dumping diagnostic data in directory=[cdmp_20170901062624], requested by (instance=1, osid=3464), summary=[incident=3755].
Fri Sep 01 06:26:32 2017
Sweep [inc][3755]: completed
Sweep [inc2][3755]: completed
再次检索MOS,《Alter database open fails with ORA-00600 kcratr_nab_less_than_odr (文档 ID 1296264.1)》这篇文章,详细介绍了kcratr_nab_less_than_odr错误信息,首先是现象,
After Power Fail Alter database open fails with
ORA-00600: internal error code, arguments: [kcratr_nab_less_than_odr]# The Database can't open at this Point. In the corresponding Tracefile we can find this Error Callstack:
dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, mask=0x0)
----- Current SQL Statement for this session (sql_id=1h50ks4ncswfn) -----
ALTER DATABASE OPEN----- Call Stack Trace -----
ksedst1 <- ksedst <- dbkedDefDump <- ksedmp <- dbgexPhaseII <- dbgexProcessError <- dbgePostErrorKGE <- kgeasnmierr <- kcratr_odr_check <- kcratr <- kctrec <- kcvcrv <- kcfopd <- adbdrv <- opiexe <- opiosq0 <- kpoal8 <- opiodr <- ttcpip <- opitsk <- opiino <- opiodr <- opidrv <- sou2o <- opimai_real <- ssthrdmain <- main <- start
指出异常断电后,open数据库则会报这个错误。此时数据库无法open,并且trace文件中会记录以上调用栈。
给出了一些原因,
There was a power failure causing logical corruption in controlfile
This Problem is caused by Storage Problem of the Database Files.
The Subsystem (eg. SAN) crashed while the Database was open.
The Database then crashed because the Database Files were not accessible anymore.
This caused a lost Write into the Online RedoLogs and/or causing logical corruption in controlfile so Instance Recovery is not possible and raising the ORA-600.
指出是由于异常断电,导致控制文件,出现了逻辑损坏。诱因可能是以下几种之一:
1. 数据库文件的存储问题。
2. 数据库open状态下,例如SAN子系统崩溃。
3. 由于数据库文件不可访问,导致数据库崩溃。
4. 在线重做日志写入丢失,或者控制文件出现逻辑损坏,以至于实例恢复不可执行,也会出现这种ORA-00600错误信息。
根据以上信息,我的问题,是由4这个诱因引起的,由于开机拷贝,导致在线重做日志,并未完成所有持久化,导致open数据库,无法执行实例恢复操作。
解决方案一:
Do cancel based reocvery, and apply 'current online redolog' manually
做一次不完全恢复,并手工应用当前的在线重做日志。
首先检索控制文件名称,以及当前的日志文件,
SQL> Show parameter control_files
NAME control_files
TYPE string
VALUE /u01/app/oracle/oradata/BISAL/control01.ctl, /u01/app/oracle/oradata/BISAL/control02.ctl
SQL> select a.member, a.group#, b.status from v$logfile a ,v$log b where a.group#=b.group# and b.status='CURRENT' ;
MEMBER /u01/app/oracle/oradata/BISAL/redo01a.log
GROUP# 1
STATUS CURRENT
执行shutdown abort,
SQL> shutdown abort
ORACLE instance shut down.
操作系统中备份控制文件,接着将数据库,启动mount状态,执行不完全恢复,输入当前使用的在线重做日志文件名信息,完成介质恢复,
SQL> startup mount;
ORACLE instance started.
SQL> recover database using backup controlfile until cancel ;
ORA-00279: change 243562 generated at 08/30/2017 08:10:54 needed for thread 1
ORA-00289: suggestion :
/u01/app/oracle/fast_recovery_area/BISAL/archivelog/2017_09_01/o1_mf_1_7_%u_.arc
ORA-00280: change 243562 for thread 1 is in sequence #7
Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
/u01/app/oracle/oradata/BISAL/redo01a.log
Log applied.
Media recovery complete.
从alert日志看,
ALTER DATABASE MOUNT
Successful mount of redo thread 1, with mount id 286455927
Database mounted in Exclusive Mode
Lost write protection disabled
Completed: ALTER DATABASE MOUNT
Fri Sep 01 06:36:57 2017
ALTER DATABASE RECOVER database using backup controlfile until cancel
Media Recovery Start
Serial Media Recovery started
ORA-279 signalled during: ALTER DATABASE RECOVER database using backup controlfile until cancel ...
Fri Sep 01 06:37:16 2017
ALTER DATABASE RECOVER LOGFILE '/u01/app/oracle/oradata/BISAL/redo01a.log'
Media Recovery Log /u01/app/oracle/oradata/BISAL/redo01a.log
Incomplete recovery applied all redo ever generated.
Recovery completed through change 243563 time 08/30/2017 08:10:54
Media Recovery Complete (BISAL)
Completed: ALTER DATABASE RECOVER LOGFILE '/u01/app/oracle/oradata/BISAL/redo01a.log'
以resetlogs方式,open数据库,
SQL> alter database open resetlogs;
Database altered.
从alert日志看,
alter database open resetlogs
RESETLOGS after complete recovery through change 243563
Clearing online redo logfile 1 /u01/app/oracle/oradata/BISAL/redo01a.log
Clearing online log 1 of thread 1 sequence number 7
Clearing online redo logfile 1 complete
Clearing online redo logfile 2 /u01/app/oracle/oradata/BISAL/redo02a.log
Clearing online log 2 of thread 1 sequence number 5
Clearing online redo logfile 2 complete
Clearing online redo logfile 3 /u01/app/oracle/oradata/BISAL/redo03a.log
Clearing online log 3 of thread 1 sequence number 6
Clearing online redo logfile 3 complete
Resetting resetlogs activation ID 286366553 (0x11119b59)
Online log /u01/app/oracle/oradata/BISAL/redo01a.log: Thread 1 Group 1 was previously cleared
Online log /u01/app/oracle/oradata/BISAL/redo02a.log: Thread 1 Group 2 was previously cleared
Online log /u01/app/oracle/oradata/BISAL/redo03a.log: Thread 1 Group 3 was previously cleared
Fri Sep 01 06:39:57 2017
Setting recovery target incarnation to 2
Fri Sep 01 06:39:57 2017
Assigning activation ID 286455927 (0x1112f877)
Thread 1 opened at log sequence 1
Current log# 1 seq# 1 mem# 0: /u01/app/oracle/oradata/BISAL/redo01a.log
Successful open of redo thread 1
Fri Sep 01 06:39:57 2017
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Fri Sep 01 06:39:57 2017
SMON: enabling cache recovery
[3636] Successfully onlined Undo Tablespace 2.
Undo initialization finished serial:0 start:688494 end:689074 diff:580 (5 seconds)
Dictionary check beginning
Dictionary check complete
Verifying file header compatibility for 11g tablespace encryption..
Verifying 11g file header compatibility for tablespace encryption completed
SMON: enabling tx recovery
Database Characterset is ZHS16GBK
No Resource Manager plan active
replication_dependency_tracking turned off (no async multimaster replication found)
Starting background process QMNC
Fri Sep 01 06:40:01 2017
QMNC started with pid=20, OS id=3664
LOGSTDBY: Validating controlfile with logical metadata
LOGSTDBY: Validation complete
Completed: alter database open resetlogs
Fri Sep 01 06:40:03 2017
db_recovery_file_dest_size of 2048 MB is 0.00% used. This is a
user-specified limit on the amount of space that will be used by this
database for recovery-related files, and does not reflect the amount of
space available in the underlying filesystem or ASM diskgroup.
Fri Sep 01 06:40:04 2017
Starting background process CJQ0
Fri Sep 01 06:40:04 2017
CJQ0 started with pid=22, OS id=3681
此时数据库已打开,可以执行一次正常关闭、打开操作,进行确认,
SQL> shutdown immediate;
SQL> startup;
解决方案二:
Recreate the controlfile using the Controlfile recreation script
使用控制文件重建脚本,重建控制文件。
没有实操,直接展示一下文章的操作,
$ rman target /
rman> spool log to '/tmp/rman.log';
rman> list backup ;
rman> exit# Keep this log handy
Go to sqlplus
SQL> Show parameter control_files
Keep this location handy.
SQL> oradebug setmypid
SQL> Alter session set tracefile_identifier='controlfilerecreate' ;
SQL> Alter database backup controlfile to trace noresetlogs;
SQL> oradebug tracefile_name ; --> This command will give the path and name of the trace fileGo to this location ,Open this trace file and select the controlfile recreation script with NOResetlogs option
SQL> Shutdown immediate;
Rename the existing controlfile to <originalname>_old ---> This is Important as we need to have a backup of existing controlfile since we plan to recreate it
SQL> Startup nomount
Now run the Controlfile recreation script with NO Resetlogs option.
SQL> Alter database open ;
For database version 10g and above
Once database is opened you can recatalog the rman backup information present in the list /tmp/rman.log using
Rman> Catalog start with '<location of backupiece>' ;
If backups are on tape, and you are not using a catalog, backups can be cataloged using information in:
How to Catalog Tape Backup Pieces (Doc ID 550082.1)
主要过程,是在RMAN中使用Alter database backup controlfile to trace noresetlogs,创建控制文件重建脚本,数据库启动至nomount状态,执行脚本,以非resetlogs方式,open数据库。
无论用上述何种方法,数据库open了,此时就应该做一次热备,下次再碰见这种情况,一旦上述两种方法不起作用,或者备份集不全,则可以从最近的一次备份,进行数据库的restore和recover,
Once the database has been opened using the Option a or Option b, it is recommended to take a hot backup of the database. The same steps are applicable to RAC if all instance are down with same error.
If both Options (a|b) fails, or you do not have the full set of files, then you have to restore and recover the Database from a recent Backup.
总结:
1. 对于虚拟机环境的拷贝,建议需要关闭操作系统,再拷贝文件,避免这种异常断电的场景。
2. ORA-00600是Oracle中的一种通用错误号,和普通ORA报错不同,可能会需要根据堆栈信息,才能进一步定位问题,MOS有工具可以方便检索,但终究还是需要靠积累和学习,才能从容面对更多的错误,对于我这样的小白来说,需要继续学习。
如果您觉得此篇文章对您有帮助,欢迎关注微信公众号:bisal的个人杂货铺,您的支持是对我最大的鼓励!共同学习,共同进步:)