2016年10月10日,客户一预警系统发生会话数飙高,系统响应极慢,后来确诊根源是归档空间满,引起所有redo耗尽,导致会话堆积,下面是处理过程。
操作系统:HP-UX B.11.31 U ia64
数据库版本:ORACLE 10.2.0.5 RAC
按照常规处理思路,首先查看RAC数据库的告警日志:
实例1的告警日志
Mon Oct 10 19:24:48 EAT 2016
ORACLE Instance orcl1 - Can not allocate log, archival required
.
.
Mon Oct 10 19:26:28 EAT 2016
ORACLE Instance orcl1 - Can not allocate log, archival required
.
.
Mon Oct 10 20:06:21 EAT 2016
Errors in file /oracle/app/oracle/admin/orcl/bdump/orcl1_arc0_22630.trc:
ORA-19504: failed to create file "+SFILEARCHDG01"
ORA-17502: ksfdcre:4 Failed to create file +SFILEARCHDG01
ORA-15041: diskgroup space exhausted
实例2的告警日志
Mon Oct 10 20:12:52 EAT 2016
Errors in file /oracle/app/oracle/admin/orcl/bdump/orcl2_arc1_8962.trc:
ORA-19504: failed to create file "+SFILEARCHDG01"
ORA-17502: ksfdcre:4 Failed to create file +SFILEARCHDG01
ORA-15041: diskgroup space exhausted
ARC1: Error 19504 Creating archive log file to '+SFILEARCHDG01'
ARCH: Archival stopped, error occurred. Will continue retrying
Mon Oct 10 20:12:52 EAT 2016
ORACLE Instance orcl2 - Archival Error
Mon Oct 10 20:12:52 EAT 2016
ORA-16038: log 12 sequence# 77053 cannot be archived
ORA-19504: failed to create file ""
ORA-00312: online log 12 thread 2: '+ORCLDG1/orcl/onlinelog/group_12.291.819723519'
ORA-00312: online log 12 thread 2: '+ORCLDG2/orcl/onlinelog/group_12.266.819723541'
从实例1登录数据库查看ASM磁盘组使用情况
orcl1:[/oracle/app/oracle/admin/orcl/bdump]$asmcmd
ASMCMD> lsdg
State Type Rebal Unbal Sector Block AU Total_MB Free_MB Req_mir_free_MB Usable_file_MB Offline_disks Name
MOUNTED NORMAL N N 1024 4096 1048576 1024000 678 102400 -50861 0 ORCLDG1/
MOUNTED NORMAL N N 1024 4096 1048576 1024000 1389 102400 -50505 0 ORCLDG2/
MOUNTED EXTERN N N 1024 4096 1048576 512000 100210 0 100210 0 ORCLDG3/
MOUNTED NORMAL N N 1024 4096 1048576 307200 9659 102400 -46370 0 ORCLSYSDG01/
MOUNTED NORMAL N N 1024 4096 1048576 204800 1376 0 688 0 SFILEARCHDG01/
通过查看告警日志和ASM磁盘使用情况及2个节点个告警日志:可以判断是归档空间满、归档异常,redo组耗光后实例hang住了。
如果查看数据库实例1的等待事件,发现等待事件 enq: TX - row lock contention已经比较严重:
SQL> SELECT INST_ID, EVENT, COUNT(*) CNT, SUM(WAIT_TIME) WT_TM
2 FROM GV$SESSION_WAIT
3 GROUP BY INST_ID, EVENT
4 ORDER BY INST_ID, CNT DESC;
INST_ID EVENT CNT WT_TM
---------- ---------------------------------------- ---------- ----------
1 SQL*Net message from client 265 0
1 enq: TX - row lock contention 132 0
1 db file sequential read 107 35
1 jobq slave wait 19 0
1 rdbms ipc message 15 4
1 read by other session 7 0
1 wait for a undo record 5 0
1 gcs remote message 4 0
1 enq: HW - contention 2 0
1 log file sequential read 2 0
1 db file scattered read 2 0
1 buffer busy waits 2 0
1 latch: cache buffers chains 2 -2
1 enq: US - contention 2 0
1 PX Deq: Txn Recovery Start 2 0
按等待事件查看数据库中会话、执行sql的sql_id相关信息,发现等待事件enq: TX - row lock contention全是sql_id为90frgddxc2my0的sql引起的,并且有200多个积压的会话。
SQL> select * from (
2 select a.inst_id,
3 a.sid,
4 a.sql_id,
5 a.event,
6 a.status,
7 connect_by_isleaf as isleaf,
8 sys_connect_by_path(sid,'<-') tree,
9 level as tree_level
10 from gv$session a
11 start with a.blocking_session is not null and event like '%enq: TX%'
12 connect by nocycle a.sid = prior a.blocking_Session
13 )
14 where isleaf =1
15 order by tree_level asc;
INST_ID SID SQL_ID EVENT STATUS ISLEAF TREE TREE_LEVEL
---------- ---------- ------------- ---------------------------------------- -------- ---------- -------------------------------------------------------------------------------- ----------
1 1892 90frgddxc2my0 enq: TX - row lock contention ACTIVE 1 <-1892 1
SQL> select count(*)
2 from gv$session
3 where event='enq: TX - row lock contention'
4 and inst_id=1
5 and status='ACTIVE';
COUNT(*)
----------
210
经确定90frgddxc2my0是一个update语句
SQL> select sql_fulltext from v$sql where sql_id='90frgddxc2my0';
SQL_FULLTEXT
--------------------------------------------------------------------------------
UPDATE INTF.OMGALARM A SET A.ALARMACTION = :B3 , A.VERSION = A.RECEIVETIME, A.RE
查看集群数据库的日志组状态,发现实例2的redo组已经全部用上。
SQL> select * from v$log;
GROUP# THREAD# SEQUENCE# BYTES MEMBERS ARC STATUS FIRST_CHANGE# FIRST_TIME
---------- ---------- ---------- ---------- ---------- --- ---------------- ------------- ------------
1 1 58663 1073741824 2 NO ACTIVE 1.5122E+13 10-OCT-16
2 1 58662 1073741824 2 NO INACTIVE 1.5122E+13 10-OCT-16
3 1 58661 1073741824 2 NO INACTIVE 1.5122E+13 10-OCT-16
4 1 58660 1073741824 2 NO INACTIVE 1.5122E+13 10-OCT-16
5 1 58665 1073741824 2 NO CURRENT 1.5122E+13 10-OCT-16
6 1 58664 1073741824 2 NO ACTIVE 1.5122E+13 10-OCT-16
7 2 77116 1073741824 2 NO ACTIVE 1.5122E+13 11-OCT-16
8 2 77117 1073741824 2 NO ACTIVE 1.5122E+13 11-OCT-16
9 2 77118 1073741824 2 NO ACTIVE 1.5122E+13 11-OCT-16
10 2 77114 1073741824 2 YES ACTIVE 1.5122E+13 10-OCT-16
11 2 77115 1073741824 2 NO ACTIVE 1.5122E+13 10-OCT-16
12 2 77119 1073741824 2 NO CURRENT 1.5122E+13 11-OCT-16
到此,通过查询到的信息并结合告警日志,可以肯定:大量并发dml操作,引发归档暴增,耗光归档空间,使用完redo组,导致会话积压,进而集群hang住,应用系统响应缓慢。
通过观察1节点的告警日志,还发现2个实例的redo切换很频繁,1分钟一次:
节点1
Mon Oct 10 20:05:19 EAT 2016
Thread 1 cannot allocate new log, sequence 58490
All online logs needed archiving
Current log# 1 seq# 58489 mem# 0: +ORCLSYSDG01/orcl/onlinelog/group_1.262.819722231
Current log# 1 seq# 58489 mem# 1: +ORCLGDG02/orcl/onlinelog/group_1.264.819722239
Mon Oct 10 20:05:48 EAT 2016
Thread 1 advanced to log sequence 58490 (LGWR switch)
Current log# 5 seq# 58490 mem# 0: +ORCLGDG01/orcl/onlinelog/group_5.293.819723461
Current log# 5 seq# 58490 mem# 1: +ORCLGDG02/orcl/onlinelog/group_5.268.819723471
Mon Oct 10 20:06:18 EAT 2016
ORACLE Instance hnrmdb1 - Can not allocate log, archival required
Mon Oct 10 20:06:18 EAT 2016
Thread 1 cannot allocate new log, sequence 58491
All online logs needed archiving
Current log# 5 seq# 58490 mem# 0: +ORCLGDG01/orcl/onlinelog/group_5.293.819723461
Current log# 5 seq# 58490 mem# 1: +ORCLGDG02/orcl/onlinelog/group_5.268.819723471
Mon Oct 10 20:06:20 EAT 2016
Thread 1 advanced to log sequence 58491 (LGWR switch)
Current log# 6 seq# 58491 mem# 0: +ORCLGDG01/orcl/onlinelog/group_6.292.819723491
Current log# 6 seq# 58491 mem# 1: +ORCLGDG02/orcl/onlinelog/group_6.267.819723505
采取的措施是,清理归档空间并增加2个实例的redo组数。
增加redo组:
SQL> select * from v$logfile;
GROUP# STATUS TYPE MEMBER IS_
---------- ------- ------- ------------------------------------------------------------- ---
7 ONLINE +ORCLDG01/orcl/onlinelog/group_7.275.819721599 NO
7 ONLINE +ORCLDG02/orcl/onlinelog/group_7.270.819721609 NO
12 ONLINE +ORCLDG01/orcl/onlinelog/group_12.291.819723519 NO
12 ONLINE +ORCLDG02/orcl/onlinelog/group_12.266.819723541 NO
11 ONLINE +ORCLDG01/orcl/onlinelog/group_11.297.819722567 NO
11 ONLINE +ORCLDG02/orcl/onlinelog/group_11.272.819722583 NO
10 ONLINE +ORCLDG01/orcl/onlinelog/group_10.288.819722545 NO
10 ONLINE +ORCLDG02/orcl/onlinelog/group_10.261.819722555 NO
9 ONLINE +ORCLDG01/orcl/onlinelog/group_9.266.819722429 NO
9 ONLINE +ORCLDG02/orcl/onlinelog/group_9.262.819722437 NO
8 ONLINE +ORCLDG01/orcl/onlinelog/group_8.265.819721623 NO
8 ONLINE +ORCLDG02/orcl/onlinelog/group_8.269.819721631 NO
4 ONLINE +ORCLDG01/orcl/onlinelog/group_4.294.819723443 NO
4 ONLINE +ORCLDG02/orcl/onlinelog/group_4.265.819723451 NO
3 ONLINE +ORCLDG01/orcl/onlinelog/group_3.279.819722327 NO
3 ONLINE +ORCLDG02/orcl/onlinelog/group_3.271.819722335 NO
2 ONLINE +ORCLDG01/orcl/onlinelog/group_2.267.819722257 NO
2 ONLINE +ORCLDG02/orcl/onlinelog/group_2.263.819722265 NO
1 ONLINE +ORCLDG01/orcl/onlinelog/group_1.262.819722231 NO
1 ONLINE +ORCLDG02/orcl/onlinelog/group_1.264.819722239 NO
6 ONLINE +ORCLDG01/orcl/onlinelog/group_6.292.819723491 NO
6 ONLINE +ORCLDG02/orcl/onlinelog/group_6.267.819723505 NO
5 ONLINE +ORCLDG01/orcl/onlinelog/group_5.293.819723461 NO
5 ONLINE +ORCLDG02/orcl/onlinelog/group_5.268.819723471 NO
SQL> alter database add logfile '+ORCLDG03' size 1073741824;
Database altered.
SQL> alter database add logfile thread 2 '+ORCLDG03' size 1073741824;
Database altered.
SQL> select * from v$log;
GROUP# THREAD# SEQUENCE# BYTES MEMBERS ARC STATUS FIRST_CHANGE# FIRST_TIME
---------- ---------- ---------- ---------- ---------- --- ---------------- ------------- ------------
1 1 58682 1073741824 2 NO ACTIVE 1.5122E+13 11-OCT-16
2 1 58679 1073741824 2 NO ACTIVE 1.5122E+13 11-OCT-16
3 1 58680 1073741824 2 NO ACTIVE 1.5122E+13 11-OCT-16
4 1 58678 1073741824 2 YES INACTIVE 1.5122E+13 11-OCT-16
5 1 58677 1073741824 2 YES INACTIVE 1.5122E+13 11-OCT-16
6 1 58681 1073741824 2 NO ACTIVE 1.5122E+13 11-OCT-16
7 2 77128 1073741824 2 NO INACTIVE 1.5122E+13 11-OCT-16
8 2 77129 1073741824 2 NO INACTIVE 1.5122E+13 11-OCT-16
9 2 77130 1073741824 2 NO ACTIVE 1.5122E+13 11-OCT-16
10 2 77126 1073741824 2 NO INACTIVE 1.5122E+13 11-OCT-16
11 2 77127 1073741824 2 NO INACTIVE 1.5122E+13 11-OCT-16
12 2 77125 1073741824 2 NO INACTIVE 1.5122E+13 11-OCT-16
13 1 58683 1073741824 1 NO CURRENT 1.5122E+13 11-OCT-16
14 2 77131 1073741824 1 NO CURRENT 1.5122E+13 11-OCT-16
增加完redo,查看积压的会话已经有下降了:
SQL> select count(1) from v$session ;
COUNT(1)
----------
852
SQL> select count(1) from v$session where status='ACTIVE';
COUNT(1)
----------
525
SQL> /
COUNT(1)
----------
525
SQL> /
COUNT(1)
----------
523
清理归档空间:
orcl1:[/oracle/app/oracle/admin/orcl/bdump]$asmcmd
ASMCMD> lsdg
State Type Rebal Unbal Sector Block AU Total_MB Free_MB Req_mir_free_MB Usable_file_MB Offline_disks Name
MOUNTED NORMAL N N 1024 4096 1048576 1024000 678 102400 -50861 0 ORCLDG01/
MOUNTED NORMAL N N 1024 4096 1048576 1024000 1389 102400 -50505 0 ORCLDG02/
MOUNTED EXTERN N N 1024 4096 1048576 512000 98144 0 98144 0 ORCLDG03/
MOUNTED NORMAL N N 1024 4096 1048576 307200 9659 102400 -46370 0 ORCLSYSDG01/
MOUNTED NORMAL N N 1024 4096 1048576 204800 13248 0 6624 0 SFILEARCHDG01/
ASMCMD> cd SFILEARCHDG01
ASMCMD> ls
HNRMDB/
ASMCMD> cd ORCL
ASMCMD> ls
ARCHIVELOG/
ASMCMD> cd ARCHIVELOG
ASMCMD> ls
2016_10_10/
2016_10_11/
ASMCMD> rm -r 2016_10_10
You may delete multiple files and/or directories.
Are you sure? (y/n) y
ORA-15032: not all alterations performed
ORA-15173: entry 'thread_1_seq_58606.367.924907269' does not exist in directory '2016_10_10' (DBD ERROR: OCIStmtExecute)
ASMCMD> lsdg
State Type Rebal Unbal Sector Block AU Total_MB Free_MB Req_mir_free_MB Usable_file_MB Offline_disks Name
MOUNTED NORMAL N N 1024 4096 1048576 1024000 678 102400 -50861 0 ORCLDG01/
MOUNTED NORMAL N N 1024 4096 1048576 1024000 1389 102400 -50505 0 ORCLDG02/
MOUNTED EXTERN N N 1024 4096 1048576 512000 98144 0 98144 0 ORCLDG03/
MOUNTED NORMAL N N 1024 4096 1048576 307200 9659 102400 -46370 0 ORCLSYSDG01/
MOUNTED NORMAL N N 1024 4096 1048576 204800 114264 0 57132 0 SFILEARCHDG01/
ASMCMD> exit
清理完归档空间,数据库状态恢复正常:
SQL> select count(*)
2 from gv$session
3 where event='enq: TX - row lock contention'
4 and inst_id=1
5 and status='ACTIVE'
6 ;
COUNT(*)
----------
117
SQL> /
COUNT(*)
----------
117
SQL> /
COUNT(*)
----------
43
SQL> /
COUNT(*)
----------
7
SQL> /
COUNT(*)
----------
0
SQL> select count(*) from v$session where status='ACTIVE';
COUNT(*)
----------
68
SQL> select count(*)
2 from gv$session
3 where event='enq: TX - row lock contention'
4 and inst_id=1
5 and status='ACTIVE'
6 ;
COUNT(*)
----------
0
本次故障处理过程中,发现大量如下信息:
Wed Oct 12 20:19:41 EAT 2016
SUCCESS: diskgroup SFILEARCHDG01 was mounted
Wed Oct 12 20:19:46 EAT 2016
SUCCESS: diskgroup SFILEARCHDG01 was dismounted
经求证是10g ASM rac才会出现的,oracle采取的策略是在归档目录不适用的时候对其下线,在归档进程需要切换归档的时候,才会将归档的卷组上线。