2015年4月27日,晚上6点左右,电渠3g2库ORACLE RAC系统节点1出现大量的direct path write等待事件,导致大量的会话堆积,节点1几乎无法使用,应用受到影响,相关处理流程如下:
环境:
操作系统:hp-unix
数据库版本:10.2.0.5
故障描述:开始是27号上午9左右,应用开发有人执行update语句,发现执行好长时间没有完成,然后在数据库中查询并试图杀死会话,但是执行杀会话后再操作系统中依然有相关会话存在,
然后,试图使用kill -9杀进程,发现相关进程依然无法杀死;接下来,到晚上6点是相关会话大量积压的过程,数据库服务器实在无法承受会话继续积压(24800个),晚上7点左右与业务方沟通后
决定重启数据库,发现数据库无法正常的shutdown immediate,又尝试shutdown abort,但是startup的时候实例无法启动,查明是update的相关进程很多都积压在后台,决定重启主机服务器,节点1重启后正常,节点2又出现节点1类似的征兆,为了处理故障,避免累次重启无法解决故障,先保护现场待故障处理完成。
接下来的故障定位如下:
从dba_hist_active_sess_history视图查询数据库采样变化发现上午10点左右开始出现会话堆积:
select sample_time,count(*) from dba_hist_active_sess_history where dbid=3965601902 and
snap_id>49716 and snap_id
and sample_time>to_date('20150527 10:18:00','yyyymmdd hh24:mi:ss') group by sample_time
order by sample_time;
SAMPLE_TIME COUNT(*)
---------------------------------------- ----------
27-MAY-15 10.18.04.924 AM 4
27-MAY-15 10.18.12.694 AM 3
27-MAY-15 10.18.15.024 AM 3
27-MAY-15 10.18.22.794 AM 1
27-MAY-15 10.18.25.124 AM 4
27-MAY-15 10.18.35.224 AM 3
27-MAY-15 10.18.45.324 AM 1
27-MAY-15 10.18.55.434 AM 2
27-MAY-15 10.19.03.204 AM 1
27-MAY-15 10.19.13.304 AM 1
27-MAY-15 10.19.23.404 AM 2
27-MAY-15 10.19.35.834 AM 2
27-MAY-15 10.19.43.614 AM 1
27-MAY-15 10.19.45.934 AM 1
27-MAY-15 10.19.53.714 AM 2
27-MAY-15 10.19.56.044 AM 2
27-MAY-15 10.20.03.814 AM 1
27-MAY-15 10.20.06.144 AM 1
27-MAY-15 10.20.13.914 AM 1
27-MAY-15 10.20.16.244 AM 1
27-MAY-15 10.20.24.014 AM 3
27-MAY-15 10.20.26.343 AM 2
27-MAY-15 10.20.34.114 AM 1
27-MAY-15 10.20.36.443 AM 3
27-MAY-15 10.20.44.224 AM 1
27-MAY-15 10.20.46.553 AM 1 27-MAY-15 10.20.54.324 AM 9 27-MAY-15 10.21.04.424 AM 13 27-MAY-15 10.21.06.753 AM 7 27-MAY-15 10.21.14.525 AM 21 27-MAY-15 10.21.16.853 AM 17 27-MAY-15 10.21.24.624 AM 32 27-MAY-15 10.21.26.953 AM 1 27-MAY-15 10.21.34.814 AM 50 27-MAY-15 10.21.37.053 AM 2 27-MAY-15 10.21.44.914 AM 49
27-MAY-15 10.21.47.163 AM 2
27-MAY-15 10.21.55.014 AM 61
27-MAY-15 10.21.57.263 AM 2
27-MAY-15 10.22.05.116 AM 69
27-MAY-15 10.22.07.363 AM 12
27-MAY-15 10.22.15.216 AM 77
27-MAY-15 10.22.17.463 AM 12
27-MAY-15 10.22.25.316 AM 87
27-MAY-15 10.22.27.562 AM 13
27-MAY-15 10.22.35.426 AM 112
27-MAY-15 10.22.37.662 AM 47
27-MAY-15 10.22.45.526 AM 121
27-MAY-15 10.22.47.772 AM 7
27-MAY-15 10.22.55.626 AM 129
27-MAY-15 10.22.57.872 AM 10
|
从上表可以发现,正常情况下4s对会话等待事件采样的记录从10点30左右开始增加,从会话采样历史视图查看相关等待事件如下:
select sample_time,event,count(*) from dba_hist_active_sess_history where dbid=3965601902 and
snap_id>49716 and snap_id
and sample_time>to_date('20150527 10:20:00','yyyymmdd hh24:mi:ss') group by sample_time,event
order by sample_time;
SAMPLE_TIME EVENT COUNT(*)
---------------------------------------- ---------------------------------------- ----------
27-MAY-15 10.20.03.814 AM 1
27-MAY-15 10.20.06.144 AM db file sequential read 1
27-MAY-15 10.20.13.914 AM 1
27-MAY-15 10.20.16.244 AM 1
27-MAY-15 10.20.24.014 AM db file sequential read 1
27-MAY-15 10.20.24.014 AM 2
27-MAY-15 10.20.26.343 AM 2
27-MAY-15 10.20.34.114 AM db file sequential read 1
27-MAY-15 10.20.36.443 AM 3
27-MAY-15 10.20.44.224 AM 1
27-MAY-15 10.20.46.553 AM 1
27-MAY-15 10.20.54.324 AM db file sequential read 1 27-MAY-15 10.20.54.324 AM direct path write 8
27-MAY-15 10.21.04.424 AM db file sequential read 1
27-MAY-15 10.21.04.424 AM direct path write 10
27-MAY-15 10.21.04.424 AM enq: TX - row lock contention 1
27-MAY-15 10.21.04.424 AM 1
27-MAY-15 10.21.06.753 AM db file sequential read 1
27-MAY-15 10.21.06.753 AM 6
27-MAY-15 10.21.14.525 AM db file sequential read 1
27-MAY-15 10.21.14.525 AM direct path write 17
27-MAY-15 10.21.14.525 AM enq: TX - row lock contention 2
27-MAY-15 10.21.14.525 AM 1
27-MAY-15 10.21.16.853 AM gc current block 2-way 1
27-MAY-15 10.21.16.853 AM gc current grant 2-way 1
27-MAY-15 10.21.16.853 AM kjbdrmcvtq lmon drm quiesce: ping completion 1
27-MAY-15 10.21.16.853 AM log file sync 8
27-MAY-15 10.21.16.853 AM wait for scn ack 1
27-MAY-15 10.21.16.853 AM 5
27-MAY-15 10.21.24.624 AM db file sequential read 1
27-MAY-15 10.21.24.624 AM direct path write 26
27-MAY-15 10.21.24.624 AM enq: TX - row lock contention 3
27-MAY-15 10.21.24.624 AM gcs drm freeze in enter server mode 1
27-MAY-15 10.21.24.624 AM 1
27-MAY-15 10.21.26.953 AM 1
27-MAY-15 10.21.34.814 AM db file sequential read 1
27-MAY-15 10.21.34.814 AM direct path write 33
27-MAY-15 10.21.34.814 AM enq: TX - row lock contention 4
27-MAY-15 10.21.34.814 AM gcs drm freeze in enter server mode 1
27-MAY-15 10.21.34.814 AM row cache lock 10
27-MAY-15 10.21.34.814 AM 1
27-MAY-15 10.21.37.053 AM db file sequential read 1
27-MAY-15 10.21.37.053 AM 1
27-MAY-15 10.21.44.914 AM db file sequential read 1
27-MAY-15 10.21.44.914 AM direct path write 42
27-MAY-15 10.21.44.914 AM enq: TX - row lock contention 4
27-MAY-15 10.21.44.914 AM enq: US - contention 1
27-MAY-15 10.21.44.914 AM gcs drm freeze in enter server mode 1
27-MAY-15 10.21.47.163 AM log file parallel write 1
27-MAY-15 10.21.47.163 AM 1
27-MAY-15 10.21.55.014 AM db file sequential read 1
27-MAY-15 10.21.55.014 AM direct path write 42
27-MAY-15 10.21.55.014 AM enq: TX - row lock contention 5
27-MAY-15 10.21.55.014 AM gc current block 2-way 1
27-MAY-15 10.21.55.014 AM gc current grant 2-way 2
27-MAY-15 10.21.55.014 AM kjbdrmcvtq lmon drm quiesce: ping completion 1
27-MAY-15 10.21.55.014 AM log file sync 3
27-MAY-15 10.21.55.014 AM wait for scn ack 1
27-MAY-15 10.21.55.014 AM 5
27-MAY-15 10.21.57.263 AM db file parallel write 1
27-MAY-15 10.21.57.263 AM 1
|
如上表所示,从上午10:20开始出现等待事件direct path write,紧接着出现 enq: TX - row lock contention等待事件,direct path write持续增加导致大量会话积压。
接下来查询等待事件direct path write等待的实例、资源的id号、等待次数的积压情况:
select instance_number,p1,count(*) from dba_hist_active_sess_history where dbid=3965601902 and
snap_id>49716 and snap_id
and event='direct path write'
and sample_time>to_date('20150527 10:20:00','yyyymmdd hh24:mi:ss')
group by instance_number,p1 order by 3 ;
INSTANCE_NUMBER P1 COUNT(*)
--------------- ---------- ----------
1 461 1
1 494 1
1 372 1
1 495 1
2 581 1
2 285 1
1 479 1
2 495 1
2 317 1
1 587 1
1 543 1
2 316 1
2 438 1
2 498 1
2 497 1
2 499 2
1 520 2
2 437 2
1 466 2
2 496 2
1 556 2
1 439 2
1 438 2
1 316 2
1 498 2
2 439 3
1 474 3
1 436 3
1 519 3
1 496 4
1 475 4
1 497 4
2 475 4
2 494 4
2 436 4
1 499 5
1 437 6
2 474 7 1 589 59128 2 588 84346
|
从上表可以开出,rac两个节点均有由于direct path write等待的资源并且有大量的等待,接下来查询对象589、588对应的物理设备:
SQL> select file_name,tablespace_name from dba_data_files where file_id=&file;
Enter value for file: 588
old 1: select file_name,tablespace_name from dba_data_files where file_id=&file
new 1: select file_name,tablespace_name from dba_data_files where file_id=588
FILE_NAME TABLESPACE_NAME
--------------------------------- ------------------------------
/dev/essdb2vg5/rdb2vg5_10_lv001 TBS_TRADE_D10
SQL> /
Enter value for file: 589
old 1: select file_name,tablespace_name from dba_data_files where file_id=&file
new 1: select file_name,tablespace_name from dba_data_files where file_id=589
FILE_NAME TABLESPACE_NAME
--------------------------------- ------------------------------
/dev/essdb2vg5/rdb2vg5_10_lv002 TBS_TRADE_D09
|
经系统维护责任人确认,这两个裸设备是昨天晚上新扩容的4块盘,先两块盘做raid0,再两个raid0组raid1,然后划分的每10G大小共40个的裸设备,然后对数据库表空间扩的容;通过现场硬件同事的反应,换上去的4块盘,有两块盘是旧盘,又经存储工程师确认新加的raid1组被锁住了。由此可知,由于存储RAID被锁,相关的磁盘无法写入,数据库才会出现direct path write等待事件导致大量会话积压。
后续处理:由于这是电商平台,业务对数据很敏感,考虑先进行备份,查明有问题的两个裸设备分别属于两个不同的表空间,2个表空间是2个不同方案下相同名称的分区表的photo字段分区。先检验是否能读,方法如下,结果是读失败,所以排除掉备份(无论是exp、expdp、rman);由于业务方要求28号上午9点必须恢复业务,经过多次尝试,时间已是27号晚上10点多,进行数据库rman全备恢复消耗时间不可控,考虑基于归档的恢复。
select /*+ parallel(t 8)*/* from "UCR_TRADE_03"."TF_F_IDCARD" partition(PAR_TF_B_IDCARD_VERIFY_8) t;
基于归档的恢复操作流程,通知存储工程师,准备2个与故障裸设备相同容积的裸设备,指定好数据库用户权限,在数据库中将2个有问题的裸设备进行下线,然后重定向到新的裸设备,再将节点2上从今天早上开始对数据库扩容的时间6:00时的归档拷贝到节点1上,进行recover。
新的裸设备如下:
/dev/essdb2vg7/rdb2vg7_10_lv001
/dev/essdb2vg7/rdb2vg7_10_lv002
表空间下线后重定向:
SQL> alter database create datafile 588 as '/dev/essdb2vg7/rdb2vg7_10_lv001';
SQL> alter database create datafile 589 as '/dev/essdb2vg7/rdb2vg7_10_lv002';
基于归档恢复:
SQL> recover datafile '/dev/essdb2vg7/rdb2vg7_10_lv001';
SQL> recover datafile '/dev/essdb2vg7/rdb2vg7_10_lv002';
恢复完毕后将数据文件上线:
SQL> alter database datafile '/dev/essdb2vg7/rdb2vg7_10_lv001' online;
SQL> alter database datafile '/dev/essdb2vg7/rdb2vg7_10_lv002' online;
如果有必要,回退到故障点:
停止数据库
shutdown immediate;
startup mount;
alter database rename file '/dev/essdb2vg5/rdb2vg5_10_lv038' to '/dev/essdb2vg5/rdb2vg5_10_lv001';
alter database rename file '/dev/essdb2vg5/rdb2vg5_10_lv039' to '/dev/essdb2vg5/rdb2vg5_10_lv002';
alter database open;
幸运的是,基于归档的数据文件恢复是成功的。
时间: 2024-10-13 23:00:21