有老朋友找到我,说一个客户的数据库异常,问题是asm无法正常mount,提示是缺少两块磁盘.问我是否可以恢复.因为是内网环境,通过他那边发过来的零零散散的信息,大概分析如下
asm alert日志报错
ERROR: diskgroup DGROUP1 was not mounted
Fri Aug 12 16:03:12 EAT 2016
SQL> alter diskgroup DGROUP1 mount
Fri Aug 12 16:03:12 EAT 2016
NOTE: cache registered group DGROUP1 number=1 incarn=0xf6781b5c
Fri Aug 12 16:03:12 EAT 2016
NOTE: Hbeat: instance first (grp 1)
Fri Aug 12 16:03:16 EAT 2016
NOTE: start heartbeating (grp 1)
Fri Aug 12 16:03:16 EAT 2016
NOTE: cache dismounting group 1/0xF6781B5C (DGROUP1)
NOTE: dbwr not being msg'd to dismount
ERROR: diskgroup DGROUP1 was not mounted
前台尝试mount asm 磁盘组报错ORA-15042
从这里可以明显的看出来asm 磁盘组无法正常mount,是由于缺少asm disk 15,16.如果想恢复asm,最好的方法就是找出来这两个磁盘.通过kfed对现在的磁盘进行分析,最后我们发现asm disk 14对应的磁盘为disk160,,asm disk 17对应的disk163,根据第一感觉很可能是disk161和disk161两块盘异常,让机房检查硬件无任何告警
OS层面分析
省略和本次结论无关的记录
ls -l /dev/rdisk
crw-rw---- 1 oracle dba 13 0x000070 Jan 1 2016 disk160
crw-rw---- 1 oracle dba 13 0x000073 Jan 1 2016 disk163
ls -l /dev/disk
brw-r----- 1 bin sys 1 0x000070 Jan 13 2015 disk160
brw-r----- 1 bin sys 1 0x000071 Jan 13 2015 disk161
brw-r----- 1 bin sys 1 0x000072 Jan 13 2015 disk162
brw-r----- 1 bin sys 1 0x000073 Jan 13 2015 disk163
这里我们发现在hp unix中/dev/disk下面磁盘都存在,但是/dev/rdisk下面丢失,通过ioscan相关命令继续分析
ioscan -fNnkC disk
disk 160 64000/0xfa00/0x70 esdisk CLAIMED DEVICE HP OPEN-V
/dev/disk/disk160 /dev/rdisk/disk160
disk 161 64000/0xfa00/0x71 esdisk CLAIMED DEVICE HP OPEN-V
/dev/disk/disk161
disk 162 64000/0xfa00/0x72 esdisk CLAIMED DEVICE HP OPEN-V
/dev/disk/disk162
disk 163 64000/0xfa00/0x73 esdisk CLAIMED DEVICE HP OPEN-V
/dev/disk/disk163 /dev/rdisk/disk163
这里我们基本上可以确定是/dev/rdisk下面的盘发生丢失.进一步分析,因为rdisk是聚合后的盘符,那我们分析聚合前的盘符是否正常
ioscan -m dsf
/dev/rdisk/disk160 /dev/rdsk/c29t12d4
/dev/rdsk/c28t12d4
/dev/rdisk/disk163 /dev/rdsk/c29t12d7
/dev/rdsk/c28t12d7
ls -l /dev/rdsk
crw-r----- 1 bin sys 188 0x1dc000 Apr 22 2014 c29t12d0
crw-r----- 1 bin sys 188 0x1dc100 Apr 22 2014 c29t12d1
crw-r----- 1 bin sys 188 0x1dc300 Jan 13 2015 c29t12d3
crw-r----- 1 bin sys 188 0x1dc400 Jan 13 2015 c29t12d4
crw-r----- 1 bin sys 188 0x1dc500 Jan 13 2015 c29t12d5
crw-r----- 1 bin sys 188 0x1dc600 Jan 13 2015 c29t12d6
crw-r----- 1 bin sys 188 0x1dc700 Jan 13 2015 c29t12d7
crw-r----- 1 bin sys 188 0x1cc100 Apr 22 2014 c28t12d1
crw-r----- 1 bin sys 188 0x1cc300 Jan 13 2015 c28t12d3
crw-r----- 1 bin sys 188 0x1cc400 Jan 13 2015 c28t12d4
crw-r----- 1 bin sys 188 0x1cc500 Jan 13 2015 c28t12d5
crw-r----- 1 bin sys 188 0x1cc600 Jan 13 2015 c28t12d6
crw-r----- 1 bin sys 188 0x1cc700 Jan 13 2015 c28t12d7
通过这里我们基本上可以大概判断出来/dev/rdsk/c28t12d5,/dev/rdsk/c28t12d6,/dev/rdsk/c29t12d5,/dev/rdsk/c29t12d6就是我们需要找的/dev/rdisk/disk161和disk162的聚合之前的盘符.也就是说,现在我们判断只有/dev/rdisk下面的字符设备有问题,其他均正常.
通过系统命令修复异常
insf -e -H 64000/0xfa00/0x71
insf -e -H 64000/0xfa00/0x72
hp-asm-disk
现在已经可以正常看到/dev/rdisk/disk161和/dev/rdisk/disk162盘符,初步判断,os层面盘符已经恢复正常.修改磁盘权限和所属组
chmod 660 /dev/rdisk/disk161
chmod 660 /dev/rdisk/disk162
chown oracle:dba /dev/rdisk/disk161
chown oracle:dba /dev/rdisk/disk162
正常启动asm,mount磁盘组,open数据库
asm-mount
这次的恢复,主要是从操作系统层面判断解决问题,从而实现数据库完美恢复,数据0丢失.有类似恢复案例