PMON failed to acquire latch, see PMON dump

前几天,一台Oracle数据库(Oracle Database 10g Release 10.2.0.4.0 - 64bit Production)监控出现"PMON failed to acquire latch, see PMON dump"错误,连接数据库出现短暂异常,告警日志中具体错误如下所示:

Tue Dec 20 09:13:16 2016
PMON failed to acquire latch, see PMON dump
Tue Dec 20 09:14:16 2016
PMON failed to acquire latch, see PMON dump
Tue Dec 20 09:15:55 2016
PMON failed to acquire latch, see PMON dump
Tue Dec 20 09:17:15 2016
PMON failed to acquire latch, see PMON dump
Tue Dec 20 09:17:24 2016
WARNING: inbound connection timed out (ORA-3136)
Tue Dec 20 09:18:23 2016
PMON failed to acquire latch, see PMON dump
Tue Dec 20 09:19:24 2016
PMON failed to acquire latch, see PMON dump

在生成的epps_pmon_4988.trc 跟踪文件里面,发现有些详细信息,你会发现PMON进程不能获取'Child shared pool' latch,它被一个pid = 19 ospid=5022的进程给阻塞了。而ospid这个进程是一个Dispatcher的进程。

*** 2016-12-20 09:14:16.575
PMON unable to acquire latch  600edfa8 Child shared pool level=7 child#=1 
        Location from where latch is held: kghfrunp: alloc: session dur: 
        Context saved from call: 0
        state=busy, wlstate=free
    waiters [orapid (seconds since: put on list, posted, alive check)]:
     33 (3, 1482196555, 3)
     10 (3, 1482196555, 3)
     25 (3, 1482196555, 3)
     13 (3, 1482196555, 3)
     waiter count=4
    gotten 861091119 times wait, failed first 7114074 sleeps 1392223
    gotten 0 times nowait, failed: 0
  possible holder pid = 19 ospid=5022
----------------------------------------
SO: 0x40979aec8, type: 2, owner: (nil), flag: INIT/-/-/0x00
  (process) Oracle pid=19, calls cur/top: (nil)/0x409c92608, flag: (80) DISPATCHER
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 0 0 236
              last post received-location: kmcpdp
              last process to post me: 4097a64a0 106 64
              last post sent: 0 0 229
              last post sent-location: kmcmbf: not KMCVCFTOS
              last process posted by me: 4097a64a0 106 64
  (latch info) wait_event=0 bits=a0
    holding    (efd=4) 600edfa8 Child shared pool level=7 child#=1 
        Location from where latch is held: kghfrunp: alloc: session dur: 
        Context saved from call: 0
        state=busy, wlstate=free
        waiters [orapid (seconds since: put on list, posted, alive check)]:
         33 (3, 1482196555, 3)
         10 (3, 1482196555, 3)
         25 (3, 1482196555, 3)
         13 (3, 1482196555, 3)
         waiter count=4
    holding    (efd=4) 3fff78210 Child library cache level=5 child#=2 
        Location from where latch is held: kghfrunp: clatch: wait: 
        Context saved from call: 0
        state=busy, wlstate=free
        waiters [orapid (seconds since: put on list, posted, alive check)]:
         15 (3, 1482196555, 3)
         17 (3, 1482196555, 3)
         12 (3, 1482196555, 3)
         waiter count=3
    Process Group: DEFAULT, pseudo proc: 0x4098bc190
    O/S info: user: oracle, term: UNKNOWN, ospid: 5022 
    OSD pid info: Unix process pid: 5022, image: oracle@xx.xxx.xxx.com (D007)
    Short stack dump: 
ksdxfstk()+32<-ksdxcb()+1547<-sspuser()+111<-__restore_rt()+0<-kghfrunp()+1506<-kghfnd()+1389<-kghalo()+587<-kmnsbm()+578<-nsb
al()+428<-nsbalc()+123<-nsdo()+17278<-nsopen()+2315<-nsanswer()+512<-kmnans()+37<-kmdahd()+385<-kmdmai()+5220<-kmmrdp()+564<-o
pirip()+1193<-opidrv()+582<-sou2o()+114<-opimai_real()+317<-main()+116<-__libc_start_main()+244<-_start()+41
Dump of memory from 0x0000000409747C68 to 0x0000000409747E70
409747C60                   00000001 00000000          [........]
409747C70 FE9BEE10 00000003 0000003A 0003129B  [........:.......]
409747C80 FEA7D5D0 00000003 0000003A 0003129B  [........:.......]
409747C90 FE9DAD30 00000003 0000003A 0003129B  [0.......:.......]
        Repeat 2 times
409747CC0 FEAB01F0 00000003 0000003A 0003129B  [........:.......]
409747CD0 FE9DAD30 00000003 0000003A 0003129B  [0.......:.......]
409747CE0 FEA44E70 00000003 0000003A 0003129B  [pN......:.......]
409747CF0 FEAA6FF0 00000003 0000003A 0003129B  [.o......:.......]
409747D00 FEAB8AD0 00000003 0000003A 0003129B  [........:.......]
409747D10 FEA14FF0 00000003 0000003A 0003129B  [.O......:.......]
409747D20 FE9A77F0 00000003 0000003A 0003129B  [.w......:.......]
        Repeat 1 times
409747D40 FEA3CEB0 00000003 0000003A 0003129B  [........:.......]
        Repeat 1 times
409747D60 FE9C64B0 00000003 0000003A 0003129B  [.d......:.......]
        Repeat 1 times
409747D80 FEA062B0 00000003 0000003A 0003129B  [.b......:.......]
        Repeat 3 times
409747DC0 FEAA6FF0 00000003 0000003A 0003129B  [.o......:.......]
409747DD0 FEA8F9D0 00000003 0000003A 0003129B  [........:.......]
409747DE0 FE9F7570 00000003 0000003A 0003129B  [pu......:.......]
409747DF0 FEA91530 00000003 0000003A 0003129B  [0.......:.......]
409747E00 FE9BEE10 00000003 0000003A 0003129B  [........:.......]
409747E10 FE9BB750 00000003 0000003A 0003129B  [P.......:.......]
409747E20 FEA90C10 00000003 0000003A 0003129B  [........:.......]
409747E30 FEA8B9F0 00000003 0000003A 0003129B  [........:.......]
409747E40 FE9C5270 00000003 0000003A 0003129B  [pR......:.......]
409747E50 FEAE12B0 00000003 0000003A 0003129B  [........:.......]
409747E60 FE9C5270 00000003 0000003A 0003129B  [pR......:.......]

 

由于当时没有出现问题时,并没有及时发现,没有Collect HangAnalyze traces,所以再深入一点的挖掘root case已经很难了。当时手工生成了一个快照(9:26),也就是说9:00 ~ 9:26这段时间生成的快照刚刚覆盖了出现问题的时间段。生成了这个时段的AWR报告,在这个时间段latch:library cache 和latch:shared pool等待事件是主要等待事件。

 

 

出现问题的时间段,数据库服务器是比较空闲的。

 

生成了20-Dec-16 09:11:16到20-Dec-16 09:21:16时段的ASH报告。如下所示,latch:library cache 和latch:shared pool为主要等待事件,但是Avg Active Sessions很小。

 

所以觉得很有可能是跟Bug有关系,后面在Oracle MetaLink查了一下是否有相关Bug,如下一些相关资料:

 

Bug 7039896 Spin under kghquiesce_regular_extent holding shared pool latch with AMM

Bug 6488694 - DATABASE HUNG WITH PMON FAILED TO ACQUIRE LATCH MESSAGE

Note 7039896.8 - Bug 7039896 - Spin under kghquiesce_regular_extent holding shared pool latch with AMM

Pmon Failed To Acquire Latch" Messages in Alert Log -Database Hung (文档 ID 468740.1)

 

 

Hang (Involving Shared Resource)

A process may hold a shared resource a lot longer than normally expected leading to many other processes having to wait for that resource. Such a resource could be a lock, a library cache pin, a latch etc.. The overall symptom is that typically numerous processes all appear to be stuck, although some processes may continue unhindered if they do not need the blocked resource.

Hang (Process Spins)

A process enters a tight CPU loop so appears to hang but is actually consuming CPU.

Latch Contention

This issue can result in latch contention within the database.

Waits for "latch: shared pool"

 

我们数据库版本为Oracle Database 10g Release 10.2.0.4.0 - 64bit Production, 所以Bug 7039896是会影响的这个数据库的, 而出现的现象也很符合,但是有一点就是并没有涉及MMAN进程。而且查过V$SGA_RESIZE_OPS,那个时间段并没有相关组件的增长、收缩。另外跟Bug 也非常类似,但是trc文件并没有发现跟MMAN进程有关系。 这个问题还是第一次出现,而且出现过一次后,最近几天都没有出现,所以更加确信是Bug引起的。当然是要找个时间应用Bug 7039896的相关补丁。

 

另外,在查找这个问题的时候,在官方文档看到一个如何处理、诊断'PMON failed to acquire latch, see PMON dump'的详细文档,本想收录于此,不过还是保持为PDF文件较好,需要可从下面链接下载。

SRDC - How to Collect Standard Information for Issues Where 'PMON failed to acquire latch, see PMON dump' Warnings are Seen in the Alert Log (文档 ID 1951971.1)

 

 

时间: 2024-08-04 21:05:20

PMON failed to acquire latch, see PMON dump的相关文章

oracle中PMON failed to acquire latch导致crash的例子

一朋友公司的OA系统挂了(泛微技术支持说是神马神马),友情帮忙分析一下. 如下是alert log信息: Mon Jun 23 11:28:53 2014 WARNING: inbound connection timed out (ORA-3136) Mon Jun 23 22:00:06 2014 Thread 1 advanced to log sequence 339 (LGWR switch)  Current log# 3 seq# 339 mem# 0: /oradata/redo

PMON failed to acquirelatch, see PMON dump

PMON failed to acquirelatch, see PMON dump           这两天突然发现有套在运行的Oracle 10.2.0.1 for RHEL 5.8x86_x64的数据库关闭非常慢,长达4分钟,shutdown immediate之后alert.log报大量的"PMON failed to acquire latch,see PMON dump",该库得连接数也不多,当前Process 是60,除了该错误之外没有报任何异常,也没有连接错误,但是进

oracle数据库Parallel Query 导致的ORA-04031

一个朋友遇到ORA-04031问题.虽然这个错误是非常常见的,然而这里的Case 也有点让人为之震惊! Tue Aug 26 11:51:13 2014 Errors in file /oracle/app/oracle/diag/rdbms/xx/xx1/trace/xx1_p485_28873.trc  (incident=1589637) ORA-04031: 无法分配 32792 字节的共享内存 ("shared pool","unknown object"

重启数据库的一场闹剧

在几周前,某个测试环境在尝试impdp导入dump的时候报了错误,有个DBA立马做了kill session的操作,但是持续了5个小时,session状态还是KILLED,于是他们就在等待session被pmon回收.结果又等了几个小时,还是KILLED状态. 最后两拨DBA在交接的时候把这个问题就说明了一下,另外一个DBA继续尝试impdp就报了下面的错误. Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.

Oracle Latch的使用说明

一.Latch 说明 1.1 Latch Latch属于System Lock, 用于保护SGA区中共享数据结构的一种串行化锁定机制.Latch的实现是与操作系统相关的,尤其和一个进程是否需要等待一个latch.需要等待多长时间有关. Latch是Oracle提供的轻量级锁资源,是一种能够极快地被获取和释放的锁,能快速,短时间的锁定资源,防止多个并发进程同时修改访问某个共享资源,它只工作在SGA中,通常用于保护描述buffer cache中block的数据结构. 比如SGA中,各种数据被反复从磁

Oracle latch 闩锁结构 总结

以下内容太整理自网络,完全处于学习目的,如有侵权请及时联系我,我会立即删除. 一. Latch 说明 1.1 Latch Latch属于System Lock, 用于保护SGA区中共享数据结构的一种串行化锁定机制.Latch的实现是与操作系统相关的,尤其和一个进程是否需要等待一个latch.需要等待多长时间有关. Latch是Oracle提供的轻量级锁资源,是一种能够极快地被获取和释放的锁,能快速,短时间的锁定资源,防止多个并发进程同时修改访问某个共享资源,它只工作在SGA中,通常用于保护描述b

【锁】Latch、lock、 pin的区别

[锁]Latch.lock. pin的区别  我之前写过的几篇锁的文章: [锁]Oracle锁系列:http://blog.itpub.net/26736162/viewspace-2128896/[锁]Oracle死锁(DeadLock)的分类及其模拟:http://blog.itpub.net/26736162/viewspace-2127247/[故障处理]队列等待之TX - allocate ITL entry引起的死锁处理:http://blog.itpub.net/26736162/

Linux平台下RMAN异机恢复总结

下面总结.整理一下RMAN异机恢复这方面的知识点,这篇笔记在个人笔记里面躺了几年了,直到最近偶然被翻看到,遂整理.总结一下.如下所示,个人将整个RMAN异机恢复分为准备工作和操作步骤两大部分.当然,准备工作里面,有些步骤不是必须的,可以跳过或忽略的.这个取决于你的实际环境和你对RMAN异机恢复的熟悉程度.   准备工作   1:了解一下目标服务器与源服务器的操作系统版本信息   需要对比一下目标服务器与源服务器的操作系统版本是否一致,具体来说,操作系统版本信息.内核信息(例如Oracle Lin

cloudstack下libvirtd服务无响应问题_Linux

在cloudstack4.5.2版本下,偶尔出现libvirtd服务无响应的情况,导致virsh命令无法使用,同时伴随cloudstack master丢失该slave主机连接的情况.最初怀疑是libvirtd服务或版本的问题,经过分析和排查最终确定是cloudstack-agent的问题.但是在官网上并没有找到类似的bug提交,该问题可能还存在于更高的版本,需要时间进一步从根本上分析.下面是该问题的处理过程,在此记录下,关注和使用cloudstack的朋友可以参考. 众所周知,cloudsta