[20170707]cursor: pin S wait on X(10G).txt
--生产系统遇到1个bug,版本:
EXAM@xxx> @ &r/ver1
PORT_STRING VERSION BANNER
------------------- ---------- ----------------------------------------------------------------
x86_64/Linux 2.4.xx 10.2.0.4.0 Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bi
--//我今天才介入解决问题,就是调用存储过程某个update语句时,应用hang住了.我看等待事件cursor: pin S wait on X很容易联想到
--//bug.
http://www.dba-oracle.com/t_cursor_pin_wait_on_x.htm
Question: My AWR report shows "cursor: pin S wait on X" responsible for over 90% of my database processing time. How do
I fix the "cursor: pin S wait on X" event?
Answer: The "cursor: pin S wait on X" is related to the new "mutex" option. Oracle 10g release 2 and beyond replaced
some latch mechanisms with the mutex approach, claiming that they are faster and more efficient than traditional locking
mechanisms. To improve cursor execution speed and hard pare time within the library cache, mutexes replace library
cache latches and library cache pins. Oracle claims that mutexes are faster and use less CPU, which is important for
CPU-bound database where large data buffers remove I/O as a primary source of contention.
Beware, Oracle bug 5184776 can affect 11g databases that deploy the mutex latching, causing excessive 'cursor pin s wait
on x' wait events.
If you suspect that you have a bug, you can revert to the pre-mutex method with the consent of Oracle Support using
these commands that bounce your database into the older latching mode, which may result in less deadlocks:
alter system set "_kks_use_mutex_pin"=false scope=spfile;
shutdown immediate
startup
NOTE: This hidden parameter should only be used with the consent of Oracle support, and this parameter should NOT ever
be set in Oracle 12c
The cursor pin S wait on X wait event is mostly related to mutex and hard parse and it happens when a session waits for
this event when it is requesting a shared mutex pin and another session is holding an exclusive mutex pin on the same
cursor object.
These are the parameters associated with this mutex wait event:
P1 Hash value of cursor
P2 Mutex value (top 2 bytes contains SID holding mutex in exclusive mode, and bottom two bytes usually hold the value 0)
P3 Mutex where (an internal code locator) OR'd with Mutex Sleeps
--// 以下根据作者的执行的脚本分析:
SELECT s.inst_id AS inst
,s.sid AS blocked_sid
,s.username AS blocked_user
,sa.sql_id AS blocked_sql_id
,TRUNC (s.p2 / 4294967296) AS blocking_sid
,b.username AS blocking_user
,b.sql_id AS blocking_sql_id
FROM gv$session s
JOIN gv$sqlarea sa ON sa.hash_value = s.p1
JOIN gv$session b
ON TRUNC (s.p2 / 4294967296) = b.sid AND s.inst_id = b.inst_id
JOIN gv$sqlarea sa2 ON b.sql_id = sa2.sql_id
WHERE s.event = 'cursor: pin S wait on X';
Record View
As of: 2017/7/7 10:30:03
INST: 1
BLOCKED_SID: 1577
BLOCKED_USER: EXAM
BLOCKED_SQL_ID: guv6x80vk33wb
BLOCKING_SID: 1579
BLOCKING_USER: EXAM
BLOCKING_SQL_ID: guv6x80vk33wb
--//sid=1579 阻塞了 sid=1577.
select sql_id,loaded_versions,executions,loads,invalidations,parse_calls
from gv$sql
where inst_id=1 and sql_id='guv6x80vk33wb';
SQL_ID LOADED_VERSIONS EXECUTIONS LOADS INVALIDATIONS PARSE_CALLS
------------- --------------- ---------- ---------- ------------- -----------
guv6x80vk33wb 0 0 1 1 0
--//注:我前面做了alter system flush shared_pool;,这里看到的EXECUTIONS次数是0.
--//另外我在刷新前执行kill -9 os进程(当前正在执行这条sql语句的会话),我当时以为这样问题就可以解决.脚本如下:
select 'kill -9 '||spid,a.* from V$PROCESS a where addr in (
SELECT paddr
FROM v$session
WHERE wait_class <> 'Idle' AND sql_id = 'guv6x80vk33wb'
);
--//查看sid=1579的情况:
select * from v$session where sid=1579;
Record View
As of: 2017/7/7 10:32:19
SADDR: 000000007C90B688
SID: 1579
SERIAL#: 34307
AUDSID: 54134981
PADDR: 000000007D545808
USER#: 58
USERNAME: EXAM
COMMAND: 6
OWNERID: 2147483644
TADDR:
LOCKWAIT:
STATUS: KILLED
SERVER: DEDICATED
SCHEMA#: 58
SCHEMANAME: EXAM
OSUSER: SYSTEM
PROCESS: 3324:10116
MACHINE: WORKGROUP\BROKER
TERMINAL: BROKER
PROGRAM: Kodak.GCGateway.Adapter.Service.exe
TYPE: USER
SQL_ADDRESS: 00000000733A0DF8
SQL_HASH_VALUE: 924946315
SQL_ID: guv6x80vk33wb
SQL_CHILD_NUMBER: 3
PREV_SQL_ADDR: 0000000074233B68
PREV_HASH_VALUE: 152532753
PREV_SQL_ID: 7wp2a204jfxsj
PREV_CHILD_NUMBER: 5
PLSQL_ENTRY_OBJECT_ID: 116247
PLSQL_ENTRY_SUBPROGRAM_ID: 1
PLSQL_OBJECT_ID:
PLSQL_SUBPROGRAM_ID:
MODULE: Kodak.GCGateway.Adapter.Service.exe
MODULE_HASH: 163112698
ACTION:
ACTION_HASH: 0
CLIENT_INFO: 192.168.29.13
FIXED_TABLE_SEQUENCE: 183286255
ROW_WAIT_OBJ#: -1
ROW_WAIT_FILE#: 0
ROW_WAIT_BLOCK#: 0
ROW_WAIT_ROW#: 0
LOGON_TIME: 2017/6/28 13:03:40
LAST_CALL_ET: 768474
PDML_ENABLED: NO
FAILOVER_TYPE: NONE
FAILOVER_METHOD: NONE
FAILED_OVER: NO
RESOURCE_CONSUMER_GROUP:
PDML_STATUS: DISABLED
PDDL_STATUS: ENABLED
PQ_STATUS: ENABLED
CURRENT_QUEUE_DURATION: 0
CLIENT_IDENTIFIER:
BLOCKING_SESSION_STATUS: UNKNOWN
BLOCKING_INSTANCE:
BLOCKING_SESSION:
SEQ#: 31
EVENT#: 254
EVENT: single-task message
P1TEXT:
P1: 0
P1RAW: 00
P2TEXT:
P2: 0
P2RAW: 00
P3TEXT:
P3: 0
P3RAW: 00
WAIT_CLASS_ID: 2723168908
WAIT_CLASS#: 6
WAIT_CLASS: Idle
WAIT_TIME: 0
SECONDS_IN_WAIT: 768474
STATE: WAITING
SERVICE_NAME: mid
SQL_TRACE: DISABLED
SQL_TRACE_WAITS: FALSE
SQL_TRACE_BINDS: FALSE
--//昏,这个漏网了,因为WAIT_CLASS=Idle,而且很奇怪STATUS=KILLED.如果前面执行条件没有wait_class <> 'Idle',估计问题早解决了.
--//另外从时间看2017/6/28 13:03:40,正好也是网络出现环路的时间,正好也可以相互印证.
--//再次说明做事情要认真小心....
SELECT * FROM V$PROCESS WHERE addr IN (SELECT paddr FROM v$session WHERE sid = 1579);
Record View
As of: 2017/7/7 10:34:31
ADDR: 000000007D545808
PID: 20
SPID: 6086
USERNAME: oracle
SERIAL#: 73
TERMINAL: UNKNOWN
PROGRAM: oracle@qyymid
TRACEID:
BACKGROUND:
LATCHWAIT:
LATCHSPIN:
PGA_USED_MEM: 556053
PGA_ALLOC_MEM: 779293
PGA_FREEABLE_MEM: 0
PGA_MAX_MEM: 779293
--//删除进程号6086.
$ kill -9 6086
--//再测试ok,问题消失.顺便贴分析时的awr报表的前面部分:
WORKLOAD REPOSITORY report for
DB Name DB Id Instance Inst Num Release RAC Host
------------ ----------- ------------ -------- ----------- --- ------------
MID 2812601929 mid 1 10.2.0.4.0 NO qyymid
Snap Id Snap Time Sessions Curs/Sess
--------- ------------------- -------- ---------
Begin Snap: 44914 07-Jul-17 09:00:11 50 3.8
End Snap: 44915 07-Jul-17 09:29:29 57 4.7
Elapsed: 29.30 (mins)
DB Time: 1.46 (mins)
--//很奇怪这个版本看到DB Time很小,仅仅1.46 (mins)
Cache Sizes
~~~~~~~~~~~ Begin End
---------- ----------
Buffer Cache: 136M 136M Std Block Size: 8K
Shared Pool Size: 320M 320M Log Buffer: 6,152K
--//不是我配的(数据库运行在虚拟机器上的),明显Buffer Cache小了.不过这个服务器做的事情很少.
Load Profile
~~~~~~~~~~~~ Per Second Per Transaction
--------------- ---------------
Redo size: 1,698.35 16,406.59
Logical reads: 477.12 4,609.12
Block changes: 22.35 215.91
Physical reads: 0.30 2.87
Physical writes: 0.66 6.38
User calls: 13.05 126.08
Parses: 102.20 987.24
Hard parses: 0.27 2.58
Sorts: 178.27 1,722.16
Logons: 2.18 21.09
Executes: 103.94 1,004.12
Transactions: 0.10
% Blocks changed per Read: 4.68 Recursive Call %: 98.56
Rollback per transaction %: 4.40 Rows per Sort: 1.86
Instance Efficiency Percentages (Target 100%)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Buffer Nowait %: 100.00 Redo NoWait %: 100.00
Buffer Hit %: 99.96 In-memory Sort %: 100.00
Library Hit %: 99.61 Soft Parse %: 99.74
Execute to Parse %: 1.68 Latch Hit %: 100.00
Parse CPU to Parse Elapsd %: 30.64 % Non-Parse CPU: 85.20
Shared Pool Statistics Begin End
------ ------
Memory Usage %: 93.17 91.29
% SQL with executions>1: 82.12 79.33
% Memory for SQL w/exec>1: 83.60 80.99
Top 5 Timed Events Avg %Total
~~~~~~~~~~~~~~~~~~ wait Call
Event Waits Time (s) (ms) Time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
cursor: pin S wait on X 1,786,725 19,351 11 ###### Concurrenc
--//很奇怪,这个版本为什么不把cursor: pin S wait on X计入db time时间.
CPU time 56 63.8
log file sync 356 24 66 26.9 Commit
control file parallel write 582 21 36 23.7 System I/O
log file parallel write 749 20 27 22.6 System I/O
-------------------------------------------------------------
--//查询有问题sql_id=guv6x80vk33wb,也无法在awr报表中体现出来.