2016年10月27日下午,测试同事说测试数据库连接不上了,让我们DBA查看问题并解决一下。
操作系统:Red Hat Enterprise Linux Server release 6.6 (Santiago)
数据库版本:
[oracle@se31 ~]$ sqlplus / as sysdba
SQL*Plus: Release 11.2.0.4.0 Production on Fri Oct 28 08:59:04 2016
Copyright (c) 1982, 2013, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
SQL> select * from v$version;
BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
PL/SQL Release 11.2.0.4.0 - Production
CORE 11.2.0.4.0 Production
TNS for Linux: Version 11.2.0.4.0 - Production
NLSRTL Version 11.2.0.4.0 - Production
刚登陆上数据库服务器,发现服务器的IO已经被耗光:
查看数据库告警日志,发现最大进程数告警:
Thu Oct 27 14:24:51 2016
ORA-00020: ???讴???)
ORA-20 errors will not be written to the alert log for
the next minute. Please look at trace files to see all
the ORA-20 errors.
Process P198 submission failed with error = 20
Thu Oct 27 14:25:58 2016
ORA-00020: maximum number of processes (300) exceeded
ORA-20 errors will not be written to the alert log for
the next minute. Please look at trace files to see all
the ORA-20 errors.
数据库的最大进程数设置确实是300,但是测试库是针对测试用的,最大进程数设置稍微小点容易暴露项目得问题,并且平时测试没有出现过类似情况。
SQL> show parameter processes
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
aq_tm_processes integer 1
db_writer_processes integer 3
gcs_server_processes integer 0
global_txn_processes integer 1
job_queue_processes integer 1000
log_archive_max_processes integer 4
processes integer 300
SQL>
先排除最大进程数,继续查看数据库中的非空闲等待事件:
SQL> select event,count(*) from v$session where wait_class<>'Idle' group by event;
EVENT COUNT(*)
---------------------------------------------------------- ----------
direct path read temp 5
latch free 23
SQL*Net message to client 2
然后,按照等待事件查看相关会话执行的sql语句:
SQL> select sid from v$session where event='&event_name';
Enter value for event_name: latch free
old 1: select sid from v$session where event='&event_name'
new 1: select sid from v$session where event='latch free'
SID
-----------
11
26
52
56
67
113
120
139
189
244
245
265
267
291
315
326
363
402
414
423
434
447
22 rows selected.
SQL> set linesize 1000
col username for a15
col osuser for a15
select sid,wait_class_id,user#,username,lockwait,status,osuser,sql_id,PREV_SQL_ID from v$session where
wait_class_id= 1893977003 and sid=&sid;SQL> SQL> SQL> 2
Enter value for sid: 11
old 2: wait_class_id= 1893977003 and sid=&sid
new 2: wait_class_id= 1893977003 and sid=11
SID WAIT_CLASS_ID USER# USERNAME LOCKWAIT STATUS OSUSER SQL_ID PREV_SQL_ID
----------- ------------- ---------- --------------- ---------------- ---------- --------------- ------------- -------------
11 1893977003 156 BMI_NANCHONG ACTIVE ASP.NET v4.0 c8m5td2tmpcxg
查看 c8m5td2tmpcxg的sql语句:
SQL> select sql_fulltext from v$sqlarea where sql_id='c8m5td2tmpcxg';
select *
from (select row_.*, rownum NumRow
from (select b.table_par as tablepar,
a.bill_no as billcode,
b.item_date as itemdate,
b.item_id as itemid,
b.item_name as name,
b.numbers,
b.price,
b.costs,
b.drug_spec as specification s,
b.package_unit as itemunit,
b.reject_money as rejectmone y,
b.reject_num as rejectnumber,
b.reject_reson as rejectreso n,
a.admission_number as admiss ionnumber,
b.physician_name as doctorna me,
b.deptname as deptname,
a.patient_name as patientnam e,
d.dosage_name as dosagename,
nvl(d.varchar01, zs.varchar01) as packages,
nvl(d.manu_name, zs.manu_name) as manuname,
a.PATIENT_ID as BillPatientI d,
a.ADMISSION_DATE as BillAdmi ssonDate,
a.DISCHARGE_DATE as BillDisc hargeDate,
b.hospital_remark as hospitalremark,
zdr.region_name as region_name
from dw_billdetail b
inner join dw_bill a
on b.pid = a.hisid
and b.table_par = a.table_par
left join dw_zd_drug d
on b. item_id = d.item_id
left join dw_zd_service zs
on b.item_id = zs.item_id
left join d w_zd_region zdr
on zdr.region_id = a.bmi_code
where 1 = 1
AND a.andit_manu_statu
s IN (:paramBillStatus0,
:paramBillStatus1,
:paramBillStatus2,
:paramBillStatus3)
and b.reject_money > 0
and a.hospital_id = :paramHosptialID4
and b.table_par >= :stabPar5
and b.table_par <= :etabPar6
order by b.pid, b.item_date) row_
where rownum <= 50)
where NumRow > 0
查看c8m5td2tmpcxg的子游标及执行计划情况:
SQL> select child_number,executions,buffer_gets,is_bind_sensitive BS,
is_bind_aware BA,is_shareable SH,plan_hash_value
from v$sql
where sql_id='&sql_id'; 2 3 4
Enter value for sql_id: c8m5td2tmpcxg
old 4: where sql_id='&sql_id'
new 4: where sql_id='c8m5td2tmpcxg'
CHILD_NUMBER EXECUTIONS BUFFER_GETS B B S PLAN_HASH_VALUE
------------ ---------- ----------- - - - ---------------
0 18 669286643 Y N Y 3571375093
1 0 3013153881 Y N Y 3571375093
2 1 380 Y N Y 1661143992
3 0 78051 Y N Y 1661143992
5 0 870616787 N N Y 3571375093
经过查看只有0、1、3能看到执行计划,
select * from table(dbms_xplan.display_cursor('c8m5td2tmpcxg',0,'typical'));
select * from table(dbms_xplan.display_cursor('c8m5td2tmpcxg',1,'typical'))
select * from table(dbms_xplan.display_cursor('c8m5td2tmpcxg',2,'typical'))
其余的子游标对应的执行计划已经查不到了:
SQL> c/2/3
1* select * from table(dbms_xplan.display_cursor('c8m5td5tmpcxg',3,'typical'))
SQL> /
PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID: c8m5td5tmpcxg, child number: 3 cannot be found
SQL> c/3/5
1* select * from table(dbms_xplan.display_cursor('c8m5td5tmpcxg',5,'typical'))
SQL> /
PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID: c8m5td5tmpcxg, child number: 5 cannot be found
另外一个等待事件的会话发起的sql语句还是:c8m5td5tmpcxg
SQL> select sid from v$session where event='&event_name';
Enter value for event_name: direct path read temp
old 1: select sid from v$session where event='&event_name'
new 1: select sid from v$session where event='direct path read temp'
SID
-----------
28
113
139
215
315
326
355
414
8 rows selected.
SQL> select wait_time,wait_class_id,state from v$session_wait where sid=&sid;
Enter value for sid: 28
old 1: select wait_time,wait_class_id,state from v$session_wait where sid=&sid
new 1: select wait_time,wait_class_id,state from v$session_wait where sid=28
WAIT_TIME WAIT_CLASS_ID STATE
---------- ------------- -------------------
1 1740759767 WAITED KNOWN TIME
SQL> set linesize 1000
col username for a15
col osuser for a15
select sid,wait_class_id,user#,username,lockwait,status,osuser,sql_id,PREV_SQL_ID from v$session where
wait_class_id= 1740759767 and sid=&sid;SQL> SQL> SQL> 2
Enter value for sid: 28
old 2: wait_class_id= 1740759767 and sid=&sid
new 2: wait_class_id= 1740759767 and sid=28
SID WAIT_CLASS_ID USER# USERNAME LOCKWAIT STATUS OSUSER SQL_ID PREV_SQL_ID
----------- ------------- ---------- --------------- ---------------- ---------- --------------- ------------- -------------
28 1740759767 156 BMI_NANCHONG ACTIVE ASP.NET v4.0 c8m5td2tmpcxg
令人奇怪的是,这条sql语句上并没有并行相关的Hints,但是其执行计划里却体现出了并行执行QN,查看方案下的表并没有发现有开启并行的表:
SQL> select table_name,degree from user_tables where table_name in('DW_BILL','DW_BILLDETAIL','DW_ZD_DRUG','DW_ZD_SERVICE','DW_ZD_REGION');
TABLE_NAME DEGREE
---------------------------------------- --------------------
DW_BILL 1
DW_BILLDETAIL 1
DW_ZD_DRUG 1
DW_ZD_REGION 1
DW_ZD_SERVICE 1
经过与开发、测试同事了解,应用ASP.NET端的部署没有开启并行设置,然后查看了这几张表下的索引,果然索引下均有并行。
看来是查询执行时,走了索引,触发了并行,耗尽了数据库的进程数,导致数据库对新连接无响应;接下来的处理措施是:查杀消耗CPU高的会话,取消掉索引的并行:
按等待事件类型查杀会话:
SQL> select 'alter system kill session ',''''||a.sid||','||a.serial#||''';'
from v$session a where event='latch free'; 2
'ALTERSYSTEMKILLSESSION' ''''||A.SID||','||A.SERIAL#||''';'
-------------------------- ------------------------------------------------------------------------------------
alter system kill session '11,85';
alter system kill session '26,12795';
alter system kill session '52,43';
alter system kill session '56,3';
alter system kill session '67,9473';
alter system kill session '100,5';
alter system kill session '120,37';
alter system kill session '189,67';
alter system kill session '244,19499';
alter system kill session '245,9039';
alter system kill session '265,13217';
alter system kill session '267,237';
alter system kill session '291,53';
alter system kill session '363,8251';
alter system kill session '402,39799';
alter system kill session '423,2479';
alter system kill session '434,1';
alter system kill session '447,7877';
alter system kill session '452,33';
19 rows selected.
SQL> select event,count(*) from v$session where wait_class<>'Idle' group by event;
EVENT COUNT(*)
---------------------------------------------------------- ----------
direct path read temp 8
SQL*Net message to client 1
SQL> select 'alter system kill session ',''''||a.sid||','||a.serial#||''';'
from v$session a where event='&event_name'; 2
Enter value for event_name: direct path read temp
old 2: from v$session a where event='&event_name'
new 2: from v$session a where event='direct path read temp'
'ALTERSYSTEMKILLSESSION' ''''||A.SID||','||A.SERIAL#||''';'
-------------------------- ------------------------------------------------------------------------------------
alter system kill session '28,5109';
alter system kill session '113,3';
alter system kill session '139,15';
alter system kill session '215,3';
alter system kill session '315,5';
alter system kill session '326,6017';
alter system kill session '355,33';
alter system kill session '414,3';
8 rows selected.
杀完latch free、direct path read temp等待会话,数据库服务器的CPU恢复正常:
接下来取消索引的并行:
由于修改sql相关对象的并行度,相当于其相关对象发生了DDL变更,会引起sql引擎对在相关对象上执行的sql语句重新解析,做完索引的并行度取消,
sql语句恢复正常,服务器没有发生并行使用过多的进程的情况:
c8m5td2tmpcxg果然被重新硬解析,原来的5个子游标变成了现在的3个:
SQL> select child_number,executions,buffer_gets,is_bind_sensitive BS,
is_bind_aware BA,is_shareable SH,plan_hash_value
from v$sql
where sql_id='&sql_id'; 2 3 4
Enter value for sql_id: c8m5td2tmpcxg
old 4: where sql_id='&sql_id'
new 4: where sql_id='c8m5td2tmpcxg'
CHILD_NUMBER EXECUTIONS BUFFER_GETS B B S PLAN_HASH_VALUE
------------ ---------- ----------- - - - ---------------
1 0 3013153881 Y N Y 3571375093
2 6 238503 Y N Y 3286212919
3 0 78051 Y N Y 1661143992
sql语句的执行计划恢复正常,没有了并行:
如果你有不同见解或更好的解决思路,欢迎留言讨论!