故障现象:
某物流公司新上一套软件系统,数据库CPU占用率基本维持在80%以上,业务高峰期数据库宕机。
知识预览:
LGWR什么时候开始工作的,触发他的动作?
A. 重做日志缓冲区的已使用的空间达到三分之一时
B. 当dbwn进程向磁盘写入已修改的缓冲区的时候
C. 每隔3秒钟
D. 用户提交事务处理时的一条提交记录(经常commit会及时刷新重做日志缓冲区空间)
LGWR什么时候开始工作的?
A. 重做日志缓冲区的已使用的空间达到三分之一时
B. 当DBWn进程向磁盘写入已修改的缓冲区的时候
C. 每隔3秒钟
D. 用户提交事务处理时
什么情况下会产生Log file sync等待事件?
1.用户进程发起commit
2.用户进程通知LGWR写日志
3.LGWR接收到请求开始写
4.LGWR写完成
5.LGWR通知用户进程写完成
6.用户进程获得通知,继续做其他事
1,2阶段的时间,主要是用户提交等待时间,如果系统CPU资源充足,在这个阶段一般不会出现大的延迟。前台进程提交到LGWR后,就开始等待log file sync。
2,3阶段的时间,主要是LGWR为了获取CPU资源,等待cpu调度的时间.如果系统CPU资源充足,一般不会出现大的延迟。
3,4阶段的时间,主要是真正的物理IO时间,LGWR通知OS把log buffer的内容写入到磁盘,然后LGWR进入睡眠,这个时间正常情况下的延迟占整个log file sync的大部分时间。
4,5阶段的时间,OS调度lgwr 重新获得cpu资源,lgwr post前台进程写完成。lgwr可能会post很多前台进程
5,6阶段的时间,前台进程接受到lgwr的通知,返回cpu运行队列,处理其他事物(log file sync结束)。
Log file sync等待事件有哪些优化方案?
尽量使用快速磁盘,不要把redo log file存放在raid 5的磁盘上;
加大日志缓冲区(log buffer);
使用批量提交,减少提交的次数;
部分经常提交的事务设置为异步提交;
安装最新版本数据库避免bug;
环境:
| Platform | CPUs | Cores | Sockets | Memory(GB) |
iZ28vfyim2vZ | Microsoft Windows x86 64-bit | 8 | 8 | 1 | 16.00 |
数据库运行在一台8核16G内存Windows上;这台服务器上同时还运行着web应用。
数据库基本信息:
DB Name | DB Id | Instance | Inst num | Startup Time | Release | RAC |
PMDB | 128847122 | pmdb | 1 | 27-Feb -15 21:02 | 11.2.0.1.0 | NO |
AWR数据库性能采样报告:
| Snap Id | Snap Time | Sessions | Cursors/Session |
Begin Snap: | 6603 | 3-Mar-15 14:30:15 | 80 | 2.2 |
End Snap: | 6604 | 3-Mar-15 15:30:20 | 82 | 3.0 |
Elapsed: |
| 239.56 (mins) |
|
|
DB Time: |
| 5205.71(mins) |
|
|
问题时间段60分钟的AWR数据库性能采样报告,DB Time为5205分钟,CPU数为8核,平均每个CPU忙于执行任务的时间为5205/8=650mins分钟,而数据库处理任务花费的时间为239mins,忙于执行任务时间远远大于处理任务花费的时间。
TOP5事件:
Event | Waits | Time(s) | Avg wait (ms) | % DB time | Wait Class |
DB CPU |
| 22,574 |
| 42.59 |
|
log file sync | 25,313 | 310 | 12 | 0.58 | Commit |
db file sequential read | 10,645 | 24 | 2 | 0.05 | User I/O |
direct path read | 9,191 | 22 | 2 | 0.04 | User I/O |
latch:cache buffers chains | 127,566 | 13,136 | 103 | 4.21 | Concurrency |
Top 5 Event可以看出,DB CPU消耗超高,占 42.59% 的事件比例;
而日志文件同步等待事件单次等待达到12毫秒,这表明IO可能存在迟缓;
并发latch:cache buffers chains事件平均等待时间也超高,这个事件,很大程序上与逻辑读有关,说明可能逻辑读比较大。
负载概要信息:
| Per Second | Per Transaction | Per Exec | Per Call |
DB Time(s): | 14.7 | 3.0 | 0.07 | 0.02 |
DB CPU(s): | 6.3 | 1.3 | 0.03 | 0.01 |
Redo size: | 55,291.8 | 11,421.6 |
|
|
Logical reads: | 662,383.2 | 136,828.5 |
|
|
Block changes: | 189.9 | 39.2 |
|
|
Physical reads: | 7.0 | 1.5 |
|
|
Physical writes: | 19.9 | 4.1 |
|
|
User calls: | 883.5 | 182.5 |
|
|
Parses: | 21.1 | 4.4 |
|
|
Hard parses: | 3.4 | 0.7 |
|
|
W/A MB processed: | 1,170.6 | 241.8 |
|
|
Logons: | 1.2 | 0.3 |
|
|
Executes: | 214.9 | 44.4 |
|
|
Rollbacks: | 0.0 | 0.0 |
|
|
Transactions: | 5.0 |
|
|
|
数据库每秒仅仅折合5个事务,而此前的日志文件同步等待事件较高,这说明IO资源可能被其他操作占用。除了事务之外,应该就是查询,也就是说,可能有大量低效查询消耗尽了IO资源。
主机CPU:
Load Average Begin | Load Average End | %User | %System | %WIO | %Idle |
|
| 88.3 | 3.8 |
| 8.0 |
从主机层面来看,CPU空闲只有8.0%,表明目前cpu基本满负荷运转了。
实例CPU:
%Total CPU | %Busy CPU | %DB time waiting for CPU (Resource Manager) |
83.6 | 90.8 | 0.0 |
从数据库占用CPU资源来看,数据库消耗了主机cpu资源的83.6%,说明大量的cpu资源都是由oracle消耗了。
内存统计信息:
| Begin | End |
Host Mem (MB): | 16,383.6 | 16,383.6 |
SGA use (MB): | 2,144.0 | 2,144.0 |
PGA use (MB): | 1,018.7 | 1,018.7 |
% Host Mem used for SGA+PGA: | 19.30 | 19.30 |
主机内存16G,其中SGA:2G、PGA:1G,内存小了,数据在Buffer cache中缓存的就少,前面我们注意到逻辑读超高,这将导致物理读,从而解释了IO资源迟缓的问题。
SQL逻辑读:
Buffer Gets | Executions | Gets per Exec | %Total | Elapsed Time (s) | %CPU | %IO | SQL Id | SQL Module | SQL Text |
1,171,388,996 | 90,764 | 12,905.88 | 46.05 | 30,514.20 | 44.7 | 0 | 6t0q3nw8b2b68 | JDBC Thin Client | select * from ( select row_.*,... |
554,429,813 | 42,299 | 13,107.40 | 32.22 | 9,923.23 | 40.9 | 0 | btn84j9rhuf4h | JDBC Thin Client | select * from ( select row_.*,... |
503,809,511 | 127,506 | 3,951.26 | 11.10 | 7,669.52 | 37.3 | 0 | cp9wdanzmzj4q | JDBC Thin Client | select * from (select JOBID, S... |
30,497,069 | 2,331 | 13,083.26 | 1.28 | 409.63 | 42.2 | 0 | fqjyuzjgm89qv | JDBC Thin Client | select * from ( select trunc(P... |
total列表明这个sql执行时间占据db time的比例,第一个这个比值达到46.05%,显然这个SQL导致的逻辑读,消耗了大量的IO资源。
段逻辑读:
Owner | Tablespace Name | Object Name | Subobject Name | Obj. Type | Logical Reads | %Total |
PMSOFT | PMTABLESPACE | PM_EXTRACT_GOODS |
| TABLE | 2,145,895,024 | 89.21 |
PMSOFT | PMTABLESPACE_IDX | PM_GOODS_LETTER_LL_PK |
| INDEX | 35,438,144 | 1.48 |
PMSOFT | PMTABLESPACE | PM_GOODS_HANDLE_INFO |
| TABLE | 18,438,384 | 0.77 |
PMSOFT | PMTABLESPACE_IDX | PM_TRACE_GOODS_GOODID |
| INDEX | 17,049,488 | 0.71 |
PMSOFT | USER | PM_LINE_INFO |
| TABLE | 5,766,448 | 0.24 |
在"Segments by Logical Reads"部分可以看到,逻辑读集中在一个对象上,第一位的表占用89.21%的逻辑读。
问题汇总及解决方法:
1、数据库安装在Windows上
数据库迁移到Linux上,增加稳定性。
2、数据库与WEB应用部署在同一台服务器上
数据库和WEB应用分开部署,减少其它应用对服务器的压力。
3、数据库版本11.0.2.1较低
数据库版本升级成11.0.2.4,减少软件Bug造成的影响。
4、数据库使用默认实例,采用默认配置参数
加大内存、日志缓冲区、闪回恢复区、增加日志组数。
5、表PM_EXTRACT_GOODS贡献的逻辑读异常大
解决逻辑读问题。
以上解决步骤不再详述,下面看一下如何降低逻辑读。
如何降低逻辑读?
1、复查应用程序,减少查询?
经过分析,这张表是主要业务表,没法减少查询。
2、查看相关SQL的执行计划,然后进行优化?
我就是用的这个,当然还有很多其它的方式,可以确定是否是因为热点块问题造成的、增加DBWn进程个数、创建反向索引、小buffer cache等等。
查看问题SQL执行计划:
select * from ( select row_.*, rownum rownum_ from ( select letter.bill_no bill_no, letter.id receive_id, letter.goods_desc, letter.receiver_name, letter.receiver_phone, letter.sender_name, letter.sender_phone, letter.hm_npay, letter.hm_paid, letter.insur_coverage, letter.insur_cost, letter.ag_fund, TO_CHAR(letter.createdate, 'yyyy-MM-dd HH24:MI:SS') bill_date, letter.fdept_id, letter.fdept_code, letter.fdept_name, letter.tdept_id, letter.tdept_code, letter.tdept_name, letter.ticket_no, bigg.remark, bigg.les_num, bigg.id , TO_CHAR( bigg.handle_time, 'yyyy-MM-dd HH24:MI:SS') handle_time from pm_goods_letter letter left join (select * from pm_extract_goods gin2 where (gin2.receive_id, gin2.handle_time) in ( select gin.receive_id, max(gin.handle_time) from pm_extract_goods gin group by gin.receive_id)) bigg on bigg.receive_id=letter.id where 1=1 AND (letter.fdept_id =:1 or letter.tdept_id=:2) ) row_ ) where rownum_ > 0 and rownum_ <= 20
执行计划如下:
通过执行计划,我们能够发现,可能是索引缺失导致了全表的扫描。
尝试解决问题:
在表PM_EXTRACT_GOODS的receive_id和handle_time字段建一个组合索引,问题解决。
优化前后对比
数据库基本信息
优化前,数据库安装在Windows server 2008 R2下,8核,内存16G,数据库版本11.2.0.1.0
Host Name | Platform | CPUs | Cores | Sockets | Memory(GB) |
iZ28vfyim2vZ | Microsoft Windows x86 64-bit | 8 | 8 | 1 | 16.00 |
DB Name | DB Id | Instance | Inst num | Startup Time | Release | RAC |
PMDB | 128847122 | pmdb | 1 | 27-Feb -15 21:02 | 11.2.0.1.0 | NO |
优化后,数据库安装在CentOS6.5下,16核,内存64G,数据库版本11.2.0.4.0
Host Name | Platform | CPUs | Cores | Sockets | Memory(GB) |
iZ28od1kf25Z | Linux x86 64-bit | 16 | 16 | 1 | 62.90 |
DB Name | DB Id | Instance | Inst num | Startup Time | Release | RAC |
PMDB | 135174046 | pmdb | 1 | 05-Mar-15 00:03 | 11.2.0.4.0 | NO |
AWR数据库性能采样报告
优化前
| Snap Id | Snap Time | Sessions | Cursors/Session |
Begin Snap: | 6603 | 3-Mar-15 14:30:15 | 80 | 2.2 |
End Snap: | 6604 | 3-Mar-15 15:30:20 | 82 | 3.0 |
Elapsed: |
| 239.56 (mins) |
|
|
DB Time: |
| 5205.71(mins) |
|
|
优化后
| Snap Id | Snap Time | Sessions | Cursors/Session |
Begin Snap: | 36 | 06-Mar-15 14:30:05 | 49 | 3.7 |
End Snap: | 37 | 06-Mar-15 15:30:11 | 49 | 3.7 |
Elapsed: |
| 180.10 (mins) |
|
|
DB Time: |
| 23.35 (mins) |
|
|
TOP5事件
优化前
Event | Waits | Time(s) | Avg wait (ms) | % DB time | Wait Class |
DB CPU |
| 22,574 |
| 42.59 |
|
log file sync | 25,313 | 310 | 12 | 0.58 | Commit |
db file sequential read | 10,645 | 24 | 2 | 0.05 | User I/O |
direct path read | 9,191 | 22 | 2 | 0.04 | User I/O |
latch:cache buffers chains | 127,566 | 13,136 | 103 | 4.21 | Concurrency |
优化后
Event | Waits | Time(s) | Avg wait (ms) | % DB time | Wait Class |
DB CPU |
| 1,267.9 |
| 20.5 |
|
reliable message | 2,281 | 1.2 | 0 | 0.0 | Other |
SQL*Net more data from client | 27,707 | 0.6 | 0 | 0.0 | Network |
SQL*Net message to client | 205,173 | 0.3 | 0 | 0.0 | Network |
db file sequential read | 3,612 | 0.1 | 0 | 0.0 | User I/O |
负载概要信息
优化前
| Per Second | Per Transaction | Per Exec | Per Call |
DB Time(s): | 14.7 | 3.0 | 0.07 | 0.02 |
DB CPU(s): | 6.3 | 1.3 | 0.03 | 0.01 |
Redo size: | 55,291.8 | 11,421.6 |
|
|
Logical reads: | 662,383.2 | 136,828.5 |
|
|
Block changes: | 189.9 | 39.2 |
|
|
Physical reads: | 7.0 | 1.5 |
|
|
Physical writes: | 19.9 | 4.1 |
|
|
User calls: | 883.5 | 182.5 |
|
|
Parses: | 21.1 | 4.4 |
|
|
Hard parses: | 3.4 | 0.7 |
|
|
W/A MB processed: | 1,170.6 | 241.8 |
|
|
Logons: | 1.2 | 0.3 |
|
|
Executes: | 214.9 | 44.4 |
|
|
Rollbacks: | 0.0 | 0.0 |
|
|
Transactions: | 5.0 |
|
|
|
优化后
| Per Second | Per Transaction | Per Exec | Per Call |
DB Time(s): | 0.1 | 0 | 0.00 | 0.00 |
DB CPU(s): | 0.1 | 0 | 0.00 | 0.00 |
Redo size: | 7,989.4 | 1,900.1 |
|
|
Logical reads: | 23,833.5 | 5,668.3 |
|
|
Block changes: | 46.5 | 11.1 |
|
|
Physical reads: | 7.0 | 1.5 |
|
|
Physical writes: | 4.4 | 1.1 |
|
|
Read IO requests: | 62.2 | 14.8 |
|
|
Write IO requests: | 2.7 | 0.6 |
| |
Read IO (MB): | 57.2 | 13.6 |
| |
Write IO (MB): | 0.0 | 0.0 |
| |
User calls: | 51.8 | 12.3 |
|
|
Parses: | 22.5 | 5.3 |
|
|
Hard parses: | 0.7 | 0.2 |
|
|
SQL Work Area (MB): | 6.0 | 1.4 |
|
|
Logons: | 0.2 | 0.1 |
|
|
Executes: | 32.4 | 7.7 |
|
|
Rollbacks: | 0.0 | 0.0 |
|
|
Transactions: | 4.2 |
|
|
|
主机CPU
优化前,CPU空闲率只有8%
Load Average Begin | Load Average End | %User | %System | %WIO | %Idle |
|
| 88.3 | 3.8 |
| 8.0 |
优化后,从主机层面来看,优化后CPU空闲占到99%
CPUs | Cores | Sockets | Load Average Begin | Load Average End | %User | %System | %WIO | %Idle |
16 | 16 | 1 | 0.31 | 0.15 | 0.7 | 0.3 | 0.1 | 99.0 |
实例CPU
优化前
%Total CPU | %Busy CPU | %DB time waiting for CPU (Resource Manager) |
83.6 | 90.8 | 0.0 |
优化后
%Total CPU | %Busy CPU | %DB time waiting for CPU (Resource Manager) |
0.8 | 77.5 | 0.0 |
从数据库占用CPU资源来看,优化后数据库只消耗了主机cpu资源的0.8%
内存统计信息
优化前
| Begin | End |
Host Mem (MB): | 16,383.6 | 16,383.6 |
SGA use (MB): | 2,144.0 | 2,144.0 |
PGA use (MB): | 1,018.7 | 1,018.7 |
% Host Mem used for SGA+PGA: | 19.30 | 19.30 |
优化后
| Begin | End |
Host Mem (MB): | 64,411.3 | 64,411.3 |
SGA use (MB): | 20,480.0 | 20,480.0 |
PGA use (MB): | 247.6 | 267.2 |
% Host Mem used for SGA+PGA: | 3.56 | 3.59 |
SQL逻辑读
优化前, %total列表明这个sql执行时间占据db time的比例,第一个这个比值达到46.05%,显然这个SQL导致的逻辑读,消耗了大量的IO资源
Buffer Gets | Executions | Gets per Exec | %Total | Elapsed Time (s) | %CPU | %IO | SQL Id | SQL Module | SQL Text |
1,171,388,996 | 90,764 | 12,905.88 | 46.05 | 30,514.20 | 44.7 | 0 | 6t0q3nw8b2b68 | JDBC Thin Client | select * from ( select row_.*,... |
554,429,813 | 42,299 | 13,107.40 | 32.22 | 9,923.23 | 40.9 | 0 | btn84j9rhuf4h | JDBC Thin Client | select * from ( select row_.*,... |
503,809,511 | 127,506 | 3,951.26 | 11.10 | 7,669.52 | 37.3 | 0 | cp9wdanzmzj4q | JDBC Thin Client | select * from (select JOBID, S... |
30,497,069 | 2,331 | 13,083.26 | 1.28 | 409.63 | 42.2 | 0 | fqjyuzjgm89qv | JDBC Thin Client | select * from ( select trunc(P... |
优化后, %total列表明这个sql执行时间占据db time的比例,第一个这个比值只有5.39%
Buffer Gets | Executions | Gets per Exec | %Total | Elapsed Time (s) | %CPU | %IO | SQL Id | SQL Module | SQL Text |
270,368,217 | 20,198 | 13,385.89 | 5.39 | 1,023.63 | 34.7 | 0 | bqxs5y957g75m | JDBC Thin Client | select count(*) from PM_GOODS_... |
211,213,752 | 1,544 | 136,796.47 | 4.21 | 890.64 | 28.2 | 0 | 6x7279fvzt5pk | JDBC Thin Client | select * from ( select row_.*,... |
90,328,606 | 5,188 | 17,411.07 | 1.30 | 457.81 | 37.3 | 0 | d577budk4k7aq | JDBC Thin Client | select letter.bill_no bill_no,... |
81,834,742 | 5,698 | 14,362.01 | 1.13 | 1.63 | 39.2 | 0 | 740wmdm31jyzp | JDBC Thin Client | select pickup_total.name tihuo... |
段逻辑读
优化前
Owner | Tablespace Name | Object Name | Subobject Name | Obj. Type | Logical Reads | %Total |
PMSOFT | PMTABLESPACE | PM_EXTRACT_GOODS |
| TABLE | 2,145,895,024 | 89.21 |
PMSOFT | PMTABLESPACE_IDX | PM_GOODS_LETTER_LL_PK |
| INDEX | 35,438,144 | 1.48 |
PMSOFT | PMTABLESPACE | PM_GOODS_HANDLE_INFO |
| TABLE | 18,438,384 | 0.77 |
PMSOFT | PMTABLESPACE_IDX | PM_TRACE_GOODS_GOODID |
| INDEX | 17,049,488 | 0.71 |
PMSOFT | USER | PM_LINE_INFO |
| TABLE | 5,766,448 | 0.24 |
优化后
Owner | Tablespace Name | Object Name | Subobject Name | Obj. Type | Logical Reads | %Total |
PMSOFT | PMTABLESPACE | PM_MID_DAOHUO |
| TABLE | 47,626,288 | 9.21 |
PMSOFT | PMTABLESPACE | PM_GOODS_LETTER |
| TABLE PARTITION | 40,368,688 | 1.50 |
PMSOFT | PMTABLESPACE | PM_TRACE_GOODS |
| TABLE PARTITION | 10,344,832 | 0.97 |
PMSOFT | PMTABLESPACE_IDX | VOICE_SEND |
| TABLE | 10,603,568 | 0.56 |
PMSOFT | PMTABLESPACE | PM_GOODS_LETTER |
| TABLE PARTITION | 5,302,848 | 0.34 |
案例总结
1、了解日志文件同步事件优化方案
2、合理配置数据库参数
3、根据sql执行计划判断执行计划是否正确
2、正确使用索引。
案例警示
即使一个小的问题,也可能造成严重的事故。
将问题消灭在源头。
发生事故后,冷静分析,寻找最佳方案。
已有 0人发表留言,猛击->> 这里<<-参与讨论
ITeye推荐