跳过正文

记一次 1000+ 每秒 UPDATE 引发的 Redo 风暴:AWR 定位 log file switch checkpoint incomplete

Greatfinish
作者
Greatfinish
记录 Oracle、PostgreSQL、达梦、Linux、存储与生产环境故障处理经验。

一、问题现象
#

12 月 8 日上午 11 点多,监控告警提示生产数据库磁盘使用率超过 90%。初步判断:磁盘使用突增,需要关注数据库整体负载与日志增长情况。

二、问题排查
#

2.1 操作系统磁盘空间
#

查看 OS 空间与 Oracle 目录占用,发现归档/恢复区相关目录增长较快(fast_recovery_area 约 43G)。

[root@md-db-out ~]# df -h
Filesystem                   Size  Used Avail Use% Mounted on
/dev/mapper/vg_root-lv_root  461G  399G   39G  92% /
tmpfs                         32G  160K   32G   1% /dev/shm
/dev/sda1                    485M   39M  421M   9% /boot
[root@md-db-out ~]# su - oracle
[oracle@md-db-out ~]$ cd /u01/app/oracle/
[oracle@md-db-out oracle]$ du -h --max-depth=1
4.0K    ./checkpoints
55G     ./diag
4.6G    ./admin
70G     ./oradata
43G     ./fast_recovery_area
1.8M    ./cfgtoollogs
4.4G    ./product
177G    .
[oracle@md-db-out oracle]$ 

虽然oradata目录不小但故障前没有添加过表空间数据文件,所以这次告警更像是“归档/恢复区增长过快”触发

2.2 归档日志切换频率异常
#

归档日志增长过快后,进一步统计日志切换频率,发现上午 9 点开始切换次数从 11 次 → 755 次(提升近 68 倍)。

[oracle@md-db-out oracle]$ sqlplus / as sysdba

SQL*Plus: Release 11.2.0.4.0 Production on Mon Dec 8 11:27:01 2025

Copyright (c) 1982, 2013, Oracle.  All rights reserved.


Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> set lines 300
SQL> column h0 format 999
column h1 format 999
column h2 format 999
column h3 format 999
column h4 format 999
column h5 format 999
column h6 format 999
column h7 format 999
column h8 format 999
column h9 format 999
column h10 format 999
column h11 format 999
column h12 format 999
column h13 format 999
column h14 format 999
column h15 format 999
column h16 format 999
column h17 format 999
column h18 format 999
column h19 format 999
column h20 format 999
column h21 format 999
column h22 format 999
column h23 format 999
column avg format 999.99
column day format a6

SELECT TRUNC (first_time) "Date", TO_CHAR (first_time, 'Dy') "Day", COUNT (1) "Total",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '00', 1, 0)) h0,
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '01', 1, 0)) "h1",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '02', 1, 0)) "h2",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '03', 1, 0)) "h3",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '04', 1, 0)) "h4",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '05', 1, 0)) "h5",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '06', 1, 0)) "h6",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '07', 1, 0)) "h7",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '08', 1, 0)) "h8",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '09', 1, 0)) "h9",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '10', 1, 0)) "h10",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '11', 1, 0)) "h11",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '12', 1, 0)) "h12",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '13', 1, 0)) "h13",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '14', 1, 0)) "h14",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '15', 1, 0)) "h15",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '16', 1, 0)) "h16",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '17', 1, 0)) "h17",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '18', 1, 0)) "h18",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '19', 1, 0)) "h19",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '20', 1, 0)) "h20",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '21', 1, 0)) "h21",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '22', 1, 0)) "h22",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '23', 1, 0)) "h23", to_char(ROUND (COUNT (1) / 24, 2),'fm99999999990.00') "Avg"
FROM gv$log_history
WHERE first_time >= trunc(SYSDATE) - 7
and thread# = inst_id
GROUP BY TRUNC (first_time), TO_CHAR (first_time, 'Dy')
ORDER BY 1 DESC;
SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL> SQL>   2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30
Date                Day         Total   H0   h1   h2   h3   h4   h5   h6   h7   h8   h9  h10  h11  h12  h13  h14  h15  h16  h17  h18  h19  h20  h21  h22  h23 Avg
------------------- ------ ---------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---------------
2025-12-08 00:00:00 Mon          1823   48    6    8    7    8    7    8    7   11  755  884   74    0    0    0    0    0    0    0    0    0    0    0    0 75.96
2025-12-07 00:00:00 Sun           480   38    8    9    8    6    8    9    7    9    9    9    8    8    8    9    8    9    8    8    9    7    7   10  261 20.00
2025-12-06 00:00:00 Sat           520   32    8    8    7    7   10    9    9    9    9   42    9    9    9    8    9    9    9    8    9    9    8    8  266 21.67
2025-12-05 00:00:00 Fri           568   33    9   11    8    7    9    9    9    9   19   19   18   13   11   15   17   23   16   11   10    7    9   10  266 23.67
2025-12-04 00:00:00 Thu           581   42    9   10    9    8    9    9    8   12   24   18   17   11   24   16   18   20   14   12    9    9    8   11  254 24.21
2025-12-03 00:00:00 Wed           585   55    8   10    8    7   10    9   10   12   14   19   20   11   13   19   20   20   17   11   10    8    9   10  255 24.38
2025-12-02 00:00:00 Tue           603   48    9   10    7    7   10    8    8   12   31   18   21   11   12   22   32   21   27   20   14    8   10   10  227 25.13
2025-12-01 00:00:00 Mon           554   51    7    7    6    7    7    8    7   10   14   18   17   11   13   16   18   17   17   14    8    8    9   11  253 23.08

8 rows selected.

SQL> 

三、AWR报告分析
#

拉取10:00~11:00的AWR报告

SQL> @$ORACLE_HOME/rdbms/admin/awrrpt.sql

3.1 整体负载概况
#

DB Time:29,845s ≈ 497 分钟,AAS ≈ 29,845 / (59×60) ≈ 8.4 个活跃会话

DB CPU ≈ 6,541s → 每秒约 1.84 CPU 秒,用满 16 核只用了 约 11–12% 的 CPU

OS 负载:%busy ~ 11.8%,%idle ~ 88%,%iowait ~ 5.4%

20251210load1.jpg
20251210load2.jpg

结论:CPU 很空闲,系统总体是“等待型”而不是“CPU 打满型”。

3.2 Top Wait Events
#

查看Top Wait 事件,找到真正卡在哪里。Top 10 Foreground Events:

log file switch (checkpoint incomplete)

  • Waits:12,509

  • Total Wait Time:≈ 12,288s(占 DB time ~41%

  • Avg wait:982 ms / 次

log file sync

  • Waits:4,112,201
  • Total Wait Time:≈ 6,929s(占 DB time ~23%
  • Avg wait:2 ms / 次
  • 大约 1,160 次 commit/s 左右

Wait Class 汇总也印证了这一点:

  • Configuration:45%(主要就是各种 log file switch、SQ contention 等
  • Commit:23%(log file sync
  • DB CPU:22%
  • Concurrency:8.6%
  • User I/O:7%

结论:绝大部分时间都耗在 redo / checkpoint 配置和并发控制上。

20251210topwait1.jpg

3.3 定位 SQL:高频 UPDATE
#

在 “SQL ordered by Elapsed Time” 中,你定位到 SQL ID:0vq0s6rm8fawn。它在 60 分钟内执行 3,763,703 次,约 1000+ 次/秒 UPDATE,直接带来大量 redo / 归档与频繁 log switch。

20251210sqle1.jpg

SQL如下:

UPDATE SOA_EX_DRUG SET DRUG_ID=:1 , CMM_NAME=:2 , CMMDT_NM=:3 , MED_CTGY_TYPE=:4 , NORM=:5 , MIN_PCKNG_CMPNY=:6 , APRV_NO=:7 , PRDCT_PLT=:8 , status=:9 , CNTRY_NAME=:10 , DEVICES=:11 , PRV_APRV_NO=:12 , CMM_NAME_PY=:13 , PRDCT_CREDIT=:14 , PRDCT_ID=:15 , PACK_QUANTITY=:16 , MGT_CAT_NAME=:17 , PROD_CAT_NAME=:18 , REGIST_FORM_NAME=:19 , CLASS_SUBCATALOG_V17_NAME=:20 , LEVEL1_PROD_CAT_V17_NAME=:21 , LEVEL2_PROD_CAT_V17_NAME=:22 , SPU_FLAG=:23 WHERE ID=:24

进一步用 dba_hist_sqlstat 验证:12/5 之前 executions 仅 2 万级,到了 12/8 突增到 300 万级(增长 167 倍)。

SQL> set linesize 300
SQL> select a.snap_id,b.begin_interval_time,a.executions_delta from dba_hist_sqlstat a,dba_hist_snapshot b where a.sql_id='0vq0s6rm8fawn' and a.snap_id=b.snap_id order by a.snap_id;

   SNAP_ID BEGIN_INTERVAL_TIME                                                              EXECUTIONS_DELTA
---------- -------------------------------------------------------------------------------- ----------------
     64968 02-DEC-25 02.00.04.126 PM                                                                   22446
     64969 02-DEC-25 03.00.06.304 PM                                                                   35605
     64989 03-DEC-25 11.00.53.713 AM                                                                   25956
     65011 04-DEC-25 09.00.47.148 AM                                                                   22293
     65018 04-DEC-25 04.00.05.309 PM                                                                   15736
     65040 05-DEC-25 02.00.01.717 PM                                                                    5283
     65042 05-DEC-25 04.00.06.122 PM                                                                   51054
     65107 08-DEC-25 09.00.51.667 AM                                                                 3144734
     65108 08-DEC-25 10.00.56.171 AM                                                                 3763703
     65109 08-DEC-25 11.00.03.946 AM                                                                  318640
     65110 08-DEC-25 12.00.35.037 PM                                                                 3151397
     65111 08-DEC-25 01.00.43.053 PM                                                                 3999423
     65112 08-DEC-25 02.00.29.114 PM                                                                 2629413

13 rows selected.

SQL>

一直到下午14点多查看oratop还是显示SQL Id:0vq0s6rm8fawn频繁执行。造成大量log file switch (checkpoint incomplete)等待。

oratop20251208.png

结论:高频的更新SQL语句,导致log file switch (checkpoint incomplete)一小时内 12,509 次,平均每次接近 1 秒

四、问题解决
#

4.1 应用侧
#

第一步:与开发沟通,把高频 SQL 告知对方,推动应用修改。

应用修改后,日志切换频率正常。如下:

SQL> SELECT TRUNC (first_time) "Date", TO_CHAR (first_time, 'Dy') "Day", COUNT (1) "Total",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '00', 1, 0)) h0,
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '01', 1, 0)) "h1",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '02', 1, 0)) "h2",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '03', 1, 0)) "h3",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '04', 1, 0)) "h4",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '05', 1, 0)) "h5",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '06', 1, 0)) "h6",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '07', 1, 0)) "h7",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '08', 1, 0)) "h8",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '09', 1, 0)) "h9",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '10', 1, 0)) "h10",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '11', 1, 0)) "h11",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '12', 1, 0)) "h12",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '13', 1, 0)) "h13",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '14', 1, 0)) "h14",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '15', 1, 0)) "h15",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '16', 1, 0)) "h16",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '17', 1, 0)) "h17",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '18', 1, 0)) "h18",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '19', 1, 0)) "h19",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '20', 1, 0)) "h20",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '21', 1, 0)) "h21",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '22', 1, 0)) "h22",
SUM (DECODE (TO_CHAR (first_time, 'hh24'), '23', 1, 0)) "h23", to_char(ROUND (COUNT (1) / 24, 2),'fm99999999990.00') "Avg"
FROM gv$log_history
WHERE first_time >= trunc(SYSDATE) - 7
and thread# = inst_id
GROUP BY TRUNC (first_time), TO_CHAR (first_time, 'Dy')
ORDER BY 1 DESC;  2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30

Date                Day         Total   H0   h1   h2   h3   h4   h5   h6   h7   h8   h9  h10  h11  h12  h13  h14  h15  h16  h17  h18  h19  h20  h21  h22  h23 Avg
------------------- ------ ---------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---------------
2025-12-10 00:00:00 Wed           333   56    9   10    7    7    8    7    9   18   22   20   16   13   12   16   18   19   16   10    7    8    7    9    9 13.88
2025-12-09 00:00:00 Tue           607    0    0    0    0    0    0    0    0   67   72   16   15   10   12   17   19   44   27   11   10    7    8    7  265 25.29
2025-12-08 00:00:00 Mon          4364   48    6    8    7    8    7    8    7   11  755  884   83  725  884  622   17   17   14    9    9    7    7   10  211 181.83
2025-12-07 00:00:00 Sun           480   38    8    9    8    6    8    9    7    9    9    9    8    8    8    9    8    9    8    8    9    7    7   10  261 20.00
2025-12-06 00:00:00 Sat           520   32    8    8    7    7   10    9    9    9    9   42    9    9    9    8    9    9    9    8    9    9    8    8  266 21.67
2025-12-05 00:00:00 Fri           568   33    9   11    8    7    9    9    9    9   19   19   18   13   11   15   17   23   16   11   10    7    9   10  266 23.67
2025-12-04 00:00:00 Thu           581   42    9   10    9    8    9    9    8   12   24   18   17   11   24   16   18   20   14   12    9    9    8   11  254 24.21
2025-12-03 00:00:00 Wed           585   55    8   10    8    7   10    9   10   12   14   19   20   11   13   19   20   20   17   11   10    8    9   10  255 24.38
SQL>

4.2 数据库侧:扩大 redo log size(从 50MB 扩到 2GB)
#

当时的 redo 配置:3 组、每组 50MB(默认)。然后新增 3 组 2GB(group 4/5/6),切换后删除旧组(1/2/3)。检查 checkpoint / mttr 参数,查看 FAST_START_MTTR_TARGET / LOG_CHECKPOINT_INTERVAL / LOG_CHECKPOINT_TIMEOUT避免设置得太小,导致频繁 checkpoint。

-- 1) 查看现状
SQL> select group#, bytes/1024/1024 mb, status from v$log;

    GROUP#         MB STATUS
---------- ---------- ----------------
         1         50 CURRENT
         2         50 INACTIVE
         3         50 INACTIVE

SQL> col member for a80
SQL> select group#, member from v$logfile;

    GROUP# MEMBER
---------- --------------------------------------------------------------------------------
         1 /u01/app/oracle/oradata/mddb/redo01.log
         2 /u01/app/oracle/oradata/mddb/redo02.log
         3 /u01/app/oracle/oradata/mddb/redo03.log

-- 2) 新增大 redo
SQL> alter database add logfile group 4 '/u01/app/oracle/oradata/mddb/redo04.log' size 2g;

Database altered.

SQL> alter database add logfile group 5 '/u01/app/oracle/oradata/mddb/redo05.log' size 2g;

Database altered.

SQL> alter database add logfile group 6 '/u01/app/oracle/oradata/mddb/redo06.log' size 2g;

Database altered.

-- 3) 手工切换,确保旧组不再 ACTIVE/CURRENT,再删除
SQL> alter system switch logfile;

System altered.

SQL> /

System altered.

SQL> select group#, bytes/1024/1024 mb, status from v$log;

    GROUP#         MB STATUS
---------- ---------- ----------------
         1         50 INACTIVE
         2         50 INACTIVE
         3         50 INACTIVE
         4       2048 INACTIVE
         5       2048 ACTIVE
         6       2048 CURRENT

6 rows selected.

SQL> alter database drop logfile group 1;

Database altered.

SQL> alter database drop logfile group 2;

Database altered.

SQL> alter database drop logfile group 3;

Database altered.

-- 4) 验证
SQL> select group#, bytes/1024/1024 mb, status from v$log;

    GROUP#         MB STATUS
---------- ---------- ----------------
         4       2048 INACTIVE
         5       2048 ACTIVE
         6       2048 CURRENT

-- 5) 检查参数,为默认,系统自动触发/控制 checkpoint
SQL> show parameter fast_start_mttr_target

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
fast_start_mttr_target               integer     0
SQL> show parameter log_check

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
log_checkpoint_interval              integer     0
log_checkpoint_timeout               integer     1800
log_checkpoints_to_alert             boolean     FALSE
SQL>

五、总结
#

12 月 8 日 11 点生产库磁盘使用率超过 90%,排查发现主要由归档日志快速增长导致(/u01 下 fast_recovery_area 占用明显)。进一步查看日志切换,9 点后归档切换从 11 次飙升到 755 次。拉取 10:00–11:00 AWR 后确认系统属于“等待型”,主要等待集中在 log file switch (checkpoint incomplete)(12,509 次,约 41% DB time)和 log file sync(4,112,201 次,约 23% DB time)。同期 Top SQL 中 0vq0s6rm8fawn 在 60 分钟内执行 3,763,703 次(约 1000+ 次/秒),并且相较 12/5 执行次数突增 167 倍,是导致 redo/归档暴涨的关键诱因。处置上先通知开发处理高频 SQL,同时将 redo log 从 50MB 3 组扩容为 2GB 3 组并完成切换与清理,缓解日志切换与 checkpoint 压力。