一、故障现象#
2026 年 5 月 13 日下午,WMS 系统数据库在 16:50–17:06 左右出现短时间卡顿,应用侧报错:
ORA-03135: connection lost contact
Process ID: 0
Session ID: 0
Serial number: 0现场检查发现,应用服务器到数据库服务器的网络连通性正常:
ping 正常
telnet 数据库监听端口正常问题持续几分钟后自动恢复,数据库实例没有重启。由于是故障发生后第二天才介入,数据库服务器当时没有安装 OSWatcher / oswbb 这类操作系统历史采集工具,无法完整回放当时 CPU、内存、I/O、进程级资源使用情况。因此本次排查主要依赖以下信息:
1. Oracle alert 日志
2. 故障时间段 ASH 历史会话等待信息
3. Oracle 内存参数
4. v$sga_resize_ops 中的 SGA resize 记录
5. sysstat/sar 保留下来的 swap 历史数据二、alert 日志关键时间线#
alert 日志中,故障窗口内有三类重要信息。
1. 连接超时#
16:50 以后,alert 日志开始出现连接超时:
2026-05-13T16:35:12.506429+08:00
Thread 1 advanced to log sequence 58602 (LGWR switch)
Current log# 4 seq# 58602 mem# 0: /home/oracle/oracle/app/oradata/WMSPROD/redo04.log
2026-05-13T16:35:12.585976+08:00
TT03 (PID:6908): SRL selected for T-1.S-58602 for LAD:2
2026-05-13T16:35:13.192971+08:00
ARC1 (PID:17839): Archived Log entry 60125 added for T-1.S-58601 ID 0x50fa7345 LAD:1
2026-05-13T16:50:52.780199+08:00
***********************************************************************
Fatal NI connect error 12170.
VERSION INFORMATION:
TNS for Linux: Version 18.0.0.0.0 - Production
Oracle Bequeath NT Protocol Adapter for Linux: Version 18.0.0.0.0 - Production
TCP/IP NT Protocol Adapter for Linux: Version 18.0.0.0.0 - Production
Version 18.3.0.0.0
Time: 13-5月 -2026 16:50:52
Tracing not turned on.
Tns error struct:
ns main err code: 12535
TNS-12535: TNS: 操作超时
ns secondary err code: 12606
nt main err code: 0
nt secondary err code: 0
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=172.16.76.126)(PORT=50927))
2026-05-13T16:50:52.802481+08:00
WARNING: inbound connection timed out (ORA-3136)
2026-05-13T16:51:05.680349+08:00
***********************************************************************
Fatal NI connect error 12170.
VERSION INFORMATION:
TNS for Linux: Version 18.0.0.0.0 - Production
Oracle Bequeath NT Protocol Adapter for Linux: Version 18.0.0.0.0 - Production
TCP/IP NT Protocol Adapter for Linux: Version 18.0.0.0.0 - Production
Version 18.3.0.0.0
Time: 13-5月 -2026 16:51:05
Tracing not turned on.
Tns error struct:
2026-05-13T16:51:05.680946+08:00
***********************************************************************
ns main err code: 12535
Fatal NI connect error 12170.
TNS-12535: TNS: 操作超时
VERSION INFORMATION:
TNS for Linux: Version 18.0.0.0.0 - Production
Oracle Bequeath NT Protocol Adapter for Linux: Version 18.0.0.0.0 - Production
TCP/IP NT Protocol Adapter for Linux: Version 18.0.0.0.0 - Production
ns secondary err code: 12606
nt main err code: 0
Version 18.3.0.0.0
nt secondary err code: 0
nt OS err code: 0
Time: 13-5月 -2026 16:51:05
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=172.16.76.125)(PORT=32992))
2026-05-13T16:51:05.681668+08:00
WARNING: inbound connection timed out (ORA-3136)
Tracing not turned on.
Tns error struct:
ns main err code: 12535
TNS-12535: TNS: 操作超时
ns secondary err code: 12606
nt main err code: 0
nt secondary err code: 0
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=172.16.76.125)(PORT=33012))
2026-05-13T16:51:05.682861+08:00
WARNING: inbound connection timed out (ORA-3136)
2026-05-13T16:51:06.750191+08:00
***********************************************************************
2026-05-13T16:51:06.750193+08:00
***********************************************************************
Fatal NI connect error 12170.
Fatal NI connect error 12170.
VERSION INFORMATION:
TNS for Linux: Version 18.0.0.0.0 - Production
Oracle Bequeath NT Protocol Adapter for Linux: Version 18.0.0.0.0 - Production
TCP/IP NT Protocol Adapter for Linux: Version 18.0.0.0.0 - Production
VERSION INFORMATION:
TNS for Linux: Version 18.0.0.0.0 - Production
Oracle Bequeath NT Protocol Adapter for Linux: Version 18.0.0.0.0 - Production
TCP/IP NT Protocol Adapter for Linux: Version 18.0.0.0.0 - Production
Version 18.3.0.0.0
Version 18.3.0.0.0
Time: 13-5月 -2026 16:51:06
Time: 13-5月 -2026 16:51:06
Tracing not turned on.
Tracing not turned on.
Tns error struct:
Tns error struct:
ns main err code: 12535
ns main err code: 12535
TNS-12535: TNS: 操作超时
TNS-12535: TNS: 操作超时
ns secondary err code: 12606
ns secondary err code: 12606
nt main err code: 0
nt main err code: 0
nt secondary err code: 0
nt secondary err code: 0
nt OS err code: 0
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=172.16.76.125)(PORT=33032))
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=172.16.76.125)(PORT=33054))
2026-05-13T16:51:06.751531+08:00
WARNING: inbound connection timed out (ORA-3136)
2026-05-13T16:51:06.751552+08:00
WARNING: inbound connection timed out (ORA-3136)
2026-05-13T16:51:06.760173+08:00
***********************************************************************这些错误与应用侧 ORA-03135 时间基本吻合。应用端看到的是连接丢失,数据库服务端看到的是入站连接处理超时。
这类错误说明:
TCP 连接或连接请求能到达数据库主机,但数据库端无法及时完成入站连接处理。
因此,排查方向从网络层转向数据库内部等待、后台进程状态、内存管理和操作系统内存压力。
2. Hang Manager 检测到数据库内部 hang#
alert 日志中最关键的信息是16:56:24~16:59:35 DIA0 Hang Manager 报出的 hang chain:
***********************************************************************
Fatal NI connect error 12170.
VERSION INFORMATION:
TNS for Linux: Version 18.0.0.0.0 - Production
Oracle Bequeath NT Protocol Adapter for Linux: Version 18.0.0.0.0 - Production
TCP/IP NT Protocol Adapter for Linux: Version 18.0.0.0.0 - Production
Version 18.3.0.0.0
Time: 13-5月 -2026 16:54:47
Tracing not turned on.
Tns error struct:
ns main err code: 12535
TNS-12535: TNS: 操作超时
ns secondary err code: 12606
nt main err code: 0
nt secondary err code: 0
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.44.150)(PORT=55992))
2026-05-13T16:54:47.544739+08:00
WARNING: inbound connection timed out (ORA-3136)
2026-05-13T16:56:24.151586+08:00
DIA0 Critical Database Process As Root: Hang ID 280 blocks 59 sessions
Final blocker is session ID 827 serial# 50907 OSPID 17153 on Instance 1
No resolution will be attempted by Hang Manager
2026-05-13T16:56:50.127317+08:00
DIA0 Critical Database Process Blocked: Hang ID 280 blocks 59 sessions
Final blocker is session ID 827 serial# 50907 OSPID 17153 on Instance 1
No resolution will be attempted by Hang Manager
2026-05-13T16:57:24.327262+08:00
DIA0 Critical Database Process Blocked: Hang ID 280 blocks 93 sessions
Final blocker is session ID 827 serial# 50907 OSPID 17153 on Instance 1
No resolution will be attempted by Hang Manager
2026-05-13T16:59:35.273959+08:00
DIA0 Critical Database Process Blocked: Hang ID 280 blocks 354 sessions
Final blocker is session ID 827 serial# 50907 OSPID 17153 on Instance 1
No resolution will be attempted by Hang Manager
2026-05-13T17:00:34.444892+08:00这说明当时数据库内部已经形成明显阻塞链,最终阻塞源是:
SID : 827
SERIAL# : 50907
OSPID : 17153
Instance : 1并且阻塞影响范围从59个会话扩大到 354 个会话。
3. alert 日志提示 heavy swapping#
17:01:37 左右,alert 日志出现:
2026-05-13T17:00:36.331741+08:00
WARNING: inbound connection timed out (ORA-3136)
TNS-12535: TNS: 操作超时
ns secondary err code: 12606
nt main err code: 0
nt secondary err code: 0
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=172.16.76.125)(PORT=33604))
2026-05-13T17:00:36.332656+08:00
WARNING: inbound connection timed out (ORA-3136)
2026-05-13T17:00:40.371253+08:00
ALTER SYSTEM ARCHIVE LOG
2026-05-13T17:00:40.378963+08:00
Thread 1 cannot allocate new log, sequence 58603
Private strand flush not complete
Current log# 4 seq# 58602 mem# 0: /home/oracle/oracle/app/oradata/WMSPROD/redo04.log
2026-05-13T17:00:40.554161+08:00
Thread 1 advanced to log sequence 58603 (LGWR switch)
Current log# 5 seq# 58603 mem# 0: /home/oracle/oracle/app/oradata/WMSPROD/redo05.log
2026-05-13T17:00:40.677566+08:00
TT03 (PID:6908): SRL selected for T-1.S-58603 for LAD:2
2026-05-13T17:00:41.181641+08:00
NET (PID:3977): Archived Log entry 60127 added for T-1.S-58602 ID 0x50fa7345 LAD:1
2026-05-13T17:01:37.540212+08:00
WARNING: Heavy swapping observed on system in last 5 mins.
Heavy swapping can lead to timeouts, poor performance, and instance eviction.
2026-05-13T17:04:53.657779+08:00
Resize operation completed for file# 3, old size 8396800K, new size 8407040K
2026-05-13T17:42:48.649406+08:00
Thread 1 cannot allocate new log, sequence 58604
Private strand flush not complete
Current log# 5 seq# 58603 mem# 0: /home/oracle/oracle/app/oradata/WMSPROD/redo05.log
2026-05-13T17:42:51.719239+08:00
Thread 1 advanced to log sequence 58604 (LGWR switch)
Current log# 6 seq# 58604 mem# 0: /home/oracle/oracle/app/oradata/WMSPROD/redo06.logWARNING: Heavy swapping observed on system in last 5 mins.
Heavy swapping can lead to timeouts, poor performance, and instance eviction.
这条信息说明 Oracle 自身已经检测到操作系统最近 5 分钟存在严重换页。虽然没有 OSWatcher,但这条告警已经提示操作系统内存压力很可能与本次故障有关。
4. 初步结论#
通过告警日志文件,判断数据库主机在 5月13日 16:50–17:01 左右出现短时间严重卡顿/阻塞/内存换页,导致大量客户端连接超时,应用侧表现为 ORA-03135。
| 时间 | 关键现象 | 判断 |
|---|---|---|
| 16:50:52 开始 | Fatal NI connect error 12170、TNS-12535: 操作超时 | 客户端连接 Oracle 网络层超时 |
| 16:50:52–17:00:36 | 大量 WARNING: inbound connection timed out (ORA-3136) | 数据库服务端无法及时完成客户端连接握手 |
| 16:56:24–16:59:35 | DIA0 Hang Manager 报 Hang ID 280,阻塞会话从 59 增长到 354 | 数据库内部出现短时间严重 hang/blocking |
| 17:01:37 | WARNING: Heavy swapping observed on system in last 5 mins | 操作系统发生重度 swap,可能导致超时、性能下降、甚至实例驱逐风险 |
三、ASH 分析#
1. 查当时所有 shared pool latch 等待的来源#
根据alter日志DIA0 Hang Manager 报出的 hang chain先查询 ASH 5月13日 16:45–17:05 查最终阻塞会话当时在干什么
set lines 300 pages 1000
col sample_time for a20
col username for a20
col machine for a40
col program for a50
col module for a40
col action for a40
col event for a40
col sql_id for a15
col top_level_sql_id for a15
col p1text for a25
col p2text for a25
col p3text for a25
select to_char(ash.sample_time,'yyyy-mm-dd hh24:mi:ss') sample_time,
ash.session_id sid,
ash.session_serial# serial#,
u.username,
ash.machine,
ash.program,
ash.module,
ash.action,
ash.sql_id,
ash.sql_child_number,
ash.sql_exec_id,
to_char(ash.sql_exec_start,'yyyy-mm-dd hh24:mi:ss') sql_exec_start,
ash.top_level_sql_id,
ash.session_state,
ash.event,
ash.wait_class,
ash.p1text,
ash.p1,
ash.p2text,
ash.p2,
ash.p3text,
ash.p3
from dba_hist_active_sess_history ash
left join dba_users u
on ash.user_id = u.user_id
where ash.sample_time between
to_timestamp('2026-05-13 16:45:00','yyyy-mm-dd hh24:mi:ss')
and to_timestamp('2026-05-13 17:05:00','yyyy-mm-dd hh24:mi:ss')
and ash.session_id = 827
and ash.session_serial# = 50907
order by ash.sample_time;查询结果如下:
SAMPLE_TIME SID SERIAL# USERNAME MACHINE PROGRAM MODULE ACTION SQL_ID SQL_CHILD_NUMBER SQL_EXEC_ID
-------------------- ---------- ---------- -------------------- ---------------------------------------- -------------------------------------------------- ---------------------------------------- ---------------------------------------- --------------- ---------------- -----------
SQL_EXEC_START
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
TOP_LEVEL_SQL_I SESSION_STATE EVENT WAIT_CLASS P1TEXT P1 P2TEXT P2 P3TEXT P3
--------------- ---------------------------- ---------------------------------------- ---------------------------------------------------------------- ------------------------- ---------- ------------------------- ---------- ------------------------- ----------
2026-05-13 16:48:51 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:49:01 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:49:11 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:49:21 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:49:31 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:49:41 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:49:51 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:50:01 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:50:11 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:50:21 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:50:31 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:50:41 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:50:51 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:51:01 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:51:11 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:51:21 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:51:31 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:51:41 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:51:51 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:52:01 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:52:11 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:52:22 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:52:33 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:52:43 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:52:53 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:53:03 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:53:13 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:53:23 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:53:33 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:53:43 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:53:53 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:54:03 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:54:13 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:54:23 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:54:33 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:54:43 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:54:53 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:55:03 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:55:13 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:55:23 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:55:33 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:55:43 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:55:53 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:56:03 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:56:13 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:56:23 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:56:33 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:56:43 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:56:53 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:57:03 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:57:13 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:57:23 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:57:33 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:57:43 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:57:53 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:58:03 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:58:13 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:58:24 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:58:34 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:58:44 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:58:54 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:59:05 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:59:15 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:59:25 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:59:35 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:59:45 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 16:59:55 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 17:00:05 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 17:00:15 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 17:00:25 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 17:00:35 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
2026-05-13 17:00:45 827 50907 SYS bogon oracle@bogon (MMAN) 0
ON CPU timeout 300 0 0
已选择 72 行。
SQL>2. 关键发现#
ASH 查询显示:
SID : 827
SERIAL# : 50907
USERNAME : SYS
MACHINE : bogon
PROGRAM : oracle@bogon (MMAN)
SESSION_STATE : ON CPU
SQL_ID : 空
时间范围 : 2026-05-13 16:48:51 ~ 17:00:45也就是说,告警日志里提到的最终 blocker:
Final blocker is session ID 827 serial# 50907 OSPID 17153实际对应的是:
oracle@bogon (MMAN)MMAN 是 Oracle 的 Memory Manager 内存管理后台进程,负责 SGA 自动内存管理、shared pool / buffer cache / large pool 等内存组件调整。你的文件里 827 会话从 16:48:51 到 17:00:45 一直被 ASH 采样为 ON CPU,没有 SQL_ID,程序名明确是 oracle@bogon (MMAN)。
这和告警日志完全对得上:16:56:24 开始 DIA0 Hang Manager 报 Hang ID 280,最终 blocker 就是 SID 827 / SERIAL# 50907 / OSPID 17153,到 16:59:35 已经阻塞 354 个会话。
最终阻塞源是 MMAN是 Oracle 的 Memory Manager 后台进程,负责 SGA 自动内存管理和各组件之间的内存调整。这个发现将问题方向转到了 Oracle 自动内存管理。
四、内存参数分析:AMM 关闭,ASMM 开启,但 shared_pool_size 为 0#
检查数据库内存参数:
set lines 300 pages 1000
col name for a35
col value for a30
col display_value for a40
select name, value, display_value, isdefault
from v$parameter
where name in (
'memory_target',
'memory_max_target',
'sga_target',
'sga_max_size',
'shared_pool_size',
'shared_pool_reserved_size',
'pga_aggregate_target',
'pga_aggregate_limit',
'use_large_pages'
)
order by name;现场结果:
| 参数 | 值 | 说明 |
|---|---|---|
memory_target | 0 | AMM 未开启 |
memory_max_target | 0 | AMM 未开启 |
sga_target | 35G | ASMM 开启 |
sga_max_size | 35G | SGA 最大值 35G |
shared_pool_size | 0 | shared pool 未设置固定下限 |
shared_pool_reserved_size | 约 185MB | 默认值 |
pga_aggregate_target | 7G | PGA 目标值 |
pga_aggregate_limit | 20G | PGA 上限 |
use_large_pages | TRUE | 使用 HugePages |
从参数可以判断:
AMM 未开启:
memory_target = 0
memory_max_target = 0
ASMM 已开启:
sga_target = 35G
sga_max_size = 35G
shared pool 未设置明确下限:
shared_pool_size = 0这表示 Oracle 可以在 sga_target=35G 内部自动调整 shared pool、buffer cache、large pool 等组件大小。
这个配置并不是绝对错误,但对于 WMS 这类连接数多、SQL 解析频繁、library cache 活跃的系统来说,如果 shared_pool_size=0,shared pool 完全交给 Oracle 自动调整,在业务高峰可能存在风险。
五、关键证据:故障窗口内发生了 SGA resize#
继续查询 v$sga_resize_ops:
set lines 300 pages 1000
col start_time for a20
col end_time for a20
col component for a35
col oper_type for a20
col oper_mode for a20
col status for a15
select to_char(start_time,'yyyy-mm-dd hh24:mi:ss') start_time,
to_char(end_time,'yyyy-mm-dd hh24:mi:ss') end_time,
component,
oper_type,
oper_mode,
initial_size/1024/1024 initial_mb,
target_size/1024/1024 target_mb,
final_size/1024/1024 final_mb,
status
from v$sga_resize_ops
where start_time between
to_date('2026-05-13 16:30:00','yyyy-mm-dd hh24:mi:ss')
and to_date('2026-05-13 17:10:00','yyyy-mm-dd hh24:mi:ss')
order by start_time;结果如下:
| START_TIME | END_TIME | COMPONENT | OPER_TYPE | OPER_MODE | INITIAL_MB | TARGET_MB | FINAL_MB | STATUS |
|---|---|---|---|---|---|---|---|---|
| 2026-05-13 16:48:47 | 2026-05-13 17:00:50 | DEFAULT buffer cache | GROW | DEFERRED | 25344 | 25472 | 25472 | COMPLETE |
| 2026-05-13 16:48:47 | 2026-05-13 17:00:50 | shared pool | SHRINK | DEFERRED | 7680 | 7552 | 7552 | COMPLETE |
这两条记录非常关键:
shared pool : 7680 MB -> 7552 MB,缩小 128 MB
buffer cache : 25344 MB -> 25472 MB,增大 128 MB
开始时间 : 16:48:47
结束时间 : 17:00:50
执行进程 : MMAN与 ASH 中 MMAN 从 16:48:51 到 17:00:45 持续处于 ON CPU 的时间完全吻合。
正常情况下,SGA resize 不应持续 12 分钟,更不应引发 300 多个会话等待 latch: shared pool。因此可以判断,这次 ASMM 自动调整是故障链路中的核心触发点。
六、OS 层 sar 验证:故障窗口确实存在 swap-in#
由于没有部署 OSWatcher,最初无法回放操作系统详细状态。但后续检查发现,故障当天的 sysstat 采集文件仍然存在:
ls -lh /var/log/sa/sa13结果:
-rw-r--r-- 1 root root 1.2M 5月 13 23:50 /var/log/sa/sa13继续查看故障窗口内的 swap 活动:
sar -W -f /var/log/sa/sa13 -s 16:30:00 -e 17:10:00结果:
Linux 3.10.0-957.el7.x86_64 (bogon) 2026年05月13日 _x86_64_ (32 CPU)
16时30分01秒 pswpin/s pswpout/s
16时40分01秒 5.89 0.20
16时50分01秒 34.98 0.59
17时00分01秒 430.11 0.77
平均时间: 156.93 0.52字段含义:
| 字段 | 含义 |
|---|---|
pswpin/s | 每秒从 swap 读回内存的页数 |
pswpout/s | 每秒写入 swap 的内存页数 |
从结果看,故障窗口内 swap-in 明显升高:
| 时间 | pswpin/s | pswpout/s | 判断 |
|---|---|---|---|
| 16:40:01 | 5.89 | 0.20 | 已有少量 swap-in |
| 16:50:01 | 34.98 | 0.59 | swap-in 升高 |
| 17:00:01 | 430.11 | 0.77 | 明显异常,故障核心窗口 |
| 平均 | 156.93 | 0.52 | 故障窗口平均 swap-in 偏高 |
pswpin/s=430.11 表示在对应采样周期内,系统平均每秒从 swap 读回约 430 个内存页。如果按 Linux 常见 4KB page size 粗略估算:
430.11 pages/s × 4KB ≈ 1.68 MB/s这说明故障核心窗口内,数据库服务器存在明显 swap-in。对于数据库系统来说,swap-in 会带来明显延迟,尤其当 Oracle 后台进程或关键前台进程访问到已经被换出的内存页时,可能导致响应变慢、持有 latch 时间变长、连接处理超时等问题。
这与 alert 日志中的 heavy swapping 完全吻合:
WARNING: Heavy swapping observed on system in last 5 mins.因此,OS 层 sar 数据补充证明:
本次故障不只是 Oracle 内部 shared pool resize 卡顿,还叠加了 OS 层 swap-in。swap 活动很可能放大了 MMAN 持有 shared pool latch 的时间,导致大量会话等待和连接超时。
七、故障链路表格#
| 阶段 | 时间点 / 时间段 | 证据来源 | 关键现象 | 分析判断 |
|---|---|---|---|---|
| 1 | 故障前 | 参数检查 | memory_target=0、sga_target=35G、shared_pool_size=0 | AMM 关闭,ASMM 开启,shared pool 未设置明确下限 |
| 2 | 16:48:47 | v$sga_resize_ops | shared pool SHRINK 7680MB -> 7552MB | ASMM 开始自动缩小 shared pool |
| 3 | 16:48:47 | v$sga_resize_ops | DEFAULT buffer cache GROW 25344MB -> 25472MB | Oracle 将 shared pool 释放的 128MB 转给 buffer cache |
| 4 | 16:48:51–17:00:45 | ASH | SID 827 / SERIAL# 50907 为 oracle@bogon (MMAN),持续 ON CPU | 执行 resize 的 MMAN 长时间活跃 |
| 5 | 16:48:51 起 | ASH | 大量会话等待 latch: shared pool,阻塞源指向 SID 827 | MMAN 成为 shared pool latch 阻塞源 |
| 6 | 16:49–16:56 | ASH | cursor: pin S wait on X、cursor: pin S、library cache: mutex X 增多 | shared pool / library cache / cursor 争用开始扩散 |
| 7 | 16:50 以后 | alert 日志 | Fatal NI connect error 12170、TNS-12535、ORA-3136 | 数据库端无法及时完成入站连接处理 |
| 8 | 16:56:24 | alert 日志 | Hang ID 280 blocks 59 sessions,final blocker 为 SID 827 / OSPID 17153 | Hang Manager 检测到数据库内部 hang |
| 9 | 16:59:35 | alert 日志 | Hang ID 280 blocks 354 sessions | 阻塞影响面扩大到 300 多个会话 |
| 10 | 17:00:01 | sar -W | pswpin/s=430.11 | OS 层发生明显 swap-in,可能放大数据库卡顿 |
| 11 | 17:00:50 | v$sga_resize_ops | SGA resize 状态 COMPLETE | shared pool shrink / buffer cache grow 操作完成 |
| 12 | 17:01:37 | alert 日志 | WARNING: Heavy swapping observed on system in last 5 mins | Oracle 检测到最近 5 分钟系统 heavy swapping |
| 13 | 17:00 以后 | ASH / 业务反馈 | shared pool latch 阻塞逐步下降,业务逐步恢复 | MMAN 释放阻塞后,数据库恢复处理能力 |
| 14 | 应用侧表现 | 应用日志 | ORA-03135,且 Process ID / Session ID / Serial# 均为 0 | 新连接未完成有效会话建立即超时断开 |
八、解决问题#
1. 给 shared pool 设置明确下限#
当前 shared_pool_size=0,建议设置明确下限,避免 Oracle 在业务高峰自动 shrink shared pool。
根据本次故障前 shared pool 为 7680MB,建议先设置为 8G:
alter system set shared_pool_size=8G scope=spfile;然后安排业务低峰窗口重启数据库生效。
如果必须在线调整,也可以使用:
alter system set shared_pool_size=8G scope=both;但在线调整本身可能触发新的 SGA resize,不建议在业务高峰执行。
2. 保留 ASMM,但控制关键内存组件下限#
当前 AMM 已关闭,这是生产库较常见的方向。但在保留 ASMM 的情况下,建议对关键组件设置合理下限:
shared_pool_size
large_pool_size
java_pool_size
db_cache_size对于 WMS 这类连接多、游标多、SQL 解析频繁的系统,shared pool 不建议完全交给 Oracle 自动调整。
3. 检查物理内存是否足够#
当前数据库参数:
sga_target = 35G
pga_aggregate_target = 7G
pga_aggregate_limit = 20GSGA 和 PGA target 合计已经约 42G。如果加上操作系统、Oracle 进程私有内存、监听、Agent、备份、监控等开销,物理内存不足时可能发生 swap。
建议检查:
free -g
cat /proc/meminfo | egrep "MemTotal|MemFree|MemAvailable|SwapTotal|SwapFree|HugePages|Commit"
sar -r -f /var/log/sa/sa13 -s 16:30:00 -e 17:10:00
sar -q -f /var/log/sa/sa13 -s 16:30:00 -e 17:10:00重点关注:
MemAvailable
SwapFree
pswpin/s
pswpout/s
%commit
runq-sz4. 排查 shared pool 压力来源#
检查是否存在大量硬解析、SQL 未绑定变量、子游标过多、对象频繁失效等情况:
set lines 300 pages 1000
col sql_text for a100
select sql_id,
count(*) child_cnt,
sum(executions) executions,
sum(parse_calls) parse_calls,
sum(loads) loads,
sum(invalidations) invalidations,
max(version_count) version_count,
substr(max(sql_text),1,100) sql_text
from v$sql
group by sql_id
having count(*) > 20
order by child_cnt desc;针对故障期间 ASH 中出现较多的 SQL_ID,可以继续查询 SQL 文本:
set lines 300 pages 1000 long 100000
col sql_id for a15
col sql_text for a200
select sql_id,
dbms_lob.substr(sql_text,4000,1) sql_text
from dba_hist_sqltext
where sql_id in (
'3fkbgd5c88nbn',
'77gs44zhuh7jn',
'd9q2q82g1tuzx',
'8gmzzh25xx5ys',
'9uapzrwn9k1vq',
'a4c1g6rpjbcpu'
);5. 监控 SGA resize#
建议建立定期巡检,关注 shared pool 是否频繁 shrink/grow:
set lines 300 pages 1000
col start_time for a20
col end_time for a20
col component for a30
col oper_type for a15
col status for a15
select to_char(start_time,'yyyy-mm-dd hh24:mi:ss') start_time,
to_char(end_time,'yyyy-mm-dd hh24:mi:ss') end_time,
component,
oper_type,
oper_mode,
initial_size/1024/1024 initial_mb,
target_size/1024/1024 target_mb,
final_size/1024/1024 final_mb,
status
from v$sga_resize_ops
where start_time >= sysdate - 7
and component in ('shared pool','DEFAULT buffer cache')
order by start_time;如果 shared pool 经常出现 SHRINK/GROW,说明 ASMM 对 shared pool 调整频繁,需要进一步固定 shared pool 下限或优化 SQL 解析压力。
6. 部署 OSWatcher 或等效采集#
本次虽然通过 /var/log/sa/sa13 找到了 swap 证据,但 sar 默认 10 分钟粒度较粗,不能完整还原故障期间每秒级资源状态。
建议部署:
OSWatcher / oswbb
或至少定时采集:
vmstat
iostat
sar
top
ps建议至少保留 30 天历史数据,便于故障追溯。
九、总结#
本次故障不是网络中断,也不是普通业务行锁,而是 Oracle 18c 在关闭 AMM、开启 ASMM 的配置下,由于 shared_pool_size=0,shared pool 没有设置明确下限。故障窗口内 Oracle 自动触发 SGA resize,将 shared pool 缩小并把内存转给 buffer cache。该操作由 MMAN 后台进程执行,期间 MMAN 成为 shared pool latch 阻塞源,导致大量业务会话等待 latch: shared pool、cursor: pin S wait on X、library cache: mutex X 等事件。与此同时,OS 层出现明显 swap-in,进一步放大了数据库卡顿,最终导致新连接无法及时完成登录握手,应用侧表现为 ORA-03135。
