跳过正文

Oracle 18c WMS 数据库卡顿故障处理实战:一次 ASMM 调整 shared pool 引发的 ORA-03135

Greatfinish
作者
Greatfinish
记录 Oracle、PostgreSQL、达梦、Linux、存储与生产环境故障处理经验。

一、故障现象
#

2026 年 5 月 13 日下午,WMS 系统数据库在 16:50–17:06 左右出现短时间卡顿,应用侧报错:

ORA-03135: connection lost contact
Process ID: 0
Session ID: 0
Serial number: 0

现场检查发现,应用服务器到数据库服务器的网络连通性正常:

ping 正常
telnet 数据库监听端口正常

问题持续几分钟后自动恢复,数据库实例没有重启。由于是故障发生后第二天才介入,数据库服务器当时没有安装 OSWatcher / oswbb 这类操作系统历史采集工具,无法完整回放当时 CPU、内存、I/O、进程级资源使用情况。因此本次排查主要依赖以下信息:

1. Oracle alert 日志
2. 故障时间段 ASH 历史会话等待信息
3. Oracle 内存参数
4. v$sga_resize_ops 中的 SGA resize 记录
5. sysstat/sar 保留下来的 swap 历史数据

二、alert 日志关键时间线
#

alert 日志中,故障窗口内有三类重要信息。

1. 连接超时
#

16:50 以后,alert 日志开始出现连接超时:

2026-05-13T16:35:12.506429+08:00
Thread 1 advanced to log sequence 58602 (LGWR switch)
  Current log# 4 seq# 58602 mem# 0: /home/oracle/oracle/app/oradata/WMSPROD/redo04.log
2026-05-13T16:35:12.585976+08:00
TT03 (PID:6908): SRL selected for T-1.S-58602 for LAD:2
2026-05-13T16:35:13.192971+08:00
ARC1 (PID:17839): Archived Log entry 60125 added for T-1.S-58601 ID 0x50fa7345 LAD:1
2026-05-13T16:50:52.780199+08:00


***********************************************************************

Fatal NI connect error 12170.

  VERSION INFORMATION:
	TNS for Linux: Version 18.0.0.0.0 - Production
	Oracle Bequeath NT Protocol Adapter for Linux: Version 18.0.0.0.0 - Production
	TCP/IP NT Protocol Adapter for Linux: Version 18.0.0.0.0 - Production
  Version 18.3.0.0.0
  Time: 13-5 -2026 16:50:52
  Tracing not turned on.
  Tns error struct:
    ns main err code: 12535
    
TNS-12535: TNS: 操作超时
    ns secondary err code: 12606
    nt main err code: 0
    nt secondary err code: 0
    nt OS err code: 0
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=172.16.76.126)(PORT=50927))
2026-05-13T16:50:52.802481+08:00
WARNING: inbound connection timed out (ORA-3136)
2026-05-13T16:51:05.680349+08:00


***********************************************************************

Fatal NI connect error 12170.

  VERSION INFORMATION:
	TNS for Linux: Version 18.0.0.0.0 - Production
	Oracle Bequeath NT Protocol Adapter for Linux: Version 18.0.0.0.0 - Production
	TCP/IP NT Protocol Adapter for Linux: Version 18.0.0.0.0 - Production
  Version 18.3.0.0.0
  Time: 13-5 -2026 16:51:05
  Tracing not turned on.
  Tns error struct:
2026-05-13T16:51:05.680946+08:00


***********************************************************************
    ns main err code: 12535
    

Fatal NI connect error 12170.
TNS-12535: TNS: 操作超时

  VERSION INFORMATION:
	TNS for Linux: Version 18.0.0.0.0 - Production
	Oracle Bequeath NT Protocol Adapter for Linux: Version 18.0.0.0.0 - Production
	TCP/IP NT Protocol Adapter for Linux: Version 18.0.0.0.0 - Production
    ns secondary err code: 12606
    nt main err code: 0
  Version 18.3.0.0.0
    nt secondary err code: 0
    nt OS err code: 0
  Time: 13-5 -2026 16:51:05
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=172.16.76.125)(PORT=32992))
2026-05-13T16:51:05.681668+08:00
WARNING: inbound connection timed out (ORA-3136)
  Tracing not turned on.
  Tns error struct:
    ns main err code: 12535
    
TNS-12535: TNS: 操作超时
    ns secondary err code: 12606
    nt main err code: 0
    nt secondary err code: 0
    nt OS err code: 0
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=172.16.76.125)(PORT=33012))
2026-05-13T16:51:05.682861+08:00
WARNING: inbound connection timed out (ORA-3136)
2026-05-13T16:51:06.750191+08:00


***********************************************************************
2026-05-13T16:51:06.750193+08:00


***********************************************************************

Fatal NI connect error 12170.

Fatal NI connect error 12170.

  VERSION INFORMATION:
	TNS for Linux: Version 18.0.0.0.0 - Production
	Oracle Bequeath NT Protocol Adapter for Linux: Version 18.0.0.0.0 - Production
	TCP/IP NT Protocol Adapter for Linux: Version 18.0.0.0.0 - Production

  VERSION INFORMATION:
	TNS for Linux: Version 18.0.0.0.0 - Production
	Oracle Bequeath NT Protocol Adapter for Linux: Version 18.0.0.0.0 - Production
	TCP/IP NT Protocol Adapter for Linux: Version 18.0.0.0.0 - Production
  Version 18.3.0.0.0
  Version 18.3.0.0.0
  Time: 13-5 -2026 16:51:06
  Time: 13-5 -2026 16:51:06
  Tracing not turned on.
  Tracing not turned on.
  Tns error struct:
  Tns error struct:
    ns main err code: 12535
    ns main err code: 12535
    
    
TNS-12535: TNS: 操作超时
TNS-12535: TNS: 操作超时
    ns secondary err code: 12606
    ns secondary err code: 12606
    nt main err code: 0
    nt main err code: 0
    nt secondary err code: 0
    nt secondary err code: 0
    nt OS err code: 0
    nt OS err code: 0
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=172.16.76.125)(PORT=33032))
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=172.16.76.125)(PORT=33054))
2026-05-13T16:51:06.751531+08:00
WARNING: inbound connection timed out (ORA-3136)
2026-05-13T16:51:06.751552+08:00
WARNING: inbound connection timed out (ORA-3136)
2026-05-13T16:51:06.760173+08:00


***********************************************************************

这些错误与应用侧 ORA-03135 时间基本吻合。应用端看到的是连接丢失,数据库服务端看到的是入站连接处理超时。

这类错误说明:

TCP 连接或连接请求能到达数据库主机,但数据库端无法及时完成入站连接处理。

因此,排查方向从网络层转向数据库内部等待、后台进程状态、内存管理和操作系统内存压力。

2. Hang Manager 检测到数据库内部 hang
#

alert 日志中最关键的信息是16:56:24~16:59:35 DIA0 Hang Manager 报出的 hang chain:

***********************************************************************

Fatal NI connect error 12170.

  VERSION INFORMATION:
	TNS for Linux: Version 18.0.0.0.0 - Production
	Oracle Bequeath NT Protocol Adapter for Linux: Version 18.0.0.0.0 - Production
	TCP/IP NT Protocol Adapter for Linux: Version 18.0.0.0.0 - Production
  Version 18.3.0.0.0
  Time: 13-5 -2026 16:54:47
  Tracing not turned on.
  Tns error struct:
    ns main err code: 12535
    
TNS-12535: TNS: 操作超时
    ns secondary err code: 12606
    nt main err code: 0
    nt secondary err code: 0
    nt OS err code: 0
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.44.150)(PORT=55992))
2026-05-13T16:54:47.544739+08:00
WARNING: inbound connection timed out (ORA-3136)
2026-05-13T16:56:24.151586+08:00
DIA0 Critical Database Process As Root: Hang ID 280 blocks 59 sessions
     Final blocker is session ID 827 serial# 50907 OSPID 17153 on Instance 1
     No resolution will be attempted by Hang Manager
2026-05-13T16:56:50.127317+08:00
DIA0 Critical Database Process Blocked: Hang ID 280 blocks 59 sessions
     Final blocker is session ID 827 serial# 50907 OSPID 17153 on Instance 1
     No resolution will be attempted by Hang Manager
2026-05-13T16:57:24.327262+08:00
DIA0 Critical Database Process Blocked: Hang ID 280 blocks 93 sessions
     Final blocker is session ID 827 serial# 50907 OSPID 17153 on Instance 1
     No resolution will be attempted by Hang Manager
2026-05-13T16:59:35.273959+08:00
DIA0 Critical Database Process Blocked: Hang ID 280 blocks 354 sessions
     Final blocker is session ID 827 serial# 50907 OSPID 17153 on Instance 1
     No resolution will be attempted by Hang Manager
2026-05-13T17:00:34.444892+08:00

这说明当时数据库内部已经形成明显阻塞链,最终阻塞源是:

SID      : 827
SERIAL#  : 50907
OSPID    : 17153
Instance : 1

并且阻塞影响范围从59个会话扩大到 354 个会话。

3. alert 日志提示 heavy swapping
#

17:01:37 左右,alert 日志出现:

2026-05-13T17:00:36.331741+08:00
WARNING: inbound connection timed out (ORA-3136)
    
TNS-12535: TNS: 操作超时
    ns secondary err code: 12606
    nt main err code: 0
    nt secondary err code: 0
    nt OS err code: 0
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=172.16.76.125)(PORT=33604))
2026-05-13T17:00:36.332656+08:00
WARNING: inbound connection timed out (ORA-3136)
2026-05-13T17:00:40.371253+08:00
ALTER SYSTEM ARCHIVE LOG
2026-05-13T17:00:40.378963+08:00
Thread 1 cannot allocate new log, sequence 58603
Private strand flush not complete
  Current log# 4 seq# 58602 mem# 0: /home/oracle/oracle/app/oradata/WMSPROD/redo04.log
2026-05-13T17:00:40.554161+08:00
Thread 1 advanced to log sequence 58603 (LGWR switch)
  Current log# 5 seq# 58603 mem# 0: /home/oracle/oracle/app/oradata/WMSPROD/redo05.log
2026-05-13T17:00:40.677566+08:00
TT03 (PID:6908): SRL selected for T-1.S-58603 for LAD:2
2026-05-13T17:00:41.181641+08:00
NET  (PID:3977): Archived Log entry 60127 added for T-1.S-58602 ID 0x50fa7345 LAD:1
2026-05-13T17:01:37.540212+08:00
WARNING: Heavy swapping observed on system in last 5 mins.
Heavy swapping can lead to timeouts, poor performance, and instance eviction.
2026-05-13T17:04:53.657779+08:00
Resize operation completed for file# 3, old size 8396800K, new size 8407040K
2026-05-13T17:42:48.649406+08:00
Thread 1 cannot allocate new log, sequence 58604
Private strand flush not complete
  Current log# 5 seq# 58603 mem# 0: /home/oracle/oracle/app/oradata/WMSPROD/redo05.log
2026-05-13T17:42:51.719239+08:00
Thread 1 advanced to log sequence 58604 (LGWR switch)
  Current log# 6 seq# 58604 mem# 0: /home/oracle/oracle/app/oradata/WMSPROD/redo06.log

WARNING: Heavy swapping observed on system in last 5 mins.

Heavy swapping can lead to timeouts, poor performance, and instance eviction.

这条信息说明 Oracle 自身已经检测到操作系统最近 5 分钟存在严重换页。虽然没有 OSWatcher,但这条告警已经提示操作系统内存压力很可能与本次故障有关。

4. 初步结论
#

通过告警日志文件,判断数据库主机在 5月13日 16:50–17:01 左右出现短时间严重卡顿/阻塞/内存换页,导致大量客户端连接超时,应用侧表现为 ORA-03135。

时间关键现象判断
16:50:52 开始Fatal NI connect error 12170TNS-12535: 操作超时客户端连接 Oracle 网络层超时
16:50:52–17:00:36大量 WARNING: inbound connection timed out (ORA-3136)数据库服务端无法及时完成客户端连接握手
16:56:24–16:59:35DIA0 Hang Manager 报 Hang ID 280,阻塞会话从 59 增长到 354数据库内部出现短时间严重 hang/blocking
17:01:37WARNING: Heavy swapping observed on system in last 5 mins操作系统发生重度 swap,可能导致超时、性能下降、甚至实例驱逐风险

三、ASH 分析
#

1. 查当时所有 shared pool latch 等待的来源
#

根据alter日志DIA0 Hang Manager 报出的 hang chain先查询 ASH 5月13日 16:45–17:05 查最终阻塞会话当时在干什么

set lines 300 pages 1000
col sample_time for a20
col username for a20
col machine for a40
col program for a50
col module for a40
col action for a40
col event for a40
col sql_id for a15
col top_level_sql_id for a15
col p1text for a25
col p2text for a25
col p3text for a25

select to_char(ash.sample_time,'yyyy-mm-dd hh24:mi:ss') sample_time,
       ash.session_id sid,
       ash.session_serial# serial#,
       u.username,
       ash.machine,
       ash.program,
       ash.module,
       ash.action,
       ash.sql_id,
       ash.sql_child_number,
       ash.sql_exec_id,
       to_char(ash.sql_exec_start,'yyyy-mm-dd hh24:mi:ss') sql_exec_start,
       ash.top_level_sql_id,
       ash.session_state,
       ash.event,
       ash.wait_class,
       ash.p1text,
       ash.p1,
       ash.p2text,
       ash.p2,
       ash.p3text,
       ash.p3
from dba_hist_active_sess_history ash
left join dba_users u
  on ash.user_id = u.user_id
where ash.sample_time between
      to_timestamp('2026-05-13 16:45:00','yyyy-mm-dd hh24:mi:ss')
  and to_timestamp('2026-05-13 17:05:00','yyyy-mm-dd hh24:mi:ss')
  and ash.session_id = 827
  and ash.session_serial# = 50907
order by ash.sample_time;

查询结果如下:

SAMPLE_TIME                 SID    SERIAL# USERNAME             MACHINE                                  PROGRAM              MODULE                                    ACTION                                   SQL_ID          SQL_CHILD_NUMBER SQL_EXEC_ID
-------------------- ---------- ---------- -------------------- ---------------------------------------- -------------------------------------------------- ---------------------------------------- ---------------------------------------- --------------- ---------------- -----------
SQL_EXEC_START
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
TOP_LEVEL_SQL_I SESSION_STATE                EVENT                                    WAIT_CLASS                                 P1TEXT                             P1 P2TEXT                            P2 P3TEXT                            P3
--------------- ---------------------------- ---------------------------------------- ---------------------------------------------------------------- ------------------------- ---------- ------------------------- ---------- ------------------------- ----------
2026-05-13 16:48:51         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:49:01         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:49:11         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:49:21         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:49:31         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:49:41         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:49:51         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:50:01         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:50:11         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:50:21         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:50:31         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:50:41         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:50:51         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:51:01         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:51:11         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:51:21         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:51:31         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:51:41         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:51:51         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:52:01         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:52:11         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:52:22         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:52:33         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:52:43         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:52:53         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:53:03         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:53:13         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:53:23         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:53:33         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:53:43         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:53:53         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:54:03         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:54:13         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:54:23         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:54:33         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:54:43         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:54:53         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:55:03         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:55:13         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:55:23         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:55:33         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:55:43         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:55:53         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:56:03         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:56:13         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:56:23         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:56:33         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:56:43         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:56:53         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:57:03         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:57:13         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:57:23         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:57:33         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:57:43         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:57:53         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:58:03         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:58:13         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:58:24         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:58:34         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:58:44         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:58:54         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:59:05         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:59:15         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:59:25         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:59:35         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:59:45         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 16:59:55         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 17:00:05         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 17:00:15         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 17:00:25         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 17:00:35         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0

2026-05-13 17:00:45         827      50907 SYS                  bogon                                    oracle@bogon (MMAN)                                                                                                                              0

                ON CPU                                                                                                           timeout                           300                                    0                                    0


已选择 72 行。

SQL>

2. 关键发现
#

ASH 查询显示:

SID        : 827
SERIAL#    : 50907
USERNAME   : SYS
MACHINE    : bogon
PROGRAM    : oracle@bogon (MMAN)
SESSION_STATE : ON CPU
SQL_ID     : 
时间范围   : 2026-05-13 16:48:51 ~ 17:00:45

也就是说,告警日志里提到的最终 blocker:

Final blocker is session ID 827 serial# 50907 OSPID 17153

实际对应的是:

oracle@bogon (MMAN)

MMAN 是 Oracle 的 Memory Manager 内存管理后台进程,负责 SGA 自动内存管理、shared pool / buffer cache / large pool 等内存组件调整。你的文件里 827 会话从 16:48:51 到 17:00:45 一直被 ASH 采样为 ON CPU,没有 SQL_ID,程序名明确是 oracle@bogon (MMAN)

这和告警日志完全对得上:16:56:24 开始 DIA0 Hang Manager 报 Hang ID 280,最终 blocker 就是 SID 827 / SERIAL# 50907 / OSPID 17153,到 16:59:35 已经阻塞 354 个会话。

最终阻塞源是 MMAN是 Oracle 的 Memory Manager 后台进程,负责 SGA 自动内存管理和各组件之间的内存调整。这个发现将问题方向转到了 Oracle 自动内存管理。


四、内存参数分析:AMM 关闭,ASMM 开启,但 shared_pool_size 为 0
#

检查数据库内存参数:

set lines 300 pages 1000
col name for a35
col value for a30
col display_value for a40

select name, value, display_value, isdefault
from v$parameter
where name in (
  'memory_target',
  'memory_max_target',
  'sga_target',
  'sga_max_size',
  'shared_pool_size',
  'shared_pool_reserved_size',
  'pga_aggregate_target',
  'pga_aggregate_limit',
  'use_large_pages'
)
order by name;

现场结果:

参数说明
memory_target0AMM 未开启
memory_max_target0AMM 未开启
sga_target35GASMM 开启
sga_max_size35GSGA 最大值 35G
shared_pool_size0shared pool 未设置固定下限
shared_pool_reserved_size约 185MB默认值
pga_aggregate_target7GPGA 目标值
pga_aggregate_limit20GPGA 上限
use_large_pagesTRUE使用 HugePages

从参数可以判断:

AMM 未开启:
memory_target = 0
memory_max_target = 0

ASMM 已开启:
sga_target = 35G
sga_max_size = 35G

shared pool 未设置明确下限:
shared_pool_size = 0

这表示 Oracle 可以在 sga_target=35G 内部自动调整 shared pool、buffer cache、large pool 等组件大小。

这个配置并不是绝对错误,但对于 WMS 这类连接数多、SQL 解析频繁、library cache 活跃的系统来说,如果 shared_pool_size=0,shared pool 完全交给 Oracle 自动调整,在业务高峰可能存在风险。


五、关键证据:故障窗口内发生了 SGA resize
#

继续查询 v$sga_resize_ops

set lines 300 pages 1000
col start_time for a20
col end_time for a20
col component for a35
col oper_type for a20
col oper_mode for a20
col status for a15

select to_char(start_time,'yyyy-mm-dd hh24:mi:ss') start_time,
       to_char(end_time,'yyyy-mm-dd hh24:mi:ss') end_time,
       component,
       oper_type,
       oper_mode,
       initial_size/1024/1024 initial_mb,
       target_size/1024/1024 target_mb,
       final_size/1024/1024 final_mb,
       status
from v$sga_resize_ops
where start_time between
      to_date('2026-05-13 16:30:00','yyyy-mm-dd hh24:mi:ss')
  and to_date('2026-05-13 17:10:00','yyyy-mm-dd hh24:mi:ss')
order by start_time;

结果如下:

START_TIMEEND_TIMECOMPONENTOPER_TYPEOPER_MODEINITIAL_MBTARGET_MBFINAL_MBSTATUS
2026-05-13 16:48:472026-05-13 17:00:50DEFAULT buffer cacheGROWDEFERRED253442547225472COMPLETE
2026-05-13 16:48:472026-05-13 17:00:50shared poolSHRINKDEFERRED768075527552COMPLETE

这两条记录非常关键:

shared pool      : 7680 MB -> 7552 MB,缩小 128 MB
buffer cache     : 25344 MB -> 25472 MB,增大 128 MB
开始时间         : 16:48:47
结束时间         : 17:00:50
执行进程         : MMAN

与 ASH 中 MMAN 从 16:48:51 到 17:00:45 持续处于 ON CPU 的时间完全吻合。

正常情况下,SGA resize 不应持续 12 分钟,更不应引发 300 多个会话等待 latch: shared pool。因此可以判断,这次 ASMM 自动调整是故障链路中的核心触发点。


六、OS 层 sar 验证:故障窗口确实存在 swap-in
#

由于没有部署 OSWatcher,最初无法回放操作系统详细状态。但后续检查发现,故障当天的 sysstat 采集文件仍然存在:

ls -lh /var/log/sa/sa13

结果:

-rw-r--r-- 1 root root 1.2M 5月 13 23:50 /var/log/sa/sa13

继续查看故障窗口内的 swap 活动:

sar -W -f /var/log/sa/sa13 -s 16:30:00 -e 17:10:00

结果:

Linux 3.10.0-957.el7.x86_64 (bogon)  2026年05月13日  _x86_64_  (32 CPU)

16时30分01秒  pswpin/s pswpout/s
16时40分01秒      5.89      0.20
16时50分01秒     34.98      0.59
17时00分01秒    430.11      0.77
平均时间:       156.93      0.52

字段含义:

字段含义
pswpin/s每秒从 swap 读回内存的页数
pswpout/s每秒写入 swap 的内存页数

从结果看,故障窗口内 swap-in 明显升高:

时间pswpin/spswpout/s判断
16:40:015.890.20已有少量 swap-in
16:50:0134.980.59swap-in 升高
17:00:01430.110.77明显异常,故障核心窗口
平均156.930.52故障窗口平均 swap-in 偏高

pswpin/s=430.11 表示在对应采样周期内,系统平均每秒从 swap 读回约 430 个内存页。如果按 Linux 常见 4KB page size 粗略估算:

430.11 pages/s × 4KB ≈ 1.68 MB/s

这说明故障核心窗口内,数据库服务器存在明显 swap-in。对于数据库系统来说,swap-in 会带来明显延迟,尤其当 Oracle 后台进程或关键前台进程访问到已经被换出的内存页时,可能导致响应变慢、持有 latch 时间变长、连接处理超时等问题。

这与 alert 日志中的 heavy swapping 完全吻合:

WARNING: Heavy swapping observed on system in last 5 mins.

因此,OS 层 sar 数据补充证明:

本次故障不只是 Oracle 内部 shared pool resize 卡顿,还叠加了 OS 层 swap-in。swap 活动很可能放大了 MMAN 持有 shared pool latch 的时间,导致大量会话等待和连接超时。


七、故障链路表格
#

阶段时间点 / 时间段证据来源关键现象分析判断
1故障前参数检查memory_target=0sga_target=35Gshared_pool_size=0AMM 关闭,ASMM 开启,shared pool 未设置明确下限
216:48:47v$sga_resize_opsshared pool SHRINK 7680MB -> 7552MBASMM 开始自动缩小 shared pool
316:48:47v$sga_resize_opsDEFAULT buffer cache GROW 25344MB -> 25472MBOracle 将 shared pool 释放的 128MB 转给 buffer cache
416:48:51–17:00:45ASHSID 827 / SERIAL# 50907oracle@bogon (MMAN),持续 ON CPU执行 resize 的 MMAN 长时间活跃
516:48:51 起ASH大量会话等待 latch: shared pool,阻塞源指向 SID 827MMAN 成为 shared pool latch 阻塞源
616:49–16:56ASHcursor: pin S wait on Xcursor: pin Slibrary cache: mutex X 增多shared pool / library cache / cursor 争用开始扩散
716:50 以后alert 日志Fatal NI connect error 12170TNS-12535ORA-3136数据库端无法及时完成入站连接处理
816:56:24alert 日志Hang ID 280 blocks 59 sessions,final blocker 为 SID 827 / OSPID 17153Hang Manager 检测到数据库内部 hang
916:59:35alert 日志Hang ID 280 blocks 354 sessions阻塞影响面扩大到 300 多个会话
1017:00:01sar -Wpswpin/s=430.11OS 层发生明显 swap-in,可能放大数据库卡顿
1117:00:50v$sga_resize_opsSGA resize 状态 COMPLETEshared pool shrink / buffer cache grow 操作完成
1217:01:37alert 日志WARNING: Heavy swapping observed on system in last 5 minsOracle 检测到最近 5 分钟系统 heavy swapping
1317:00 以后ASH / 业务反馈shared pool latch 阻塞逐步下降,业务逐步恢复MMAN 释放阻塞后,数据库恢复处理能力
14应用侧表现应用日志ORA-03135,且 Process ID / Session ID / Serial# 均为 0新连接未完成有效会话建立即超时断开

八、解决问题
#

1. 给 shared pool 设置明确下限
#

当前 shared_pool_size=0,建议设置明确下限,避免 Oracle 在业务高峰自动 shrink shared pool。

根据本次故障前 shared pool 为 7680MB,建议先设置为 8G:

alter system set shared_pool_size=8G scope=spfile;

然后安排业务低峰窗口重启数据库生效。

如果必须在线调整,也可以使用:

alter system set shared_pool_size=8G scope=both;

但在线调整本身可能触发新的 SGA resize,不建议在业务高峰执行。

2. 保留 ASMM,但控制关键内存组件下限
#

当前 AMM 已关闭,这是生产库较常见的方向。但在保留 ASMM 的情况下,建议对关键组件设置合理下限:

shared_pool_size
large_pool_size
java_pool_size
db_cache_size

对于 WMS 这类连接多、游标多、SQL 解析频繁的系统,shared pool 不建议完全交给 Oracle 自动调整。

3. 检查物理内存是否足够
#

当前数据库参数:

sga_target = 35G
pga_aggregate_target = 7G
pga_aggregate_limit = 20G

SGA 和 PGA target 合计已经约 42G。如果加上操作系统、Oracle 进程私有内存、监听、Agent、备份、监控等开销,物理内存不足时可能发生 swap。

建议检查:

free -g

cat /proc/meminfo | egrep "MemTotal|MemFree|MemAvailable|SwapTotal|SwapFree|HugePages|Commit"

sar -r -f /var/log/sa/sa13 -s 16:30:00 -e 17:10:00

sar -q -f /var/log/sa/sa13 -s 16:30:00 -e 17:10:00

重点关注:

MemAvailable
SwapFree
pswpin/s
pswpout/s
%commit
runq-sz

4. 排查 shared pool 压力来源
#

检查是否存在大量硬解析、SQL 未绑定变量、子游标过多、对象频繁失效等情况:

set lines 300 pages 1000
col sql_text for a100

select sql_id,
       count(*) child_cnt,
       sum(executions) executions,
       sum(parse_calls) parse_calls,
       sum(loads) loads,
       sum(invalidations) invalidations,
       max(version_count) version_count,
       substr(max(sql_text),1,100) sql_text
from v$sql
group by sql_id
having count(*) > 20
order by child_cnt desc;

针对故障期间 ASH 中出现较多的 SQL_ID,可以继续查询 SQL 文本:

set lines 300 pages 1000 long 100000
col sql_id for a15
col sql_text for a200

select sql_id,
       dbms_lob.substr(sql_text,4000,1) sql_text
from dba_hist_sqltext
where sql_id in (
'3fkbgd5c88nbn',
'77gs44zhuh7jn',
'd9q2q82g1tuzx',
'8gmzzh25xx5ys',
'9uapzrwn9k1vq',
'a4c1g6rpjbcpu'
);

5. 监控 SGA resize
#

建议建立定期巡检,关注 shared pool 是否频繁 shrink/grow:

set lines 300 pages 1000
col start_time for a20
col end_time for a20
col component for a30
col oper_type for a15
col status for a15

select to_char(start_time,'yyyy-mm-dd hh24:mi:ss') start_time,
       to_char(end_time,'yyyy-mm-dd hh24:mi:ss') end_time,
       component,
       oper_type,
       oper_mode,
       initial_size/1024/1024 initial_mb,
       target_size/1024/1024 target_mb,
       final_size/1024/1024 final_mb,
       status
from v$sga_resize_ops
where start_time >= sysdate - 7
  and component in ('shared pool','DEFAULT buffer cache')
order by start_time;

如果 shared pool 经常出现 SHRINK/GROW,说明 ASMM 对 shared pool 调整频繁,需要进一步固定 shared pool 下限或优化 SQL 解析压力。

6. 部署 OSWatcher 或等效采集
#

本次虽然通过 /var/log/sa/sa13 找到了 swap 证据,但 sar 默认 10 分钟粒度较粗,不能完整还原故障期间每秒级资源状态。

建议部署:

OSWatcher / oswbb
或至少定时采集:
vmstat
iostat
sar
top
ps

建议至少保留 30 天历史数据,便于故障追溯。


九、总结
#

本次故障不是网络中断,也不是普通业务行锁,而是 Oracle 18c 在关闭 AMM、开启 ASMM 的配置下,由于 shared_pool_size=0,shared pool 没有设置明确下限。故障窗口内 Oracle 自动触发 SGA resize,将 shared pool 缩小并把内存转给 buffer cache。该操作由 MMAN 后台进程执行,期间 MMAN 成为 shared pool latch 阻塞源,导致大量业务会话等待 latch: shared poolcursor: pin S wait on Xlibrary cache: mutex X 等事件。与此同时,OS 层出现明显 swap-in,进一步放大了数据库卡顿,最终导致新连接无法及时完成登录握手,应用侧表现为 ORA-03135。