一、故障背景#
生产数据库为 Oracle 11.2.0.4 RAC 环境。业务侧对表:
GKSD6.SCM_LOSSINOUT_APPLY_HDR新增了一个字段:
BPMID字段增加后,相关业务包没有完整重新编译和验证,导致部分 PL/SQL 包体失效。后续业务高并发访问时,大量会话调用 GKSD6.SCM_PUBLIC、GKSD6.SCM_PURBILL、FOP_SERVICE 等包,最终引发数据库严重卡顿。
故障期间,oratop 显示数据库存在大量等待:
library cache pin
library cache lock
library cache: mutex X
enq: TX - row lock contention其中 GKSD6.SCM_PUBLIC PACKAGE BODY 和 GKSD6.SCM_PURBILL PACKAGE BODY 最终被确认处于 INVALID 状态;同时大量会话正在调用 SCM_PUBLIC 相关函数和过程,形成 library cache 争用。
二、故障现象#
业务反馈数据库很卡,应用响应缓慢,部分操作长时间无返回。
通过 oratop 可以看到两个明显现象:
1. 大量 enq: TX - row lock contention
2. 大量 library cache pin / library cache lock表面上看是行锁问题,例如大量会话等待:
enq: TX - row lock contention但进一步分析发现,真正的问题不是单纯行锁,而是:
部分业务会话持有 DML 行锁
↓
这些会话又卡在 library cache pin
↓
事务无法提交或回滚
↓
后续 UPDATE / INSERT 全部等待行锁
↓
数据库整体表现为严重卡顿三、核心故障链路#
本次故障链路可以概括为:
表 SCM_LOSSINOUT_APPLY_HDR 新增 BPMID 字段
↓
依赖该表或相关视图/包的 PL/SQL 对象失效
↓
GKSD6.SCM_PUBLIC PACKAGE BODY INVALID
GKSD6.SCM_PURBILL PACKAGE BODY INVALID
↓
业务高并发调用 SCM_PUBLIC / SCM_PURBILL / FOP_SERVICE
↓
Oracle 尝试运行时校验或自动编译失效对象
↓
大量 library cache pin / library cache lock
↓
部分会话已持有 DML 锁,无法继续执行完成
↓
引发 enq: TX - row lock contention
↓
数据库整体卡顿四、为什么新增字段会导致包体失效#
正常情况下,单纯给表增加一个普通字段,不一定会导致所有包失效。但在以下场景中,表结构变更很容易引发包体失效或运行时异常:
| 场景 | 说明 |
|---|---|
包中使用 %ROWTYPE | 表结构变化后,依赖包需要重新编译 |
包中使用 select * into record | 字段数量变化可能影响逻辑 |
| 包依赖相关视图 | 表字段变化后视图或包依赖链失效 |
| 包体中引用固定字段列表 | 新字段加入后,相关 insert/select/update 逻辑未同步 |
| 发布过程中只改表,未编译包 | 对象状态遗留为 INVALID |
| 高峰期首次调用失效包 | 触发 library cache 争用 |
本次问题中,SCM_PUBLIC 的依赖对象非常多,并且依赖链中包含大量业务表、函数、包、视图等对象。检查结果显示,SCM_PUBLIC 与 SCM_PURBILL 包体均处于 INVALID 状态,而 DBA_ERRORS 查询 SCM_PUBLIC 没有直接错误记录,这种情况经常出现在对象依赖失效、运行时重编译被阻塞、或编译过程被其他会话 pin 住的场景。
五、判断过程#
1. 先看数据库主要等待事件#
select inst_id,
event,
count(*) cnt,
max(seconds_in_wait) max_wait_sec
from gv$session
where event in ('library cache pin',
'library cache lock',
'library cache: mutex X',
'enq: TX - row lock contention')
group by inst_id, event
order by inst_id, cnt desc;如果同时出现:
library cache pin / lock 很多
enq: TX - row lock contention 很多要警惕这不是普通行锁,而是 library cache 等待导致持锁会话无法完成。
2. 查行锁 blocker#
set lines 260 pages 1000
col blocker for a20
col holder_user for a15
col holder_event for a35
col holder_sql for a13
col sample_sql for a80
with w as (
select blocking_instance as holder_inst,
blocking_session as holder_sid,
count(*) as waiters,
min(seconds_in_wait) as min_wait_sec,
max(seconds_in_wait) as max_wait_sec
from gv$session
where event = 'enq: TX - row lock contention'
and blocking_session is not null
group by blocking_instance, blocking_session
)
select w.holder_inst || ',' || w.holder_sid as blocker,
hs.serial#,
hp.spid,
hs.username as holder_user,
hs.status,
hs.event as holder_event,
hs.sql_id as holder_sql,
w.waiters,
w.min_wait_sec,
w.max_wait_sec,
hs.last_call_et,
substr(q.sql_text, 1, 80) as sample_sql
from w
join gv$session hs
on hs.inst_id = w.holder_inst
and hs.sid = w.holder_sid
left join gv$process hp
on hp.inst_id = hs.inst_id
and hp.addr = hs.paddr
left join gv$sqlarea q
on q.inst_id = hs.inst_id
and q.sql_id = hs.sql_id
order by w.waiters desc, w.max_wait_sec desc;本次发现大量 blocker 是:
FOP_CMS 用户
fop_service.fop_service_unop
library cache pin说明这些会话不是普通慢 SQL,而是已经被 library cache pin 卡住,同时又持有业务表 DML 锁。
3. 查对象是否失效#
set lines 200 pages 1000
col owner for a20
col object_name for a35
col object_type for a20
col status for a10
select owner,
object_name,
object_type,
status,
created,
last_ddl_time
from dba_objects
where upper(object_name) in (
'FOP_SERVICE',
'FIN_SALINV',
'BPM_XIAOSHOUSP_CC',
'SCM_PUBLIC',
'SCM_PURBILL'
)
order by owner, object_name, object_type;本次关键发现:
GKSD6.SCM_PUBLIC PACKAGE BODY INVALID
GKSD6.SCM_PURBILL PACKAGE BODY INVALID这基本确认了方向:不是普通行锁,而是包体失效后,高并发调用引发 library cache 争用。
4. 查编译错误#
set lines 260 pages 1000
col owner for a15
col name for a35
col type for a20
col text for a160
select owner,
name,
type,
line,
position,
text
from dba_errors
where owner = 'GKSD6'
and name in ('SCM_PUBLIC','SCM_PURBILL')
order by name, sequence;本次 SCM_PUBLIC 查询没有明显编译错误记录,说明问题不一定是语法错误,而可能是依赖对象状态异常、运行时重编译冲突,或者编译时被其他会话 pin 住。
5. 查包依赖对象#
set lines 260 pages 1000
col owner for a15
col name for a35
col type for a20
col referenced_owner for a20
col referenced_name for a40
col referenced_type for a20
select owner,
name,
type,
referenced_owner,
referenced_name,
referenced_type
from dba_dependencies
where owner = 'GKSD6'
and name in ('SCM_PUBLIC','SCM_PURBILL')
order by referenced_owner, referenced_name;如果依赖对象中包含刚刚变更过的表,例如:
SCM_LOSSINOUT_APPLY_HDR就要重点怀疑表结构变更导致依赖包失效。
六、处理过程#
1. 先阻断新业务连接#
因为当时大量应用会话正在调用 GKSD6.SCM_PUBLIC,直接编译很容易一直等待 library cache pin。
生产中建议优先用锁用户方式临时截流:
alter user GKSD6 account lock;
alter user FOP_CMS account lock;
alter user TREASURY_CMS account lock;注意:
锁用户只能阻止新连接;
不会断开已有会话;
不会自动释放已有行锁。2. 清理已有持锁和 library cache 等待会话#
生成 kill 语句:
set lines 300 pages 1000
col kill_cmd for a90
col username for a15
col event for a35
col sql_text for a100
select s.inst_id,
s.sid,
s.serial#,
p.spid,
s.username,
s.status,
s.event,
s.sql_id,
s.last_call_et,
substr(q.sql_text,1,100) as sql_text,
'alter system kill session ''' ||
s.sid || ',' || s.serial# || ',@' || s.inst_id ||
''' immediate;' as kill_cmd
from gv$session s
left join gv$process p
on s.inst_id = p.inst_id
and s.paddr = p.addr
left join gv$sqlarea q
on s.inst_id = q.inst_id
and s.sql_id = q.sql_id
where s.type = 'USER'
and s.username in ('GKSD6','FOP_CMS','TREASURY_CMS')
and (
s.event in ('library cache pin',
'library cache lock',
'library cache: mutex X',
'enq: TX - row lock contention')
or lower(q.sql_text) like '%scm_public%'
or lower(q.sql_text) like '%fop_service.fop_service_unop%'
or lower(q.sql_text) like '%scm_purbill%'
or lower(q.sql_text) like '%fin_salinv%'
)
order by s.last_call_et desc;优先处理:
1. 正在阻塞编译的 final blocking session
2. 持有 TX 行锁的 holder
3. library cache pin / lock 时间最长的业务会话
4. 正在调用 SCM_PUBLIC / SCM_PURBILL / FOP_SERVICE 的会话3. 编译失效包#
先编译依赖包:
alter package GKSD6.SCM_PURBILL compile body;
show errors package body GKSD6.SCM_PURBILL;再编译核心公共包:
alter package GKSD6.SCM_PUBLIC compile body;
show errors package body GKSD6.SCM_PUBLIC;本次处理过程中,编译 SCM_PUBLIC 时被一个业务会话阻塞:
SYS 会话正在执行:
alter package GKSD6.SCM_PUBLIC compile body
等待:
library cache pin
blocking session:
INST 1 / SID 705查询发现 1,705 是 GKSD6 业务会话,正在执行:
INSERT INTO pub_collection_dtl ...并等待:
latch: row cache objects处理该 blocker:
alter system kill session '705,5' immediate;之后 SCM_PUBLIC 编译成功。
4. 确认对象状态#
set lines 200 pages 1000
col owner for a15
col object_name for a35
col object_type for a20
col status for a10
col last_ddl_time for a20
select owner,
object_name,
object_type,
status,
to_char(last_ddl_time,'yyyy-mm-dd hh24:mi:ss') last_ddl_time
from dba_objects
where owner = 'GKSD6'
and object_name in ('SCM_PUBLIC','SCM_PURBILL')
order by object_name, object_type;最终结果:
GKSD6.SCM_PUBLIC PACKAGE VALID
GKSD6.SCM_PUBLIC PACKAGE BODY VALID
GKSD6.SCM_PURBILL PACKAGE VALID
GKSD6.SCM_PURBILL PACKAGE BODY VALID说明核心问题已解决。
5. 恢复应用用户#
alter user GKSD6 account unlock;
alter user FOP_CMS account unlock;
alter user TREASURY_CMS account unlock;七、故障恢复后验证#
1. 查看等待事件是否下降#
select inst_id,
event,
count(*) cnt,
max(seconds_in_wait) max_wait_sec
from gv$session
where event in ('library cache pin',
'library cache lock',
'library cache: mutex X',
'enq: TX - row lock contention')
group by inst_id, event
order by inst_id, cnt desc;2. 查看是否仍有行锁 blocker#
with w as (
select blocking_instance as holder_inst,
blocking_session as holder_sid,
count(*) as waiters
from gv$session
where event = 'enq: TX - row lock contention'
and blocking_session is not null
group by blocking_instance, blocking_session
)
select w.holder_inst,
w.holder_sid,
hs.serial#,
hs.username,
hs.status,
hs.event,
hs.sql_id,
w.waiters
from w
join gv$session hs
on hs.inst_id = w.holder_inst
and hs.sid = w.holder_sid
order by w.waiters desc;3. 查看是否还有 invalid 对象#
select owner,
object_name,
object_type,
status,
last_ddl_time
from dba_objects
where owner in ('GKSD6','FOP_CMS','TREASURY_CMS','SAVEFEGKSD6')
and status <> 'VALID'
order by owner, object_type, object_name;八、故障根因总结#
本次故障的根因可以总结为:
业务表 SCM_LOSSINOUT_APPLY_HDR 新增 BPMID 字段后,
依赖该表或相关依赖链的 PL/SQL 包未完成重新编译,
导致 GKSD6.SCM_PUBLIC 和 GKSD6.SCM_PURBILL 包体失效。
业务高峰期大量会话并发调用失效包,
触发 library cache pin / lock 争用。
部分业务过程在持有 DML 锁后卡住,
导致后续 UPDATE / INSERT 出现 enq: TX - row lock contention,
最终表现为数据库整体卡顿。九、经验教训#
1. 表结构变更后必须检查 invalid 对象#
任何生产表结构变更后,都要执行:
select owner,
object_name,
object_type,
status,
last_ddl_time
from dba_objects
where owner in ('GKSD6','FOP_CMS','TREASURY_CMS','SAVEFEGKSD6')
and status <> 'VALID'
order by owner, object_type, object_name;不能只看 DDL 是否执行成功。
2. 不能让业务高峰期首次调用触发自动编译#
失效对象如果不提前编译,Oracle 可能在业务首次调用时尝试自动编译。
低并发时可能无感知,高并发时就可能出现:
library cache pin
library cache lock
library cache: mutex X严重时会放大成全局业务卡顿。
3. 发布流程中应增加强制编译步骤#
表结构变更后建议执行:
alter package GKSD6.SCM_PURBILL compile body;
alter package GKSD6.SCM_PUBLIC compile body;或者对 schema 执行统一编译:
exec dbms_utility.compile_schema(schema => 'GKSD6', compile_all => false);生产环境建议优先在低峰期执行,并提前评估影响。
4. 编译前要先控制业务入口#
如果已经发生大量调用,不建议直接反复编译。
正确顺序是:
1. 阻断新连接
2. 清理已有 pin 对象的会话
3. 编译 invalid 对象
4. 验证对象状态
5. 恢复业务连接5. 不要只杀行锁 waiter#
本次表面有大量:
enq: TX - row lock contention但真正根因是:
library cache pin / lock所以只杀等待行锁的 waiter 没意义。要找到持锁 holder,以及阻塞编译或 library cache 的 blocker。
十、建议形成标准变更检查清单#
以后生产表结构变更后,建议固定执行以下检查:
-- 1. 查 invalid 对象
select owner, object_name, object_type, status, last_ddl_time
from dba_objects
where owner in ('GKSD6','FOP_CMS','TREASURY_CMS','SAVEFEGKSD6')
and status <> 'VALID'
order by owner, object_type, object_name;
-- 2. 查变更表影响了哪些对象
select owner,
name,
type,
referenced_owner,
referenced_name,
referenced_type
from dba_dependencies
where referenced_owner = 'GKSD6'
and referenced_name = 'SCM_LOSSINOUT_APPLY_HDR'
order by owner, name, type;
-- 3. 编译核心业务包
alter package GKSD6.SCM_PURBILL compile body;
alter package GKSD6.SCM_PUBLIC compile body;
-- 4. 查编译错误
select owner, name, type, line, position, text
from dba_errors
where owner = 'GKSD6'
order by name, sequence;
-- 5. 再次确认 invalid 对象
select owner, object_name, object_type, status, last_ddl_time
from dba_objects
where owner in ('GKSD6','FOP_CMS','TREASURY_CMS','SAVEFEGKSD6')
and status <> 'VALID'
order by owner, object_type, object_name;十一、文章结论#
本次故障不是单纯的行锁问题,而是一次典型的 表结构变更后依赖包体失效,业务高并发触发 library cache 争用,最终放大为行锁阻塞 的生产事故。
核心处理思路是:
先定位 invalid 对象
再确认 library cache 等待
再阻断新业务连接
再清理已有 pin/锁会话
最后按依赖顺序重新编译包体最终通过编译:
GKSD6.SCM_PURBILL PACKAGE BODY
GKSD6.SCM_PUBLIC PACKAGE BODY并清理阻塞编译的业务会话后,两个包体恢复 VALID,故障解除。
