跳过正文

Oracle 表结构变更引发包体失效、library cache pin 和行锁阻塞故障处理复盘

Greatfinish
作者
Greatfinish
记录 Oracle、PostgreSQL、达梦、Linux、存储与生产环境故障处理经验。
目录

一、故障背景
#

生产数据库为 Oracle 11.2.0.4 RAC 环境。业务侧对表:

GKSD6.SCM_LOSSINOUT_APPLY_HDR

新增了一个字段:

BPMID

字段增加后,相关业务包没有完整重新编译和验证,导致部分 PL/SQL 包体失效。后续业务高并发访问时,大量会话调用 GKSD6.SCM_PUBLICGKSD6.SCM_PURBILLFOP_SERVICE 等包,最终引发数据库严重卡顿。

故障期间,oratop 显示数据库存在大量等待:

library cache pin
library cache lock
library cache: mutex X
enq: TX - row lock contention

其中 GKSD6.SCM_PUBLIC PACKAGE BODYGKSD6.SCM_PURBILL PACKAGE BODY 最终被确认处于 INVALID 状态;同时大量会话正在调用 SCM_PUBLIC 相关函数和过程,形成 library cache 争用。


二、故障现象
#

业务反馈数据库很卡,应用响应缓慢,部分操作长时间无返回。

通过 oratop 可以看到两个明显现象:

1. 大量 enq: TX - row lock contention
2. 大量 library cache pin / library cache lock

表面上看是行锁问题,例如大量会话等待:

enq: TX - row lock contention

但进一步分析发现,真正的问题不是单纯行锁,而是:

部分业务会话持有 DML 行锁
        
这些会话又卡在 library cache pin
        
事务无法提交或回滚
        
后续 UPDATE / INSERT 全部等待行锁
        
数据库整体表现为严重卡顿

三、核心故障链路
#

本次故障链路可以概括为:

 SCM_LOSSINOUT_APPLY_HDR 新增 BPMID 字段
        
依赖该表或相关视图/包的 PL/SQL 对象失效
        
GKSD6.SCM_PUBLIC PACKAGE BODY INVALID
GKSD6.SCM_PURBILL PACKAGE BODY INVALID
        
业务高并发调用 SCM_PUBLIC / SCM_PURBILL / FOP_SERVICE
        
Oracle 尝试运行时校验或自动编译失效对象
        
大量 library cache pin / library cache lock
        
部分会话已持有 DML 锁,无法继续执行完成
        
引发 enq: TX - row lock contention
        
数据库整体卡顿

四、为什么新增字段会导致包体失效
#

正常情况下,单纯给表增加一个普通字段,不一定会导致所有包失效。但在以下场景中,表结构变更很容易引发包体失效或运行时异常:

场景说明
包中使用 %ROWTYPE表结构变化后,依赖包需要重新编译
包中使用 select * into record字段数量变化可能影响逻辑
包依赖相关视图表字段变化后视图或包依赖链失效
包体中引用固定字段列表新字段加入后,相关 insert/select/update 逻辑未同步
发布过程中只改表,未编译包对象状态遗留为 INVALID
高峰期首次调用失效包触发 library cache 争用

本次问题中,SCM_PUBLIC 的依赖对象非常多,并且依赖链中包含大量业务表、函数、包、视图等对象。检查结果显示,SCM_PUBLICSCM_PURBILL 包体均处于 INVALID 状态,而 DBA_ERRORS 查询 SCM_PUBLIC 没有直接错误记录,这种情况经常出现在对象依赖失效、运行时重编译被阻塞、或编译过程被其他会话 pin 住的场景。


五、判断过程
#

1. 先看数据库主要等待事件
#

select inst_id,
       event,
       count(*) cnt,
       max(seconds_in_wait) max_wait_sec
from gv$session
where event in ('library cache pin',
                'library cache lock',
                'library cache: mutex X',
                'enq: TX - row lock contention')
group by inst_id, event
order by inst_id, cnt desc;

如果同时出现:

library cache pin / lock 很多
enq: TX - row lock contention 很多

要警惕这不是普通行锁,而是 library cache 等待导致持锁会话无法完成


2. 查行锁 blocker
#

set lines 260 pages 1000
col blocker for a20
col holder_user for a15
col holder_event for a35
col holder_sql for a13
col sample_sql for a80

with w as (
    select blocking_instance as holder_inst,
           blocking_session  as holder_sid,
           count(*) as waiters,
           min(seconds_in_wait) as min_wait_sec,
           max(seconds_in_wait) as max_wait_sec
    from gv$session
    where event = 'enq: TX - row lock contention'
      and blocking_session is not null
    group by blocking_instance, blocking_session
)
select w.holder_inst || ',' || w.holder_sid as blocker,
       hs.serial#,
       hp.spid,
       hs.username as holder_user,
       hs.status,
       hs.event as holder_event,
       hs.sql_id as holder_sql,
       w.waiters,
       w.min_wait_sec,
       w.max_wait_sec,
       hs.last_call_et,
       substr(q.sql_text, 1, 80) as sample_sql
from w
join gv$session hs
  on hs.inst_id = w.holder_inst
 and hs.sid     = w.holder_sid
left join gv$process hp
  on hp.inst_id = hs.inst_id
 and hp.addr    = hs.paddr
left join gv$sqlarea q
  on q.inst_id = hs.inst_id
 and q.sql_id  = hs.sql_id
order by w.waiters desc, w.max_wait_sec desc;

本次发现大量 blocker 是:

FOP_CMS 用户
fop_service.fop_service_unop
library cache pin

说明这些会话不是普通慢 SQL,而是已经被 library cache pin 卡住,同时又持有业务表 DML 锁。


3. 查对象是否失效
#

set lines 200 pages 1000
col owner for a20
col object_name for a35
col object_type for a20
col status for a10

select owner,
       object_name,
       object_type,
       status,
       created,
       last_ddl_time
from dba_objects
where upper(object_name) in (
    'FOP_SERVICE',
    'FIN_SALINV',
    'BPM_XIAOSHOUSP_CC',
    'SCM_PUBLIC',
    'SCM_PURBILL'
)
order by owner, object_name, object_type;

本次关键发现:

GKSD6.SCM_PUBLIC   PACKAGE BODY   INVALID
GKSD6.SCM_PURBILL  PACKAGE BODY   INVALID

这基本确认了方向:不是普通行锁,而是包体失效后,高并发调用引发 library cache 争用。


4. 查编译错误
#

set lines 260 pages 1000
col owner for a15
col name for a35
col type for a20
col text for a160

select owner,
       name,
       type,
       line,
       position,
       text
from dba_errors
where owner = 'GKSD6'
  and name in ('SCM_PUBLIC','SCM_PURBILL')
order by name, sequence;

本次 SCM_PUBLIC 查询没有明显编译错误记录,说明问题不一定是语法错误,而可能是依赖对象状态异常、运行时重编译冲突,或者编译时被其他会话 pin 住。


5. 查包依赖对象
#

set lines 260 pages 1000
col owner for a15
col name for a35
col type for a20
col referenced_owner for a20
col referenced_name for a40
col referenced_type for a20

select owner,
       name,
       type,
       referenced_owner,
       referenced_name,
       referenced_type
from dba_dependencies
where owner = 'GKSD6'
  and name in ('SCM_PUBLIC','SCM_PURBILL')
order by referenced_owner, referenced_name;

如果依赖对象中包含刚刚变更过的表,例如:

SCM_LOSSINOUT_APPLY_HDR

就要重点怀疑表结构变更导致依赖包失效。


六、处理过程
#

1. 先阻断新业务连接
#

因为当时大量应用会话正在调用 GKSD6.SCM_PUBLIC,直接编译很容易一直等待 library cache pin

生产中建议优先用锁用户方式临时截流:

alter user GKSD6 account lock;
alter user FOP_CMS account lock;
alter user TREASURY_CMS account lock;

注意:

锁用户只能阻止新连接;
不会断开已有会话;
不会自动释放已有行锁。

2. 清理已有持锁和 library cache 等待会话
#

生成 kill 语句:

set lines 300 pages 1000
col kill_cmd for a90
col username for a15
col event for a35
col sql_text for a100

select s.inst_id,
       s.sid,
       s.serial#,
       p.spid,
       s.username,
       s.status,
       s.event,
       s.sql_id,
       s.last_call_et,
       substr(q.sql_text,1,100) as sql_text,
       'alter system kill session ''' ||
       s.sid || ',' || s.serial# || ',@' || s.inst_id ||
       ''' immediate;' as kill_cmd
from gv$session s
left join gv$process p
  on s.inst_id = p.inst_id
 and s.paddr = p.addr
left join gv$sqlarea q
  on s.inst_id = q.inst_id
 and s.sql_id = q.sql_id
where s.type = 'USER'
  and s.username in ('GKSD6','FOP_CMS','TREASURY_CMS')
  and (
        s.event in ('library cache pin',
                    'library cache lock',
                    'library cache: mutex X',
                    'enq: TX - row lock contention')
        or lower(q.sql_text) like '%scm_public%'
        or lower(q.sql_text) like '%fop_service.fop_service_unop%'
        or lower(q.sql_text) like '%scm_purbill%'
        or lower(q.sql_text) like '%fin_salinv%'
      )
order by s.last_call_et desc;

优先处理:

1. 正在阻塞编译的 final blocking session
2. 持有 TX 行锁的 holder
3. library cache pin / lock 时间最长的业务会话
4. 正在调用 SCM_PUBLIC / SCM_PURBILL / FOP_SERVICE 的会话

3. 编译失效包
#

先编译依赖包:

alter package GKSD6.SCM_PURBILL compile body;
show errors package body GKSD6.SCM_PURBILL;

再编译核心公共包:

alter package GKSD6.SCM_PUBLIC compile body;
show errors package body GKSD6.SCM_PUBLIC;

本次处理过程中,编译 SCM_PUBLIC 时被一个业务会话阻塞:

SYS 会话正在执行:
alter package GKSD6.SCM_PUBLIC compile body

等待:
library cache pin

blocking session
INST 1 / SID 705

查询发现 1,705GKSD6 业务会话,正在执行:

INSERT INTO pub_collection_dtl ...

并等待:

latch: row cache objects

处理该 blocker:

alter system kill session '705,5' immediate;

之后 SCM_PUBLIC 编译成功。


4. 确认对象状态
#

set lines 200 pages 1000
col owner for a15
col object_name for a35
col object_type for a20
col status for a10
col last_ddl_time for a20

select owner,
       object_name,
       object_type,
       status,
       to_char(last_ddl_time,'yyyy-mm-dd hh24:mi:ss') last_ddl_time
from dba_objects
where owner = 'GKSD6'
  and object_name in ('SCM_PUBLIC','SCM_PURBILL')
order by object_name, object_type;

最终结果:

GKSD6.SCM_PUBLIC   PACKAGE       VALID
GKSD6.SCM_PUBLIC   PACKAGE BODY  VALID
GKSD6.SCM_PURBILL  PACKAGE       VALID
GKSD6.SCM_PURBILL  PACKAGE BODY  VALID

说明核心问题已解决。


5. 恢复应用用户
#

alter user GKSD6 account unlock;
alter user FOP_CMS account unlock;
alter user TREASURY_CMS account unlock;

七、故障恢复后验证
#

1. 查看等待事件是否下降
#

select inst_id,
       event,
       count(*) cnt,
       max(seconds_in_wait) max_wait_sec
from gv$session
where event in ('library cache pin',
                'library cache lock',
                'library cache: mutex X',
                'enq: TX - row lock contention')
group by inst_id, event
order by inst_id, cnt desc;

2. 查看是否仍有行锁 blocker
#

with w as (
    select blocking_instance as holder_inst,
           blocking_session  as holder_sid,
           count(*) as waiters
    from gv$session
    where event = 'enq: TX - row lock contention'
      and blocking_session is not null
    group by blocking_instance, blocking_session
)
select w.holder_inst,
       w.holder_sid,
       hs.serial#,
       hs.username,
       hs.status,
       hs.event,
       hs.sql_id,
       w.waiters
from w
join gv$session hs
  on hs.inst_id = w.holder_inst
 and hs.sid     = w.holder_sid
order by w.waiters desc;

3. 查看是否还有 invalid 对象
#

select owner,
       object_name,
       object_type,
       status,
       last_ddl_time
from dba_objects
where owner in ('GKSD6','FOP_CMS','TREASURY_CMS','SAVEFEGKSD6')
  and status <> 'VALID'
order by owner, object_type, object_name;

八、故障根因总结
#

本次故障的根因可以总结为:

业务表 SCM_LOSSINOUT_APPLY_HDR 新增 BPMID 字段后,
依赖该表或相关依赖链的 PL/SQL 包未完成重新编译,
导致 GKSD6.SCM_PUBLIC  GKSD6.SCM_PURBILL 包体失效。

业务高峰期大量会话并发调用失效包,
触发 library cache pin / lock 争用。

部分业务过程在持有 DML 锁后卡住,
导致后续 UPDATE / INSERT 出现 enq: TX - row lock contention
最终表现为数据库整体卡顿。

九、经验教训
#

1. 表结构变更后必须检查 invalid 对象
#

任何生产表结构变更后,都要执行:

select owner,
       object_name,
       object_type,
       status,
       last_ddl_time
from dba_objects
where owner in ('GKSD6','FOP_CMS','TREASURY_CMS','SAVEFEGKSD6')
  and status <> 'VALID'
order by owner, object_type, object_name;

不能只看 DDL 是否执行成功。


2. 不能让业务高峰期首次调用触发自动编译
#

失效对象如果不提前编译,Oracle 可能在业务首次调用时尝试自动编译。

低并发时可能无感知,高并发时就可能出现:

library cache pin
library cache lock
library cache: mutex X

严重时会放大成全局业务卡顿。


3. 发布流程中应增加强制编译步骤
#

表结构变更后建议执行:

alter package GKSD6.SCM_PURBILL compile body;
alter package GKSD6.SCM_PUBLIC compile body;

或者对 schema 执行统一编译:

exec dbms_utility.compile_schema(schema => 'GKSD6', compile_all => false);

生产环境建议优先在低峰期执行,并提前评估影响。


4. 编译前要先控制业务入口
#

如果已经发生大量调用,不建议直接反复编译。

正确顺序是:

1. 阻断新连接
2. 清理已有 pin 对象的会话
3. 编译 invalid 对象
4. 验证对象状态
5. 恢复业务连接

5. 不要只杀行锁 waiter
#

本次表面有大量:

enq: TX - row lock contention

但真正根因是:

library cache pin / lock

所以只杀等待行锁的 waiter 没意义。要找到持锁 holder,以及阻塞编译或 library cache 的 blocker。


十、建议形成标准变更检查清单
#

以后生产表结构变更后,建议固定执行以下检查:

-- 1. 查 invalid 对象
select owner, object_name, object_type, status, last_ddl_time
from dba_objects
where owner in ('GKSD6','FOP_CMS','TREASURY_CMS','SAVEFEGKSD6')
  and status <> 'VALID'
order by owner, object_type, object_name;

-- 2. 查变更表影响了哪些对象
select owner,
       name,
       type,
       referenced_owner,
       referenced_name,
       referenced_type
from dba_dependencies
where referenced_owner = 'GKSD6'
  and referenced_name = 'SCM_LOSSINOUT_APPLY_HDR'
order by owner, name, type;

-- 3. 编译核心业务包
alter package GKSD6.SCM_PURBILL compile body;
alter package GKSD6.SCM_PUBLIC compile body;

-- 4. 查编译错误
select owner, name, type, line, position, text
from dba_errors
where owner = 'GKSD6'
order by name, sequence;

-- 5. 再次确认 invalid 对象
select owner, object_name, object_type, status, last_ddl_time
from dba_objects
where owner in ('GKSD6','FOP_CMS','TREASURY_CMS','SAVEFEGKSD6')
  and status <> 'VALID'
order by owner, object_type, object_name;

十一、文章结论
#

本次故障不是单纯的行锁问题,而是一次典型的 表结构变更后依赖包体失效,业务高并发触发 library cache 争用,最终放大为行锁阻塞 的生产事故。

核心处理思路是:

先定位 invalid 对象
再确认 library cache 等待
再阻断新业务连接
再清理已有 pin/锁会话
最后按依赖顺序重新编译包体

最终通过编译:

GKSD6.SCM_PURBILL PACKAGE BODY
GKSD6.SCM_PUBLIC PACKAGE BODY

并清理阻塞编译的业务会话后,两个包体恢复 VALID,故障解除。