跳过正文

PostgreSQL 锁排查处理流程

Greatfinish
作者
Greatfinish
记录 Oracle、PostgreSQL、达梦、Linux、存储与生产环境故障处理经验。

1. 先确认当前连接数据库
#

PostgreSQL 的 information_schema、表结构、锁对象都和当前数据库有关。先确认库:

SELECT current_database();

如果不在目标库,先切换:

\c fxzxdb

2. 查看当前是否有锁等待
#

这是第一步,先看谁在等锁:

SELECT
    pid,
    usename,
    application_name,
    client_addr,
    state,
    wait_event_type,
    wait_event,
    now() - query_start AS query_age,
    left(query, 300) AS query
FROM pg_stat_activity
WHERE datname = current_database()
  AND wait_event_type = 'Lock'
ORDER BY query_start;

重点看这些字段:

pid              等锁会话
application_name 客户端来源,例如 DBeaverETL 程序
client_addr      客户端 IP
state            active / idle in transaction
wait_event_type  Lock 表示正在等锁
wait_event       relation 表示等表级锁
query_age        等了多久
query            正在执行什么 SQL

你这次就查到:

ALTER TABLE public.t_ods_fxzx_tsxm_fgs_target
ALTER COLUMN prc TYPE numeric USING prc::numeric;

状态是:

wait_event_type = Lock
wait_event      = relation

说明这个 DDL 正在等表级锁。

3. 查询某个 PID 被谁阻塞
#

假设等待锁的 PID 是 29739

SELECT
    a.pid,
    pg_blocking_pids(a.pid) AS blocking_pids,
    a.usename,
    a.application_name,
    a.client_addr,
    a.state,
    a.wait_event_type,
    a.wait_event,
    now() - a.query_start AS query_age,
    left(a.query, 300) AS query
FROM pg_stat_activity a
WHERE a.pid = 29739;

这次现场查出来:

29739 | blocking_pids = {22204,28355}

说明 297392220428355 阻塞。

4. 查询阻塞源详情
#

拿到 blocking_pids 后,继续查阻塞源:

SELECT
    pid,
    usename,
    application_name,
    client_addr,
    state,
    wait_event_type,
    wait_event,
    now() - xact_start AS xact_age,
    now() - query_start AS query_age,
    left(query, 500) AS query
FROM pg_stat_activity
WHERE pid = ANY(pg_blocking_pids(29739));

重点判断:

state = idle in transaction
wait_event_type = Client
wait_event = ClientRead
xact_age 很长

这类会话一般不是正在执行 SQL,而是事务开着没提交/没回滚,客户端也没继续动作。 它们会继续持有锁,导致 DDL 或后续查询排队。

你这次的阻塞源就是:

28355 | idle in transaction | xact_age 17:43:44
22204 | idle in transaction | xact_age 01:34:08

5. 查看某些 PID 持有什么锁
#

如果要进一步确认持锁对象,可以查 pg_locks

SELECT
    l.pid,
    a.usename,
    a.client_addr,
    a.application_name,
    a.state,
    l.locktype,
    l.mode,
    l.granted,
    l.relation::regclass AS relation_name,
    l.transactionid,
    l.virtualxid,
    now() - a.xact_start AS xact_age,
    left(a.query, 300) AS query
FROM pg_locks l
JOIN pg_stat_activity a ON l.pid = a.pid
WHERE a.datname = current_database()
  AND l.pid IN (22204, 28355, 29739)
ORDER BY l.pid, l.granted DESC, relation_name;

常见锁模式:

AccessShareLock       普通 SELECT 常见
RowExclusiveLock      INSERT / UPDATE / DELETE 常见
ShareLock             部分索引、约束操作可能出现
AccessExclusiveLock   ALTER TABLE / DROP / TRUNCATE  DDL 常见,最重

DDL 等待时,重点看是否在等:

AccessExclusiveLock

ALTER TABLE 通常需要 AccessExclusiveLock,这个锁会和普通 SELECT 的 AccessShareLock 冲突,所以一个长事务 SELECT 没结束,也可能挡住 DDL。

6. 查完整阻塞关系
#

可以查“谁阻塞谁”:

SELECT
    blocked.pid AS blocked_pid,
    blocked.usename AS blocked_user,
    blocked.application_name AS blocked_app,
    blocked.client_addr AS blocked_client,
    now() - blocked.query_start AS blocked_duration,
    left(blocked.query, 300) AS blocked_query,
    blocking.pid AS blocking_pid,
    blocking.usename AS blocking_user,
    blocking.application_name AS blocking_app,
    blocking.client_addr AS blocking_client,
    blocking.state AS blocking_state,
    now() - blocking.xact_start AS blocking_xact_age,
    left(blocking.query, 300) AS blocking_query
FROM pg_stat_activity blocked
JOIN pg_locks blocked_locks
  ON blocked_locks.pid = blocked.pid
JOIN pg_locks blocking_locks
  ON blocking_locks.locktype = blocked_locks.locktype
 AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
 AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
 AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
 AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
 AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
 AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
 AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
 AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
 AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
 AND blocking_locks.pid <> blocked_locks.pid
JOIN pg_stat_activity blocking
  ON blocking.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted
  AND blocking_locks.granted
  AND blocked.datname = current_database()
ORDER BY blocked.query_start;

这个 SQL 适合输出明细,但现场处理时更推荐先用:

pg_blocking_pids(pid)

因为更直接。

7. 判断是否可以终止阻塞源
#

优先处理这类会话:

state = idle in transaction
wait_event_type = Client
wait_event = ClientRead
xact_age 很长
client_addr 是应用服务器或 ETL 服务器
query 是元数据查询、SAVEPOINTBEGIN 后长时间未结束

这次现场的 967 / 31726 / 22204 / 28355 都属于类似情况。 idle in transaction 的危险点是:SQL 已经不跑了,但事务没结束,锁不会释放。

8. 终止阻塞源
#

先取消正在执行的 SQL:

SELECT pg_cancel_backend(22204);
SELECT pg_cancel_backend(28355);

但如果状态是 idle in transactionpg_cancel_backend() 往往没用,因为它没有正在运行的 SQL。 这时需要终止会话,让事务回滚并释放锁:

SELECT pid, pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE pid IN (22204, 28355);

也可以直接对某个等待会话的阻塞源执行:

SELECT pid, pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE pid = ANY(pg_blocking_pids(29739));

注意:pg_terminate_backend() 会断开连接,并回滚该会话未提交事务。生产环境建议先确认来源、业务影响,再执行。

9. 终止后复查锁是否释放
#

SELECT
    pid,
    usename,
    application_name,
    client_addr,
    state,
    wait_event_type,
    wait_event,
    now() - query_start AS query_age,
    left(query, 300) AS query
FROM pg_stat_activity
WHERE datname = current_database()
  AND wait_event_type = 'Lock'
ORDER BY query_start;

如果返回:

(0 rows)

说明当前锁等待已解除。

10. 确认 DDL 是否执行成功
#

如果是加字段:

SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_schema = 'public'
  AND table_name = 't_ods_fxzx_tsxm_hx_sku_kpi'
  AND column_name = 'prc';

或者:

\d+ public.t_ods_fxzx_tsxm_hx_sku_kpi

如果字段不存在,说明原 DDL 没有成功提交,需要重新执行。

11. 重新执行 DDL 时加超时保护
#

不要直接执行裸 DDL,建议加 lock_timeout

SET lock_timeout = '5s';

ALTER TABLE public.t_ods_fxzx_tsxm_hx_sku_kpi
ADD COLUMN IF NOT EXISTS prc varchar NULL;

如果拿不到锁,5 秒后失败,不会长时间排队拖死后续查询。

如果是改字段类型,例如:

ALTER TABLE public.t_ods_fxzx_tsxm_fgs_target
ALTER COLUMN prc TYPE numeric USING prc::numeric;

建议先校验数据:

SELECT prc, count(*)
FROM public.t_ods_fxzx_tsxm_fgs_target
WHERE prc IS NOT NULL
  AND btrim(prc) <> ''
  AND btrim(prc) !~ '^[+-]?([0-9]+(\.[0-9]+)?|\.[0-9]+)$'
GROUP BY prc
ORDER BY count(*) DESC
LIMIT 50;

因为 varcharnumeric 时,只要有空字符串、中文、逗号金额、N/A- 等脏值,DDL 就会失败。

12. 事后预防参数
#

建议给 ETL 用户加空闲事务超时:

ALTER ROLE etl_dev IN DATABASE fxzxdb
SET idle_in_transaction_session_timeout = '10min';

建议加锁等待超时:

ALTER ROLE etl_dev IN DATABASE fxzxdb
SET lock_timeout = '30s';

建议加 SQL 执行超时:

ALTER ROLE etl_dev IN DATABASE fxzxdb
SET statement_timeout = '30min';

这些参数对新连接生效,已有连接需要断开重连。

标准处理顺序总结
#

1. 确认当前数据库
2.  wait_event_type = 'Lock' 的会话
3.  pg_blocking_pids(pid) 找阻塞源
4. 查阻塞源状态、事务时长、客户端 IPSQL
5. 判断是否 idle in transaction
6. 必要时 pg_cancel_backend()
7. idle in transaction 无效时 pg_terminate_backend()
8. 复查是否还有 Lock 等待
9. 确认 DDL 是否成功
10. 重新执行 DDL 时加 lock_timeout
11. 事后设置 idle_in_transaction_session_timeout / lock_timeout

这次案例的核心经验是:不要只盯着被阻塞的 DDL,要找根阻塞源。真正该处理的是长时间 idle in transaction 的会话,而不是第一时间杀正在等待的 DDL。