1. 先确认当前连接数据库#
PostgreSQL 的 information_schema、表结构、锁对象都和当前数据库有关。先确认库:
SELECT current_database();如果不在目标库,先切换:
\c fxzxdb2. 查看当前是否有锁等待#
这是第一步,先看谁在等锁:
SELECT
pid,
usename,
application_name,
client_addr,
state,
wait_event_type,
wait_event,
now() - query_start AS query_age,
left(query, 300) AS query
FROM pg_stat_activity
WHERE datname = current_database()
AND wait_event_type = 'Lock'
ORDER BY query_start;重点看这些字段:
pid 等锁会话
application_name 客户端来源,例如 DBeaver、ETL 程序
client_addr 客户端 IP
state active / idle in transaction
wait_event_type Lock 表示正在等锁
wait_event relation 表示等表级锁
query_age 等了多久
query 正在执行什么 SQL你这次就查到:
ALTER TABLE public.t_ods_fxzx_tsxm_fgs_target
ALTER COLUMN prc TYPE numeric USING prc::numeric;状态是:
wait_event_type = Lock
wait_event = relation说明这个 DDL 正在等表级锁。
3. 查询某个 PID 被谁阻塞#
假设等待锁的 PID 是 29739:
SELECT
a.pid,
pg_blocking_pids(a.pid) AS blocking_pids,
a.usename,
a.application_name,
a.client_addr,
a.state,
a.wait_event_type,
a.wait_event,
now() - a.query_start AS query_age,
left(a.query, 300) AS query
FROM pg_stat_activity a
WHERE a.pid = 29739;这次现场查出来:
29739 | blocking_pids = {22204,28355}说明 29739 被 22204 和 28355 阻塞。
4. 查询阻塞源详情#
拿到 blocking_pids 后,继续查阻塞源:
SELECT
pid,
usename,
application_name,
client_addr,
state,
wait_event_type,
wait_event,
now() - xact_start AS xact_age,
now() - query_start AS query_age,
left(query, 500) AS query
FROM pg_stat_activity
WHERE pid = ANY(pg_blocking_pids(29739));重点判断:
state = idle in transaction
wait_event_type = Client
wait_event = ClientRead
xact_age 很长这类会话一般不是正在执行 SQL,而是事务开着没提交/没回滚,客户端也没继续动作。 它们会继续持有锁,导致 DDL 或后续查询排队。
你这次的阻塞源就是:
28355 | idle in transaction | xact_age 17:43:44
22204 | idle in transaction | xact_age 01:34:085. 查看某些 PID 持有什么锁#
如果要进一步确认持锁对象,可以查 pg_locks:
SELECT
l.pid,
a.usename,
a.client_addr,
a.application_name,
a.state,
l.locktype,
l.mode,
l.granted,
l.relation::regclass AS relation_name,
l.transactionid,
l.virtualxid,
now() - a.xact_start AS xact_age,
left(a.query, 300) AS query
FROM pg_locks l
JOIN pg_stat_activity a ON l.pid = a.pid
WHERE a.datname = current_database()
AND l.pid IN (22204, 28355, 29739)
ORDER BY l.pid, l.granted DESC, relation_name;常见锁模式:
AccessShareLock 普通 SELECT 常见
RowExclusiveLock INSERT / UPDATE / DELETE 常见
ShareLock 部分索引、约束操作可能出现
AccessExclusiveLock ALTER TABLE / DROP / TRUNCATE 等 DDL 常见,最重DDL 等待时,重点看是否在等:
AccessExclusiveLockALTER TABLE 通常需要 AccessExclusiveLock,这个锁会和普通 SELECT 的 AccessShareLock 冲突,所以一个长事务 SELECT 没结束,也可能挡住 DDL。
6. 查完整阻塞关系#
可以查“谁阻塞谁”:
SELECT
blocked.pid AS blocked_pid,
blocked.usename AS blocked_user,
blocked.application_name AS blocked_app,
blocked.client_addr AS blocked_client,
now() - blocked.query_start AS blocked_duration,
left(blocked.query, 300) AS blocked_query,
blocking.pid AS blocking_pid,
blocking.usename AS blocking_user,
blocking.application_name AS blocking_app,
blocking.client_addr AS blocking_client,
blocking.state AS blocking_state,
now() - blocking.xact_start AS blocking_xact_age,
left(blocking.query, 300) AS blocking_query
FROM pg_stat_activity blocked
JOIN pg_locks blocked_locks
ON blocked_locks.pid = blocked.pid
JOIN pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
AND blocking_locks.pid <> blocked_locks.pid
JOIN pg_stat_activity blocking
ON blocking.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted
AND blocking_locks.granted
AND blocked.datname = current_database()
ORDER BY blocked.query_start;这个 SQL 适合输出明细,但现场处理时更推荐先用:
pg_blocking_pids(pid)因为更直接。
7. 判断是否可以终止阻塞源#
优先处理这类会话:
state = idle in transaction
wait_event_type = Client
wait_event = ClientRead
xact_age 很长
client_addr 是应用服务器或 ETL 服务器
query 是元数据查询、SAVEPOINT、BEGIN 后长时间未结束这次现场的 967 / 31726 / 22204 / 28355 都属于类似情况。
idle in transaction 的危险点是:SQL 已经不跑了,但事务没结束,锁不会释放。
8. 终止阻塞源#
先取消正在执行的 SQL:
SELECT pg_cancel_backend(22204);
SELECT pg_cancel_backend(28355);但如果状态是 idle in transaction,pg_cancel_backend() 往往没用,因为它没有正在运行的 SQL。
这时需要终止会话,让事务回滚并释放锁:
SELECT pid, pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE pid IN (22204, 28355);也可以直接对某个等待会话的阻塞源执行:
SELECT pid, pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE pid = ANY(pg_blocking_pids(29739));注意:pg_terminate_backend() 会断开连接,并回滚该会话未提交事务。生产环境建议先确认来源、业务影响,再执行。
9. 终止后复查锁是否释放#
SELECT
pid,
usename,
application_name,
client_addr,
state,
wait_event_type,
wait_event,
now() - query_start AS query_age,
left(query, 300) AS query
FROM pg_stat_activity
WHERE datname = current_database()
AND wait_event_type = 'Lock'
ORDER BY query_start;如果返回:
(0 rows)说明当前锁等待已解除。
10. 确认 DDL 是否执行成功#
如果是加字段:
SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_schema = 'public'
AND table_name = 't_ods_fxzx_tsxm_hx_sku_kpi'
AND column_name = 'prc';或者:
\d+ public.t_ods_fxzx_tsxm_hx_sku_kpi如果字段不存在,说明原 DDL 没有成功提交,需要重新执行。
11. 重新执行 DDL 时加超时保护#
不要直接执行裸 DDL,建议加 lock_timeout:
SET lock_timeout = '5s';
ALTER TABLE public.t_ods_fxzx_tsxm_hx_sku_kpi
ADD COLUMN IF NOT EXISTS prc varchar NULL;如果拿不到锁,5 秒后失败,不会长时间排队拖死后续查询。
如果是改字段类型,例如:
ALTER TABLE public.t_ods_fxzx_tsxm_fgs_target
ALTER COLUMN prc TYPE numeric USING prc::numeric;建议先校验数据:
SELECT prc, count(*)
FROM public.t_ods_fxzx_tsxm_fgs_target
WHERE prc IS NOT NULL
AND btrim(prc) <> ''
AND btrim(prc) !~ '^[+-]?([0-9]+(\.[0-9]+)?|\.[0-9]+)$'
GROUP BY prc
ORDER BY count(*) DESC
LIMIT 50;因为 varchar 转 numeric 时,只要有空字符串、中文、逗号金额、N/A、- 等脏值,DDL 就会失败。
12. 事后预防参数#
建议给 ETL 用户加空闲事务超时:
ALTER ROLE etl_dev IN DATABASE fxzxdb
SET idle_in_transaction_session_timeout = '10min';建议加锁等待超时:
ALTER ROLE etl_dev IN DATABASE fxzxdb
SET lock_timeout = '30s';建议加 SQL 执行超时:
ALTER ROLE etl_dev IN DATABASE fxzxdb
SET statement_timeout = '30min';这些参数对新连接生效,已有连接需要断开重连。
标准处理顺序总结#
1. 确认当前数据库
2. 查 wait_event_type = 'Lock' 的会话
3. 用 pg_blocking_pids(pid) 找阻塞源
4. 查阻塞源状态、事务时长、客户端 IP、SQL
5. 判断是否 idle in transaction
6. 必要时 pg_cancel_backend()
7. idle in transaction 无效时 pg_terminate_backend()
8. 复查是否还有 Lock 等待
9. 确认 DDL 是否成功
10. 重新执行 DDL 时加 lock_timeout
11. 事后设置 idle_in_transaction_session_timeout / lock_timeout这次案例的核心经验是:不要只盯着被阻塞的 DDL,要找根阻塞源。真正该处理的是长时间 idle in transaction 的会话,而不是第一时间杀正在等待的 DDL。
