高并发场景下,多个事务同时操作同一批数据很容易产生锁竞争,出现锁等待甚至会话阻塞的情况,轻则导致接口响应变慢,重则引发业务超时。要快速解决这类问题,核心是先找到阻塞的源头会话和对应的锁信息。

锁等待与阻塞会话的基础概念
锁等待指的是一个事务需要获取某把锁,但是这把锁被其他事务持有,当前事务只能进入等待状态。如果持有锁的事务长时间不释放,等待的事务就会一直阻塞,形成阻塞会话链。常见的锁类型包括行锁、表锁、意向锁等,不同数据库的实现略有差异,但排查思路基本一致。
通用排查思路
排查SQL高并发下的锁等待和阻塞会话,通常可以按照以下步骤进行:
- 先查看当前数据库的活跃会话,找到处于等待状态的会话
- 根据等待会话的信息,找到阻塞它的上游会话
- 分析上游会话正在执行的SQL和持有的锁,判断阻塞原因
- 根据业务场景决定是终止阻塞会话还是优化对应SQL
MySQL中的排查方法
查看当前锁等待信息
MySQL 5.7及以上版本可以通过performance_schema库下的表查询锁相关信息,首先开启对应的监控项:
-- 开启锁监控相关配置,默认可能是关闭的 UPDATE performance_schema.setup_instruments SET ENABLED = 'YES', TIMED = 'YES' WHERE NAME LIKE 'wait/lock/table/sql/%'; UPDATE performance_schema.setup_consumers SET ENABLED = 'YES' WHERE NAME LIKE '%lock%';
之后可以通过data_locks表查看当前所有持有的锁和等待的锁:
-- 查询当前锁信息,包含持有锁和等待锁的会话
SELECT
ENGINE_LOCK_ID,
ENGINE_TRANSACTION_ID,
THREAD_ID,
EVENT_ID,
OBJECT_SCHEMA,
OBJECT_NAME,
INDEX_NAME,
LOCK_TYPE,
LOCK_MODE,
LOCK_STATUS,
LOCK_DATA
FROM performance_schema.data_locks;
定位阻塞会话
通过data_lock_waits表可以直接找到锁等待的对应关系,确定哪个会话阻塞了哪个会话:
-- 查询锁等待关系,找到阻塞源头
SELECT
REQUESTING_ENGINE_TRANSACTION_ID AS 等待事务ID,
REQUESTING_THREAD_ID AS 等待线程ID,
REQUESTING_EVENT_ID AS 等待事件ID,
BLOCKING_ENGINE_TRANSACTION_ID AS 阻塞事务ID,
BLOCKING_THREAD_ID AS 阻塞线程ID,
BLOCKING_EVENT_ID AS 阻塞事件ID
FROM performance_schema.data_lock_waits;
拿到阻塞线程ID后,可以到information_schema库的PROCESSLIST表找到对应的会话信息,查看正在执行的SQL:
-- 根据线程ID查询会话详情和执行的SQL
SELECT
ID,
USER,
HOST,
DB,
COMMAND,
TIME,
STATE,
INFO
FROM information_schema.PROCESSLIST
WHERE ID IN (阻塞线程ID, 等待线程ID);
PostgreSQL中的排查方法
查看锁等待状态
PostgreSQL可以通过系统视图pg_locks和pg_stat_activity联合查询锁等待和阻塞会话信息:
-- 查询锁等待和对应的会话信息
SELECT
a.pid AS 等待会话ID,
a.usename AS 用户名,
a.datname AS 数据库名,
a.query AS 等待执行的SQL,
a.state AS 会话状态,
a.query_start AS 查询开始时间,
l.locktype AS 锁类型,
l.mode AS 锁模式,
l.granted AS 是否已获取锁
FROM pg_locks l
JOIN pg_stat_activity a ON l.pid = a.pid
WHERE l.granted = false;
定位阻塞源头
要找到阻塞上述等待会话的源头,可以再次关联pg_locks和pg_stat_activity,找到持有对应锁的会话:
-- 查询阻塞等待会话的源头会话
SELECT
blocked.pid AS 被阻塞会话ID,
blocked.query AS 被阻塞SQL,
blocking.pid AS 阻塞会话ID,
blocking.query AS 阻塞会话执行的SQL,
blocking.state AS 阻塞会话状态,
blocking.query_start AS 阻塞SQL开始时间
FROM pg_locks blocked_locks
JOIN pg_stat_activity blocked ON blocked_locks.pid = blocked.pid
JOIN pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
JOIN pg_stat_activity blocking ON blocking_locks.pid = blocking.pid
WHERE blocked_locks.granted = false
AND blocking_locks.granted = true;
阻塞会话的优化建议
找到阻塞源头后,可以根据实际情况处理:
- 如果是测试环境或者阻塞时间很长影响业务,可以先终止阻塞会话,MySQL使用
KILL 会话ID;,PostgreSQL使用SELECT pg_terminate_backend(会话ID); - 如果是业务SQL导致的长事务持有锁,需要优化SQL,比如避免大事务、给查询字段加合适的索引减少锁范围、调整事务隔离级别
- 高并发场景尽量缩短事务执行时间,避免事务中执行非数据库操作,减少锁持有时间
注意:终止生产环境的会话前一定要确认对应SQL的影响,避免误终止重要业务的事务导致数据不一致。