跳过正文

Oracle ADG 同步状态与延迟巡检 SQL SOP

Greatfinish
作者
Greatfinish
记录 Oracle、PostgreSQL、达梦、Linux、存储与生产环境故障处理经验。
目录

适用范围:Oracle Data Guard / Active Data Guard 日常巡检、故障判断、恢复后验证。

适用版本:Oracle 11g / 12c / 18c / 19c 物理备库。

一、文档目标
#

本文档用于检查 Oracle ADG 是否同步正常、是否存在传输延迟、应用延迟、归档缺口,以及主库到备库的日志传输目标是否异常。

巡检最终应输出以下结论之一:

结论含义
正常无 archive gap,transport lag=0 或秒级,apply lag=0 或秒级,MRP/RFS 正常
有传输延迟主库 redo 未及时传到备库,transport lag 较大
有应用延迟redo 已到备库,但 MRP 应用慢,apply lag 较大
有归档缺口v$archive_gap 有记录,备库缺少中间归档
MRP 未启动没有 MRP0,或备库只读但未应用
主库传输异常主库 v$archive_dest / v$archive_dest_status 有 ERROR

二、总体判断逻辑
#

ADG 巡检不能只看一个视图,要从以下几个维度综合判断:

检查维度关键视图正常状态
备库角色v$databaseDATABASE_ROLE=PHYSICAL STANDBY
备库打开模式v$databaseMOUNTEDREAD ONLY WITH APPLY
日志接收v$managed_standby / v$dataguard_processRFS,且对应主库 LGWR/ARCH 连接
日志应用v$managed_standby / v$dataguard_processMRP0,状态为 APPLYING_LOGWAIT_FOR_LOG
传输延迟v$dataguard_statstransport lag 为 0 或秒级
应用延迟v$dataguard_statsapply lag 为 0 或秒级
归档缺口v$archive_gapno rows selected
sequence 差距v$archived_log每个 thread#seq_lag=0 或短时间可追平
主库传输目标v$archive_dest / v$archive_dest_statusSTATUS=VALIDERROR 为空

注意:Active Data Guard 实时应用场景下,v$archived_log.applied 可能略滞后于 standby redo log 的实时应用位置,因此需要结合 v$managed_standbyv$dataguard_statsv$archive_gap 一起判断。

三、备库侧巡检 SQL
#

巡检前 SQL*Plus 格式设置,建议每次巡检前先执行:

set lines 300 pages 1000
set long 20000
set trimspool on
set tab off

col database_role for a20
col open_mode for a25
col protection_mode for a25
col switchover_status for a25

col process for a20
col status for a20
col client_process for a20
col thread# for 999
col sequence# for 999999999
col block# for 999999999

col name for a30
col value for a30
col unit for a25
col time_computed for a30
col datum_time for a30

col dest_name for a20
col type for a10
col database_mode for a20
col recovery_mode for a20
col synchronization_status for a20
col synchronized for a10
col gap_status for a20
col db_unique_name for a20

col dest_name for a30
col destination for a80
col error for a100

以下 SQL 优先在备库执行。

3.1 检查备库角色和打开状态
#

select database_role,
       open_mode,
       protection_mode,
       switchover_status
from v$database;

正常参考:

DATABASE_ROLE        OPEN_MODE                 PROTECTION_MODE           SWITCHOVER_STATUS
-------------------- ------------------------- ------------------------- -------------------------
PHYSICAL STANDBY     MOUNTED                   MAXIMUM PERFORMANCE       NOT ALLOWED

或者:

PHYSICAL STANDBY     READ ONLY WITH APPLY      MAXIMUM PERFORMANCE       NOT ALLOWED

判断标准:

字段正常/异常判断
DATABASE_ROLE正常应为 PHYSICAL STANDBY
OPEN_MODE=MOUNTED备库挂载状态,可以正常应用日志
OPEN_MODE=READ ONLY WITH APPLYADG 可读且实时应用,正常
OPEN_MODE=READ ONLY只读打开但未应用,需要检查 MRP 是否启动
OPEN_MODE=READ WRITE不是正常物理备库状态,需要确认是否已 failover/switchover

3.2 检查 RFS / MRP 进程状态
#

select process,
       status,
       thread#,
       sequence#,
       block#,
       client_process
from v$managed_standby
order by process, thread#;

正常参考:

PROCESS            STATUS                   THREAD#  SEQUENCE#     BLOCK# CLIENT_PROCESS
------------------ ------------------------ ------- ---------- ---------- ----------------
MRP0               APPLYING_LOG                   2      62398      27749 N/A
RFS                IDLE                           1      96120     239800 LGWR
RFS                IDLE                           2      62398      27754 LGWR

判断标准:

进程/状态含义判断
RFS 存在备库正在接收主库 redo正常
CLIENT_PROCESS=LGWR主库 LGWR 直接传输 redo正常
CLIENT_PROCESS=Archival归档方式传输正常
MRP0 APPLYING_LOG备库正在应用日志正常
MRP0 WAIT_FOR_LOG已追平,等待新日志正常
没有 MRP0日志应用未启动异常
MRP0 WAIT_FOR_GAP缺少归档日志异常
MRP0 的 sequence# / block# 长时间不变可能卡住或 I/O 慢需排查

注意:RFS IDLE 不一定异常,可能表示当前没有新的 redo 或已经接收完当前 redo。

3.3 检查 ADG 传输延迟和应用延迟
#

select name,
       value,
       unit,
       time_computed,
       datum_time
from v$dataguard_stats
where name in ('transport lag',
               'apply lag',
               'apply finish time',
               'estimated startup time');

正常参考:

NAME                           VALUE                          TIME_COMPUTED                  DATUM_TIME
------------------------------ ------------------------------ ------------------------------ ------------------------------
transport lag                  +00 00:00:00                   05/15/2026 21:10:06            05/15/2026 21:10:06
apply lag                      +00 00:00:00                   05/15/2026 21:10:06            05/15/2026 21:10:06
apply finish time              +00 00:00:00.000               05/15/2026 21:10:06

判断标准:

指标含义正常判断
transport lag主库 redo 传到备库的延迟0 或秒级
apply lag备库应用 redo 的延迟0 或秒级
apply finish time预计追平还需要多久0 或很短
datum_time统计数据采样时间应接近当前时间

异常示例:

apply lag  +92 07:52:39

说明备库应用延迟已经超过 92 天,属于严重不同步。

注意:如果存在 archive gap,transport lagapply finish time 可能不准确,应优先以 v$archive_gap 为准。

3.4 检查是否存在归档缺口
#

select * from v$archive_gap;

正常结果:

no rows selected

异常示例:

THREAD# LOW_SEQUENCE# HIGH_SEQUENCE#     CON_ID
------- ------------- -------------- ----------
      1         88228          94296          1
      2         56221          60831          1

含义:

Thread缺失范围说明
thread 188228 ~ 94296缺少 thread 1 的中间归档
thread 256221 ~ 60831缺少 thread 2 的中间归档

只要 v$archive_gap 有记录,就不能认为 ADG 同步正常。

3.5 检查每个 thread 的接收和应用差距
#

select thread#,
       max(sequence#) as max_received_seq,
       max(case when applied in ('YES','IN-MEMORY') then sequence# end) as max_applied_seq,
       max(sequence#) -
       max(case when applied in ('YES','IN-MEMORY') then sequence# end) as seq_lag
from v$archived_log
where resetlogs_change# = (select resetlogs_change# from v$database)
group by thread#
order by thread#;

正常参考:

THREAD# MAX_RECEIVED_SEQ MAX_APPLIED_SEQ    SEQ_LAG
------- ---------------- --------------- ----------
      1            96119           96119          0
      2            62397           62397          0

判断标准:

seq_lag判断
0已追平
1~2可能是实时应用/归档标记滞后,需要结合 v$managed_standby 判断
持续增长应用延迟或归档未连续应用
某个 thread 差距大RAC 某个线程存在传输或应用问题

3.6 检查最近应用到的时间点
#

select thread#,
       max(sequence#) as last_applied_seq,
       to_char(max(next_time), 'yyyy-mm-dd hh24:mi:ss') as last_applied_time
from v$archived_log
where applied in ('YES','IN-MEMORY')
  and resetlogs_change# = (select resetlogs_change# from v$database)
group by thread#
order by thread#;

输出示例:

THREAD# LAST_APPLIED_SEQ LAST_APPLIED_TIME
------- ---------------- --------------------
      1            96119 2026-05-15 21:09:44
      2            62397 2026-05-15 21:09:44

用途:

  • 确认备库数据大致应用到哪个业务时间点;
  • apply lag 异常时,用于判断落后时间范围;
  • RAC 环境下分别确认每个 thread 的应用时间。

3.7 查看最近归档日志应用明细
#

select thread#,
       sequence#,
       applied,
       to_char(first_time, 'yyyy-mm-dd hh24:mi:ss') as first_time,
       to_char(next_time, 'yyyy-mm-dd hh24:mi:ss') as next_time,
       to_char(completion_time, 'yyyy-mm-dd hh24:mi:ss') as completion_time
from (
    select a.*,
           row_number() over(partition by thread# order by sequence# desc) rn
    from v$archived_log a
    where resetlogs_change# = (select resetlogs_change# from v$database)
)
where rn <= 10
order by thread#, sequence# desc;

用途:

  • 查看每个 thread 最近 10 个归档日志是否已应用;
  • 检查 APPLIED 是否为 YESIN-MEMORY
  • 辅助判断某个 thread 是否落后。

四、主库侧巡检 SQL
#

巡检前 SQL*Plus 格式设置,建议每次巡检前先执行:

set lines 300 pages 1000
set long 20000
set trimspool on
set tab off

col database_role for a20
col open_mode for a25
col protection_mode for a25
col switchover_status for a25

col process for a20
col status for a20
col client_process for a20
col thread# for 999
col sequence# for 999999999
col block# for 999999999

col name for a30
col value for a30
col unit for a25
col time_computed for a30
col datum_time for a30

col dest_name for a20
col type for a10
col database_mode for a20
col recovery_mode for a20
col synchronization_status for a20
col synchronized for a10
col gap_status for a20
col db_unique_name for a20

col dest_name for a30
col destination for a80
col error for a100

以下 SQL 在主库执行,用于确认主库到备库的日志传输目标是否正常。

4.1 检查主库归档目标状态
#

select dest_id,
       dest_name,
       status,
       type,
       database_mode,
       recovery_mode,
       db_unique_name,
       synchronization_status,
       synchronized,
       gap_status,
       error
from v$archive_dest_status
where status <> 'INACTIVE'
  and type = 'PHYSICAL'
order by dest_id;

正常参考:

DEST_ID STATUS  TARGET  DATABASE_MODE    RECOVERY_MODE             ERROR
------- ------- ------- ---------------- ------------------------- -----
      2 VALID   STANDBY OPEN_READ-ONLY   MANAGED REAL TIME APPLY

判断标准:

字段正常判断
STATUSVALID
TARGETSTANDBY
RECOVERY_MODEMANAGED REAL TIME APPLYMANAGED
ERROR为空

5.2 检查主库归档目标配置和错误信息
#

select dest_id,
       dest_name,
       status,
       destination,
       error
from v$archive_dest
where status <> 'INACTIVE'
  and destination is not null
order by dest_id;

正常参考:

DEST_ID DEST_NAME             STATUS   DESTINATION       ERROR
------- --------------------- -------- ----------------- -----
	  1 LOG_ARCHIVE_DEST_1    VALID    USE_DB_RECOVERY_FILE_DEST
	  2 LOG_ARCHIVE_DEST_2    VALID    standby_tns

如果 ERROR 不为空,需要优先处理主库日志传输问题。

5.3 主库查看当前归档 sequence
#

select thread#,
       max(sequence#) as current_archived_seq
from v$archived_log
where resetlogs_change# = (select resetlogs_change# from v$database)
group by thread#
order by thread#;

用途:

  • 和备库 max_received_seq 对比;
  • 判断主库已经生成到哪个归档 sequence;
  • RAC 环境下必须分别看每个 thread#

5.4 主库查看当前 online redo sequence
#

select thread#,
       group#,
       sequence#,
       status,
       archived
from v$log
order by thread#, group#;

用途:

  • 查看主库当前正在写哪个 online redo;
  • 和备库 RFS/MRP 当前 sequence 对比;
  • 判断实时应用是否接近当前日志。

五、一键汇总巡检 SQL
#

以下 SQL 建议保存为 adg_check.sql,在备库执行。

set lines 300 pages 1000
set trimspool on
set tab off

prompt ======================================================================
prompt 1. DATABASE ROLE / OPEN MODE
prompt ======================================================================
col database_role for a20
col open_mode for a25
col protection_mode for a25
col switchover_status for a25
select database_role,
       open_mode,
       protection_mode,
       switchover_status
from v$database;

prompt ======================================================================
prompt 2. MANAGED STANDBY PROCESS
prompt ======================================================================
col process for a18
col status for a24
col client_process for a16
select process,
       status,
       thread#,
       sequence#,
       block#,
       client_process
from v$managed_standby
order by process, thread#;

prompt ======================================================================
prompt 3. DATAGUARD LAG
prompt ======================================================================
col name for a30
col value for a30
col unit for a25
col time_computed for a30
col datum_time for a30
select name,
       value,
       unit,
       time_computed,
       datum_time
from v$dataguard_stats
where name in ('transport lag',
               'apply lag',
               'apply finish time',
               'estimated startup time');

prompt ======================================================================
prompt 4. ARCHIVE GAP
prompt ======================================================================
select * from v$archive_gap;

prompt ======================================================================
prompt 5. RECEIVED/APPLIED SEQUENCE LAG BY THREAD
prompt ======================================================================
col max_received_seq for 999999999
col max_applied_seq for 999999999
col seq_lag for 999999999
select thread#,
       max(sequence#) as max_received_seq,
       max(case when applied in ('YES','IN-MEMORY') then sequence# end) as max_applied_seq,
       max(sequence#) -
       max(case when applied in ('YES','IN-MEMORY') then sequence# end) as seq_lag
from v$archived_log
where resetlogs_change# = (select resetlogs_change# from v$database)
group by thread#
order by thread#;

prompt ======================================================================
prompt 6. LAST APPLIED TIME BY THREAD
prompt ======================================================================
col last_applied_time for a20
select thread#,
       max(sequence#) as last_applied_seq,
       to_char(max(next_time), 'yyyy-mm-dd hh24:mi:ss') as last_applied_time
from v$archived_log
where applied in ('YES','IN-MEMORY')
  and resetlogs_change# = (select resetlogs_change# from v$database)
group by thread#
order by thread#;

执行方式:

sqlplus / as sysdba @adg_check.sql

如需保存日志:

spool /tmp/adg_check_&&_DATE.log
@adg_check.sql
spool off

六、异常场景判断与处理 SOP
#

6.1 场景一:v$archive_gap 有记录
#

异常表现:

select * from v$archive_gap;

输出类似:

THREAD# LOW_SEQUENCE# HIGH_SEQUENCE#
------- ------------- --------------
      1         88228          94296
      2         56221          60831

判断:

备库缺少中间归档,MRP 无法连续恢复到最新位置。

处理路径:

  1. 在主库确认缺失归档是否还存在;
  2. 如果归档存在,复制到备库并注册;
  3. 如果归档不存在或缺口太大,使用 RMAN 增量推进备库;
  4. 重启 MRP;
  5. 重新验证 v$archive_gapapply lagseq_lag

主库查询缺失归档是否存在:

select thread#,
       sequence#,
       name,
       deleted,
       status,
       to_char(first_time, 'yyyy-mm-dd hh24:mi:ss') first_time,
       to_char(next_time, 'yyyy-mm-dd hh24:mi:ss') next_time
from v$archived_log
where resetlogs_change# = (select resetlogs_change# from v$database)
  and (
       (thread# = 1 and sequence# between 88228 and 94296)
    or (thread# = 2 and sequence# between 56221 and 60831)
  )
order by thread#, sequence#;

备库注册归档:

alter database recover managed standby database cancel;

alter database register physical logfile '/u01/arch_gap/thread1/1_88228_xxx.arc';
alter database register physical logfile '/u01/arch_gap/thread2/2_56221_xxx.arc';

alter database recover managed standby database using current logfile disconnect from session;

批量生成注册脚本:

find /u01/arch_gap -type f | sort | awk '{print "alter database register physical logfile '\''" $0 "'\'';"}' > /tmp/register_arch.sql
sqlplus / as sysdba @/tmp/register_arch.sql

6.2 场景二:归档缺口太大或归档已丢失
#

适用情况:

  • 缺失归档数量很多;
  • 主库 FRA 已清理;
  • RMAN 备份中恢复归档成本过高;
  • 手工补归档时间太长。

推荐处理:使用 RMAN 增量推进备库。

方式 A:18c/19c 从主库服务恢复备库
#

备库执行:

rman target /
recover standby database from service '主库TNS服务名';

执行前确认备库能连接主库:

tnsping 主库TNS服务名
sqlplus sys/密码@主库TNS服务名 as sysdba

方式 B:传统 RMAN incremental from SCN
#

备库查询最低 SCN:

select min(checkpoint_change#) as standby_min_scn
from v$datafile_header;

主库基于该 SCN 做增量备份:

run {
  allocate channel c1 device type disk;
  allocate channel c2 device type disk;
  backup incremental from scn <standby_min_scn> database
    format '/u01/standby_incr/standby_incr_%U.bkp';
  release channel c1;
  release channel c2;
}

backup current controlfile for standby format '/u01/standby_incr/standby_control_%U.bkp';

拷贝到备库:

scp /u01/standby_incr/* oracle@备库主机:/u01/standby_incr/

备库恢复:

rman target /
startup force nomount;
restore standby controlfile from '/u01/standby_incr/standby_control_xxx.bkp';
alter database mount;
catalog start with '/u01/standby_incr/';
recover database noredo;

恢复完成后启动 MRP:

alter database recover managed standby database using current logfile disconnect from session;

6.3 场景三:没有 MRP0
#

异常表现:

v$managed_standby 中没有 MRP0

处理:

alter database recover managed standby database using current logfile disconnect from session;

如果不使用实时应用:

alter database recover managed standby database disconnect from session;

再次检查:

select process,
       status,
       thread#,
       sequence#,
       block#,
       client_process
from v$managed_standby
order by process, thread#;

正常应看到:

MRP0 APPLYING_LOG

或:

MRP0 WAIT_FOR_LOG

6.4 场景四:主库传输目标报错
#

主库检查:

select dest_id,
       dest_name,
       status,
       destination,
       error
from v$archive_dest
where target = 'STANDBY'
order by dest_id;

如果 ERROR 不为空,按以下方向排查:

问题方向检查内容
网络主备监听、端口、主机名解析、防火墙
TNStnspingtnsnames.ora、service_name
密码文件主备 SYS 密码、orapw 文件是否一致
归档目标log_archive_dest_n 配置是否正确
备库状态备库实例是否启动到 mount/open 状态
FRA/磁盘空间主库或备库归档目录是否满

常用测试:

tnsping standby_tns
sqlplus sys/密码@standby_tns as sysdba

6.5 场景五:传输正常但应用慢
#

现象:

  • RFS 正常;
  • transport lag 不大;
  • apply lag 持续增长;
  • max_received_seq > max_applied_seq

处理方向:

  1. 检查备库 I/O 性能;
  2. 检查备库 CPU / 内存压力;
  3. 检查是否有大量 redo 需要应用;
  4. 检查 standby redo log 是否合理;
  5. 检查 alert 日志是否有 MRP 报错。

操作系统检查:

top
vmstat 1 10
iostat -x 1 10

重点关注:

指标异常判断
%util接近 100%,磁盘忙
await明显升高,I/O 响应慢
%iowait长时间较高,系统等待 I/O
MRP 进程 CPU长时间高,说明正在追日志;长时间无变化则可能卡住

七、恢复后验证 SOP
#

ADG 异常处理完成后,必须连续验证以下项目。

7.1 验证无归档缺口
#

select * from v$archive_gap;

正常:

no rows selected

7.2 验证延迟归零
#

select name,
       value,
       time_computed,
       datum_time
from v$dataguard_stats
where name in ('transport lag','apply lag','apply finish time');

正常:

transport lag      +00 00:00:00
apply lag          +00 00:00:00
apply finish time  +00 00:00:00.000

7.3 验证每个 thread 追平
#

select thread#,
       max(sequence#) as max_received_seq,
       max(case when applied in ('YES','IN-MEMORY') then sequence# end) as max_applied_seq,
       max(sequence#) -
       max(case when applied in ('YES','IN-MEMORY') then sequence# end) as seq_lag
from v$archived_log
where resetlogs_change# = (select resetlogs_change# from v$database)
group by thread#
order by thread#;

正常:

THREAD# MAX_RECEIVED_SEQ MAX_APPLIED_SEQ SEQ_LAG
------- ---------------- --------------- -------
      1            96119           96119       0
      2            62397           62397       0

7.4 验证 MRP/RFS 状态
#

select process,
       status,
       thread#,
       sequence#,
       block#,
       client_process
from v$managed_standby
order by process, thread#;

正常:

MRP0 APPLYING_LOG  WAIT_FOR_LOG
RFS  IDLE / RECEIVINGCLIENT_PROCESS  LGWR  Archival

7.5 建议观察周期
#

只有持续满足以下条件,才能认为恢复稳定:

  1. v$archive_gap = no rows selected
  2. transport lag = 0 或秒级
  3. apply lag = 0 或秒级
  4. 每个 thread# 的 seq_lag = 0 或短时间可追平
  5. MRP0/RFS 状态正常
  6. 主库传输目标无 ERROR

八、常用命令速查
#

8.1 启动实时应用
#

alter database recover managed standby database using current logfile disconnect from session;

8.2 停止日志应用
#

alter database recover managed standby database cancel;

8.3 启动非实时应用
#

alter database recover managed standby database disconnect from session;

8.4 主库强制归档当前日志
#

alter system archive log current;

8.5 主库切换日志
#

alter system switch logfile;

8.6 备库注册单个归档
#

alter database register physical logfile '/path/to/archive.arc';

8.7 查看 alert 日志
#

adrci
show homes
set home diag/rdbms/<db_unique_name>/<instance_name>
show alert -tail 200

或:

tail -200f $ORACLE_BASE/diag/rdbms/*/*/trace/alert_*.log