OEL6.2 EXT4 filesystemio_options=SETALL造成archivelog坏块

联系:QQ(5163721)

标题:OEL6.2 EXT4 filesystemio_options=SETALL造成archivelog坏块

作者:Lunar©版权所有[文章允许转载,但必须以链接方式注明源地址,否则追究法律责任.]

又踩了个坑……

今天同事告诉我前天调整一个数据库的参数,重启后,备库总报错:

Errors in file /u01/oracle/app/diag/rdbms/lunar1db3/lunar1db3/trace/lunar1db3_rfs_26241.trc:
ORA-00272: error writing archive log /u01/oradata/arch/1_775047_804791836.arc
ORA-00354: corrupt redo log block header
RFS[3]: Possible network disconnect with primary database
Errors in file /u01/oracle/app/diag/rdbms/lunar1db3/lunar1db3/trace/lunar1db3_rfs_26237.trc:
ORA-00272: error writing archive log /u01/oradata/arch/1_775048_804791836.arc
ORA-00354: corrupt redo log block header
Errors in file /u01/oracle/app/diag/rdbms/lunar1db3/lunar1db3/trace/lunar1db3_rfs_26239.trc:
ORA-00272: error writing archive log /u01/oradata/arch/1_775046_804791836.arc
ORA-00354: corrupt redo log block header
RFS[1]: Possible network disconnect with primary database
RFS[2]: Possible network disconnect with primary database
Tue Apr 14 20:29:08 2015
RFS[4]: Assigned to RFS process 26243
RFS[4]: Identified database type as 'physical standby': Client is ARCH pid 21348
RFS[4]: Opened log for thread 1 sequence 775051 dbid 148164954 branch 804791836
Tue Apr 14 20:29:09 2015
RFS[5]: Assigned to RFS process 26245
RFS[5]: Identified database type as 'physical standby': Client is ARCH pid 21354
Tue Apr 14 20:29:09 2015
RFS[6]: Assigned to RFS process 26247
RFS[6]: Identified database type as 'physical standby': Client is ARCH pid 21356
CORRUPTION DETECTED: In redo blocks starting at block 1count 2048 for thread 1 sequence 775051
RFS[5]: Opened log for thread 1 sequence 775049 dbid 148164954 branch 804791836
RFS[6]: Opened log for thread 1 sequence 775050 dbid 148164954 branch 804791836
CORRUPTION DETECTED: In redo blocks starting at block 1count 2048 for thread 1 sequence 775049
CORRUPTION DETECTED: In redo blocks starting at block 1count 2048 for thread 1 sequence 775050
Errors in file /u01/oracle/app/diag/rdbms/lunar1db3/lunar1db3/trace/lunar1db3_rfs_26243.trc:
ORA-00272: error writing archive log /u01/oradata/arch/1_775051_804791836.arc
ORA-00354: corrupt redo log block header
Errors in file /u01/oracle/app/diag/rdbms/lunar1db3/lunar1db3/trace/lunar1db3_rfs_26245.trc:
ORA-00272: error writing archive log /u01/oradata/arch/1_775049_804791836.arc
ORA-00354: corrupt redo log block header
Errors in file /u01/oracle/app/diag/rdbms/lunar1db3/lunar1db3/trace/lunar1db3_rfs_26247.trc:
ORA-00272: error writing archive log /u01/oradata/arch/1_775050_804791836.arc
ORA-00354: corrupt redo log block header
RFS[4]: Possible network disconnect with primary database
RFS[5]: Possible network disconnect with primary database
RFS[6]: Possible network disconnect with primary database
Tue Apr 14 20:29:12 2015
RFS[7]: Assigned to RFS process 26249
RFS[7]: Identified database type as 'physical standby': Client is ARCH pid 21348
RFS[7]: Opened log for thread 1 sequence 775052 dbid 148164954 branch 804791836
CORRUPTION DETECTED: In redo blocks starting at block 1count 2048 for thread 1 sequence 775052
Errors in file /u01/oracle/app/diag/rdbms/lunar1db3/lunar1db3/trace/lunar1db3_rfs_26249.trc:
ORA-00272: error writing archive log /u01/oradata/arch/1_775052_804791836.arc
ORA-00354: corrupt redo log block header
RFS[7]: Possible network disconnect with primary database

初分析,这个错误有点怪异,有redo 头损坏,有“Possible network disconnect with primary database”
尝试clear online redo log和standby redo log,没用
尝试在主库重建控制文件,然后再直接重启备库,还是上面的错误。
冷静下来,感觉不对劲,检查主库和备库,发现主库归档日志是按照sequence顺序生成
备库则是断断续续的,有的可以从主库传过来,有的传不过来
手工传过来,APPLY还是报错:

SYS@ lunar1db3> recover standby database;
ORA-00279: change 6194107977528 generated at 04/13/2015 12:28:54 needed for thread 1
ORA-00289: suggestion : /u01/oradata/arch/1_775046_804791836.arc
ORA-00280: change 6194107977528 for thread 1 is in sequence #775046


20:49:02 Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
/u01/oradata/arch/1_775046_804791836.arc
ORA-00283: recovery session canceled due to errors
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 2048 change 6194107977662 time 04/13/2015 12:28:54
ORA-00334: archived log: '/u01/oradata/arch/1_775046_804791836.arc'


ORA-01112: media recovery not started


SYS@ lunar1db3> 

感觉是arch异常了,而且貌似所有报错的,都是没有从主库传过来的
而所有没传过来的,都是损坏的,因此手工传过来也没用,因此抽取一个arch进行校验:

SYS@ lunar1db3> alter system dump logfile '/u01/oradata/arch/1_775046_804791836.arc' VALIDATE;
alter system dump logfile '/u01/oradata/arch/1_775046_804791836.arc' VALIDATE
*
ERROR at line 1:
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 2048 change 6194107977662 time 04/13/2015 12:28:54
ORA-00334: archived log: '/u01/oradata/arch/1_775046_804791836.arc'


Elapsed: 00:00:01.08
SYS@ lunar1db3> 

结果显示,确实arch损坏了。
.
根据问题发生的时间,想起来那天调整数据库参数的时候,有一个filesystemio_options=SETALL
也就是调整文件系统AIO方式的,去年在公司还给大家发邮件说起来“ext4上不要使用 filesystemio_options=SETALL”
否则会造成数据库坏块,没想到今年自己被坑了……(哇哇大哭啊……)
具体可以参见Oracle 文档:
ORA-1578 ORA-353 ORA-19599 Corrupt blocks with zeros when filesystemio_options=SETALL on ext4 file system using Linux (Doc ID 1487957.1)
.
大概是说,在下面的版本上,如果使用ext4,那么设置了filesystemio_options=SETALL可能会造成数据库坏块:

- for RHEL5
kernel-2.6.18-238.el5 - RHEL5.6 Errata RHSA-2011-0017 or later
( [fs] ext4: move aio completion after unwritten extent con (Eric Sandeen) [617690] )

- for RHEL6 you need to have
kernel-2.6.32-71 and later 
( [fs] ext4: move aio completion after unwritten extent conversion (Christoph Hellwig) [589985] )

检查了一下我们的系统,果然命中:

[oracle@lunardb1 ~]$ uname -a
Linux lunardb1.800best.com 2.6.32-300.3.1.el6uek.x86_64 #1 SMP Fri Dec 9 18:57:35 EST 2011 x86_64 x86_64 x86_64 GNU/Linux
[oracle@lunardb1 ~]$ cat /etc/issue
Oracle Linux Server release 6.2
Kernel \r on an \m

[oracle@lunardb1 ~]$ mount
/dev/mapper/vg_lunar1adg01-lv_root on / type ext4 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
/dev/sda1 on /boot type ext4 (rw)
/dev/mapper/vg_lunar1adg01-LogVol02 on /u01 type ext4 (rw)
/dev/mapper/vg_lunar1data01-oradata on /u01/oradata type ext4 (rw)
/dev/mapper/vg_lunar1data01-tmpdata on /tmpdata type ext4 (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
[oracle@lunardb1 ~]$ 

后悔呀,上周类似的一个系统同样的一套参数调整,怎么没问题呢?
查看了一下,因为是ext3…………

[oracle@lunardb2 ~]$ uname -a
Linux lunardb2.800best.com 2.6.18-308.el5 #1 SMP Fri Jan 27 17:17:51 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
[oracle@lunardb2 ~]$ cat /etc/issue
Red Hat Enterprise Linux Server release 5.8 (Tikanga)
Kernel \r on an \m

[oracle@lunardb2 ~]$ 
[oracle@lunardb2 ~]$ mount
/dev/sda1 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
/dev/mapper/vg00-lv00 on /tmpdata type ext3 (rw)
/dev/mapper/vg00-lv01 on /hddata type ext3 (rw)
/dev/sda3 on /u01 type ext3 (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
[oracle@lunardb2 ~]$ 

没办法,申请业务,短期内,只能修改数据库参数回到filesystemio_options=none,然后重启主库
至于备库,只能重建了……
.
dd了两个好的归档日志和1个损坏的归档日志,对比了一下,发现有一个关键标示位,确实写错了:


无标题


无标题1


顺便说一下,这个bug在下面的版本已经被fixed了:
Updated kernel to version kernel-uek-2.6.39-200.29.3.el6uek

此条目发表在 backup&recovery, Dataguard 分类目录,贴了 , , 标签。将固定链接加入收藏夹。

发表评论

电子邮件地址不会被公开。 必填项已用 * 标注

您可以使用这些 HTML 标签和属性: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>