修复由于修改主机名造成Standbalone异常(ORA-29701 raised in ASM I/O path)

联系:QQ(5163721)

标题:修复由于修改主机名造成Standbalone异常(ORA-29701 raised in ASM I/O path)

作者:Lunar©版权所有[文章允许转载,但必须以链接方式注明源地址,否则追究法律责任.]

前不久,我同事修改standbalone的主机名,只是单纯的从主机层面修改了,has等并没有做调整,虽然数据库奇迹般的open了(有些时候,数据库在此情况下是不能open的,直接就会报错),但是数据库更改主机名以后,ASM和数据库的alert都有报错,而且分析后,觉得这个报错不解决,会在后期使用中造成数据库crash。
修复这个问题的程很简单,有几点说明:
1,8i和8i以前叫做OPS,Oracle Parallel Server,9.0.1~11.1之间叫做RAC(Real Application Cluster),11.2以后叫做GI(Grid Infrastructure)
2,Standbalone结构就是单实例数据库使用ASM的场景,他只需要db和asm通信的cssd等结构,不需要crs。
3,启动,关闭和管理has流程类似rac,但是千万别混淆,Standbalone的叫做crsctl start/stop has,RAC的叫做crsctl start/stop crs(整套架构中所有进程由ohasd创建),如果弄错了,会出现一些异常情况,在其他案例中,我们遇到过直接ASM报错的,后续在总结了那个案例分享出来。
4,不管是Standbalone还是RAC,修改主机名和IP等都需要特别小心,因为他们不像单机数据库一样,单纯的主机层面就该就可以,他们需要分别在has和crs中进行重配或者修改。
5,无论发生了什么,只要没有动ASM和DB,那么不用担心丢数据。因为has或者crs的结构都跟db是独立的,一般不会丢数据,最差的结果,重装一下,也可以把asm和db拉起来。
6,很多时候,在做类似该ip或者主机名,或者升级crs或者has的版本,或者升级数据库软件时,都建议对现有环境进行备份,要么借助NBU之类的工具,要么使用tar命令。千万别用ftp或者直接copy,这类的否不靠谱,不能用于文件恢复。

具体过程如下:
ASM ALERT:

Tue Mar 03 14:58:34 2015
Warning: VKTM detected a time drift.
Time drifts can result in an unexpected behavior such as time-outs. Please check trace file for more details.
Wed Mar 18 17:01:23 2015 
NOTE: client lunarogg:lunarogg deregistered
Wed Mar 18 17:07:18 2015 
NOTE: client lunarogg:lunarogg registered, osid 16085, mbr 0x1
~

数据库的日志,DB ALERT:

Fri Mar 20 11:29:44 2015
Archived Log entry 650 added for thread 1 sequence 107161 ID 0x9911db3 dest 1:
Fri Mar 20 11:30:15 2015
ERROR: unrecoverable error ORA-29701 raised in ASM I/O path; terminating process 25695
Fri Mar 20 11:32:05 2015
ERROR: unrecoverable error ORA-29701 raised in ASM I/O path; terminating process 25711
Fri Mar 20 11:32:54 2015
Thread 1 cannot allocate new log, sequence 107163
Private strand flush not complete
  Current log# 3 seq# 107162 mem# 0: /u01/oradata/redo/lunarogg/onlinelog/redo03.log
Fri Mar 20 11:32:55 2015
ERROR: unrecoverable error ORA-29701 raised in ASM I/O path; terminating process 25766
Thread 1 advanced to log sequence 107163 (LGWR switch)
  Current log# 4 seq# 107163 mem# 0: /u01/oradata/redo/lunarogg/onlinelog/redo04.log
Fri Mar 20 11:32:59 2015
Archived Log entry 651 added for thread 1 sequence 107162 ID 0x9911db3 dest 1:

CRS和ohasd也有明显报错:

2015-03-03 14:32:22.832: [UiServer][1135483200]{0:0:2} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client.
2015-03-03 14:32:22.833: [UiServer][1135483200]{0:0:2} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client.
2015-03-03 14:32:22.836: [UiServer][1135483200]{0:0:54} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client.
2015-03-03 14:32:22.836: [UiServer][1135483200]{0:0:54} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client.
2015-03-03 14:32:53.108: [UiServer][1135483200]{0:0:78} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client.
2015-03-03 14:32:53.109: [UiServer][1135483200]{0:0:78} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client.
2015-03-03 14:32:53.118: [UiServer][1135483200]{0:0:79} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client.
2015-03-03 14:32:53.119: [UiServer][1135483200]{0:0:79} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client.
2015-03-03 14:33:22.609: [  CRSSEC][1133381952]{0:0:95} Exception: setGroupIdentity failed while checking groupname: asmadmin
2015-03-03 14:33:22.615: [   CRSPE][1133381952]{0:0:95} Exception in setGroupIdentity (ignoring groupIdentity): Authorization Exception:The input identity user id does not belong to the group is invalid
2015-03-03 14:33:53.733: [UiServer][1135483200]{0:0:91} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client.
2015-03-03 14:33:53.738: [UiServer][1135483200]{0:0:94} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client.
2015-03-03 14:33:53.739: [UiServer][1135483200]{0:0:95} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client.
2015-03-03 14:33:53.740: [UiServer][1135483200]{0:0:95} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client.
2015-03-18 17:01:17.748: [  CRSSEC][1133381952]{0:0:8811} Exception: setGroupIdentity failed while checking groupname: asmadmin
2015-03-18 17:01:17.748: [   CRSPE][1133381952]{0:0:8811} Exception in setGroupIdentity (ignoring groupIdentity): Authorization Exception:The input identity user id does not belong to the group is invalid
2015-03-18 17:01:59.271: [  CRSSEC][1133381952]{0:0:8812} Exception: setGroupIdentity failed while checking groupname: asmadmin
2015-03-18 17:01:59.271: [   CRSPE][1133381952]{0:0:8812} Exception in setGroupIdentity (ignoring groupIdentity): Authorization Exception:The input identity user id does not belong to the group is invalid
2015-03-18 17:01:59.277: [UiServer][1135483200]{0:0:8812} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client.
2015-03-18 17:01:59.295: [UiServer][1135483200]{0:0:8812} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client.
2015-03-18 17:01:59.296: [UiServer][1135483200]{0:0:8812} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client.
2015-03-18 17:34:14.510: [UiServer][1135483200]{0:0:8834} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client.
2015-03-18 17:34:14.510: [UiServer][1135483200]{0:0:8834} Communication exception sending reply back to client.FatalCommsException : Failed to send response to client.

我猜测,此时如果做类似增加/删除文件数据库都会crash,果然,我让同事做rman备份,数据库就crash了。
处理的方法很简单,就是修改has的相关配置,讲新主机名配置进去。具体步骤如下:
首先记录当前的系统关键信息,因为修改完has后,需要将asm,db,diskgroup等关键信息重新注册到has中:

[grid@lunardb ~]$ srvctl config asm
ASM home: /u01/app/grid/product/11.2.0/grid
ASM listener: LISTENER
Spfile: +DATA/asm/asmparameterfile/registry.253.847035651
ASM diskgroup discovery string: /dev/mapper/*
[grid@lunardb ~]$ 
[oracle@lunardb ~]$ srvctl config database -d lunarogg
Database unique name: lunarogg
Database name: lunarogg
Oracle home: /u01/app/oracle/product/11.2.0/db_1
Oracle user: oracle
Spfile: +DATA/lunarogg/spfilelunarogg.ora
Domain: 
Start options: open
Stop options: immediate
Database role: PRIMARY
Management policy: AUTOMATIC
Database instance: lunarogg
Disk Groups: DATA,DATA2
Services: 
[oracle@lunardb ~]$ 

这里还发现了一个问题,这个库的ASM的在has中的信息和实际使用的spfile不一致

[grid@lunardb ~]$ asmcmd spbackup +DATA/asm/asmparameterfile/registry.253.847035651 /tmp/ASMspfile.backup
ASMCMD-8303: invalid SPFILE '+DATA/asm/asmparameterfile/registry.253.847035651'
ORA-15056: additional error message
ORA-15173: entry 'asm' does not exist in directory '/'
ORA-06512: at "SYS.X$DBMS_DISKGROUP", line 322
ORA-06512: at line 3 (DBD ERROR: OCIStmtExecute)
[grid@lunardb ~]$ 
检查ASM实例,确实使用的是pfile启动的,这个从alert中也可以得到证实:
[grid@lunardb ~]$ ss

SQL*Plus: Release 11.2.0.4.0 Production on Mon Mar 23 13:56:01 2015

Copyright (c) 1982, 2013, Oracle.  All rights reserved.


Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Automatic Storage Management option

SQL> show parameter pfile

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
spfile                               string
SQL> 


在所有的磁盘组中,我们发现实际上是找不到在has中注册的ASM实例使用的spfile的:

[grid@lunardb ~]$ asmcmd find --type ASMPARAMETERFILE +DATA2 "*"
[grid@lunardb ~]$ asmcmd find --type ASMPARAMETERFILE +DATA "*"

使用kfod检查会发现,大量磁盘的报错:

[grid@lunardb ~]$ kfod asm_diskstring='/dev/mapper/*' disks=all
--------------------------------------------------------------------------------
 Disk          Size Path                                     User     Group   
================================================================================
   1:    1429504 Mb /dev/mapper/D_R5_01_LUN01                grid     oinstall
   2:    1429504 Mb /dev/mapper/D_R5_01_LUN02                grid     oinstall
   3:    1425408 Mb /dev/mapper/D_R5_02_LUN01                grid     oinstall
   4:    1425408 Mb /dev/mapper/D_R5_02_LUN02                grid     oinstall
   5:    1425408 Mb /dev/mapper/D_R5_03_LUN01                grid     oinstall
   6:    1425408 Mb /dev/mapper/D_R5_03_LUN02                grid     oinstall
   7:    1425408 Mb /dev/mapper/D_R5_04_LUN01                grid     oinstall
   8:    1425408 Mb /dev/mapper/D_R5_04_LUN02                grid     oinstall
KFOD-00301: Unable to contact Cluster Synchronization Services (CSS). Return code 2 from kgxgncin.
KFOD-00311: Error scanning device /dev/mapper/control
ORA-15025: could not open disk "/dev/mapper/control"
Linux-x86_64 Error: 13: Permission denied
Additional information: 42
KFOD-00311: Error scanning device /dev/mapper/VolGroup00-LogVol00
ORA-15025: could not open disk "/dev/mapper/VolGroup00-LogVol00"
Linux-x86_64 Error: 13: Permission denied
Additional information: 42
KFOD-00311: Error scanning device /dev/mapper/VolGroup00-LogVol03
ORA-15025: could not open disk "/dev/mapper/VolGroup00-LogVol03"
Linux-x86_64 Error: 13: Permission denied
Additional information: 42
KFOD-00311: Error scanning device /dev/mapper/VolGroup00-LogVol04
ORA-15025: could not open disk "/dev/mapper/VolGroup00-LogVol04"
Linux-x86_64 Error: 13: Permission denied
Additional information: 42
KFOD-00311: Error scanning device /dev/mapper/VolGroup00-LogVol02
ORA-15025: could not open disk "/dev/mapper/VolGroup00-LogVol02"
Linux-x86_64 Error: 13: Permission denied
Additional information: 42
KFOD-00311: Error scanning device /dev/mapper/VolGroup00-LogVol01
ORA-15025: could not open disk "/dev/mapper/VolGroup00-LogVol01"
Linux-x86_64 Error: 13: Permission denied
Additional information: 42
KFOD-00311: Error scanning device /dev/mapper/Vol_oradata01-lv_oradata1
ORA-15025: could not open disk "/dev/mapper/Vol_oradata01-lv_oradata1"
Linux-x86_64 Error: 13: Permission denied
Additional information: 42
KFOD-00311: Error scanning device /dev/mapper/lunarogg01lun1
ORA-15025: could not open disk "/dev/mapper/lunarogg01lun1"
Linux-x86_64 Error: 13: Permission denied
Additional information: 42
Additional information: 22621383
Additional information: 1598903119
[grid@lunardb ~]$ 

接下来,我就开始重建has了,首先是使用force都无法停止has,因为已经异常了,这个在意料之中的,没关系:

[root@lunardb ~]# crsctl stop has -f
CLSU-00100: Operating System function: opendir failed with error data: 2
CLSU-00101: Operating System error message: No such file or directory
CLSU-00103: error location: scrsearch1
CLSU-00104: additional error information: cant open scr home dir scls_scr_getval
CRS-4544: Unable to connect to OHAS
CRS-4000: Command Stop failed, or completed with errors.
[root@lunardb ~]# 
[root@lunardb ~]# crsctl stop crs
CRS-4013: This command is not supported in a single-node configuration.
CRS-4000: Command Stop failed, or completed with errors.
[root@lunardb ~]# ps -ef|grep d.bin
grid      9107     1  0 Mar03 ?        00:01:43 /u01/app/grid/product/11.2.0/grid/bin/evmd.bin
grid      9140  9107  0 Mar03 ?        00:00:00 /u01/app/grid/product/11.2.0/grid/bin/evmlogger.bin -o /u01/app/grid/product/11.2.0/grid/evm/log/evmlogger.info -l /u01/app/grid/product/11.2.0/grid/evm/log/evmlogger.log
grid      9149     1  0 Mar03 ?        00:15:09 /u01/app/grid/product/11.2.0/grid/bin/cssdagent
grid      9172     1  0 Mar03 ?        00:01:50 /u01/app/grid/product/11.2.0/grid/bin/ocssd.bin 
grid     16751     1  0 Mar18 ?        00:00:17 /u01/app/grid/product/11.2.0/grid/bin/tnslsnr LISTENER -inherit
root     31046 30878  0 14:17 pts/8    00:00:00 grep d.bin
[root@lunardb ~]# 

这些进程已经停不掉了,于是只能重启主机(kill ocssd.bin 一样会导致主机重启,因此直接手工reboot了)。
起来后,先删除现有的has配置:

[root@lunardb ~]# $ORACLE_HOME/crs/install/roothas.pl -deconfig -force
Using configuration parameter file: /u01/app/grid/product/11.2.0/grid/crs/install/crsconfig_params
CRS-4639: Could not contact Oracle High Availability Services
CRS-4000: Command Stop failed, or completed with errors.
CRS-4639: Could not contact Oracle High Availability Services
CRS-4000: Command Delete failed, or completed with errors.
CLSU-00100: Operating System function: opendir failed with error data: 2
CLSU-00101: Operating System error message: No such file or directory
CLSU-00103: error location: scrsearch1
CLSU-00104: additional error information: cant open scr home dir scls_scr_getval
CRS-4544: Unable to connect to OHAS
CRS-4000: Command Stop failed, or completed with errors.
Successfully deconfigured Oracle Restart stack
[root@lunardb ~]# 

然后执行重新配置:

[root@lunardb ~]# $ORACLE_HOME/crs/install/roothas.pl
Using configuration parameter file: /u01/app/grid/product/11.2.0/grid/crs/install/crsconfig_params
LOCAL ADD MODE 
Creating OCR keys for user 'grid', privgrp 'oinstall'..
Operation successful.
LOCAL ONLY MODE 
Successfully accumulated necessary OCR keys.
Creating OCR keys for user 'root', privgrp 'root'..
Operation successful.
CRS-4664: Node lunardb successfully pinned.
Adding Clusterware entries to inittab

lunardb     2015/03/23 14:29:29     /u01/app/grid/product/11.2.0/grid/cdata/lunardb/backup_20150323_142929.olr
Successfully configured Oracle Grid Infrastructure for a Standalone Server
You have new mail in /var/spool/mail/root
[root@lunardb ~]# 

这过程很快,比GI快多了,HAS还是结构简单啊,O(∩_∩)O哈哈~。
配置完成后,检查一下,服务都正常,只是需要将下面的服务改为自动启动:

[grid@lunardb ~]$ crsctl stat res -t
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.ons
               OFFLINE OFFLINE      lunardb                                    
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.cssd
      1        OFFLINE OFFLINE                                                   
ora.diskmon
      1        OFFLINE OFFLINE                                                   
ora.evmd
      1        ONLINE  ONLINE       lunardb                                    
[grid@lunardb ~]$ 
[grid@lunardb ~]$ crsctl modify resource "ora.cssd" -attr "AUTO_START=1"
[grid@lunardb ~]$ crsctl modify resource "ora.evmd" -attr "AUTO_START=1"
[grid@lunardb ~]$ crsctl modify resource "ora.ons" -attr "AUTO_START=1"

使用NETCA重建监听:

[grid@lunardb ~]$ crsctl stat res -t
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.LISTENER.lsnr
               ONLINE  ONLINE       lunardb                                    
ora.ons
               OFFLINE OFFLINE      lunardb                                    
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.cssd
      1        OFFLINE OFFLINE                                                   
ora.diskmon
      1        OFFLINE OFFLINE                                                   
ora.evmd
      1        ONLINE  ONLINE       lunardb                                    
[grid@lunardb ~]$ 

添加asm:

[grid@lunardb ~]$ srvctl add asm
[grid@lunardb ~]$ crsctl modify resource "ora.asm" -attr "AUTO_START=1"
[grid@lunardb ~]$ srvctl config asm
ASM home: /u01/app/grid/product/11.2.0/grid
ASM listener: LISTENER
Spfile: 
ASM diskgroup discovery string: ++no-value-at-resource-creation--never-updated-through-ASM++
[grid@lunardb ~]$
修改asm和listener的配置:
[grid@lunardb dbs]$ srvctl modify asm -l LISTENER -p "/u01/app/grid/product/11.2.0/grid/dbs/init+ASM.ora" -d "/dev/mapper/*"
[grid@lunardb dbs]$ crsctl modify resource "ora.asm" -attr "AUTO_START=1"
[grid@lunardb dbs]$ srvctl config asm
ASM home: /u01/app/grid/product/11.2.0/grid
ASM listener: LISTENER
Spfile: /u01/app/grid/product/11.2.0/grid/dbs/init+ASM.ora
ASM diskgroup discovery string: /dev/mapper/*
[grid@lunardb dbs]$ 

在has中添加db:

[oracle@lunardb ~]$ srvctl add database -d lunarogg -n lunardb -o /u01/app/oracle/product/11.2.0/db_1 -p +DATA/lunarogg/spfilelunarogg.ora
[oracle@lunardb ~]$ srvctl modify database -d lunarogg -a "DATA,DATA2"
[oracle@lunardb ~]$

现在再次检查has的配置,都已经好了:

[grid@lunardb dbs]$ crsctl status res -t
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DATA.dg
               ONLINE  ONLINE       lunardb                                    
ora.DATA2.dg
               ONLINE  ONLINE       lunardb                                    
ora.LISTENER.lsnr
               ONLINE  ONLINE       lunardb                                    
ora.asm
               ONLINE  ONLINE       lunardb                Started             
ora.ons
               OFFLINE OFFLINE      lunardb                                    
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.cssd
      1        ONLINE  ONLINE       lunardb                                    
ora.diskmon
      1        OFFLINE OFFLINE                                                   
ora.evmd
      1        ONLINE  ONLINE       lunardb                                    
ora.lunarogg.db
      1        OFFLINE OFFLINE                                                   
[grid@lunardb dbs]$ 

在使用kfod检查asmdisk已经不抱错了:

[grid@lunardb dbs]$ kfod asm_diskstring='/dev/mapper/*' disks=all
--------------------------------------------------------------------------------
 Disk          Size Path                                     User     Group   
================================================================================
   1:    1429504 Mb /dev/mapper/D_R5_01_LUN01                grid     oinstall
   2:    1429504 Mb /dev/mapper/D_R5_01_LUN02                grid     oinstall
   3:    1425408 Mb /dev/mapper/D_R5_02_LUN01                grid     oinstall
   4:    1425408 Mb /dev/mapper/D_R5_02_LUN02                grid     oinstall
   5:    1425408 Mb /dev/mapper/D_R5_03_LUN01                grid     oinstall
   6:    1425408 Mb /dev/mapper/D_R5_03_LUN02                grid     oinstall
   7:    1425408 Mb /dev/mapper/D_R5_04_LUN01                grid     oinstall
   8:    1425408 Mb /dev/mapper/D_R5_04_LUN02                grid     oinstall
--------------------------------------------------------------------------------
ORACLE_SID ORACLE_HOME                                                          
================================================================================
      +ASM /u01/app/grid/product/11.2.0/grid                                    
[grid@lunardb dbs]$ 

使用has重新启动,把所有资源和db都带起来:

[grid@lunardb dbs]$ crsctl status res -t
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DATA.dg
               ONLINE  ONLINE       lunardb                                    
ora.DATA2.dg
               ONLINE  ONLINE       lunardb                                    
ora.LISTENER.lsnr
               ONLINE  ONLINE       lunardb                                    
ora.asm
               ONLINE  ONLINE       lunardb                Started             
ora.ons
               OFFLINE OFFLINE      lunardb                                    
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.cssd
      1        ONLINE  ONLINE       lunardb                                    
ora.diskmon
      1        OFFLINE OFFLINE                                                   
ora.evmd
      1        ONLINE  ONLINE       lunardb                                    
ora.lunarogg.db
      1        ONLINE  ONLINE       lunardb                Open                
[grid@lunardb dbs]$ 

好了,都ok了。

此条目发表在 RAC 分类目录,贴了 标签。将固定链接加入收藏夹。

发表评论

电子邮件地址不会被公开。 必填项已用 * 标注

您可以使用这些 HTML 标签和属性: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>