看图说话——ASM实例和ASMB进程

联系:QQ(5163721)

标题:看图说话——ASM实例和ASMB进程

作者:Lunar©版权所有[文章允许转载,但必须以链接方式注明源地址,否则追究法律责任.]

先看一下ASM实例的大体部署:
1881184

我们都知道,ASM实例管理着元数据,普通数据库实例通过查询元数据的信息来访问相应的ASM文件。
ASM实例和数据库实例都可以访问一组普通的磁盘,这套磁盘被称为磁盘组。
然后,数据库实例直接访问ASM文件的内容,并在与ASM实例通信时获取有关这些文件的分布信息。

Group Services用于注册数据库实例查找ASM实例时所需要的连接信息:
Group Services用于注册数据库实例查找ASM实例所需要的连接信息。
当ASM实例mount一个磁盘组时,它就将磁盘组的信息和连接串注册到Group Services。
数据库实例知道了磁盘组的名称,就可以找到应该连接到哪个ASM实例。

ASM实例有哪些独特地方:
1,INSTANCE_TYPE = ASM
2,startup = startup mount(11.2以后,可以直接对ASM实例 startup,但是本质还是startup mount),对于ASM实例,mount选项不会去mount数据文件,而是mount在参数文件中ASM_DISKGROUPS指定的磁盘组
3,connect / as sysdba(10g) 和 connect / as sysasm(11.2)

ASM的后台进程有很多,具体可以参考reference中的描述,这里只想研究一下数据库和ASM之间负责心跳机制的ASMB进程。

[grid@dm01db01 oraagent_grid]$ ps -ef|grep ASM1
grid      2714  2711  0 12:21 ?        00:00:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
grid      3467     1  0 09:24 ?        00:00:00 asm_pmon_+ASM1
grid      3471     1  0 09:24 ?        00:00:00 asm_psp0_+ASM1
grid      3475     1  0 09:24 ?        00:00:05 asm_vktm_+ASM1
grid      3481     1  0 09:24 ?        00:00:00 asm_gen0_+ASM1
grid      3485     1  0 09:24 ?        00:00:00 asm_diag_+ASM1
grid      3489     1  0 09:24 ?        00:00:00 asm_ping_+ASM1
grid      3493     1  0 09:24 ?        00:00:00 asm_dskm_+ASM1
grid      3497     1  0 09:24 ?        00:00:03 asm_dia0_+ASM1
grid      3501     1  0 09:24 ?        00:00:01 asm_lmon_+ASM1
grid      3505     1  0 09:24 ?        00:00:00 asm_lmd0_+ASM1
grid      3512     1  0 09:24 ?        00:00:01 asm_lms0_+ASM1
grid      3518     1  0 09:24 ?        00:00:00 asm_lmhb_+ASM1
grid      3522     1  0 09:24 ?        00:00:00 asm_mman_+ASM1
grid      3526     1  0 09:24 ?        00:00:00 asm_dbw0_+ASM1
grid      3530     1  0 09:24 ?        00:00:00 asm_lgwr_+ASM1
grid      3534     1  0 09:24 ?        00:00:00 asm_ckpt_+ASM1
grid      3538     1  0 09:24 ?        00:00:00 asm_smon_+ASM1
grid      3542     1  0 09:24 ?        00:00:00 asm_rbal_+ASM1
grid      3546     1  0 09:24 ?        00:00:00 asm_gmon_+ASM1
grid      3550     1  0 09:24 ?        00:00:00 asm_mmon_+ASM1
grid      3554     1  0 09:24 ?        00:00:00 asm_mmnl_+ASM1
grid      3558     1  0 09:24 ?        00:00:00 asm_xdmg_+ASM1
grid      3562     1  0 09:24 ?        00:00:00 asm_lck0_+ASM1
grid      3580     1  0 09:24 ?        00:00:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
grid      3628     1  0 09:24 ?        00:00:00 oracle+ASM1_ocr (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
grid      3637     1  0 09:24 ?        00:00:00 asm_asmb_+ASM1        
--------------ASM的ASMB进程
grid      3641     1  0 09:24 ?        00:00:00 oracle+ASM1_asmb_+asm1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))   -----ASMB进程连接到+ASM1,并把存储的统计信息同步到CSS
grid      3847     1  0 09:24 ?        00:00:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))  
-----oracleagent进程
grid      4296     1  0 09:25 ?        00:00:00 oracle+ASM1_asmb_bbff1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))   
-----ASMB进程连接到数据库实例,并把存储相关的统计信息同步到CSS(比如增加磁盘组等等)
grid      6596 30872  0 13:11 pts/4    00:00:00 grep ASM1
grid      8872     1  0 10:25 ?        00:00:00 oracle+ASM1_o000_bbff1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
[grid@dm01db01 oraagent_grid]$ 

我们知道ASMB进程实际上是提供了一个数据库实例和ASM实例之间通信的桥梁,比如在数据库中创建、删除文件,或者修改文件等等的跟存储物理变化相关的操作。首先,我们观察下,他们在CRS,ASM和数据库启动过程中的启动顺序和先后关系
ASM的alert

。。。。。
Sun Mar 09 09:24:47 2014
NOTE: [crsd.bin@dm01db01 (TNS V1-V3) 3615] opening OCR file
Starting background process ASMB
Sun Mar 09 09:24:47 2014
ASMB started with pid=27, OS id=3637 
Sun Mar 09 09:24:47 2014
NOTE: client +ASM1:+ASM registered, osid 3641, mbr 0x0
Sun Mar 09 09:26:06 2014
NOTE: client bbff1:bbff registered, osid 4296, mbr 0x1
。。。。。

DB的alert:

。。。。。
Sun Mar 09 09:25:49 2014
SMON started with pid=21, OS id=4272 
Sun Mar 09 09:25:50 2014
RECO started with pid=22, OS id=4276 
Sun Mar 09 09:25:50 2014
RBAL started with pid=23, OS id=4280 
Sun Mar 09 09:25:50 2014
ASMB started with pid=24, OS id=4284 
。。。。。

ASM和数据库实例的ASMB进程都分别将信息注册到css中,参看ocssd.log:

。。。。。
2014-03-09 09:24:47.069: [    CSSD][1081276736]clssgmDestroyProc: cleaning up proc(0x1f7cba50) con(0x2518) skgpid 3628 ospid 3628 with 0 clients, refcount 0  -------3628是ocr进程:oracle+ASM1_ocr (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq))
2014-03-09 09:24:47.069: [    CSSD][1081276736]clssgmDiscEndpcl: gipcDestroy 0x2518
2014-03-09 09:24:47.089: [    CSSD][1081276736]clssscSelect: cookie accept request 0x1ef1ef60
2014-03-09 09:24:47.089: [    CSSD][1081276736]clssgmAllocProc: (0x1f7cba50) allocated
2014-03-09 09:24:47.089: [    CSSD][1081276736]clssgmClientConnectMsg: properties of cmProc 0x1f7cba50 - 1,2,3,4,5
2014-03-09 09:24:47.089: [    CSSD][1081276736]clssgmClientConnectMsg: Connect from con(0x2579) proc(0x1f7cba50) pid(3628/3628) version 11:2:1:4, properties: 1,2,3,4,5
2014-03-09 09:24:47.089: [    CSSD][1081276736]clssgmClientConnectMsg: msg flags 0x0000
2014-03-09 09:24:47.487: [    CSSD][1081276736]clssscSelect: cookie accept request 0x1ef1ef60
2014-03-09 09:24:47.487: [    CSSD][1081276736]clssgmAllocProc: (0x1f7ddbd0) allocated
2014-03-09 09:24:47.487: [    CSSD][1081276736]clssgmClientConnectMsg: properties of cmProc 0x1f7ddbd0 - 1,2,3,4,5
2014-03-09 09:24:47.487: [    CSSD][1081276736]clssgmClientConnectMsg: Connect from con(0x25f5) proc(0x1f7ddbd0) pid(3641/3641) version 11:2:1:4, properties: 1,2,3,4,5 ---3641是ASMB进程,oracle+ASM1_asmb_+asm1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
。。。。。。。。。。。
2014-03-09 09:25:50.663: [    CSSD][1081276736]clssgmAllocProc: (0x1f8b6290) allocated
2014-03-09 09:25:50.663: [    CSSD][1081276736]clssgmClientConnectMsg: properties of cmProc 0x1f8b6290 - 1,2,3,4,5
2014-03-09 09:25:50.663: [    CSSD][1081276736]clssgmClientConnectMsg: Connect from con(0x35fc) proc(0x1f8b6290) pid(4284/4284) version 11:2:1:4, properties: 1,2,3,4,5----4284是数据库的ASMB进程
2014-03-09 09:25:50.663: [    CSSD][1081276736]clssgmClientConnectMsg: msg flags 0x0000
2014-03-09 09:25:50.921: [    CSSD][1081276736]clssgmDeadProc: proc 0x1f8b6290
2014-03-09 09:25:50.921: [    CSSD][1081276736]clssgmDestroyProc: cleaning up proc(0x1f8b6290) con(0x35fc) skgpid 4284 ospid 4284 with 0 clients, refcount 0
2014-03-09 09:25:50.921: [    CSSD][1081276736]clssgmDiscEndpcl: gipcDestroy 0x35fc
2014-03-09 09:25:51.195: [    CSSD][1081276736]clssscSelect: cookie accept request 0x1ef1ef60
2014-03-09 09:25:51.195: [    CSSD][1081276736]clssgmAllocProc: (0x1f8b6290) allocated
2014-03-09 09:25:51.196: [    CSSD][1081276736]clssgmClientConnectMsg: properties of cmProc 0x1f8b6290 - 1,2,3,4,5
2014-03-09 09:25:51.196: [    CSSD][1081276736]clssgmClientConnectMsg: Connect from con(0x3663) proc(0x1f8b6290) pid(4284/4284) version 11:2:1:4, properties: 1,2,3,4,5
2014-03-09 09:25:51.196: [    CSSD][1081276736]clssgmClientConnectMsg: msg flags 0x0000
2014-03-09 09:25:51.216: [    CSSD][1081276736]clssscSelect: cookie accept request 0x1ef1ef60
2014-03-09 09:25:51.216: [    CSSD][1081276736]clssgmAllocProc: (0x1f8cdb50) allocated
2014-03-09 09:25:51.218: [    CSSD][1081276736]clssgmClientConnectMsg: properties of cmProc 0x1f8cdb50 - 1,2,3,4,5
2014-03-09 09:25:51.218: [    CSSD][1081276736]clssgmClientConnectMsg: Connect from con(0x36dd) proc(0x1f8cdb50) pid(4231/4231) version 11:2:1:4, properties: 1,2,3,4,5----4231是数据库的lmon进程
。。。。。。。
2014-03-09 09:26:06.534: [    CSSD][1109334336]clssnmSendingThread: sending status msg to all nodes
2014-03-09 09:26:06.534: [    CSSD][1109334336]clssnmSendingThread: sent 4 status msgs to all nodes
2014-03-09 09:26:06.885: [    CSSD][1081276736]clssscSelect: cookie accept request 0x1ef1ef60
2014-03-09 09:26:06.885: [    CSSD][1081276736]clssgmAllocProc: (0x1f91e650) allocated
2014-03-09 09:26:06.886: [    CSSD][1081276736]clssgmClientConnectMsg: properties of cmProc 0x1f91e650 - 1,2,3,4,5
2014-03-09 09:26:06.886: [    CSSD][1081276736]clssgmClientConnectMsg: Connect from con(0x397e) proc(0x1f91e650) pid(4296/4296) version 11:2:1:4, properties: 1,2,3,4,5----oracle+ASM1_asmb_bbff1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq
2014-03-09 09:26:06.886: [    CSSD][1081276736]clssgmClientConnectMsg: msg flags 0x0000
2014-03-09 09:26:06.912: [    CSSD][1081276736]clssscSelect: cookie accept request 0x1f91e650
2014-03-09 09:26:06.912: [    CSSD][1081276736]clssscevtypSHRCON: getting client with cmproc 0x1f91e650
2014-03-09 09:26:06.912: [    CSSD][1081276736]clssgmRegisterClient: proc(33/0x1f91e650), client(1/0x1f7baaf0)
2014-03-09 09:26:06.912: [    CSSD][1081276736]clssgmJoinGrock: local grock UFG_+ASM1 new client 0x1f7baaf0 with con 0x39b6, requested num 1, flags 0x10100
2014-03-09 09:26:06.912: [    CSSD][1081276736]clssgmAddGrockMember: adding member to grock UFG_+ASM1
2014-03-09 09:26:06.912: [    CSSD][1081276736]clssgmAddMember: Adding fencing for member 1, group UFG_+ASM1, death 1, SAGE 0
2014-03-09 09:26:06.912: [    CSSD][1081276736]clssgmAddMember: member (1/0x1f332350) added. pbsz(108) prsz(108) flags 0x0 to grock (0x1f805240/UFG_+ASM1)
2014-03-09 09:26:06.912: [    CSSD][1081276736]clssgmCommonAddMember: local group grock UFG_+ASM1 member(1/Local) node(1) flags 0x0 0x30 
。。。。。

这里,数据库启动时,ASMB的活动过程:
1,ASM实例的ASMB进程启动(spid: 3637,asm_asmb_+ASM1)
2,ASM实例的ASMB进程启动了一个连接到ASM实例的进程(spid:3641,oracle+ASM1_asmb_+asm1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq))))
3,ASM实例的ASMB进程将连接进程(oracle+ASM1_asmb_+asm1)的信息注册到css中
4,数据库启动时,启动数据库的ASMB进程(spid:4284,ora_asmb_bbff1)
5,数据库的ASMB进程将数据库的ASMB进程注册到CSS中
6,ASM实例的ASMB进程启动一个进程连接到数据库实例的进程:20140309-09:26:05,oracle+ASM1_asmb_bbff1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
7,ASM实例的ASMB进程将这个连接到数据库实例的进程(oracle+ASM1_asmb_bbff1)的信息注册到CSS中

当然,通常情况下,连接到数据库的ASMB如果出现异常,那么会很快创建一个新的连接,并注册到css中,这一点可以从css的日志中发现。

我目前的测试环境时EXADATA 11.2.3.2.1的VM,经过跟踪,可以发现,数据库进程在做类似添加、删除表空间等等所有跟存储相关的操作的时候,实际上是通过pipe来完成的(通常每个相关进程2个pipe,一个用于读,一个用于写)。不知道其他的ASM环境,是否也是这个结论,回头找个普通的ASM环境测试下,O(∩_∩)O哈哈~

下面我们删除一个表空间,并跟踪一下,看看ASMB是如何操作的:

[root@dm01db01 ~]# ps -ef|grep LOCAL=YES
grid      3580     1  0 09:24 ?        00:00:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
grid      3628     1  0 09:24 ?        00:00:00 oracle+ASM1_ocr (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
grid      3641     1  0 09:24 ?        00:00:00 oracle+ASM1_asmb_+asm1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
grid      3847     1  0 09:24 ?        00:00:00 oracle+ASM1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
grid     11438     1  0 14:20 ?        00:00:00 oracle+ASM1_asmb_bbff1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))  ----ASM实例的ASMB进程连接到数据库进程
oracle   11465     1  0 14:20 ?        00:00:01 oraclebbff1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))       ------oracleagent进程
oracle   11650     1  0 14:22 ?        00:00:00 oraclebbff1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))       ------oracleagent进程
oracle   11666     1  0 14:22 ?        00:00:00 oraclebbff1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))       ------oracleagent进程
oracle   13959 13956  0 14:54 ?        00:00:00 oraclebbff1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))       ------我的进程
root     14019 13831  0 14:55 pts/1    00:00:00 grep LOCAL=YES
[root@dm01db01 ~]# 
[root@dm01db01 ~]# ps -ef|grep ocss
grid      2881     1  0 09:22 ?        00:00:25 /u01/app/11.2.0.3/grid/bin/ocssd.bin 
root     14465 13831  0 15:01 pts/1    00:00:00 grep ocss
[root@dm01db01 ~]#
[root@dm01db01 ~]# ps -ef|grep asmb
grid      3637     1  0 09:24 ?        00:00:00 asm_asmb_+ASM1
grid      3641     1  0 09:24 ?        00:00:00 oracle+ASM1_asmb_+asm1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle   11433     1  0 14:20 ?        00:00:00 ora_asmb_bbff1
grid     11438     1  0 14:20 ?        00:00:00 oracle+ASM1_asmb_bbff1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
root     12228 30240  0 14:29 pts/4    00:00:00 grep asmb
[root@dm01db01 ~]#

可以看到,spid 13959是我当前的进程,删除表空间之前使用strace进行跟踪:

strace -fr -o /tmp/11438.log -p 11438
strace -fr -o /tmp/13956.log -p 13956
strace -fr -o /tmp/2881.log -p 2881

SYS@bbff1>drop tablespace lunartest  include contents and datafiles;
drop tablespace lunartest  include contents and datafiles
                           *
ERROR at line 1:
ORA-02173: invalid option for DROP TABLESPACE


Elapsed: 00:00:00.08
SYS@bbff1>>drop tablespace lunartest including contents and datafiles;
SP2-0734: unknown command beginning ">drop tabl..." - rest of line ignored.
SYS@bbff1>
SYS@bbff1>drop tablespace lunartest including contents and datafiles;

Tablespace dropped.

Elapsed: 00:00:07.15
SYS@bbff1>

删除表空间后,结束跟踪,并进行观察:

[root@dm01db01 ~]# strace -fr -o /tmp/11438.log -p 11438
Process 11438 attached - interrupt to quit
Process 11438 detached
[root@dm01db01 ~]# 

[root@dm01db01 ~]# strace -fr -o /tmp/13956.log -p 13956
Process 13956 attached - interrupt to quit
Process 13956 detached
[root@dm01db01 ~]# 

[root@dm01db01 ~]# strace -fr -o /tmp/2881.log -p 2881
Process 2881 attached with 20 threads - interrupt to quit
Process 2881 detached
Process 2885 detached
Process 2888 detached
Process 2889 detached
Process 2890 detached
Process 2891 detached
Process 2902 detached
Process 2903 detached
Process 2924 detached
Process 2925 detached
Process 2926 detached
Process 2927 detached
Process 2930 detached
Process 2934 detached
Process 2940 detached
Process 2941 detached
Process 2942 detached
Process 2944 detached
Process 2948 detached
Process 2949 detached
[root@dm01db01 ~]# 

我们看到,数据库的server process接收到“drop tablespace lunartest includ……”命令后,将信息写入了设备/proc/13956/fd下面的10号文件,并从11号文件读取了反馈信息:

13956      0.000203 read(0, "drop tablespace lunartest includ"..., 1024) = 60
13956      8.392710 gettimeofday({1394348248, 583001}, NULL) = 0
13956      0.000332 write(10, "\1S\0\0\6\0\0\0\0\0\21i\t\376\377\377\377\377\377\377\377\1\0\0\0\0\0\0\0\1\0\0"..., 339) = 339
13956      0.002337 read(11, "\0\313\0\0\6\0\0\0\0\0\10\6\0(\37\6\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0"..., 8208) = 203
13956      7.140464 write(1, "\n", 1)   = 1
13956      0.000291 lseek(3, 3072, SEEK_SET) = 3072
13956      0.000089 read(3, "\22\0A\0\0\0t\0B\0\0\0\212\0C\0\0\0\240\0D\0\0\0\261\0E\0\0\0\302\0"..., 512) = 512
13956      0.000962 write(1, "Tablespace dropped.", 19) = 19
13956      0.000855 write(1, "\n", 1)   = 1
13956      0.000333 write(1, "\n", 1)   = 1
13956      0.005740 gettimeofday({1394348255, 734445}, NULL) = 0
13956      0.000224 write(1, "Elapsed: 00:00:07.15\n", 21) = 21
13956      0.000403 write(1, "SYS@bbff1>", 10) = 10

再看下进程的fd(file description)信息,我们看到,10号和11号文件分别是两个pipe:

[root@dm01db01 fd]# pwd
/proc/13956/fd
[root@dm01db01 fd]# ls -lrt
total 0
lrwx------ 1 oracle oinstall 64 Mar  9 15:01 2 -> /dev/pts/2
lrwx------ 1 oracle oinstall 64 Mar  9 15:02 0 -> /dev/pts/2
lr-x------ 1 oracle oinstall 64 Mar  9 15:02 8 -> /u01/app/oracle/product/11.2.0.3/dbhome_1/rdbms/mesg/ocius.msb
lr-x------ 1 oracle oinstall 64 Mar  9 15:02 7 -> /proc/13956/fd
lr-x------ 1 oracle oinstall 64 Mar  9 15:02 6 -> /u01/app/oracle/product/11.2.0.3/dbhome_1/rdbms/mesg/diaus.msb
lr-x------ 1 oracle oinstall 64 Mar  9 15:02 5 -> /u01/app/oracle/product/11.2.0.3/dbhome_1/sqlplus/mesg/cpyus.msb
lr-x------ 1 oracle oinstall 64 Mar  9 15:02 4 -> /u01/app/oracle/product/11.2.0.3/dbhome_1/sqlplus/mesg/sp2us.msb
lr-x------ 1 oracle oinstall 64 Mar  9 15:02 3 -> /u01/app/oracle/product/11.2.0.3/dbhome_1/sqlplus/mesg/sp1us.msb
lr-x------ 1 oracle oinstall 64 Mar  9 15:02 11 -> pipe:[8035210]
l-wx------ 1 oracle oinstall 64 Mar  9 15:02 10 -> pipe:[8035209]
lrwx------ 1 oracle oinstall 64 Mar  9 15:02 1 -> /dev/pts/2
[root@dm01db01 fd]# 

也就是说出了写到终端的反馈信息外,服务器进程将删除表空间的信息写入一个pipe(10),并从另一个pipe(11)读取反馈信息

此条目发表在 ASM, Exadata 分类目录,贴了 , , 标签。将固定链接加入收藏夹。

发表评论

电子邮件地址不会被公开。 必填项已用 * 标注