如何看待exadata的cell节点出现的writethrough/wirteback模式更换或者控制器充放电信息

联系:QQ(5163721)

标题:如何看待exadata的cell节点出现的writethrough/wirteback模式更换或者控制器充放电信息

作者:Lunar©版权所有[文章允许转载,但必须以链接方式注明源地址,否则追究法律责任.]

Exadata使用的是LSI的disk driver,在定期进行的HC中,如果cell上出现类似下面的信息,需要考虑是否需要更换或者bug:

Hardware Alert 10_1

Event Time 2011-01-17T17:26:30+02:00
Description All Logical drives are in WriteThrough caching mode. Either battery is in a learn cycle or it needs to be replaced. Please contact Oracle Support

Affected Cell 
Name sba2cel12
Chassis Serial Number 103XXXXXXX
Version OSS_MAIN_LINUX.X64_100929

Recommended Action Battery is either in a learn cycle or it needs replacement. Please contact Oracle Support

这个信息意味着disk controller写cache的策略从”write-back” 更改为 “write-through” 了,原因是电池学习周期(battery learn cycle)正在进行。
这个学习周期一年回周期性的执行4次,这个操作主要是每次执行一次控制器电池的充电和放电(discharge and charge)操作。

在Image 11.2.1.3之前,每个月执行一次
从Image 11.2.1.3开始,每3个月执行一次: 每年的1月/4月/7月/10月 的17日凌晨2点

这个缺省的时间(下一次学习的时间)可以使用命令修改,例如: cellcli> alter cell bbuLearnCycleTime=”2013-01-22T02:00:00-08:00″

Oracle推荐所有cell磁盘的电源学习周期是同一个时间。

众所周知,Write-through 的性能比 write-back 差。但是当存储crash或者电源丢失(looses power)发生时,write back有丢数据的风险。
因此,在电池学习周期中,会自动将写策略从写回模式(write-back)修改为写模式(Write-through)

如果在cell 的alert上看到类似下面的信息:

Event Time 2011-01-17T17:26:30+02:00
Description All Logical drives are in WriteThrough caching mode. Either battery is in a learn cycle or it needs to be replaced. Please contact Oracle Support

Affected Cell 
Name sba2cel12
Chassis Serial Number 103XXXXXXX
Version OSS_MAIN_LINUX.X64_100929

需要连接到cell节点,查看一下电池充电的百分比:

# MegaCli64 -AdpBbuCmd -GetBbuStatus -a0




BBU status for Adapter: 0

BatteryType: iBBU08
Voltage: 3721 mV
Current: 541 mA
Temperature: 43 C

BBU Firmware Status:
Charging Status : Charging    ++++++++++++这里显示正在充电
Voltage : OK
Temperature : OK
Learn Cycle Requested : No
Learn Cycle Active : No
Learn Cycle Status : OK
Learn Cycle Timeout : No
I2c Errors Detected : No
Battery Pack Missing : No
Battery Replacement required : No
Remaining Capacity Low : Yes
Periodic Learn Required : No
Transparent Learn : No

Battery state:

GasGuageStatus:
Fully Discharged : No
Fully Charged : No
Discharging : No
Initialized : No
Remaining Time Alarm : Yes
Remaining Capacity Alarm: No
Discharge Terminated : No
Over Temperature : No
Charging Terminated : No
Over Charged : No

Relative State of Charge: 7 %
Charger System State: 1
Charger System Ctrl: 0
Charging current: 541 mA
Absolute state of charge: 0 %
Max Error: 0 %

Exit Code: 0x00

当充电完成后,可以在cell的alert上看到如下信息:

Hardware Alert 10_2

Event Time 2011-01-17T19:14:51+02:00

Description Battery is back to a good state

Affected Cell 
Name sba2cel12
Chassis Serial Number 103XXXXXXX
Version OSS_MAIN_LINUX.X64_100929

Recommended Action Battery is back to a good state. No Action Required

连接到cell节点,查看磁盘的写模式(writethrough/writeback)的状态,可以发现:

# MegaCli64 -LDInfo -Lall -aALL | grep 'Current Cache Policy'

Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
...

同样在

[root@dm02db01 ~]# dcli -g cell_group -l root "cellcli -e list alerthistory"
dm02cel01: 15_1  2013-10-17T02:00:01+08:00       info    "The disk controller battery is executing a learn cycle and may temporarily enter WriteThrough Caching mode as part of the learn cycle. Disk write throughput might be temporarily lower during this time. The flash drives are not affected. The battery learn cycle is a normal maintenance activity that occurs quarterly and runs for approximately 1 to 12 hours.  Note that many learn cycles do not require entering WriteThrough caching mode.  When the disk controller cache returns to the normal WriteBack caching mode, an additional informational alert will be sent.  Battery Serial Number : 6198  Battery Type          : iBBU08  Battery Temperature   : 42 C  Full Charge Capacity  : 1303 mAh  Relative Charge       : 98 %  Ambient Temperature   : 25 C"


dm02cel01: 15_2  2013-10-17T07:33:21+08:00       clear   "All disk drives are in WriteBack caching mode.  Battery Serial Number : 6198  Battery Type          : iBBU08  Battery Temperature   : 47 C  Full Charge Capacity  : 1303 mAh  Relative Charge       : 53 %  Ambient Temperature   : 25 C"


。。。。。。。。

上面信息显示了10月17日凌晨:02:00cell01上有一个逻辑盘开始学习,完成时间是10月17日早上7:33。充电完成后,磁盘驱动器已经改回了writeback模式。

通常电池充电(Learning state)可能需要几个小时,如果充电完成后没有自动改回wirteback模式,可能是控制器电源出现问题,需要联系support

可以使用下面方法查看电池的学习周期:
/usr/local/bin/dcli -g /opt/oracle.SupportTools/onecommand/cell_group -l root ‘cellcli -e list cell attributes bbuLearnCycleTime’

正常的结果如下:

root@exadb01# /usr/local/bin/dcli -g /opt/oracle.SupportTools/onecommand/cell_group -l root 'cellcli -e list cell attributes bbuLearnCycleTime'

exacel01: 2012-04-17T02:00:00-04:00
exacel02: 2012-04-17T02:00:00-04:00
exacel03: 2012-04-17T02:00:00-04:00  
exacel04: 2012-04-17T02:00:00-04:00
exacel05: 2012-04-17T02:00:00-04:00
exacel06: 2012-04-17T02:00:00-04:00
exacel07: 2012-04-17T02:00:00-04:00
exacel08: 2012-04-17T02:00:00-04:00
exacel09: 2012-04-17T02:00:00-04:00
exacel10: 2012-04-17T02:00:00-04:00
exacel11: 2012-04-17T02:00:00-04:00
exacel12: 2012-04-17T02:00:00-04:00
exacel13: 2012-04-17T02:00:00-04:00
exacel14: 2012-04-17T02:00:00-04:00

可以看到14个cell的信息是一致的,再db节点也可以看到类似的信息:

# /opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -GetBbuStatus -aAll
BBU status for Adapter: 0
....
Learn Cycle Requested                      : No   
.............
GasGuageStatus:
...................
  Remaining Time Alarm    : No
  Discharge Terminated    : No
...................

# /opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -BbuLearn -a0
Adapter 0: BBU Learn Succeeded.

如果出现不一致的信息或者异常的输出,例如:

root@exadb01# /usr/local/bin/dcli -g /opt/oracle.SupportTools/onecommand/cell_group -l root 'cellcli -e list cell attributes bbuLearnCycleTime'

exacel01: 2012-04-17T02:00:00+01:00
exacel02: 2012-04-17T02:00:00+01:00
exacel03: 2012-10-17T02:00:00+01:00  
exacel04: 2012-07-17T02:00:00+01:00
exacel05: 2012-04-17T02:00:00+01:00
exacel06: 2012-07-17T02:00:00+01:00
exacel07: 2012-07-17T02:00:00+01:00
exacel08: 2012-07-17T02:00:00+01:00
exacel09: 2012-07-17T02:00:00+01:00
exacel10: 2012-07-17T02:00:00+01:00
exacel11: 2012-07-17T02:00:00+01:00
exacel12: 2012-07-17T02:00:00+01:00
exacel13: 2012-07-17T02:00:00+01:00
exacel14: 2012-07-17T02:00:00+01:00

上面的电源学习信息不一致。
同样,在db节点执行下面的命令,也可以看到异常信息:

# /opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -GetBbuStatus -aAll
BBU status for Adapter: 0
....
  Learn Cycle Requested        : Yes   <<<<
.............
GasGuageStatus:
...................
  Initialized                  : Yes  
  Remaining Time Alarm         : Yes
...................


# /opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -BbuLearn -a0
Adapter 0: BBU Learn Failed   <<<<<

同时,sundiag中也会有类似的信息(/opt/MegaRAID/MegaCli/MegaCli64 -adpbbucmd -aALL):

Relative State of Charge: 27 % 
Absolute State of charge: 23 % 
Remaining Capacity: 356 mAh <------- 
Full Charge Capacity: 1365 mAh 

这是个电池生命周期的bug:

 Bug 15788039: SUNBT7164516 BBU LEARN CYCLES NO LONGER RUN ON NIWOT CARD.
 This will be fixed in LSI firmware 12.12.0-0147.

Workround: reboot cell节点和db节点,让他重新学习和计算电池生命周期,防止电池耗尽

此条目发表在 体系架构, 安装和升级, 硬件配置 分类目录,贴了 , , , 标签。将固定链接加入收藏夹。

发表评论

电子邮件地址不会被公开。 必填项已用 * 标注