Monday, August 1, 2011

Oracle 11g: Process that cause Node Reboot / Avoiding False Reboot

There are few processes that can evict nodes from the cluster or cause a node rebooting issue.
  • hangcheck-timer: This process is used to monitor for machine hangs and pauses
  • oclskd: This process is used by CSS to reboot a node based on requests from the other node with in the cluster.
  • ocssd: This process is used to monitor the internode's health.
But from Oracle 11g Release 2 hangcheck-timer is no longer needed.

To identify that which of the above process is causing node reboot we need to go through some log files
  • hangcheck-timer
    • /var/log/messages

  • oclskd
    • GRID_HOME/log/hostname/client/oclskd.log
  • ocssd
    • /var/log/messages
    • GRID_HOME/log/hostname/cssd/ocssd.log 
Below are the few lines that are mentioned by these above processes at the time of reboot in the log.
  • hangcheck-timer
    • "Hangcheck: hangcheck is restarting the machine."
  • ocssd
    • "Oracle CSSD failure. Rebooting for cluster integrity"
    • There might some more information similar to "Begin Dump" and "End Dump" just before the rebooting.
    • If you dont find any identification about the node rebooting then you might need to enable tracing and additional debugging.
There might be the case that sometimes there is a false reboot due to low MARGIN settings and heavy CPU load or a scheduler bug.

There have been wide variations in scheduling latencies observed across operating systems and versions of operating systems that can result us with false rebooting.

Increase the value of diagwait if it is set to too low and false rebooting is occured by using the below command
  • crsctl set css diagwait -force
If hangcheck-timer is used and found as a cause then increase the value of hangcheck_margin parameter of the hangcheck-timer module. To validate the values of diagwait or hangcheck_margin you can use the below method.
  • CSS misscount > (TIMEOUT + MARGIN)
    • To get the current css misscount please use crsctl get css misscount
  • CSS misscount > diagwait
  • CSS misscount > hangcheck_tick + hangcheck_margin
Note: It is recommended not to change the value of misscount and disk timeout until and unless it is not recommended by Oracle Support.

No comments:

Post a Comment