There are few processes that can evict nodes from the cluster or cause a node rebooting issue.
To identify that which of the above process is causing node reboot we need to go through some log files
- hangcheck-timer: This process is used to monitor for machine hangs and pauses
- oclskd: This process is used by CSS to reboot a node based on requests from the other node with in the cluster.
- ocssd: This process is used to monitor the internode's health.
To identify that which of the above process is causing node reboot we need to go through some log files
- hangcheck-timer
- /var/log/messages
- oclskd
- GRID_HOME/log/hostname/client/oclskd.log
- ocssd
- /var/log/messages
- GRID_HOME/log/hostname/cssd/ocssd.log
- hangcheck-timer
- "Hangcheck: hangcheck is restarting the machine."
- ocssd
- "Oracle CSSD failure. Rebooting for cluster integrity"
- There might some more information similar to "Begin Dump" and "End Dump" just before the rebooting.
- If you dont find any identification about the node rebooting then you might need to enable tracing and additional debugging.
There might be the case that sometimes there is a false reboot due to low MARGIN settings and heavy CPU load or a scheduler bug.
There have been wide variations in scheduling latencies observed across operating systems and versions of operating systems that can result us with false rebooting.
Increase the value of diagwait if it is set to too low and false rebooting is occured by using the below command
- crsctl set css diagwait
-force
If hangcheck-timer is used and found as a cause then increase the value of hangcheck_margin parameter of the hangcheck-timer module. To validate the values of diagwait or hangcheck_margin you can use the below method.
- CSS misscount > (TIMEOUT + MARGIN)
- To get the current css misscount please use crsctl get css misscount
- CSS misscount > diagwait
- CSS misscount > hangcheck_tick + hangcheck_margin
Note: It is recommended not to change the value of misscount and disk timeout until and unless it is not recommended by Oracle Support.
No comments:
Post a Comment