HA Cluster Introduction: Understanding Monitor Resources

November 12th, 2024

Machine translation is used partially for this article. See the Japanese version for the original article.

Introduction

This time, as an introduction to HA clusters, we will explain the monitoring behavior of monitor resources and the actions after monitor resources detect an error.
By understanding the behavior of monitor resources, let's build an appropriate HA cluster that meets the system's requirements.

1. What are Monitor Resources

Monitor resources monitor whether the business applications and server resources for perform business operations are functioning properly. If an error is detected in the monitor target, necessary recovery actions (for example, restarting group resources, executing failover, etc.) are performed to continue the business.

In EXPRESSCLUSTER, dedicated monitor resources are provided to monitor the status of group resources (such as business applications and floating IP addresses) and server resources (such as network communication paths and operating systems). This allows easy monitoring of the target resources. Additionally, the custom monitor resources are available for monitoring any targets in any method. By providing scripts, users can customize the monitoring methods.

Furthermore, all monitor resources have default values set to minimize the chance of errors occurring when used as is. This allows you to start monitoring without designing detailed monitoring settings. If you want to execute more appropriate monitoring based on user requirements, you can achieve this by customizing the settings.

For information on monitor resources supported by EXPRESSCLUSTER and their functions, please refer to the Getting Startup Guide.

[Reference]
popup

Documentation - Manuals

EXPRESSCLUSTER X 5.2 > EXPRESSCLUSTER X 5.2 for Windows > Getting Started Guide

-> 3. Using EXPRESSCLUSTER

-> 3.6 What is a resource?

-> 3.6.4 Monitor resources

EXPRESSCLUSTER X 5.2 > EXPRESSCLUSTER X 5.2 for Linux > Getting Started Guide

-> 3. Using EXPRESSCLUSTER

-> 3.6 What is a resource?

-> 3.6.4 Monitor resources

2. Behavior of Monitor Resources

To understand the behavior of monitor resources, we will explain the process from the detection of an error to the recovery actions (recovery from the anomaly) by monitor resources in three parts: monitoring timing, content of monitoring, and recovery actions after an error detection.

2.1 Monitoring Timing

There are two types of monitoring timings: always monitoring and active monitoring.

2.1.1 Always Monitoring

Always monitoring executes monitoring as soon as the cluster service is started, regardless of the business operations state. It is appropriate for monitoring server resources such as network communication paths and OS, which are necessary for performing and accessing business.

Furthermore, in the case of always monitoring, it is possible to monitor standby servers that are not performing business operations. This allows monitoring whether the standby servers are ready to perform business operations.

For example, in the IP monitor resource (ipw) shown in the figure below, it uses ping to perform communication checks to the servers that all servers for HA cluster always need to access, monitoring for any errors in the network path or the target server.

2.1.2 Active Monitoring

Active monitoring executes monitoring when a business operation starts and the group resources set as the monitoring target are activated. It is appropriate for monitoring whether the applications controlled by the monitored group resources or the connection destinations (such as virtual IP addresses, DNS names, etc.) are in a usable state.

For example, in the AWS virtual IP monitor resource (awsvipw1) shown in the figure below, it monitors whether the assignment of the virtual IP address and the route setting of the AWS route table controlled by the AWS virtual IP resource (awsvip), which is the monitoring target, are in a normal state.

2.2 Contents of monitoring

We will explain the monitoring process for monitor resources. Monitor resources periodically execute the following monitoring processes to ensure that the target is operating normally.

As settings for monitoring processes, you can configure the interval for monitoring and the number of retries before determining the monitored target as an error.

In addition, when the application is overloaded and the server is in a high load state, monitor resources may detect a timeout, but even in such cases, errors can be detected.

Furthermore, by setting the monitor delay warning from the cluster properties, you can output an alert log when monitoring delay occurs without reaching the timeout, allowing you to be aware of delays in advance. The following figure shows the relationship between monitoring timeouts and delay warnings.

2.3 Recovery Actions after Error Detection

After monitor resources detect an error, the recovery actions are executed.
In the recovery actions, you can specify in detail what to the recovery target and how to restore from the abnormal condition.
Additionally, there are presets available for the recovery actions, such as restarting the group resources or the failover group selected for recovery target, executing a failover if the error is still detected even after restarting the recovery target, and custom settings where each setting item can be optionally set. The setting items that do not require settings in the chosen recovery action preset are grayed out for easy configuration.

Example) Immediate failover settings

When using the "Executing failover to the recovery target" preset

Executing Failover to the Recovery Target

When using the "Custom settings" preset
By setting the "Maximum Reactivation Count" to 0 and the "Maximum Failover Count" to 1 or more, an immediate failover will occur upon error detection.

Messages for troubleshooting during recovery actions, etc., are output to the alert log of the Cluster WebUI and the logs on the OS (event log for Windows, syslog for Linux).

Additionally, by using EXPRESSCLUSTER X Alert Service, an optional product of EXPRESSCLUSTER, you can notify important events such as server down or failover triggered by monitoring errors via email. For details about the Alert Service, please refer to the Reference Guide.

[Reference]
popup

Documentation - Manuals

EXPRESSCLUSTER X 5.2 > EXPRESSCLUSTER X 5.2 for Windows > Reference Guide

-> 8. Information on other settings

-> 8.1. Alert Service

EXPRESSCLUSTER X 5.2 > EXPRESSCLUSTER X 5.2 for Linux > Reference Guide

-> 8. Information on other settings

-> 8.3. Alert Service

3. Checking the Operation Using the Dummy Failure

As an example of how to check the operation of monitor resources, we will introduce the method using the dummy failure function of EXPRESSCLUSTER. By using the dummy failure function, you can easily check whether the monitoring resources execute the intended monitoring and recovery actions.

The dummy failure function allows you to simulate abnormal/normal conditions of the monitoring resources using the verification mode of the Cluster WebUI or the clpmonctrl command. This does not actually cause a failure, but it simulates the monitoring resources detecting an error to execute the recovery actions.

Example) When enable the dummy failure on the Cluster WebUI

1. Switch to the verification mode from the top of the Cluster WebUI and move to the status tab.

2. Expand the monitor resource for which you want to enable the dummy failure and click the "Enable dummy failure" button.

3. After detecting the error, check that the monitor resource execute the recovery action (in this case, failover) according to the monitoring settings.

4. After checking the operation, click the "Disable dummy failure" button to disable the dummy failure.

The dummy failure is not supported by all monitor resources. For monitor resources that are not supported by the dummy failure, intentionally cause errors to check the operation. For details on how to check the operation by intentionally cause errors, please refer to the Installation and Configuration Guide.

[Reference]
popup

Documentation - Manuals

EXPRESSCLUSTER X 5.2 > EXPRESSCLUSTER X 5.2 for Windows > Installation and Configuration Guide

-> 8. Verifying operation

EXPRESSCLUSTER X 5.2 > EXPRESSCLUSTER X 5.2 for Linux > Installation and Configuration Guide

-> 9. Verifying operation

Conclusion

This time, as an introduction to HA cluster, we explained the monitoring behavior of monitor resources and the actions after monitor resources detect an error.
Thank you for reading the entire this article.

In this blog, we are always looking forward to your requests. If you have any questions about HA clusters or requests such as verification configurations, please do not hesitate to contact us if you have any questions.

Contact

Free trial download

Displaying present location in the site.

HA Cluster Introduction: Understanding Monitor Resources

Introduction

Contents

1. What are Monitor Resources

2. Behavior of Monitor Resources

2.1 Monitoring Timing

2.1.1 Always Monitoring

2.1.2 Active Monitoring

2.2 Contents of monitoring

2.3 Recovery Actions after Error Detection

3. Checking the Operation Using the Dummy Failure

Conclusion