[2022 Edition] How to Prevent Both-System Activation on AWS (Windows/Linux)

January 10th, 2023

Machine translation is used partially for this article. See the Japanese version for the original article.

Last updated: December 20th, 2024

* A description about the Service Startup Delay Time has been added.

Introduction

We tried building HA cluster using a forced stop resource, which is one of the features of EXPRESSCLUSTER in order to prevent the both-system activation of HA cluster on Amazon Web Services (hereinafter called "AWS").

This article is the revised version (2022 Edition) of the popup previous article. In EXPRESSCLUSTER X 5.0, the forced stop function and the script for forced stop have been innovated as the forced stop resource, so the description of how to set them has been revised.
If you are using EXPRESSCLUSTER X 4.3 or earlier, refer to the previous article.

EXPRESSCLUSTER can set network partition resolution (hereinafter called "NP resolution") as a method to prevent both-system activation. The recommended configuration on AWS is to set HTTP NP resolution. (For more information on NP resolution, refer to popup previous article.)

However, even if NP resolution is set, if a failure such as OS stalling or a communication disconnection between Availability Zones occurs, both-system activation may occurs.
In such cases, we use a function called the forced stop resource as a method to prevent both-system activation.
This time, we will set the forced stop resource to the HA cluster on AWS.

1. Both-system Activation and the Forced Stop Resource
2. HA Cluster Configuration
3. HA Cluster Building Procedure
3.1 Preparation for Building an HA Cluster
3.2 Procedure for Building an HA Cluster
4. Checking the Operation at the Time of NP Resolution
4.1 When Not to Execute the Forced Stop Resource
4.2 When Executing the Forced Stop Resource

1. Both-system Activation and the Forced Stop Resource

Both-system activation
The both-system activation is an event that a failover group starts on multiple severs in the cluster which is in the network partition (split-brain). When the both-system activation occurs, the business application running on each server writes/reads business data independently, so you may face on the critical problems such as data inconsistency between servers. Therefore, it is important to take measures to prevent the both-system activation.

Forced stop resource
The forced stop resource is the function that works on the remaining server (normal server) in the cluster when the server detect that the other server is down, and can force the down server to stop from the outside. As a result, even if the network partition (split-brain) occurs, the server can be stopped using the forced stop resource in addition to the response by the NP resolution resource, making it more reliable to prevent the both-system activation.

The trigger to execute the forced stop resource is when failover group that was running on down server starts on the other server by detecting server down with heartbeat timeout. If you stop normally the server from Cluster WebUI, etc. or if a failover group is not started on a downed server and failover does not occur, the forced stop resource is not executed. So the server will not be forced to stop at unnecessary times by the forced stop resource.

There are some types of the forced stop resource depending on the environment. For example, you can use a BMC forced stop resource that uses IPMI function to stop servers in a physical environment, and a vCenter forced stop resource that uses VMware vCenter Server function to stop virtual machines.

For more information of the forced stop resource, see the Reference Guide.

[Reference]

Documentation - Manuals

EXPRESSCLUSTER X 5.0 > EXPRESSCLUSTER X 5.0 for Windows > Reference Guide

-> 7. Forced stop resource details

EXPRESSCLUSTER X 5.0 > EXPRESSCLUSTER X 5.0 for Linux > Reference Guide

-> 7. Forced stop resource details

2. HA Cluster Configuration

We build an HA cluster based on VIP control. We make a few changes from the configuration in the HA Cluster Configuration Guide for Amazon Web Services.

In the Configuration Guide, we specify a web server outside of VPC (on the Internet) as a target for HTTP NP resolution, but it cannot be used if you do not want to access the Internet from the servers for HA cluster due to security policies.
In this article, we will build a "HA cluster based on VIP control" combined with "VPC endpoint" as a HA cluster configuration that does not access the Internet.

The diagram of the HA cluster to be built is as follows:

We will set only AWS virtual IP resources and mirror disk resources to a failover group.
We will also specify a website created on Amazon S3 as the target for HTTP NP resolution and access to Amazon S3 will be via a gateway endpoint. For security reasons, we will configure it so that it can only be connected to from the VPC where the HA cluster will be built. In addition, we will set up the forced stop resource to shut down the downed server.

* Configure the NP resolution resources as appropriate for your HA cluster environment.
For example, if clients in on-premises access to HA cluster on AWS with AWS Direct Connect, NP resolution can be achieved by specifying the gateway of on-premise as the target for Ping NP resolution resource.

Refer to the below for more information about gateway endpoints, how to create them, and how to host a website on Amazon S3.

[Reference]
popup

Gateway endpoints
popup

Controlling access from VPC endpoints with bucket policies
popup

Tutorial: Configuring a static website on Amazon S3

3. HA Cluster Building Procedure

In this section, we introduce the procedure for building an HA cluster.
Please note that the procedures are different between Windows and Linux.

3.1 Preparation for Building an HA Cluster

For more information on preparation in your AWS environment, refer to the HA Configuration Guide for Amazon Web Services.

In this configuration, the VPC endpoints are used for communication from the servers for the HA cluster to the endpoint when executing AWS CLI commands, so the creation of NAT instances are not necessary.
For the procedure of configuring the VPC endpoints, please refer to our popup previous article.

[Reference]

Documentation - Setup Guides

Windows > Cloud > Amazon Web Services > EXPRESSCLUSTER X 5.0 for Windows HA Cluster Configuration Guide for Amazon Web Services

-> 5. Constructing an HA cluster based on VIP control

-> 5.1 Configure the VPC Environment
-> 5.2 Configuring the instance

Linux > Cloud > Amazon Web Services > EXPRESSCLUSTER X 5.0 for Linux HA Cluster Configuration Guide for Amazon Web Services

-> 5. Constructing an HA cluster based on VIP control

-> 5.1 Configure the VPC Environment
-> 5.2 Configuring the instance

Also, this time, in order to stop the EC2 instance using the forced stop resource, the servers for the HA cluster execute the stop-instance command in the AWS CLI from the forced stop resource.
So, we will add the ec2:StopInstances action to the policy of the IAM role assigned to the servers along with the actions required to control the AWS virtual IP resources. The policy of the IAM role will allow the following actions:

"ec2:Describe*"
"ec2:StopInstances"
"ec2:ReplaceRoute"

The configuration of the VPC is as below. Routes to Amazon S3 (VPC endpoints) are automatically added to the route tables that you selected when you created the gateway endpoint.

VPC(VPC ID : vpc-1234abcd)

- CIDR : 10.0.0.0/16
- Subnets

■Subnet-A1 (Subnet ID : sub-1111aaaa) : 10.0.10.0/24
■Subnet-A2 (Subnet ID : sub-2222aaaa) : 10.0.110.0/24
■Subnet-B1 (Subnet ID : sub-1111bbbb) : 10.0.20.0/24
■Subnet-B2 (Subnet ID : sub-2222bbbb) : 10.0.120.0/24

- RouteTables

■Main (Route table ID : rtb-00000001)

>10.0.0.0/16 -> local

>0.0.0.0/0 -> igw-1234abcd (Internet Gateway)
>20.0.0.200/32 -> eni-1234abcd (ENI ID)

■Route-A (Route table ID : rtb-0000000a)

>10.0.0.0/16 -> local
>20.0.0.200/32 -> eni-1234abcd (ENI ID)
>pl-xxxxxxxx -> vpce-5678cdef (Endpoint ID)
* Route to Amazon S3 (VPC endpoint)

■Route-B (Route table ID : rtb-0000000b)

>10.0.0.0/16 -> local
>20.0.0.200/32 -> eni-1234abcd (ENI ID)
>pl-xxxxxxxx -> vpce-5678cdef (Endpoint ID)
* Route to Amazon S3 (VPC endpoint)

3.2 Procedure for Building an HA Cluster

For more information of procedure for building an "HA cluster based on VIP control", refer to HA Cluster Configuration Guide for Amazon Web Services.

[Reference]

Documentation - Setup Guides

Windows > Cloud > Amazon Web Services > EXPRESSCLUSTER X 5.0 for Windows HA Cluster Configuration Guide for Amazon Web Services

-> 5. Constructing an HA cluster based on VIP control

-> 5.3 Setting up EXPRESSCLUSTER

Linux > Cloud > Amazon Web Services > EXPRESSCLUSTER X 5.0 for Linux HA Cluster Configuration Guide for Amazon Web Services

-> 5. Constructing an HA cluster based on VIP control

-> 5.3 Setting up EXPRESSCLUSTER

In this configuration, we change the target host of the HTTP NP resolution resource from the configuration of the Configuration Guide for AWS, and set the forced stop resource in addition.
Specify the website you created on Amazon S3 as the target host for HTTP NP resolution.

3.2.1 Setting HTTP NP Resolution Resources

The HTTP NP resolution resource is set as below.
This time, we will set up HTTP NP resolution resources for a website created on Amazon S3.

The properties of the HTTP NP resolution resource are as follows:

3.2.2 Setting the Forced Stop Resource

The forced stop resource is configured as follows:

1. The "Fencing" screen where you have configured the HTTP NP resolution resource, select "AWS" in [Type] of [Forced Stop], and then click [Properties].

2. Select “server01” from the [Available Servers], and then click [Add].

3. Enter the EC2 instance ID for server01 in [Instance ID], and then click [OK].

4. By the same steps of 2. and 3., add the EC2 instance ID for server02.

5. Click "Forced Stop" tab and check [Disable Group Failover When Execution Fails].
By enabling this setting, failover is suppressed if the forced stop fails, so it is possible to prevent both-system activation more reliably.

3.2.3 Specifying the Service Startup Delay Time

Specify the service startup delay time of EXPRESSCLUSTER.
This prevents both-system activation as a result when the OS restart is performed on the opposite server while the forced stop resource is being executed. It also prevents Forced Stop from being executed during the cluster startup process.
Specify the service startup delay time as follows:

Service Startup Delay Time ≥ "Forced Stop Timeout" of the forced stop resource + "Time to Wait for Stop to Be Completed" of the forced stop resource + Heartbeat Timeout + Heartbeat Interval

The service startup delay time can be set in [Service Startup Delay Time] on the "Timeout" tab.

Specifying the Service Startup Time Delay

Instead of specifying the service startup delay time, you can adjust the OS startup time.
Specify the OS startup time as follows:

OS startup time ≥ "Forced Stop Timeout" of the forced stop resource + "Time to Wait for Stop to Be Completed" of the forced stop resource + Heartbeat Timeout + Heartbeat Interval

For more information of adjusting the OS startup time, refer to the following:

[Reference]

Documentation - Manuals

EXPRESSCLUSTER X 5.0> EXPRESSCLUSTER X 5.0 for Windows > Installation and Configuration Guide

-> 2. Determining a system configuration

-> 2.6 Setting after configuring hardware

-> 2.6.3 Adjustment of the operating system startup time (Required)

EXPRESSCLUSTER X 5.0 > EXPRESSCLUSTER X 5.0 for Linux > Installation and Configuration Guide

- -> 2. Determining a system configuration
- -> 2.8 Setting after configuring hardware
- -> 2.8.5 Adjustment of the operating system startup time (Required)

Additionally, in the case of a cluster configuration that uses a shared disk, it is necessary to consider the calculation of the Service Startup Delay Time or OS startup time from a different perspective.
For more detailed information, please also refer to the popup

[2024 Edition] Introduction of the Service Startup Delay Time Setting Feature.

4. Checking the Operation at the Time of NP Resolution

We will check the operation of the HA cluster by causing a network partition in a configuration where the forced stop resource is executed and in a configuration where it is not executed. In this article, we will describe an example of checking the operation in a Windows environment.

In order to cause a network partition we set up Network ACLs to block all communication across the Availability Zones of the servers for HA cluster. This causes the heartbeats between the servers to cease, but http communication to the website created on Amazon S3 from the servers is still possible, so each server decides that there is a problem on the other server and tries to start the failover group.

4.1 When Not to Execute the Forced Stop Resource

If servers do not execute the forced stop resource, each server starts a failover group, resulting in a both-system activation. If you check the status of the HA cluster on each server, it will be as follows. You can see that the failover group is running on each server.

[Status of Cluster for server01]
A failover group is running on server01.

C:\Users\Administrator>clpstat ======================== CLUSTER STATUS =========================== Cluster : cluster <server> *server01 ........: Online <- server01 is running lankhb1 : Normal LAN Heartbeat httpnp1 : Normal http resolution server02 ........: Offline <- server02 is stopped lankhb1 : Unknown LAN Heartbeat httpnp1 : Unknown http resolution <group> failover ........: Online current : server01 <- Failover group starting on server01 awsvip : Online md : Online <monitor> awsvipw1 : Normal mdw1 : Caution userw : Normal =====================================================================

[Status of Cluster for server02]
A failover group is running on server02.

C:\Users\Administrator>clpstat ======================== CLUSTER STATUS =========================== Cluster : cluster <server> server01 ........: Offline <- server01 is stopped lankhb1 : Unknown LAN Heartbeat httpnp1 : Unknown http resolution *server02 ........: Online <- server02 is running lankhb1 : Normal LAN Heartbeat httpnp1 : Normal http resolution <group> failover ........: Online current : server02 <- Failover group starting on server02 awsvip : Online md : Online <monitor> awsvipw1 : Normal mdw1 : Caution userw : Normal =====================================================================

4.2 When Executing the Forced Stop Resource

When servers execute the forced stop resource, the standby server executes the forced stop resource to stop the active server before starting the failover group. This prevents both systems from being activated.

The output of the alert log after the standby server detects the down of the active server is as follows. (Omitted the received time of the alert log.) You can verify that the forced stop resource is executed before the failover group is started.

Info 2022/12/14 06:42:00.649 server02 nm 2 The server server01 has been stopped. ★ Detecting down of the active server
Info 2022/12/14 06:42:04.743 server02 forcestop 5201 Forced stop of server server01 has been requested.(aws, stop) ★ Start executing the forced stop resource
Warning 2022/12/14 06:42:19.368 server02 mdadmn 3880 The mirror disk connect of the mirror disk md has been disconnected.
Warning 2022/12/14 06:42:19.603 server02 rm 1504 Monitor mdw1 is in the warning status. (105 : Whether mirror disk md data is old/new is not determined.)
Info 2022/12/14 06:42:37.180 server02 forcestop 5202 Forced stop of server server01 has completed.(aws, stop) ★ Execution of the forced stop resource is complete
Info 2022/12/14 06:42:37.180 server02 rc 1060 Failing over the group failover. ★ Start the failover group activation
Info 2022/12/14 06:42:37.180 server02 rc 1010 The group failover is starting.

After failover is complete, check the status of the EC2 instance of the active server, and you will see that it is stopped, which means that the active server has been stopped.

C:\Users\Administrator>aws ec2 describe-instances --instance-ids i-11111111111111111 --query "Reservations[0].Instances[0].[InstanceId, State.Name]" [ [ [ "i-11111111111111111", "stopped" ] ] ]

Conclusion

This time, we introduced the procedure for building an HA cluster using the forced stop resource.
Since you can more reliably prevent both-system activation when a network partition occurs on AWS, please consider using the forced stop resource.

If you consider introducing the configuration described in this article, you can perform a validation with the popup trial module of EXPRESSCLUSTER. Please do not hesitate to contact us if you have any questions.