Displaying present location in the site.

[2022 Edition] How to Prevent Both-System Activation on AWS (Windows/Linux)

EXPRESSCLUSTER Official Blog

January 10th, 2023

Machine translation is used partially for this article. See the Japanese version for the original article.

Introduction

We tried building HA cluster using a forced stop resource, which is one of the features of EXPRESSCLUSTER in order to prevent the both-system activation of HA cluster on Amazon Web Services (hereinafter called "AWS").

This article is the revised version (2022 Edition) of the popupprevious article. In EXPRESSCLUSTER X 5.0, the forced stop function and the script for forced stop have been innovated as the forced stop resource, so the description of how to set them has been revised.
If you are using EXPRESSCLUSTER X 4.3 or earlier, refer to the previous article.

EXPRESSCLUSTER can set network partition resolution (hereinafter called "NP resolution") as a method to prevent both-system activation. The recommended configuration on AWS is to set HTTP NP resolution. (For more information on NP resolution, refer to popupprevious article.)

However, even if NP resolution is set, if a failure such as OS stalling or a communication disconnection between Availability Zones occurs, both-system activation may occurs.
In such cases, we use a function called the forced stop resource as a method to prevent both-system activation.
This time, we will set the forced stop resource to the HA cluster on AWS.

Contents

1. Both-system Activation and the Forced Stop Resource

Both-system activation
The both-system activation is an event that a failover group starts on multiple severs in the cluster which is in the network partition (split-brain). When the both-system activation occurs, the business application running on each server writes/reads business data independently, so you may face on the critical problems such as data inconsistency between servers. Therefore, it is important to take measures to prevent the both-system activation.

Forced stop resource
The forced stop resource is the function that works on the remaining server (normal server) in the cluster when the server detect that the other server is down, and can force the down server to stop from the outside. As a result, even if the network partition (split-brain) occurs, the server can be stopped using the forced stop resource in addition to the response by the NP resolution resource, making it more reliable to prevent the both-system activation.

The trigger to execute the forced stop resource is when failover group that was running on down server starts on the other server by detecting server down with heartbeat timeout. If you stop normally the server from Cluster WebUI, etc. or if a failover group is not started on a downed server and failover does not occur, the forced stop resource is not executed. So the server will not be forced to stop at unnecessary times by the forced stop resource.

There are some types of the forced stop resource depending on the environment. For example, you can use a BMC forced stop resource that uses IPMI function to stop servers in a physical environment, and a vCenter forced stop resource that uses VMware vCenter Server function to stop virtual machines.

For more information of the forced stop resource, see the Reference Guide.

[Reference]
  • EXPRESSCLUSTER X 5.0 > EXPRESSCLUSTER X 5.0 for Windows > Reference Guide
  •  
  • -> 7. Forced stop resource details

  • EXPRESSCLUSTER X 5.0 > EXPRESSCLUSTER X 5.0 for Linux > Reference Guide
  •  
  • -> 7. Forced stop resource details

2. HA Cluster Configureation

We build an HA cluster based on VIP control. We make a few changes from the configuration in the HA Cluster Configuration Guide for Amazon Web Services.

In the Configuration Guide, we specify a web server outside of VPC (on the Internet) as a target for HTTP NP resolution, but it cannot be used if you do not want to access the Internet from the servers for HA cluster due to security policies.
In this article, we will build a "HA cluster based on VIP control" combined with "VPC endpoint" as a HA cluster configuration that does not access the Internet.

The diagram of the HA cluster to be built is as follows:

HA Cluster Configuration

We will set only AWS virtual IP resources and mirror disk resources to a failover group.
We will also specify a website created on Amazon S3 as the target for HTTP NP resolution and access to Amazon S3 will be via a gateway endpoint. For security reasons, we will configure it so that it can only be connected to from the VPC where the HA cluster will be built. In addition, we will set up the forced stop resource to shut down the downed server.

* Configure the NP resolution resources as appropriate for your HA cluster environment.
For example, if clients in on-premises access to HA cluster on AWS with AWS Direct Connect, NP resolution can be achieved by specifying the gateway of on-premise as the target for Ping NP resolution resource.


Refer to the below for more information about gateway endpoints, how to create them, and how to host a website on Amazon S3.

3. HA Cluster Building Procedure

In this section, we introduce the procedure for building an HA cluster.
Please note that the procedures are different between Windows and Linux.

3.1 Preparation for Building an HA Cluster

For more information on preparation in your AWS environment, refer to the HA Configuration Guide for Amazon Web Services.

In this configuration, the VPC endpoints are used for communication from the servers for the HA cluster to the endpoint when executing AWS CLI commands, so the creation of NAT instances are not necessary.
For the procedure of configuring the VPC endpoints, please refer to our popupprevious article.

[Reference]
  • Windows > Cloud > Amazon Web Services > EXPRESSCLUSTER X 5.0 for Windows HA Cluster Configuration Guide for Amazon Web Services
  •  
  • -> 5. Constructing an HA cluster based on VIP control
  • -> 5.1 Configure the VPC Environment
  • -> 5.2 Configuring the instance
  •  
  •  
  • Linux > Cloud > Amazon Web Services > EXPRESSCLUSTER X 5.0 for Linux HA Cluster Configuration Guide for Amazon Web Services
  •  
  • -> 5. Constructing an HA cluster based on VIP control
  • -> 5.1 Configure the VPC Environment
  • -> 5.2 Configuring the instance

Also, this time, in order to stop the EC2 instance using the forced stop resource, the servers for the HA cluster execute the stop-instance command in the AWS CLI from the forced stop resource.
So, we will add the ec2:StopInstances action to the policy of the IAM role assigned to the servers along with the actions required to control the AWS virtual IP resources. The policy of the IAM role will allow the following actions:

"ec2:Describe*"
"ec2:StopInstances"
"ec2:ReplaceRoute"

The configuration of the VPC is as below. Routes to Amazon S3 (VPC endpoints) are automatically added to the route tables that you selected when you created the gateway endpoint.

  • VPC(VPC ID : vpc-1234abcd)
  • - CIDR : 10.0.0.0/16
  • - Subnets
  • Subnet-A1 (Subnet ID : sub-1111aaaa) : 10.0.10.0/24
  • Subnet-A2 (Subnet ID : sub-2222aaaa) : 10.0.110.0/24
  • Subnet-B1 (Subnet ID : sub-1111bbbb) : 10.0.20.0/24
  • Subnet-B2 (Subnet ID : sub-2222bbbb) : 10.0.120.0/24
  •  
  • - RouteTables
  • Main (Route table ID : rtb-00000001)
  • >10.0.0.0/16 -> local
  • >0.0.0.0/0  -> igw-1234abcd (Internet Gateway)
  • >20.0.0.200/32 -> eni-1234abcd (ENI ID)
  • Route-A (Route table ID : rtb-0000000a)
  • >10.0.0.0/16 -> local
  • >20.0.0.200/32 -> eni-1234abcd (ENI ID)
  • >pl-xxxxxxxx -> vpce-5678cdef (Endpoint ID)
  • * Route to Amazon S3 (VPC endpoint)
  • Route-B (Route table ID : rtb-0000000b)
  • >10.0.0.0/16 -> local
  • >20.0.0.200/32 -> eni-1234abcd (ENI ID)
  • >pl-xxxxxxxx -> vpce-5678cdef (Endpoint ID)
  •  * Route to Amazon S3 (VPC endpoint)

VPC

3.2 Procedure for Building an HA Cluster

For more information of procedure for building an "HA cluster based on VIP control", refer to HA Cluster Configuration Guide for Amazon Web Services.

[Reference]
  • Windows > Cloud > Amazon Web Services > EXPRESSCLUSTER X 5.0 for Windows HA Cluster Configuration Guide for Amazon Web Services
  •  
  • -> 5. Constructing an HA cluster based on VIP control
  • -> 5.3 Setting up EXPRESSCLUSTER
  •  
  •  
  • Linux > Cloud > Amazon Web Services > EXPRESSCLUSTER X 5.0 for Linux HA Cluster Configuration Guide for Amazon Web Services
  •  
  • -> 5. Constructing an HA cluster based on VIP control
  • -> 5.3 Setting up EXPRESSCLUSTER

In this configuration, we change the target host of the HTTP NP resolution resource from the configuration of the Configuration Guide for AWS, and set the forced stop resource in addition.
Specify the website you created on Amazon S3 as the target host for HTTP NP resolution.

3.2.1 Setting HTTP NP Resolution Resources

The HTTP NP resolution resource is set as below.
This time, we will set up HTTP NP resolution resources for a website created on Amazon S3.

The properties of the HTTP NP resolution resource are as follows:

Setting of HTTP NP Resolution 02

3.2.2 Setting the Forced Stop Resource

The forced stop resource is configured as follows:

1. The "Fencing" screen where you have configured the HTTP NP resolution resource, select "AWS" in [Type] of [Forced Stop], and then click [Properties].

2. Select “server01” from the [Available Servers], and then click [Add].

3. Enter the EC2 instance ID for server01 in [Instance ID], and then click [OK].

Setting for Forced Stop resource 03

4. By the same steps of 2. and 3., add the EC2 instance ID for server02.

5. Click "Forced Stop" tab and check [Disable Group Failover When Execution Fails].
By enabling this setting, failover is suppressed if the forced stop fails, so it is possible to prevent both-system activation more reliably.

3.2.3 Specifying the Service Startup Delay Time

Specify the service startup delay time of EXPRESSCLUSTER.
This prevents both-system activation as a result when the OS restart is performed on the opposite server while the forced stop resource is being executed.
Specify the service startup delay time as follows:

Service Startup Delay Time ≥ "Forced Stop Timeout" of the forced stop resource + "Time to Wait for Stop to Be Completed" of the forced stop resource + heartbeat timeout

The service startup delay time can be set in [Service Startup Delay Time] on the "Timeout" tab.

Instead of specifying the service startup delay time, you can adjust the OS startup time.

OS startup time ≥ "Forced Stop Timeout" of the forced stop resource + "Time to Wait for Stop to Be Completed" of the forced stop resource + heartbeat timeout

For more information of adjusting the OS startup time, refer to the following:
[Reference]
  • EXPRESSCLUSTER X 5.0> EXPRESSCLUSTER X 5.0 for Windows > Installation and Configuration Guide
  •  
  • -> 2. Determining a system configuration
  • -> 2.6 Setting after configuring hardware
  • -> 2.6.3 Adjustment of the operating system startup time (Required)
  •  
  • EXPRESSCLUSTER X 5.0 > EXPRESSCLUSTER X 5.0 for Linux > Installation and Configuration Guide
    •  
    • -> 2. Determining a system configuration
    • -> 2.8 Setting after configuring hardware
    • -> 2.8.5 Adjustment of the operating system startup time (Required)

4. Checking the Operation at the Time of NP Resolution

We will check the operation of the HA cluster by causing a network partition in a configuration where the forced stop resource is executed and in a configuration where it is not executed. In this article, we will describe an example of checking the operation in a Windows environment.

In order to cause a network partition we set up Network ACLs to block all communication across the Availability Zones of the servers for HA cluster. This causes the heartbeats between the servers to cease, but http communication to the website created on Amazon S3 from the servers is still possible, so each server decides that there is a problem on the other server and tries to start the failover group.

4.1 When Not to Execute the Forced Stop Resource

If servers do not execute the forced stop resource, each server starts a failover group, resulting in a both-system activation. If you check the status of the HA cluster on each server, it will be as follows. You can see that the failover group is running on each server.

[Status of Cluster for server01]
A failover group is running on server01.

C:\Users\Administrator>clpstat
 ========================  CLUSTER STATUS  ===========================
  Cluster : cluster
  <server>
   *server01 ........: Online     <- server01 is running
      lankhb1        : Normal           LAN Heartbeat
      httpnp1        : Normal           http resolution
    server02 ........: Offline    <- server02 is stopped
      lankhb1        : Unknown          LAN Heartbeat
      httpnp1        : Unknown          http resolution
  <group>
    failover ........: Online
      current        : server01   <- Failover group starting on server01
      awsvip         : Online
      md             : Online
  <monitor>
    awsvipw1         : Normal
    mdw1             : Caution
    userw            : Normal
 =====================================================================

[Status of Cluster for server02]
A failover group is running on server02.

C:\Users\Administrator>clpstat
 ========================  CLUSTER STATUS  ===========================
  Cluster : cluster
  <server>
    server01 ........: Offline    <- server01 is stopped
      lankhb1        : Unknown          LAN Heartbeat
      httpnp1        : Unknown          http resolution
   *server02 ........: Online     <- server02 is running
      lankhb1        : Normal           LAN Heartbeat
      httpnp1        : Normal           http resolution
  <group>
    failover ........: Online
      current        : server02   <- Failover group starting on server02
      awsvip         : Online
      md             : Online
  <monitor>
    awsvipw1         : Normal
    mdw1             : Caution
    userw            : Normal
 =====================================================================

4.2 When Executing the Forced Stop Resource

When servers execute the forced stop resource, the standby server executes the forced stop resource to stop the active server before starting the failover group. This prevents both systems from being activated.

The output of the alert log after the standby server detects the down of the active server is as follows. (Omitted the received time of the alert log.) You can verify that the forced stop resource is executed before the failover group is started.

Info         2022/12/14 06:42:00.649   server02    nm                    2     The server server01 has been stopped.  ★ Detecting down of the active server
Info         2022/12/14 06:42:04.743   server02    forcestop    5201     Forced stop of server server01 has been requested.(aws, stop)  ★ Start executing the forced stop resource
Warning  2022/12/14 06:42:19.368   server02    mdadmn    3880     The mirror disk connect of the mirror disk md has been disconnected.
Warning  2022/12/14 06:42:19.603   server02    rm              1504     Monitor mdw1 is in the warning status. (105 : Whether mirror disk md data is old/new is not determined.)
Info         2022/12/14 06:42:37.180   server02    forcestop    5202     Forced stop of server server01 has completed.(aws, stop)  ★ Execution of the forced stop resource is complete
Info         2022/12/14 06:42:37.180   server02    rc                1060     Failing over the group failover.  ★ Start the failover group activation
Info         2022/12/14 06:42:37.180   server02    rc                1010     The group failover is starting.

After failover is complete, check the status of the EC2 instance of the active server, and you will see that it is stopped, which means that the active server has been stopped.

C:\Users\Administrator>aws ec2 describe-instances --instance-ids i-11111111111111111 --query "Reservations[0].Instances[0].[InstanceId, State.Name]"
[
    [
        [
            "i-11111111111111111",
            "stopped"
        ]
    ]
]

Conclusion

This time, we introduced the procedure for building an HA cluster using the forced stop resource.
Since you can more reliably prevent both-system activation when a network partition occurs on AWS, please consider using the forced stop resource.

If you consider introducing the configuration described in this article, you can perform a validation with the popuptrial module of EXPRESSCLUSTER. Please do not hesitate to contact us if you have any questions.