Displaying present location in the site.

[2021 Edition] How to Prevent Both-System Activation on AWS (Windows/Linux)

EXPRESSCLUSTER Official Blog

December 24th, 2021

Machine translation is used partially for this article. See the Japanese version for the original article.


* We have published the revised version (2022 Edition) of this article.
For the 2022 Edition, refer to popuphere.

Since this article was published, several updates of EXPRESSCLUSTER have been released, and we introduce the updated setting methods and items.

Introduction

We tried building HA cluster using a script for forced stop, which is one of the features of EXPRESSCLUSTER in order to prevent the both-system activation of HA cluster on Amazon Web Services (hereinafter called "AWS").

EXPRESSCLUSTER can set network partition resolution (hereinafter called "NP resolution") as a method to prevent both-system activation. The recommended configuration on AWS is to set HTTP NP resolution. (HTTP NP resolution can be set in EXPRESSCLUSTER X 4.1 or later)
For more information on NP resolution, refer to popupthis page.

However, even if NP resolution is set, if a failure such as OS stalling or a communication disconnection between Availability Zones occurs, both systems may be active.
In such cases, we use a function called a script for forced stop as a method to prevent both-system activation.
This time, we will set this script for forced stop to the HA cluster on AWS.

* In a physical environment, servers can be stopped by the "force stop function" using the IPMI function or the VMware vCenter Server function. On AWS, the same function is not available, so the server can be stopped by using the script for forced stop.

Contents

1. What is the Script for Forced Stop?

A script for forced stop is the feature to run any script that users created on the remaining servers (normal servers) when they are aware of the server's down. This feature allows you to force down servers to be stopped.

This allows the prevention of both-system activation to be more reliable by stopping the server using a script for forced stop in addition to the NP resolution resource's response, even in the event of a network partition (split brain) condition.

The trigger to execute the script for forced stop is when failover group that was running on down server starts on the other server by detecting server down with heartbeat timeout. If you stop normally the server from Cluster WebUI, etc. or if a failover group is not started on a downed server and failover does not occur, the script for forced stop is not executed. So the server will not be forced to stop at unnecessary times by the script for forced stop.

For more information of the script for forced stop, refer to the Reference Guide.

[Reference]
  • Manuals > EXPRESSCLUSTER X > EXPRESSCLUSTER X 4.3 for Windows > Reference Guide
  •  
  • 4. Information on other settings
  • → 7.2 Script for forced stop
  •  
  •  
  • Manuals > EXPRESSCLUSTER X > EXPRESSCLUSTER X 4.3 for Linux > Reference Guide
  •  
  • 4. Information on other settings
  • → 7.4 Script for forced stop
  •  
  •  

2. HA Cluster Configureation

We build an HA cluster based on VIP control. We make a few changes from the configuration in the HA Cluster Configuration Guide for Amazon Web Services.

In the Configuration Guide, we specify a web server outside of VPC (on the Internet) as a target for HTTP NP resolution, but it cannot be used if you do not want to access the Internet from the servers for HA cluster due to security policies.
In this article, we will build a "HA cluster based on VIP control" combined with "VPC endpoint" as a HA cluster configuration that does not access the Internet.

The diagram of the HA cluster to be built is as follows:

HA Cluster Configuration

We will set only AWS virtual IP resources and mirror disk resources to a failover group.
We will also specify a website created on Amazon S3 as the target for HTTP NP resolution and access to Amazon S3 will be via a gateway endpoint. For security reasons, we will configure it so that it can only be connected to from the VPC where the HA cluster will be built. In addition, we will set up a script for forced stop and use the AWS CLI to shut down the downed server.

* Configure the NP resolution resources as appropriate for your HA cluster environment.
For example, if clients in on-premises access to HA cluster on AWS with AWS Direct Connect, NP resolution can be achieved by specifying the gateway of on-premise as the target for Ping NP resolution resource.


Refer to the below for more information about gateway endpoints, how to create them, and how to host a website on Amazon S3.

3. HA Cluster Building Procedure

In this section, we introduce the procedure for building an HA cluster.
Please note that the procedures are different between Windows and Linux.

3.1 Preparation for Building an HA Cluster

For more information on preparation in your AWS environment, refer to the HA Configuration Guide for Amazon Web Services.

In this configuration, the VPC endpoints are used for communication from the servers for the HA cluster to the endpoint when executing AWS CLI commands, so the creation of NAT instances are not necessary.
For the procedure of configuring the VPC endpoints, please refer to our popupprevious blog.

[Reference]
  • Windows > Cloud > Amazon Web Services > EXPRESSCLUSTER X 4.3 for Windows HA Cluster Configuration Guide for Amazon Web Services
  •  
  • → 5. Constructing an HA cluster based on VIP control
  • → 5.1 Configure the VPC Environment
  • → 5.2 Configuring the instance
  •  
  •  
  • Linux > Cloud > Amazon Web Services > EXPRESSCLUSTER X 4.3 for Linux HA Cluster Configuration Guide for Amazon Web Services
  •  
  • → 5. Constructing an HA cluster based on VIP control
  • → 5.1 Configure the VPC Environment
  • → 5.2 Configuring the instance

Also, this time, in order to stop the EC2 instance using the script for forced stop, the servers for the HA cluster execute the stop-instance command in the AWS CLI from the script for forced stop.
So, we will add the EC2:StopInstances action to the policy of the IAM role assigned to the servers along with the actions required to control the AWS virtual IP resources. The policy of the IAM role will allow the following actions:

"ec2:Describe*"
"ec2:StopInstances"
"ec2:ReplaceRoute"

The configuration of the VPC is as below. Routes to Amazon S3 (VPC endpoints) are automatically added to the route tables that you selected when you created the gateway endpoint.

  • VPC(VPC ID:vpc-1234abcd)
  • CIDR:10.0.0.0/16
  • Subnets
  • Subnet-A1 (Subnet ID : sub-1111aaaa) : 10.0.10.0/24
  • Subnet-A2 (Subnet ID : sub-2222aaaa) : 10.0.110.0/24
  • Subnet-B1 (Subnet ID : sub-1111bbbb) : 10.0.20.0/24
  • Subnet-B2 (Subnet ID : sub-2222bbbb) : 10.0.120.0/24
  •  
  • -RouteTables
  • Main (Route table ID : rtb-00000001)
    • >10.0.0.0/16 → local
    • >0.0.0.0/0 → igw-1234abcd (Internet Gateway)
      • >20.0.0.200/32 → eni-1234abcd (ENI ID)
      • Route-A (Route table ID : rtb-0000000a)
        • >10.0.0.0/16 → local
          • >20.0.0.200/32 → eni-1234abcd (ENI ID)
          • >pl-xxxxxxxx → vpce-5678cdef (Endpoint ID)
          •  * Route to Amazon S3 (VPC endpoint)
          • Route-B (Route table ID : rtb-0000000b)
            • >10.0.0.0/16 → local
              • >20.0.0.200/32 → eni-1234abcd (ENI ID)
              • >pl-xxxxxxxx → vpce-5678cdef (Endpoint ID)
              •  * Route to Amazon S3 (VPC endpoint)

VPC

3.2 Procedure for Building an HA Cluster

For more information of procedure for building an "HA cluster based on VIP control", refer to HA Cluster Configuration Guide for Amazon Web Services.

[Reference]
  • Windows > Cloud > Amazon Web Services > EXPRESSCLUSTER X 4.3 for Windows HA Cluster Configuration Guide for Amazon Web Services
  •  
  • → 5. Constructing an HA cluster based on VIP control
  • → 5.3 Setting up EXPRESSCLUSTER
  •  
  •  
  • Linux > Cloud > Amazon Web Services > EXPRESSCLUSTER X 4.3 for Linux HA Cluster Configuration Guide for Amazon Web Services
  •  
  • → 5. Constructing an HA cluster based on VIP control
  • → 5.3 Setting up EXPRESSCLUSTER

In this configuration, we change the target host of the HTTP NP resolution resource from the configuration of the Configuration Guide for AWS, and set a script for force stop in addition.
Specify the website you created on Amazon S3 as the target host for HTTP NP resolution.

3.2.1 Setting HTTP NP Resolution Resources

The HTTP NP resolution resource is set as below.
This time, we will set up HTTP NP resolution resources for a website created on Amazon S3.

The properties of the HTTP NP resolution resource are as follows:

Setting of HTTP NP Resolution 02

3.2.2 Setting a Script for Forced Stop

The script for forced stop should be configured as follows:

1. On the "Extension" tab of "Cluster Properties", check "Execute Script for Forced Stop" and click "Script Settings".

2. In "Edit Script", click Edit.

  • 3. In the text editor that you launched, type the following script:
  • The following scripts are samples and do not guarantee operation.
  • The value of the variable in the ★ part should be set according to your environment.
  • Write to return 0 as a return value when the script exits normally.

Script for Forced Stop (Windows)

@echo off
rem ***************************************
rem *             forcestop.bat            *
rem ***************************************
rem Configuration information for AWS CLI
set AWS_CONFIG_FILE=C:\Users\Administrator\.aws\config
rem Absolute path for AWS CLI
set AWS_PATH="C:\Program Files\Amazon\AWSCLI\bin\aws.exe"

rem Configuration information for the first node
set SERVER1_NAME=server01 ★ Server1 hostname
set SERVER1_INSTANCE=111111111111111111111111 ★ Instance ID for server1

rem Configuration information for the second node
set SERVER2_NAME=server02 ★ Server2 hostname
set SERVER2_INSTANCE=i-2222222222222222222222222 ★ Instance ID for server2

rem Maximum number of attempts to stop an instance
set STOP_LOOP_MAX=2 ★ Maximum number of attempts for stop operation

rem Maximum number of attempts to check instance status
set CHECK_LOOP_MAX=240 ★ Maximum number of attempts for confirmation

echo "START"
echo "DOWN SERVER NAME : %CLP_SERVER_DOWN%"
echo "LOCAL SERVER NAME: %CLP_SERVER_LOCAL%"

if %CLP_SERVER_DOWN%==%SERVER1_NAME% (
    set INSTANCE_ID=%SERVER1_INSTANCE%
) else if %CLP_SERVER_DOWN%==%SERVER2_NAME% (
    set INSTANCE_ID=%SERVER2_INSTANCE%
) else (
    echo "SERVER is not found."
    exit 1
)

rem Shutdown request
STOP_LOOP_COUNT=0

:STOP_LOOP

%AWS_PATH% ec2 stop-instances --instance-ids %INSTANCE_ID% --force
set ret=%ERRORLEVEL%

if %ERRORLEVEL%==0 (
    echo "succeeded to stop instance. (%INSTANCE_ID%)"
    GOTO STOP_CHECK
)

echo "failed to stop instance. (%INSTANCE_ID%, ret=%ret%)"

timeout /t 1
set /a STOP_LOOP_COUNT=STOP_LOOP_COUNT+1

if %STOP_LOOP_COUNT%==%STOP_LOOP_MAX% (
    echo "EXIT %ret%"
    exit %ret%
)

GOTO STOP_LOOP


:STOP_CHECK
rem Server down confirmed
rem Even if we can't confirm that the server is down,
rem we will stop the process that has failed
rem when CHECK_LOOP_COUNT exceeds CHECK_LOOP_MAX.
set CHECK_LOOP_COUNT=0

:CHECK_LOOP

%AWS_PATH% ec2 describe-instances --instance-ids %INSTANCE_ID% --filters "Name=instance-state-name,Values=stopped" | findstr %INSTANCE_ID%
set ret=%ERRORLEVEL%

if %ERRORLEVEL%==0 (
    echo "%INSTANCE_ID% has been stopped."
    GOTO EXIT
)

timeout /t 1
set /a CHECK_LOOP_COUNT=CHECK_LOOP_COUNT+1

if %CHECK_LOOP_COUNT%==%CHECK_LOOP_MAX% (
    echo "EXIT %ret%"
    exit %ret%
)

GOTO CHECK_LOOP


:EXIT
echo "EXIT 0"
exit 0

Script for Forced Stop (Linux)

#! /bin/sh
# ***************************************
# *             forcestop.sh             *
# ***************************************

# Absolute path of AWS CLI
AWS_CLI="/usr/local/bin/aws" ★ AWS CLI path

# First node configuration information
SERVER1_NAME="server01" ★ Server1 hostname
SERVER1_INSTANCE="i-11111111111111111" ★ Instance ID for server1

# Second node configuration information
SERVER2_NAME="server02" ★ Server2 hostname
SERVER2_INSTANCE="i-22222222222222222" ★ Instance ID for server2

# Maximum number of attempts to stop instances
STOP_LOOP_MAX=2 ★ Maximum number of attempts for stop operation

# Maximum number of attempts to check instance state
CHECK_LOOP_MAX=240 ★ Maximum number of attempts for confirmation

echo "START"
echo "DOWN SERVER NAME : ${CLP_SERVER_DOWN}"
echo "LOCAL SERVER NAME: ${CLP_SERVER_LOCAL}"
if [ "${CLP_SERVER_DOWN}" = "${SERVER1_NAME}" ]; then
INSTANCE="${SERVER1_INSTANCE}"
elif [ "${CLP_SERVER_DOWN}" = "${SERVER2_NAME}" ]; then
INSTANCE=${SERVER2_INSTANCE}
else
echo "DOWN SERVER is not found."
echo "EXIT 1"
exit 1
fi

# Shutdown request
STOP_LOOP_COUNT=0

while [ ${STOP_LOOP_COUNT} -lt ${STOP_LOOP_MAX} ]
do
${AWS_CLI} ec2 stop-instances --instance-ids ${INSTANCE} --force
ret=$?
if [ ${ret} -eq 0 ]; then
echo "succeeded to stop instance. (${INSTANCE})"
break
fi
echo "failed to stop instance. (${INSTANCE}, ret=${ret})"

sleep 1
let STOP_LOOP_COUNT=${STOP_LOOP_COUNT}+1
done

if [ ${ret} -ne 0 ]; then
echo "EXIT ${ret}"
exit ${ret}
fi

# Server Down Check
# Even if we can't confirm that the server is down,
# we will stop the process that has failed when CHECK_LOOP_COUNT exceeds CHECK_LOOP_MAX.
CHECK_LOOP_COUNT=0
while [ ${CHECK_LOOP_COUNT} -lt ${CHECK_LOOP_MAX} ]
do
${AWS_CLI} ec2 describe-instances --instance-ids ${INSTANCE} --filters "Name=instance-state-name,Values=stopped" | grep ${INSTANCE}
ret=$?
if [ ${ret} -eq 0 ]; then
echo "${INSTANCE} has been stopped."
break
fi

sleep 1
let CHECK_LOOP_COUNT=${CHECK_LOOP_COUNT}+1
done

if [ ${ret} -ne 0 ]; then
echo "EXIT ${ret}"
exit ${ret}
fi

echo "EXIT 0"
exit 0

4. Go back to "Edit Script" and set "Disable Failover" and "Timeout".

Check "Disable Group Failover When Execution Fails" to activate the settings. (Setting of disable failover can be set in EXPRESSCLUSTER X 4.1 or later.)
By activating this setting, determine the return value of the script for forced stop. If the script fails, failover is disable, allowing you to more reliably prevent both-system activation.

Also, in this case, we will set the timeout to 300 seconds in order to wait for the completion of stopping the down server by the script for forced stop.

After setting the script for forced stop, adjust the startup time of the OS. This will prevent the opposite server from rebooting the OS during the execution of the force stop script, resulting in the activation of both systems.
Set the OS startup time as follows:

OS startup time ≥ script for forced stop timeout + heartbeat timeout

For more information on how to adjust OS startup time, refer to the Installation and Configuration Guide.
[Reference]
  • Manuals > EXPRESSCLUSTER X > EXPRESSCLUSTER X 4.3 for Windows > Installation and Configuration Guide
  •  
  • 2. Determining a system configuration
  • 2.6 Setting after configuring hardware
  •  
  •  
  • 2.6.3 Adjustment of the operating system startup time (Required)
  •  
  •  
  • Manuals > EXPRESSCLUSTER X > EXPRESSCLUSTER X 4.3 for Linux > Installation and Configuration Guide
    • 2. Determining a system configuration
    • 2.8 Setting after configuring hardware
    •  
    •  
    • 2.8.5 Adjustment of the operating system startup time (Required)
In the case of Windows, instead of adjusting the OS startup time, it is possible to respond by delaying the start of the EXPRESSCLUSTER service. Set the delay time when EXPRESSCLUSTER service starts to be as follows:

Delay time at startup of EXPRESSCLUSTER service ≥ Timeout of script for forced stop timeout + heartbeat timeout

For more information on setting the delay time when EXPRESSCLUSTER service starts, refer to the Legacy Feature Guide.
[Reference]
  • Common for Version 4.x > EXPRESSCLUSTER X for Windows Legacy Feature Guide
  •  
  • 4. Compatible command reference
  • 4.24 Setting or displaying the start delay time (armdelay command)
  •  
  •  

4. Checking the Operation at the Time of NP Resolution

We will check the operation of the HA cluster by causing a network partition in a configuration where the force stop script is executed and in a configuration where it is not executed. In this article, we will describe an example of checking the operation in a Windows environment.

In order to cause a network partition we set up Network ACLs to block all communication across the Availability Zones of the servers for HA cluster. This causes the heartbeats between the servers to cease, but http communication to the website created on Amazon S3 from the servers is still possible, so each server decides that there is a problem on the other server and tries to start the failover group.

4.1 When Not to Run the Script for Forced Stop

If servers do not run the script for forced stop, each server starts a failover group, resulting in a both-system activation. If you check the status of the HA cluster on each server, it will be as follows. You can see that the failover group is running on each server.

[Status of Cluster for Server1]
A failover group is running on Server1.

C:\Users\Administrator>clpstat
 ========================  CLUSTER STATUS  ===========================
  Cluster : cluster
  
   *server01 ........: Online     ←Server1 is running
      lankhb1        : Normal           LAN Heartbeat
      httpnp1        : Normal           http resolution
    server02 ........: Offline    ←Server2 is stopped
      lankhb1        : Unknown          LAN Heartbeat
      httpnp1        : Unknown          http resolution
 
    failover ........: Online
      current        : server01   ←Failover group starting on Server1
      awsvip         : Online
      md             : Online
  
    awsvipw1         : Normal
    mdnw1            : Error
    mdw1             : Caution
    userw            : Normal
 =====================================================================

[Status of Cluster for Server2]
A failover group is running on Server2.

C:\Users\Administrator>clpstat
 ========================  CLUSTER STATUS  ===========================
  Cluster : cluster
  
    server01 ........: Offline    ←Server1 is stopped
      lankhb1        : Unknown          LAN Heartbeat
      httpnp1        : Unknown          http resolution
   *server02 ........: Online     ←Server2 is running
      lankhb1        : Normal           LAN Heartbeat
      httpnp1        : Normal           http resolution
 
    failover ........: Online
      current        : server02   Failover group starting on Server2
      awsvip         : Online
      md             : Online
  
    awsvipw1         : Normal
    mdnw1            : Caution
    mdw1             : Caution
    userw            : Normal
 =====================================================================

4.2 When Running the Script for Forced Stop

When servers run a script for forced stop, the standby server runs the script to stop the active server before starting the failover group. This prevents both systems from being activated.
The output of the event log after the standby server detects the down of the active server is as follows. You can verify that the script for forced stop is executed before the failover group is started.

Warning 12/10/2021 8:06:31 EXPRESSCLUSTER X 1504 None Monitor mdw1 is in the warning status. (102 : Mirror disk md is not being mirrored.)
Info 12/10/2021 8:07:01 EXPRESSCLUSTER X 1526 None Status of monitor mdw1 was returned to normal.
Info 12/10/2021 8:47:48 EXPRESSCLUSTER X 2 None The server server1 has been stopped.
★ Detecting downs of active servers
Info 12/10/2021 8:47:49 EXPRESSCLUSTER X 1405 None Script for forced stop has started.
★ Start executing script for forced stop
Warning 12/10/2021 8:48:01 EXPRESSCLUSTER X 1504 None Monitor mdw1 is in the warning status. (102 : Mirror disk md is not being mirrored.)
Warning 12/10/2021 8:48:01 EXPRESSCLUSTER X 1504 None Monitor mdnw1 is in the warning status. (100 : Network was interrupted.)
Info 12/10/2021 8:48:41 EXPRESSCLUSTER X 1406 None Script for forced stop has completed.
★ Execution of the script for forced stop is complete
Info 12/10/2021 8:48:41 EXPRESSCLUSTER X 1060 No Failing over the group failover.
★ Start the failover group activation
Info 12/10/2021 8:48:41 EXPRESSCLUSTER X 1010 No The group failover is starting.
Info 12/10/2021 8:48:41 EXPRESSCLUSTER X 1030 No The resource md is starting.

After failover is complete, check the status of the EC2 instance of the active server, and you will see that it is stopped, which means that the active server has been stopped.

C:\Users\Administrator>aws ec2 describe-instances --instance-ids i-11111111111111111 --query "Reservations[0].Instances[0].[InstanceId, State.Name]"
[
    [
        [
            "i-11111111111111111",
            "stopped"
        ]
    ]
]

Conclusion

This time, we introduced the procedure for building an HA cluster using a script for forced stop.

Since you can more reliably prevent both-system activation when a network partition occurs on AWS, please consider using a script for forced stop. If you consider introducing the configuration described in this article, you can perform a validation with the popuptrial module of EXPRESSCLUSTER. Please do not hesitate to contact us if you have any questions.