Solutions with Practical SQL DBA : Understanding the concept of SQL Server Failover Cluster

Wednesday, 14 March 2012

Understanding the concept of SQL Server Failover Cluster

Before getting hands on experience in SQL server on failover cluster , I used to read many article about failover clustering. Unfortunately I was not able to digest the concept till I wet my hands.I will keep this point in my mind while explaining the failover cluster in this post.

What is a windows Cluster ? A cluster is a group of independent computer systems, referred to as nodes, working together as a unified computing resource. When we say unified computing , it does not mean that a single application can access the resources (cpu/memory) of multiple computer at any point of time. A cluster provides a single name for clients to use it services.There are two type of cluster:

Network Load Balancing Cluster : A Network Load Balancing Cluster (NLB) distribute the load (request from the client) to the different nodes which are part of the cluster based on predefined rules.The client application has to communicate to a single cluster IP address (virtual server) and client does not know which node in the cluster exactly serving its request.Network Load Balancing cluster helps to enhance the the availability and scalability of application. We are not going to talk much about this on this post.

Fail over Cluster: Failover cluster is a collection of servers that by working together increase the availability of applications and services that run on the cluster. It can be described as a technology that automatically allows one physical server to take over the tasks and responsibilities of another physical server that has failed.A failover cluster provides high availability for cluster aware application like SQL server ,exchange server,etc. The major difference between NLB cluster and failover cluster is, failover cluster will not help to improve the scalability of the application. It can only support high availability.

There are many misunderstanding among people about the advantage of having SQL server on a failover cluster environment. Below table will give a clear picture about the capability of SQL server on cluster environment.

To understand the failover cluster it is important to familiar with the terminologies. Let us have look on the terminologies used in the windows clustering.

Server Node: Physical node with operating system that support windows clustering. Each server node should have minimum two network card for public and private network.Minimum one local hard disk is required for OS and other application binaries.

Private Storage: Local disks are referred as private storage. There will be minimum of one private disk for OS and SQL binaries.Server node can have a additional local disk for bigger page file.
Shared Disk Array: Each server needs to be attached to the shared external storage.In non-clustered SQL server instance, databases (system/user/resource) are store on locally attached disk storage but in clustered SQL server instances databases are store data on a shared disk array.That mean all the server nodes in the cluster setup are physically connected to the disk array.This shared storage configuration allow application to failover between server in the cluster.

Quorum Disk: Quorum is the cluster's configuration file.This file (quorum.log) resides in the the quorum disk (one disk from shared disk array).Quorum is the main interpreter between all nodes. It stores latest cluster configuration and resource data. This helps the other nodes to take ownership when one node goes down.
In real life, it is like emergency contact number and other medical information in your wallet which help others to take action in case of emergency.

Public Network and Private Network: Each server needs at least two network cards. Typically, one is the public network and the other is a private network between the nodes. The private network is used for internal communication of cluster.This is referred as heart beat. Public network is used to connect the external world or intranet.

Heart Beat: Heart beat is health check mechanism in cluster A single UDP packet sent between nodes in the cluster via the private network to confirm that nodes are still online. By default cluster service will wait five seconds(one heart beat sent in every second) before considering a cluster node to be unreachable.

Looks Alive check:Looks alive check is a basic resource health check to verify that the service(SQL service in our context) is running properly.To perform this , cluster service queries the windows service control manager to check the status of the service.By default looks alive check will happen in every five seconds.

Is Alive check: An exhaustive check to verify that a resource is running properly. If this check fails, the resource is moved offline and the failover process is triggered. During the Is alive check the cluster service connects to the SQL server instance and execute select @@SERVERNAME.It will check only the SQL server instance availability and does not check the availability of user databases.

Failover: Clusters service monitor the health of the nodes in the cluster setup and the resources in the cluster. In the event of a server/resource failure, the cluster service re-starts the failed server's workload on one of the remaining servers based on possible owners settings. The process of detecting failures and restarting the application on another server in the cluster is known as failover .

Virtual Server:Virtual server consist of a network name and IP address to which the client connect.This helps the client to connect the service which hosted in the cluster environment with out knowing the details of the server node on which the service is running.In simple word , virtual server helps the client application to connect the SQL server instance with out changing the connection string in case of failover of SQL instance.

Possible Owners :Possible owners are the server nodes on which cluster group (in our context SQL instance) can failover

Preferred Owner: Preferred owner is the server node which best suited to running an application or group.

Cluster Setup: Below schema diagram will give a clear picture about the windows cluster setup.

This picture depicts two node cluster setup.At this stage Node A and Node B will have only OS and windows cluster service.One of the disk from cluster disk will be designated as the quorum disk. This setup basically done by windows administrators and storage specialists.On top of this, DBA's install and configure the SQL instances. We are not going talk about the SQL server installation on cluster environment on this post.

To make it more clear, let us describe our cluster environment as given below.

Let us look into the pictorial representation current setup.

Fig 1

From the Fig-1, we can see that INST1 and INST2 are installed on both physical node , but at any point of time one instance will be online only in one node.At this stage INST1 is running on physical node Pnode-A and INST2 is running on node Pnode-B.The cluster disk D and E will be visible in the node Pnode-A and F and G will be visible in the node Pnode-B.The request from the App1 will be routed through the virtual server A and landed in Pnode-A where the INST1 is running.In the same way the request from the App2 will be routed through the virtual Server B and landed in Pnode-B where the INST2 is running.

Let us see what will happen if there is a hardware/network failure in Pnode-A. Fig 2 depicts the failover of scenario of INST1 to Pnode-B.

Fig 2

Now the INST1 went offline in Pnode-A and came online Pnode-B.The connection from the App1 will be routed through the same virtual Server A and landed in Pnode-B.All four cluster disk will be visible in Pnode-B and only C (Local disk) will be accessible from Pnode-A. Note that even after the failover of INST1,App1 will be connecting to the virtual server A.It helps to make the failover transparent to the application.

Thank you for reading this article. Please feel free to pass your comments to improve the quality of the post. To follow this blog visit my Facebook Page