DNS Time-Outs in Multi-Subnet SQL 2014 Availabilty Group Cluster using InfoBlox DNS

Hello all, this is my first blog post on WordPress and I thought I would start with a recent issue that we encountered at work.

The Setup:

My company uses InfoBlox for our DNS solution and I have recently created a 4 node AG cluster with two subnets.  These are test servers and are running on VMware 5.1 currently with Windows 2012R2 and SQL 2014 SP1 (the 2nd SP1 release that is).  The creation of the cluster was handled by our Windows team, but it went smoothly and once the environment was handed over, my team installed SQL on each node and then configured 4 Availability Groups within the cluster.

The Cluster was modified via PowerShell to change the TTL for each resource (AG) to 120 seconds (down from the default of 20 min) and the RegisterAllProvidersIP option was set to zero as we have several legacy apps that cannot make use of the “MultiSubnetFailover=True” option.

The setup of servers and groups can be summarized as follows:

  Subnet1 Subnet2
  Server1 Server2 Server3 Server4
AG1 Sync Sync   Async
AG2 Sync Sync Async Sync
AG3   Async Sync Sync
AG4 Async Sync Sync Sync

AG1 and AG2 are synchronous within Subnet1, while AG1 is a DR group with an Async copy in Subnet2 and AG2 is an HA group with a Sync and Async copy in Subnet2.  AG3 and AG4 are the same, but based in Subnet2 with copies in Subnet1.

Each AG has a listener created for it with two IP address for each so that they can handle the failover from one subnet to the other.  The DNS entries for each listener name were pre-created in InfoBlox and both IPs were assigned to each listener’s A-record (more on this in the problem section).

Each listener name could be pinged and connected to via SSMS and the AGs could failover between servers at will with no immediate issues presented and users could access their databases via the listener names.

The Problem(s):

The first error was detected in the Cluster Events in the Failover Cluster Manager:

ClusterError1

A search for error 1196 and ‘DNS Bad Key’ came up with a lot of hits, but none that seemed to be specific to this situation.  Further investigation into the Cluster Diagnostic Log (Event Viewer–>Applications and Services Logs–>Microsoft–>Windows–>Failover Clustering–>Diagnostic) found repeated errors stating that the SQL listeners could not register in DNS:

ClusterError2 ClusterError3 ClusterError4

It became obvious that there was an issue writing updates to DNS from the cluster servers.

Most sites online mentioned granting the Cluster Name Object full control on the Listener Objects in DNS, and for normal Microsoft DNS that would probably work, but we run InfoBlox for DNS so it doesn’t have the same solution.

Shortly after finding these errors in the logs, we started to receive reports of time-outs from the users.  Being a test environment, we have some users connecting with SSMS and they claimed that the time-outs appeared to be random, and that they could get around it by flushing their local DNS cache.  While that can be fine for developers in test, it won’t go over well in production since most users will be coming in from App servers, or won’t have the knowledge or permissions to run ipconfig/flushdns, and it isn’t a solution, just a workaround.

I contacted one of our network administrators, and he removed and re-added the DNS records in InfoBlox and he added the IP addresses for the Cluster (2 of them again, one per subnet) to the Update ACL in the InfoBlox grid control.  This still did not work.

The Solution:

The first realization is that the 4 AG Listener DNS records did not need to be manually setup within InfoBlox, so we deleted them (after making sure there were no aliases attached to them).  Once removed we tested a failover of one of the Availability Groups and used the InfoBlox live monitor to capture any errors.

We found an error record stating that the IP of the server we failed the AG to could not enter a record in DNS due to lack of permissions.  We then added the IP of each Host Server to the InfoBlox Update ACL and tested a failover again; it failed with the same permissions error.  We waited a few minutes (in case any TTL was involved) and had the same result.

For InfoBlox, at least, the IP of each Host Server has to be added to the Update ACL, not just the Cluster IPs, AND the InfoBlox Grid Control Service has to be restarted.

After we restarted the Grid Control Service, everything worked as expected.  On each failover test, the servers in the cluster could add records to InfoBlox and the time-outs are no longer an issue.

InfoBlox does have a newer version that will test all IPs for a name and return the first to respond, but the version we have currently does not have that option and it checks IPs in a round robin fashion.