DNS Time-Outs in Multi-Subnet SQL 2014 Availabilty Group Cluster using InfoBlox DNS

Hello all, this is my first blog post on WordPress and I thought I would start with a recent issue that we encountered at work.

The Setup:

My company uses InfoBlox for our DNS solution and I have recently created a 4 node AG cluster with two subnets.  These are test servers and are running on VMware 5.1 currently with Windows 2012R2 and SQL 2014 SP1 (the 2nd SP1 release that is).  The creation of the cluster was handled by our Windows team, but it went smoothly and once the environment was handed over, my team installed SQL on each node and then configured 4 Availability Groups within the cluster.

The Cluster was modified via PowerShell to change the TTL for each resource (AG) to 120 seconds (down from the default of 20 min) and the RegisterAllProvidersIP option was set to zero as we have several legacy apps that cannot make use of the “MultiSubnetFailover=True” option.

The setup of servers and groups can be summarized as follows:

  Subnet1 Subnet2
  Server1 Server2 Server3 Server4
AG1 Sync Sync   Async
AG2 Sync Sync Async Sync
AG3   Async Sync Sync
AG4 Async Sync Sync Sync

AG1 and AG2 are synchronous within Subnet1, while AG1 is a DR group with an Async copy in Subnet2 and AG2 is an HA group with a Sync and Async copy in Subnet2.  AG3 and AG4 are the same, but based in Subnet2 with copies in Subnet1.

Each AG has a listener created for it with two IP address for each so that they can handle the failover from one subnet to the other.  The DNS entries for each listener name were pre-created in InfoBlox and both IPs were assigned to each listener’s A-record (more on this in the problem section).

Each listener name could be pinged and connected to via SSMS and the AGs could failover between servers at will with no immediate issues presented and users could access their databases via the listener names.

The Problem(s):

The first error was detected in the Cluster Events in the Failover Cluster Manager:

ClusterError1

A search for error 1196 and ‘DNS Bad Key’ came up with a lot of hits, but none that seemed to be specific to this situation.  Further investigation into the Cluster Diagnostic Log (Event Viewer–>Applications and Services Logs–>Microsoft–>Windows–>Failover Clustering–>Diagnostic) found repeated errors stating that the SQL listeners could not register in DNS:

ClusterError2 ClusterError3 ClusterError4

It became obvious that there was an issue writing updates to DNS from the cluster servers.

Most sites online mentioned granting the Cluster Name Object full control on the Listener Objects in DNS, and for normal Microsoft DNS that would probably work, but we run InfoBlox for DNS so it doesn’t have the same solution.

Shortly after finding these errors in the logs, we started to receive reports of time-outs from the users.  Being a test environment, we have some users connecting with SSMS and they claimed that the time-outs appeared to be random, and that they could get around it by flushing their local DNS cache.  While that can be fine for developers in test, it won’t go over well in production since most users will be coming in from App servers, or won’t have the knowledge or permissions to run ipconfig/flushdns, and it isn’t a solution, just a workaround.

I contacted one of our network administrators, and he removed and re-added the DNS records in InfoBlox and he added the IP addresses for the Cluster (2 of them again, one per subnet) to the Update ACL in the InfoBlox grid control.  This still did not work.

The Solution:

The first realization is that the 4 AG Listener DNS records did not need to be manually setup within InfoBlox, so we deleted them (after making sure there were no aliases attached to them).  Once removed we tested a failover of one of the Availability Groups and used the InfoBlox live monitor to capture any errors.

We found an error record stating that the IP of the server we failed the AG to could not enter a record in DNS due to lack of permissions.  We then added the IP of each Host Server to the InfoBlox Update ACL and tested a failover again; it failed with the same permissions error.  We waited a few minutes (in case any TTL was involved) and had the same result.

For InfoBlox, at least, the IP of each Host Server has to be added to the Update ACL, not just the Cluster IPs, AND the InfoBlox Grid Control Service has to be restarted.

After we restarted the Grid Control Service, everything worked as expected.  On each failover test, the servers in the cluster could add records to InfoBlox and the time-outs are no longer an issue.

InfoBlox does have a newer version that will test all IPs for a name and return the first to respond, but the version we have currently does not have that option and it checks IPs in a round robin fashion.

12 thoughts on “DNS Time-Outs in Multi-Subnet SQL 2014 Availabilty Group Cluster using InfoBlox DNS

  1. We have similar situation but our environment all 4 nodes are in single subnet and DNS is combination of both infoblox and Active directory. all our AG records and hostname A records are created in AD DNS. Servers are also pointed to AD DNS. There are C name records for all AGs and nodes in Info blox DNS

    Verified the permissions on AD DNS records Cluster record has full control permissions.

    Like

    1. Hi Dinesh,

      I have not tested that exact configuration, but if your Cluster IP and Server IPs have the correct ACL permissions in Infoblox, then I would think it should work from that point. I have a production rollout of a more detailed system in the next month and I plan to blog about that setup as well, but please post if you are seeing another issue, preferably with the version number of Infoblox that you are using.

      Thanks,
      Dave

      Like

    1. In InfoBlox you can remove IPs by going to the Data Management tab, then IPAM, then drill into your subnet until you find the ip in question, at that point you can open the ip properties and delete it. For the ACL list, that is also under Data Management, but instead of IPAM it is under DNS. From there you go to the Grid DNS Properties and click the Update link on the left. The list that appears is the list of IPs that have permission to update InfoBlox.

      Liked by 1 person

  2. I had same Issue. But is it possible to set the TTL for clustername ip’s on Infoblox to 60 secs and also make changes to cluster properties HostRecordTTL=60 and RegisterAllProvidersIP=0. I’m asking this because if our adminstrators don’t allow dynamic update to DNS but can change the indvidual TTl time outs

    Like

    1. Hi Amit,

      I spoke with our Infoblox administrator and I don’t think this would work. In order for the servers to failover to another subnet automatically, and for the applications to find those databases, DNS would need to be updated to the new IP value. If the DNS administrators do not want this IP update to happen automatically via the failover mechanisms, then they might need to be available 24/7 to manually type in the new IP whenever it needs to change, which most people would not enjoy doing.

      The RegisterAllProvidersIP=0 tells the cluster to only register the current IP with DNS, rather than all possible IPs (which is the default). If you allow it to register all of the IPs, then DNS won’t need to be updated on a failover, but it is very likely that your applications would timeout before they connected to the database as they would be forced to try each IP (one at a time), and most applications have a shorter timeout than what would be needed. If your application is newer or custom developed for your company, then you could try leaving RegisterAllProvidersIP at the default of 1, but then add the line MultiSubnetFailover=True into your connection string, this has the effect of making the application try all IPs at once rather than one at a time, so the timeouts can be avoided.

      If you are using ODBC to connect, then you can try using the SQL Server Native Client rather than the SQL Server entry in ODBC. In the Native Client there is a checkbox labeled Multi-subnet Failover, this checkbox is the same as adding MultiSubnetFailover=True to a connection string. Most vendor applications do not support that option yet, so it may be of limited use.

      I hope this helps some,
      Dave

      Liked by 1 person

Leave a comment