Troubleshooting #Azure ILB connection issues in a SQL Server AlwaysOn FCI Cluster

Many times in troubleshooting SQL Server connectivity issues, especially in Azure with all the ILB requirements I use the following tools to help troubleshoot. I’ll keep this article up to date as I find and use other tools. In addition to this article, I have found another resource which is pretty decent.

https://blogs.msdn.microsoft.com/sql_pfe_blog/2017/02/21/trouble-shooting-availability-group-listener-in-azure-sql-vm/

Netstat

The first tool is a simple test to verify whether the SQL Cluster IP is listening on the port I think it should be listening on. In this case, the SQL Cluster IP address is 10.0.0.201 and it is using the default instance which is port 1433.

Here is the command which will help you quickly identify whether the active node is listening on that port. In our case below everything looks normal.

C:\Users\dave.SIOS>netstat -na | find "1433"
TCP    10.0.0.4:49584         10.0.0.201:1433        ESTABLISHED
TCP    10.0.0.4:49592         10.0.0.201:1433        ESTABLISHED
TCP    10.0.0.4:49593         10.0.0.201:1433        ESTABLISHED
TCP    10.0.0.4:49595         10.0.0.201:1433        ESTABLISHED
TCP    10.0.0.201:1433        0.0.0.0:0              LISTENING
ESTABLISHED
TCP    10.0.0.201:1433        10.0.0.4:49592         ESTABLISHED
TCP    10.0.0.201:1433        10.0.0.4:49593         ESTABLISHED
TCP    10.0.0.201:1433        10.0.0.4:49595         ESTABLISHED

Once I can be sure SQL is listening to the proper port I use PSPING to try to connect to the port remotely.

PSPing

[EDIT 7/5/18]
I just discovered Test-NetConnection, this should eliminate the need to use PSPing. [/Edit]

PSPing is part of the PSTools package available from Microsoft. I usually download the tool and put PSPing directly in my System32 folder so I can use it whenever I want without having to change directories.

Assuming everything is configured properly from the ILB, Cluster and Firewall perspective you should be able to ping the SQL Cluster IP address and port 1433 from the passive server and get the results shown below…

C:\Users\dave.SIOS>psping 10.0.0.201:1433
PsPing v2.01 - PsPing - ping, latency, bandwidth measurement utility
Copyright (C) 2012-2014 Mark Russinovich
Sysinternals - www.sysinternals.com
TCP connect to 10.0.0.201:1433:
5 iterations (warmup 1) connecting test:
Connecting to 10.0.0.201:1433 (warmup): 6.99ms
Connecting to 10.0.0.201:1433: 0.78ms
Connecting to 10.0.0.201:1433: 0.96ms
Connecting to 10.0.0.201:1433: 0.68ms
Connecting to 10.0.0.201:1433: 0.89ms
If things are not configured properly you may see results similar to the following…
C:\Users\dave.SIOS>psping 10.0.0.201:1433
TCP connect to 10.0.0.102:1433:
5 iterations (warmup 1) connecting test:
Connecting to 10.0.0.102:1433 (warmup): This operation returned because the time out period expired.
Connecting to 10.0.0.102:1433 (warmup): This operation returned because the time out period expired.
Connecting to 10.0.0.102:1433 (warmup): This operation returned because the time out period expired.
Connecting to 10.0.0.102:1433 (warmup): This operation returned because the time out period expired.
Connecting to 10.0.0.102:1433 (warmup): This operation returned because the time out period expired.

If PSPing connects yet your application is having a problem connecting you may need to dig a bit deeper. I have seen some application like Great Plains also want to make a connection to port 445. If your application can’t connect but PSPing connects fine to 1433 then you may need to do a network trace and see what other ports your application is trying to connect to and then add load balancing rules for those ports as well.

Named Instances

If you are using a named instances, not only do you need to make sure you lock down your TCP service to use a static port, but you also need to make sure you add a rule to your load balancer to redirect UDP 1434 for the SQL Browser Service, otherwise you won’t be able to connect to your named instance.

Firewall

Opening up TCP ports 1433 and 59999 should cover all the manual steps required, but when troubleshooting connection issues I generally turn the Windows Firewall off to eliminate the firewall as a possible cause of the problem. Don’t forget, Azure also has a firewall called Network Security Groups. If anyone changed that from the default that could be blocking traffic as well.

Name Resolution

Try pinging the SQL cluster name. It should resolve to the SQL Server cluster iP address, but I have seen on more than a few occasions the DNS A-record associated with the SQL Cluster network name mysteriously disappear from DNS. If that is the case go ahead and read-ad the SQL Custer name and IP address as an A record in DNS.

SQL Configuration Manager

In SQL Configuration Manager you should see the SQL Cluster IP Address listed and port 1433. If by chance you installed a Named Instace you of course will need to go in here and lock the port to a specific port and make your load balancing rules reflect that port. Because of the Azure ILB limitation of only on ILB per AG, I really don’t see an valid reason to use a named instance. Make it easier on yourself and just use the default instance of SQL. (Update: as of Oct 2016 you CAN have multiple IP addresses per ILB, so you CAN have multiple instances of SQL installed in the cluster.

Conclusion

I’ll be updating this article as I continue to learn about new tips, tricks, tools and common problems I encounter in the field as I support customers trying to get started with clusters in Azure.

Troubleshooting #Azure ILB connection issues in a SQL Server AlwaysOn FCI Cluster

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s