Troubleshoot Microservice Networks

August 5, 2017

There are several advantages to using microservices. As one might expect, they also bring some additional challenges. For us, one of those challenges is increased network complexity. All of our microservices have to talk to each other, they frequently do this via REST API’s. Sometimes, this communication doesn’t work as expected.

There are a few questions I ask myself when I suspect foul network play. Here they are with the tools I use to answer them. I have found these tools to be indispensable when troubleshooting our microservice deployment. It may be important to know that I am running these commands on CentOS 6.9.

Question 1

Is the service actively listening on the port I expect it to be? To answer this question, I use netstat with a few options.

$ sudo netstat -anp | grep <port number>

tcp    0   0 0.0.0.0:3000     0.0.0.0:*    LISTEN     15241/puma 3.9.1

Note: if your user owns the process in question, there is no need for sudo

If there is no output from the above command while the service is running, I’ve narrowed the search for my problem to the port number I expect to get traffic in on. It’s probably misconfigured somewhere. If I do get the output I expect, I like to stop my service then run the netstat command again to make sure it goes away. This sanity test proves it is, in fact, my service that is listening on the given port.

Question 2

Am I receiving any traffic on the port I expect? tcpdump is my tool of choice to answer this question.

$ sudo tcpdump -i any port <port number>

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
...

Leave that running while the requesting service makes its queries. If there is no additional output from the above command when a request should be coming in (similar to the example output below), my search for the problem is narrowed to the service making the request on the other end or the network itself. While I enjoy blaming the network as much as the next guy, it’s usually a misconfigured target IP address, hostname, or port in the other service.

...
21:28:08.257231 IP 192.168.1.2.56284 > 192.168.1.15.hbci: Flags [S], seq 3502080001, win 65535, options [mss 1460], length 0
21:28:08.257279 IP 192.168.1.15.hbci > 192.168.1.2.56284: Flags [S.], seq 1941830594, ack 3502080002, win 14600, options [mss 1460], length 0
...

Question 3

Is a firewall rule getting in the way? On our servers, the answer to this question comes from iptables.

$ sudo iptables -L

Chain INPUT (policy ACCEPT)
target     prot opt source               destination
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:XmlIpcRegSvc
ACCEPT     icmp --  anywhere             anywhere
ACCEPT     tcp  --  anywhere             anywhere            state NEW tcp dpt:ssh
ACCEPT     udp  --  anywhere             anywhere            multiport dports snmp,snmptrap
ACCEPT     tcp  --  anywhere             anywhere            state NEW tcp dpt:http
ACCEPT     tcp  --  anywhere             anywhere            state NEW tcp dpt:webcache
ACCEPT     tcp  --  anywhere             anywhere            tcp dpt:1234
REJECT     all  --  anywhere             anywhere            reject-with icmp-host-prohibited

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination
REJECT     all  --  anywhere             anywhere            reject-with icmp-host-prohibited

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
ACCEPT     udp  --  anywhere             anywhere            udp spt:10053

Finding the rule that is getting in my way can be a bigger task than I want to tackle immediately. The quickest (and most fool proof) way to find out if it is the firewall causing me grief, is to temporarily turn it off.

NOTICE: I only do this in non-public, staging or dev environments. I would not turn off my firewall on production servers or any other server publicly accessible, even temporarily. Please take the time to understand the risk when playing with your firewall settings.

With the disclaimer out of the way, this is how I do it on our slightly older versions of Red Hat:

$ sudo service iptables stop

Then I confirm it is no longer in my way:

$ sudo iptables -L

Table: filter
Chain INPUT (policy ACCEPT)
num  target     prot opt source               destination

Chain FORWARD (policy ACCEPT)
num  target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
num  target     prot opt source               destination

An Interesting Note

The tcpdump tool will show traffic as it comes in, before it goes through the firewall. The upshot to this point is, just because you see traffic coming in from tcpdump does not mean it is getting to your service. It could still be getting stopped by the firewall.

Wrap Up

That’s it. A few simple commands to ensure the network side of your services are up and running correctly. It’s great when everything works. When Murphy comes to visit, however, it helps to have some good tools at your disposal. If you’ve got a network tool that regularly makes your life easier, please let me know with the contact form below.