ECS / Docker NFS Issue hunt down

2018-10-10

This post appeared initially on the Scout24 Engineering Blog

Who doesn’t know this situation: Sometime something weird is going on in your setup, but everything is so vague that you don’t know where to start. This is a story about a bug we experienced in a production setup and how we found out what was the root cause.

The problem

Imagine a ECS cluster with multiple tasks running with an NFS (EFS in this case) backed persistent root volume. Launching new tasks works just fine but as soon you are doing a cluster upgrade complete hosts seem to be unable to start tasks or can only start a few and then break. But also some hosts just work fine. So scaling up or down your cluster is a really scary thing that could not be done anymore without manual intervention. That’s an unacceptable situation, but what to do?

Want to know how we handle the NFS mount? My colleague Philipp has a blog post on that

symptoms

Only some hosts are affected

During the upgrades we noticed that some hosts didn’t have any issues with launching tasks. But what was the common pattern? Number of tasks, availability zone, something else? As many host instances in eu-west-1c were affected the AZ theory was the first to verify. Things that could go wrong with EFS in an AZ are:

EFS Mountpoint not there
Wrong security group attached to mountpoint
per subnet routing in VPC gone wrong
routing / subnet config on the instances
iptables on instance misconfigured
docker interfaces catches NFS traffic

I checked all of them, but everything was configured correctly, but still when i logged on to a host and tried to mount one of the EFS volumes manually it just hung forever.

Could it be something else network related that I did not think of? UDP vs TCP in NFS?

It’s `tcpdump` time!

tcpdump -v port not 443 -i any -s 65535 -w dump.pcap excludes stuff from SSM Agents that are running on the ECS instance and should really catch all traffic on all interfaces. Starting tcpdump, run the mount command manually, stop tcpdump, pcap file on s3, fetch file to analyse locally with wireshark. Digging through the whole dump I was unable to find any package to the EFS Mountpoint. Did I do something wrong? But also after the second try there was not a single package related to NFS.

The only possible conclusion here: the system does not even try to mount, but why?

Stracing things down

ps shows many other hanging NFS mounts triggered by ECS / Docker. But why are they hanging?

Its strace time! Let’s see what is going on strace -f mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport,tcp,intr $EFSID.efs.eu-west-1.amazonaws.com:/ /opt/debugmount/$EFSID &

We see that everything hangs on the mount syscall. And that for even longer than 10 minutes, which should have hit already every timeout. (In this case actually the latest timeout should have triggered after two minutes (600/10*2))

As I noticed before, there were many other hanging NFS mounts, so let’s kill them off and try again. Now the mount runs perfectly fine! Are we hitting an AWS service limit here? Is there some race condition blocking out each other? According to AWS docs, there is not such an service limit, so let’s stick with the deadlock theory and try to build a test case.

function debugmount {
    mkdir -p /opt/debugmount/$1
    echo start $1
    mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport,tcp,intr $1.efs.eu-west-1.amazonaws.com:/ /opt/debugmount/$1 
    echo finish $1
}
# this is a random list of about 15 EFS mounts
for fs in $MYFSLIST; do
    debugmount $fs
done

The test case above works perfectly fine. But maybe Docker is mounting the volume in a different way, so it might be that the test case is not accurate? Let’s have a look on what is happening within Docker: https://github.com/docker/docker-ce/blob/master/components/engine/volume/local/local_unix.go#L85 https://github.com/moby/moby/blob/master/pkg/mount/mount.go#L70

All of this looks quite sane. There is some fiddling with addr option, but we also tested mounting by IP and had the same issue.

Stepping back to our initial problem: The issues only occur on a cluster update where all tasks are re-scheduled to other instances. What if that happens not sequentially but in parallel? Our simple test is just running in a loop, so one mount command runs after another.

mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport,tcp,intr $1.efs.eu-west-1.amazonaws.com:/ /opt/debugmount/$1 &

Now we are backgrounding the mount command. So we do not wait anymore for the command to complete, but immediately start the next one. Here we go: now after a few iterations things seem to block and mounting stops to work. We now have a test case that we can use to verify if we fixed the issue.

and now

We now know that the issue has to be in the kernel and have a way to reproduce it, but I have no experience in tracing actually what is going on in the kernel here. So Google has to help out and after searching for nfs mount race condition deadlock rhel it returns a bug report.

The issue described in the report looks very similiar to what we experience and one of the comments mentions that this behaviour was introduced with NFS 4.1.

So lets change our snippet to use NFS 4.0 and see if everything is working now as expected:

mount -t nfs4 -o nfsvers=4.0,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport,tcp,intr $1.efs.eu-west-1.amazonaws.com:/ /opt/debugmount/$1 &

Awesome. Our quite coomplex to debug issue was easy to solve. We changed the mount options in our ECS cluster and everything is running smooth now.

The current Kernel of the AWS ECS optimized AMI seems not to contain that bug fix at the time of writing.