When performing a backup of a DTR environment using the command docker run --rm -i docker/dtr backup, the backup fails with the following message:
FATA Error waiting for dtr-phase2 to finish:
An error occurred trying to connect: Post https://docker-ucp1:443/v1.22/containers/<id>/wait: EOF, code: -1
It can also happen when you try to open a backup file:
tar -cf /tmp/backup-images.tar dtr-registry-<replica-id>
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
During a DTR backup job, the bootstrap script for the backup command spins up a dtr-phase2 container, where most of the backup work is performed. The bootstrapper then monitors the progress of dtr-phase2 via an ongoing call to the ContainerWait API endpoint which blocks until an exit status is returned from the container.
The ContainerWait API is not performing a large amount of traffic on the wire, if any at all. This is problematic when an incorrectly configured load balancer is involved in the communication and is not configured to keep connections alive for a large enough amount of time. This leads to the load balancer cutting the connection and the EOF error.
To fix this issue, increase the tcp_keepalive setting on the load balancer balancing traffic across the DTR replicas to a value of 5 minutes.