Data Distribution Server (FTP)

tl;dr : only passive FTP connections to work

GINA has developed the concept of a Data Distribution Server (DDS). The basic premise of a DDS is that public data can be made available for distribution via both the HTTP and FTP protocols in such a way that the hostname and path portions of the URLs are identical.

The original implementation was a CentOS 4.8 server running Apache2 and VSFTPD services with storage attached from a fiber channel SAN. Over time, some nfs and glusterfs mountpoints have been added to make data available. The current server has grown up organically and is proving to be difficult to replace without breaking functionality that our users depend on.

Recently, it was noticed that due to limitations of the older system, files larger than 2GB cannot be accessed with HTTP. In order to address this problem it was decided that would be moved behind our load balanced HAProxy setup. This allows us to direct http traffic to a different server for data that we know contains files larger than 2GB. We do a lot of http proxying at GINA and HAProxy handles it very well. Unfortunately, the idea behind a DDS is using the same hostname for both services so that forces FTP to be proxied in addition to HTTP. FTP is notoriously tricky to firewall and proxy as compared to other network services. We knew this going in and we thought that we could overcome the challenges.

As it turns out, having FTP behind a proxy isn't that bad and it allows us to make things available while we try to replace the CentOS 4.8 server with something newer. Unfortunately, it means that we can no longer support active FTP sessions. We can only currently support passive FTP sessions. Both active and passive FTP make an initial control connection to the server on TCP port 21, they differ in the way that they handle data transfer. Active FTP negotiates a TCP port on the client that the FTP server connects to to transfer the data. Passive FTP has the client make another TCP connection to the server on a high numbered port for data transfers. Most people are using passive FTP anyway because it is much easier to do from a network that is either behind a PAT/NAT translation device or a perimeter firewall.

We have had problems with some networks blocking outgoing connections to high numbered TCP ports. If you are having trouble using FTP to you should:
  • Verify that your FTP client is using Passive mode
  • Verify that you can connect to on TCP ports in the range of 10000-10250
  • You can run a command like this on a linux box to test the connection: nc -z 10001 && echo "can connect" || echo "cannot connect"
  • Or you could use telnet or nmap to check the ports
  • Email if you continue to experience problems