Tips for using the sockets API

2020-08-10

My notes on common gotchas and usage tips for usign the POSIX sockets API (also known as Berkeley sockets or BSD sockets). Unless you have specific reasons to directly use the sockets API, I recommend that you use a library such as libuv, libevent or ASIO to help you avoid these gotchas and support multi-platform portability.

If you need a introductory guide on how to use the sockets API I recommend Beej’s Guide to Network Programming.

Prevent SIGPIPE signals

When you write to a TCP socket that has been closed by the other peer the kernel will generate a SIGPIPE signal. This is usually not very useful and leads to issues in multi-threaded applications since signal handlers are process global.

On Linux and most BSDs (OpenBSD, FreeBSD and NetBSD) the SIGPIPE signal can be prevented by supplying the MSG_NOSIGNAL flag to the send system call¹:

int n = send(fd, msg, len, MSG_NOSIGNAL);

The MSG_NOSIGNAL flag is not supported on MacOS. Instead the SIGPIPE signal can be prevented on a per file descriptor basis using the SO_NOSIGPIPE socket option. This option is supported on MacOS and most BSDs²:

int opt = 1;
if (setsockopt(fd, SOL_SOCKET, SO_NOSIGPIPE, &opt, sizeof(opt)) == -1) {
    perror("setsockopt");
    return -1;
}

Finally if none of the above methods are available the SIGPIPE signal can be ignored for the whole application process:

if (signal(SIGPIPE, SIG_IGN) == SIG_ERR) {
    perror("signal");
    return -1;
}

Dealing with signals

If a system call is interrupted by a signal it’s either transparently restarted or fails with the error EINTR³. The restart behavior is optionally enabled using the SA_RESTART flag⁴. If the restart behavior is not guaranteed to be enabled, it’s safest to always handle the EINTR error. Additionally on Linux the epoll_wait system call can always return EINTR.

Example code to handle EINTR:

restart:
int n = recv(fd, buf, len, 0);
if (n == -1) {  
    if (errno == EINTR) {
        goto restart;
    }
    perror("recv");
    return -1;
}

Resolving hostnames asynchronously

POSIX standard only provides blocking function getaddrinfo for resolving hostnames into IP addresses. That’s not usable when doing non-blocking / asynchronous socket IO. Linux’s glibc provides the non-standard getaddrinfo_a function for performing asynchronous DNS lookups, but it’s not easy to use with epoll.

If you only need to support IP addresses you can use getaddrinfo with the AI_NUMERICHOST flag to make the call non-blocking.

If you need to do asynchronous DNS lookups there are external libraries that can do that, for example c-ares and getdns.

Performing an asynchronous connect

Create a non-blocking socket with information returned by getaddrinfo in res:

int fd = socket(res->ai_family, res->ai_socktype | SOCK_NONBLOCK | SOCK_CLOEXEC, res->ai_protocol);
if (fd == -1) {
    perror("socket");
    return -1;
}

Use the SOCK_NONBLOCK flag to create the socket in non-blocking mode and the SOCK_CLOEXEC flag to prevent the file descriptor from being inherited by child processes.

Connect the socket to the address returned by getaddrinfo in res:

restart:
int rc = connect(fd, res->ai_addr, res->ai_addrlen);
if (rc == -1 && errno != EINPROGRESS) {
    if (errno == EINTR) {
        goto restart;
    }
    perror("connect");
    return -1;
}
if (rc == 0) {
    // Connection succeeded immediately
} else {
    // Connection attempt is in progress
}

When poll/select/epoll/kqueue indicates the socket is writable (EPOLLOUT or equivalent) or an error (EPOLLERR, EPOLLHUP and EPOLLRDHUP or equivalents) then check if the connection succeeded:

int opt;
socklen_t optlen = sizeof(opt);
if (getsockopt(fd, SOL_SOCKET, SO_ERROR, &opt, &optlen) == -1) {
    perror("getsockopt");
    return -1;
}

if (opt != 0) {
    // Connection failed
    errno = opt;
    perror("connect");
    return -1;
}
// Connection succeeded

Asynchronously accepting connections

Create a non-blocking socket as described above.

Bind to the address you want to listen on:

if (bind(fd, rp->ai_addr, rp->ai_addrlen) == -1) {
    perror("bind");
    close(fd);
    return -1;
}

Start listening for incoming connections:

if (listen(fd, 16) == -1) {
    perror("listen");
    close(fd);
    return -1;
}

When poll/select/epoll/kqueue reports any event on the socket, try to accept any new connections:

for (;;) {
    int fd = accept4(lfd, NULL, NULL, SOCK_NONBLOCK | SOCK_CLOEXEC);
    if (fd == -1) {
        if (errno == EAGAIN || EWOULDBLOCK) {
            break;
        }
        if (errno == EINTR || errno == ECONNABORTED) {
            continue;
        }
        perror("accept4");
        return -1;
    }
    // Handle new connection
}

The accept4 system call is available on both Linux and BSD. It allows you to accept connections directly in non-blocking mode, avoiding an extra call to fcntl in order to make the socket non-blocking.

See my epollserver.c example code for full details.

Nagle’s algorithm and delayed acknowledgement

Nagle’s algorithm and TCP delayed acknowledgments interacts poorly and can lead to data being delayed when sent over TCP sockets⁵. Depending on your application’s behavior you should consider disabling these features.

Nagle’s algorithm is used to prevent many small packets from being sent. Small packets are only sent if you call send with a small amount of data. Doing so also incurs extra system call overhead. Instead it’s better to call send with larger data buffers or using scatter/gather IO with sendmmsg. This prevents multiple small packets from being sent and reduces system call overhead. To disable Nagle’s algorithm use the TCP_NODELAY socket option:

int opt = 1;
if (setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, &opt, sizeof(opt)) == -1) {
    perror("setsockopt");
    return -1;
}

TCP delayed acknowledgements were designed to reduce the number of TCP acknowledgements being sent out by delaying them and hopefully piggybacking them on outgoing data. For example an RPC server will send a reply for each incoming request and delayed acknowledgements allows the server to piggy back the acknowledgement on the reply, avoiding sending a separate TCP acknowledgment packet. The RPC client doesn’t expect to send out any data in reply to the server’s reply and thus delayed acknowledgements would be ineffective. To disable delayed acknowledgements use the TCP_QUICKACK socket option:

int opt = 1;
if (setsockopt(fd, IPPROTO_TCP, TCP_QUICKACK, &opt, sizeof(opt)) == -1) {
    perror("setsockopt");
    return -1;
}

Achieving low latency

First follow my guide to tune your system for low latency workloads.

For low latency networking I don’t recommend using the kernel networking stack. Instead I recommend using kernel bypass technologies such as DPDK, OpenOnload, Mellanox VMA or Exablaze.

You can enable kernel busy polling if you are using Linux:

int val = 10000; // number of microseonds to busy poll
if (setsockopt(fd, SOL_SOCKET, SO_BUSY_POLL, &val, sizeof(val)) == -1) {
    perror("setsockopt");
    return -1;
}

And/or use userspace busy polling:

start:
int n = recv(fd, buf, len, MSG_DONTWAIT));
if (n == -1) {
    if (errno == EINTR || errno == EAGAIN || errno == EWOULDBLOCK) {
        goto start;
    }
    perror("recv");
    return -1;
}

On Linux you can also disable interrupt coalescing:

ethtool -C eth0 adaptive-rx off rx-usecs 0 rx-frames 0

I suggest reading the Red Hat Enterprise Linux Performance Tuning Guide and the blog post “How to achieve low latency with 10Gbps Ethernet” to learn more about these options.

Packet timestamping

Linux supports hardware and kernel software timestamping of network packets. This is used by the chrony Network Time Protocol (NTP) implementation to achieve more accurate time synchronization. You can also use this in your application for accurate monitoring of for example request latencies.

https://man7.org/linux/man-pages/man2/send.2.html http://man.openbsd.org/send.2#MSG_NOSIGNAL https://www.freebsd.org/cgi/man.cgi?query=send&sektion=2&manpath=FreeBSD+12.1-RELEASE+and+Ports https://netbsd.gw.com/cgi-bin/man-cgi?send++NetBSD-current ↩︎
https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man2/getsockopt.2.html ↩︎
https://man7.org/linux/man-pages/man7/signal.7.html ↩︎
https://man7.org/linux/man-pages/man2/sigaction.2.html ↩︎
https://news.ycombinator.com/item?id=10607422

That still irks me. The real problem is not tinygram prevention. It’s ACK delays, and that stupid fixed timer. They both went into TCP around the same time, but independently. I did tinygram prevention (the Nagle algorithm) and Berkeley did delayed ACKs, both in the early 1980s. The combination of the two is awful. Unfortunately by the time I found about delayed ACKs, I had changed jobs, was out of networking, and doing a product for Autodesk on non-networked PCs.

Delayed ACKs are a win only in certain circumstances - mostly character echo for Telnet. (When Berkeley installed delayed ACKs, they were doing a lot of Telnet from terminal concentrators in student terminal rooms to host VAX machines doing the work. For that particular situation, it made sense.) The delayed ACK timer is scaled to expected human response time. A delayed ACK is a bet that the other end will reply to what you just sent almost immediately. Except for some RPC protocols, this is unlikely. So the ACK delay mechanism loses the bet, over and over, delaying the ACK, waiting for a packet on which the ACK can be piggybacked, not getting it, and then sending the ACK, delayed. There’s nothing in TCP to automatically turn this off. However, Linux (and I think Windows) now have a TCP_QUICKACK socket option. Turn that on unless you have a very unusual application.

Turning on TCP_NODELAY has similar effects, but can make throughput worse for small writes. If you write a loop which sends just a few bytes (worst case, one byte) to a socket with “write()”, and the Nagle algorithm is disabled with TCP_NODELAY, each write becomes one IP packet. This increases traffic by a factor of 40, with IP and TCP headers for each payload. Tinygram prevention won’t let you send a second packet if you have one in flight, unless you have enough data to fill the maximum sized packet. It accumulates bytes for one round trip time, then sends everything in the queue. That’s almost always what you want. If you have TCP_NODELAY set, you need to be much more aware of buffering and flushing issues.

None of this matters for bulk one-way transfers, which is most HTTP today. (I’ve never looked at the impact of this on the SSL handshake, where it might matter.)

Short version: set TCP_QUICKACK. If you find a case where that makes things worse, let me know.

John Nagle

↩︎