Erlang Latency Guide
Introduction
Latency is a tricky subject, sometimes it's not even clear what or how to measure it. I've had the experience of writing a fairly complex system requiring low latencies in Erlang. Fortunately Erlang provides really good baseline performance. Most of the time you simply write your program and it will perform well. There are however a few tricks that can be used to lower the latencies of a specific path in the system. This document describes a few of these tricks.
Yield
Erlang allows you to design efficient concurrent systems without caring how processes are scheduled or how many cores the system is running on. When running Erlang with multiple schedulers (generally one per CPU-core) the runtime will balance the load between the schedulers by migrating processes to starved schedulers. There is no way to bind processes to schedulers or control how processes are migrated between schedulers. This introduces a non-deterministic behavior in the system and makes it hard to control latency.
A common pattern is to have a demultiplexer that receives a message, sends it to some other process/processes and then performs some additional processing on the message:
loop(State) ->
receive
Msg ->
Pid = lookup_pid(Msg, State),
Pid ! Msg,
State2 = update_state(Msg, State),
loop(State2)
end.
After the message has been sent the receiving process will be ready to execute, but unless the receiving process is on a different scheduler the demultiplexer will first finish executing. Ideally we would bind the demultiplexer to one scheduler and bind the receiving processes to the other schedulers, but that's not allowed in Erlang.
Erlang provides only one simple, but powerful way to control
scheduling: The
built-in function
(BIF)
erlang:yield/0
lets processes voluntarily give up
execution and let other processes get a chance to execute.
The demultiplexer pattern can be modified by adding
erlang:yield()
after sending the message:
loop(State) ->
receive
Msg ->
Pid = lookup_pid(Msg, State),
Pid ! Msg,
erlang:yield(),
State2 = update_state(Msg, State),
loop(State2)
end.
After the message has been sent the demultiplexer will give up execution. If the demultiplexer and the receiver are on the same scheduler the receiver will execute before the demultiplexer finishes executing, if they are on different schedulers they will execute in parallel.
Using the erlang:yield/0
BIF it's possible to control
the scheduling of Erlang processes. If used correctly this can reduce
the latency in a system.
Network
All network I/O in Erlang is implemented as
an
Erlang driver. The driver is interfaced by the
module prim_inet
which in turn is interfaced by the
network related modules in
the kernel
application.
There is a performance issue with the prim_inet:send/2
and prim_inet:recv/2
functions affecting all the network
related modules. When calling prim_inet:send/2
or prim_inet:recv/2
the process will do a selective
receive. If the process's message queue is long there will be a
performance penalty from doing this selective receive.
For receiving there is a simple solution to this problem: use
the
{active, once}
socket option.
A simple selective receive-free TCP receiver:
loop(Sock) ->
inet:setopts(Sock, [{active, once}]),
receive
{tcp, Sock, Data} ->
loop(Sock);
{tcp_error, Sock, Reason} ->
exit(Reason);
{tcp_closed, Sock} ->
exit()
end.
To implement sending without doing a selective receive it is
necessary to use the low-level port interface
function
erlang:port_command/2
. Calling erlang:port_command(Sock,
Data)
on a TCP socket would send the data Data
on
the socket and return a reference Ref
. The socket will
reply by sending {inet_reply, Ref, Status}
to the process
that called erlang:port_command
.
A simple selective receive-free TCP writer:
loop(Sock) ->
receive
{inet_reply, _, ok} ->
loop(Sock);
{inet_reply, _, Status} ->
exit(Status);
Msg ->
try erlang:port_command(Sock, Msg)
catch error:Error -> exit(Error)
end,
loop(Sock)
end.
Though not Erlang specific it is important to remember to tune the send and receive buffer sizes. If the TCP receive window is full data may be delayed up to one network round trip. For UDP, packets will be dropped.
Distribution
Erlang allows you to send messages between processes at different nodes on the same or different computers. It is also possible to interact with C-nodes (Erlang nodes implemented in C). The communication is done over TCP/IP and obviously this introduces latencies, especially when communicating between nodes on a network.
Even when the nodes are running on the same computer they communicate using TCP/IP over the loopback interface. Different operating systems have widely different loopback performance (Solaris has lower latency than Linux). If your system uses the loopback interface it's a good idea to consider this.
Further Reading
erts/preloaded/src/prim_inet.erl
from the Erlang releaseerts/emulator/drivers/common/inet_drv.c
from the Erlang release