PROJET AUTOBLOG


Shaarli - Les discussions de Shaarli

Archivé

Site original : Shaarli - Les discussions de Shaarli du 23/07/2013

⇐ retour index

How to receive a million packets per second

mardi 16 juin 2015 à 17:17
GuiGui's Show - Liens 16/06/2015
À lire absolument.

« Last week during a casual conversation I overheard a colleague saying: "The Linux network stack is slow! You can't expect it to do more than 50 thousand packets per second per core!"

That got me thinking. While I agree that 50kpps per core is probably the limit for any practical application, what is the Linux networking stack capable of? Let's rephrase that to make it more fun:

On Linux, how hard is it to write a program that receives 1 million UDP packets per second?

[...]

While we could have used the usual send syscall, it wouldn't be efficient. Context switches to the kernel have a cost and it is be better to avoid it. Fortunately a handy syscall was recently added to Linux: sendmmsg. It allows us to send many packets in one go. Let's do 1,024 packets at once.

With the naive approach we can do between 197k and 350k pps. Not too bad. Unfortunately there is quite a bit of variability. It is caused by the kernel shuffling our programs between cores. Pinning the processes to CPUs will help

[...]

While 370k pps is not bad for a naive program, it's still quite far from the goal of 1Mpps. To receive more, first we must send more packets. How about sending independently from two threads: [...] The numbers on the receiving side didn't increase. ethtool -S will reveal where the packets actually went: [...] Through these stats, the NIC reports that it had successfully delivered around 350kpps to RX queue number #4. The rx_nodesc_drop_cnt is a Solarflare specific counter saying the NIC failed to deliver 450kpps to the kernel.

Sometimes it's not obvious why the packets weren't delivered. In our case though, it's very clear: the RX queue #4 delivers packets to CPU #4. And CPU #4 can't do any more work - it's totally busy just reading the 350kpps. Here's how that looks in htop

[...]

Historically, network cards had a single RX queue that was used to pass packets between hardware and kernel. This design had an obvious limitation - it was impossible to deliver more packets than a single CPU could handle.

To utilize multicore systems, NICs began to support multiple RX queues. The design is simple: each RX queue is pinned to a separate CPU, therefore, by delivering packets to all the RX queues a NIC can utilize all CPUs. But it raises a question: given a packet, how does the NIC decide to which RX queue to push it?

Round-robin balancing is not acceptable, as it might introduce reordering of packets within a single connection. An alternative is to use a hash from packet to decide the RX queue number. The hash is usually counted from a tuple (src IP, dst IP, src port, dst port). This guarantees that packets for a single flow will always end up on exactly the same RX queue, and reordering of packets within a single flow can't happen.

[...]

This is pretty limited, as it ignores the port numbers. Many NICs allow customization of the hash. Again, using ethtool we can select the tuple (src IP, dst IP, src port, dst port) for hashing: [...] Unfortunately our NIC doesn't support it - we are constrained to (src IP, dst IP) hashing.

[...]

Since the hashing algorithm on our NIC is pretty limited, the only way to distribute the packets across RX queues is to use many IP addresses.

Hurray! With two cores busy with handling RX queues, and third running the application, it's possible to get ~650k pps!

[...]

We can increase this number further by sending traffic to three or four RX queues, but soon the application will hit another limit. This time the rx_nodesc_drop_cnt is not growing, but the netstat "receiver errors" are [...] This means that while the NIC is able to deliver the packets to the kernel, the kernel is not able to deliver the packets to the application. In our case it is able to deliver only 440kpps, the remaining 390kpps + 123kpps are dropped due to the application not receiving them fast enough.

The receiving performance is down compared to a single threaded program. That's caused by a lock contention on the UDP receive buffer side. Since both threads are using the same socket descriptor, they spend a disproportionate amount of time fighting for a lock around the UDP receive buffer.

[...]

Fortunately, there is a workaround recently added to Linux: the SO_REUSEPORT flag. When this flag is set on a socket descriptor, Linux will allow many processes to bind to the same port. In fact, any number of processes will be allowed to bind and the load will be spread across them.

With SO_REUSEPORT each of the processes will have a separate socket descriptor. Therefore each will own a dedicated UDP receive buffer. This avoids the contention issues previously encountered

1.126M pps  34.374MiB / 288.354Mb [...] This is more like it! The throughput is decent now!

More investigation will reveal further room for improvement. Even though we started four receiving threads, the load is not being spread evenly across them [...] wo threads received all the work and the other two got no packets at all. This is caused by a hashing collision, but this time it is at the SO_REUSEPORT layer.

[...]

I've done some further tests, and with perfectly aligned RX queues and receiver threads on a single NUMA node it was possible to get 1.4Mpps. Running receiver on a different NUMA node caused the numbers to drop achieving at best 1Mpps.

To sum up, if you want a perfect performance you need to:

   Ensure traffic is distributed evenly across many RX queues and SO_REUSEPORT processes. In practice, the load usually is well distributed as long as there are a large number of connections (or flows).

   You need to have enough spare CPU capacity to actually pick up the packets from the kernel.

   To make the things harder, both RX queues and receiver processes should be on a single NUMA node.

While we had shown that it is technically possible to receive 1Mpps on a Linux machine, the application was not doing any actual processing of received packets - it didn't even look at the content of the traffic. Don't expect performance like that for any practical application without a lot more work. »
(Permalink)

Doo's bookmarks 16/06/2015
un article intéressant sur le tweaking de la stack réseau linux :)
(Permalink)