PROJET AUTOBLOG


Shaarli - Les discussions de Shaarli

Archivé

Site original : Shaarli - Les discussions de Shaarli du 23/07/2013

⇐ retour index

Kernel bypass

jeudi 15 octobre 2015 à 16:35
GuiGui's Show - Liens
« Unfortunately the speed of vanilla Linux kernel networking is not sufficient for more specialized workloads. For example, here at CloudFlare, we are constantly dealing with large packet floods. Vanilla Linux can do only about 1M pps. This is not enough in our environment, especially since the network cards are capable of handling a much higher throughput. Modern 10Gbps NIC's can usually process at least 10M pps.

The performance limitations of the Linux kernel network are nothing new. Over the years there had been many attempts to address them. The most common techniques involve creating specialized API's to aid with receiving packets from the hardware at very high speed. Unfortunately these techniques are in total flux and a single widely adopted approach hasn't emerged yet.

Here is a list of the most widely known kernel bypass techniques.

PACKET_MMAP
Packet_mmap is a Linux API for fast packet sniffing. While it's not strictly a kernel bypass technique, it requires a special place on the list - it's already available in vanilla kernels.

PF_RING
PF_RING is another known technique that intends to speed up packet capture. [...]

Snabbswitch
Snabbswitch is a networking framework in Lua mostly geared towards writing L2 applications. It works by completely taking over a network card, and implements a hardware driver in userspace. It's done on a PCI device level with a form of userspace IO (UIO), by mmaping the device registers with sysfs. This allows for very fast operation, but it means the packets completely skip the kernel network stack.

DPDK
DPDK is a networking framework written in C, created especially for Intel chips. It's similar to snabbswitch in spirit, since it's a full framework and relies on UIO.

Netmap
Netmap is also a rich framework, but as opposed to UIO techniques it is implemented as a couple of kernel modules. To integrate with networking hardware it requires users to patch the kernel network drivers. The main benefit of the added complexity is a nicely documented, vendor-agnostic and clean API.

Since the goal of kernel bypass is to spare the kernel from processing packets, we can rule out packet_mmap. It doesn't take over the packets - it's just a fast interface for packet sniffing. Similarly, plain PF_RING without ZC modules is unattractive since its main goal is to speed up libpcap.

[...] In order to achieve a kernel bypass all of the remaining techniques: Snabbswitch, DPDK and netmap take over the whole network card, not allowing any traffic on that NIC to reach the kernel. At CloudFlare, we simply can't afford to dedicate the whole NIC to a single offloaded application.

Solarflare's EF_VI
Solarflare network cards support OpenOnload, a magical network accelerator. It achieves a kernel bypass by implementing the network stack in userspace and using an LD_PRELOAD to overwrite network syscalls of the target program. For low level access to the network card OpenOnload relies on an "EF_VI" library. This library can be used directly and is well documented.

EF_VI, being a proprietary library, can be only used on Solarflare NIC's, but you may wonder how it actually works behind the scenes. It turns out EF_VI reuses the usual NIC features in a very smart way.

Under the hood each EF_VI program is granted access to a dedicated RX queue, hidden from the kernel. By default the queue receives no packets, until you create an EF_VI "filter". This filter is nothing more than a hidden flow steering rule.

[...]

Bifurcated driver
While EF_VI is specific to Solarflare, it's possible to replicate its techniques with other NIC's. To start off we need a multi-queue network card that supports flow steering and indirection table manipulation. [...] This idea is referred to as a "bifurcated driver" in the DPDK community. There was an attempt to create a bifurcated driver in 2014, unfortunately the patch didn't make it to the mainline kernel yet.

Virtualization approach
There is an alternative strategy for the Intel 82599 chips. Instead of having a bifurcated driver we could use the virtualization features of the NIC to do a kernel bypass.

[...]

Network card vendors came to the rescue and cooked features to speed up the virtualized guests. In one of the virtualization techniques the network card is asked to present itself as many PCI devices. Those fake virtual interfaces can then be used inside the virtualized guests without requiring any cooperation from the host operating system.

You may wonder how that helps with kernel bypass. Since the "ixgbevf" device is not used by the kernel to do normal networking, we could dedicate it to the bypass. It seems possible to run DPDK on "ixgbevf" devices.

To recap: the idea is to keep the PF device around to handle normal kernel work and run a VF interface dedicated to the kernel bypass. Since the VF is dedicated we can run the "take over the whole NIC" techniques.

[...] by default the VF interface won't receive any packets. To send some flows from PF to VF you need this obscure patch to ixgbe.

[...]

Unfortunately out of the many techniques we've researched only EF_VI seem to be a practical in our circumstances. I do hope an open source kernel bypass API will emerge soon, one that doesn't require a dedicated NIC. »
(Permalink)