Today I am at netdev, a conference about Linux networking. I promised my coworkers notes so you all get notes too :)
This is a different conference for me than usual – I think I’m learning more at this conference than most conferences I’ve been at in the last few years. A lot of it goes above my head (a lot of the presenters/attendees work on networking subsystems in the Linux kernel), but I kind of like that! It means there are a lot of new ideas and terminology.
Here are some notes from today’s talks. They’re relatively stream-of-consciousness but if I’m going to have notes for this conference at all that’s how it’s gonna have to be. There are gonna be a bunch of mistakes in here.
XDP in practice: integrating XDP in our DDoS mitigation pipeline
by Gilberto Bertin from Cloudflare.
This talk was from Cloudflare! They’re a CDN, and one of the services they provide is DDOS mitigation. So if you’re being DDOSed, they’ll figure out which network traffic is an attack and absorb the malicious traffic themselves instead of passing.
how do you figure out whether traffic is malicious or not?
To block malicious traffic, first you need to figure out that that it’s malicious! When you’re doing DDOS mitigation manually (which is slow), you look for a pattern that the malicious traffic matches, and then write code to block traffic that matches that pattern. They’re doing the same thing, except automatically
They do this by
- sampling some small percentage of traffic at every edge server
- encoding that traffic with “sFlow” (wikipedia article and sending it by UDP to some central servers that do packet analysis
- The central servers come up with “patterns” that should be blocked
- for example, those central servers might notice that there are a ton of suspicious-looking packets that are all from the same IP address
how do you block the malicious traffic?
They compile their malicious-traffic-patterns into BPF rules. Cloudflare has some software called bpftools that lets you take a pattern and compile it into code that the kernel can run to filter packets.
./bpfgen dns -- *.www.example.uk will create a BPF filter for DNS
Okay, so suppose you have BPF bytecode that matches the traffic that you want to block! How do you actually
way 1: iptables
He said that they started out using IPtables to filter traffic. This made sense because iptables is the only way I know to filter traffic :)
The problem with this was that there were “IRQ storms”. I’m not totally sure what this means, but I think that it means that in the Linux networking stack there are a lot of interrupts, and interrupts at some point are kind of expensive, using iptables to filter really high packet volumes eventually goes badly.
way 2 userspace offload
Okay, so doing networking in the kernel is going slowly. Let’s do it in userspace! He said they used solarflare which is a proprietary userspace networking stack. (and they also sell hardware?)
There were 2 problems with this but the one I understood was that “getting packets back into the kernel networking stack is expensive”
way 3: XDP
this talk was about XDP so obviously they ended up using it. here are some links about XDP:
- a nice introduction on LWN: Debating the value of XDP
What I understand about XDP so far: there are a lot of steps in the Linux networking stack (see Monitoring and Tuning the Linux Networking Stack: Receiving Data ).
If you want to filter out packets very quickly, you don’t want to go through all those steps! You want to just read packets off the network card, be like “NOPE not that one!”, and move on really quickly.
It seems like XDP is a pretty new system (at the end of this talk, someone commented “congratulations on using such new technology!), and they started using it at Cloudflare to do packet filtering. A few notes from this section of the talk:
- “ebpftools generates XDP programs”
- XDP programs are C programs, which are compiled by clang, which then become eBPF bytecode
- the eBPF uses maps (hashmaps? unclear) that are shared with userspace. I didn’t catch what the userspace code that those maps are shared with does.
- Also, he mentioned a tool called p0f, wikipedia that can do OS fingerprinting of network traffic.
- their BPF bytecode generation tool supports p0f signatures
- XDP requires at least a 4.8 kernel. They use 4.9 because they want to be using an LTS release
- XDP doesn’t support all network cards
There was some discussion at the end of this talk about how you can mark packets in XDP. I didn’t totally understand why you would want to mark packets, but here are some things I learned.
- the fundamental data structure in the linux networking stack is the skb or socket buffer
this was a really good talk and I learned a lot
XDP at Facebook
by Huapeng Zhou, Doug Porter, Ryan Tierney, Nikita Shirokov from Facebook (though only one of them spoke)
This talk was about how Facebook uses XDP for 2 different things:
- to implement an L4 load balancer
- for DDOS mitigation
He said they have an L4 load balancer (called “SHIV”?) which forwards traffic to downstream L7 load balancers. This is the frontend for Facebook, I think – like the first servers your requests hit when you go to facebook.com.
There’s another talk about load balancing at Facebook from SRECon a while back: Building A Billion User Load Balancer
He said that before they were using something called “IPVS”. At work we use HAProxy to do load balancing. It seems like IPVS is a load balancer that is inside of the Linux kernel? I found this blog post about IPVS which has more information.
Anyway, so they stopped using IPVS for some reason and decided to use XDP instead. At the end of this talk they said that XDP is 10x faster.
This talk didn’t talk about how you configure the load balancer (how do you tell it which backends to use for which requests? does it do healthchecks to figure out if some backend is temporarily down? Are the healthchecks done in userspace? How do the results from the healthchecks get back into the XDP program running in the kernel?)
The interesting thing in this talk is that they made some progress on debugging XDP programs: debugging XDP programs seems kind of hard (is that because they run on the eBPF virtual machine? all this ebpf stuff is still a little unclear to me).
eBPF seems to support putting events into perf events via
bpf_perf_event_output, which is cool because
perf_events is an existing
Linux Networking for the Enterprise
by Shrijeet Mukherjee from Cumulus
This talk seemed to be about what networking features enterprise customers care about. I learned from this talk that people in the enterprise have a lot of ethernet cables, and that ethtool is (or should be?) a good tool for figuring out if your ethernet cables are working.
He also talked about how multicast is not that popular anymore, but a lot of enterprises care about multicast, so you need to have multicast support.
At lunch there was a vegetarian table which was a cool way to mix people who might not otherwise sit together :)
Everyone was really nice and let me ask a bunch of questions. Here are a few things I learned at lunch:
sharing a network card between the Linux kernel and userspace
Usually on Linux people use the Linux kernel to do networking. But sometimes you want to do networking in userspace! (for performance reasons, or something). I wanted to know if the Linux kernel can share access to the network card with a userspace program. It turns out that it can!
The person I was talking to mentioned that there are “many RX queues”. I took that hint and Googled after lunch, and found this great blog post from Cloudflare.
So! It turns out that the first step in receiving a network packet is for the network card to put data in an “RX ring” or “RX queue”, which is an area in memory. two facts:
- there can be more than one of these queues
- you can program the network card to only put certain packets in certain queues (“put all UDP packets from port 53 into queue 4”)
- different applications can use different queues
So if you want Linux to share, you can have Linux handle most of the RX queues, but make your userspace program handle one of them. That’s cool!
more about tc
I wrote the other day about how to make your internet slow with tc. It turns out that tc can do a lot more things than make your internet slow!
For example! I used to share my internet connection with my cousin upstairs,
and I run a Linux router. I could use
tc to program my router to limit
traffic from my cousin’s computer (by filtering + ratelimiting traffic from her
MAC Address or something). There’s more about how to do that on OpenWRT’s site
Another thing apparently tc can do is hardware offload. Basically as far as
I understand it – a lot of network cards have super fancy
features. They can do a lot of the packet processing that usually your kernel
would do (like checksums, and more).
tc knows how to program your network
card to do fancy packet processing.
There is a lot more but those were the two easiest things for me to understand.
networking performance workshop
chaired by Alex Duyck, I didn’t manage to take notes of all the speakers.
This was a workshop about new developments in networking performance (by people from Intel, Mellanox, and somewhere else).
My notes here are pretty sparse because I didn’t understand most of it, but they were all around “how do you make networking faster?”
Page based receive + page reuse
The premise for here seemed to be
- the usual size of a page on Linux is 4kb
- the usual MTU for a packet (“maximum transmission unit”, or basically the packet size) is 1500 bytes
- making new pages is expensive (though I didn’t 100% understand this)
- but a page is more than twice as big as a packet, so somehow you can put two packets in one page
- and there are a bunch of complications
Interesting thing: the speaker mentioned that in their network drivers, a lot of people use “memory barriers” that are too strong. But wait, what’s a memory barrier? By Googling I found https://www.kernel.org/doc/Documentation/memory-barriers.txt about memory barriers people use in the Linux kernel which seems pretty readable. So that goes on my to-read list :)
I think he suggested using the
dma_wmb memory barriers instead of
The second talk was about a pretty similar topic – the idea is that memory allocator performance is limiting how fast you can process packets with XDP.
They said that if you use bigger pages (like 64k), then you don’t have to allocate as much memory (which is good), but the problem is that if you put many packets/fragments into the same large page, then you have to wait for all those fragments to be processed before you can free the page. So there’s a tradeoff there.
Okay, this is a different thing! This thing is a work in progress and they haven’t even written the code that would go in the kernel yet. The motivation for this work as far as I understood it was to make tcpdump (and tools like tcpdump) run faster.
I think the idea here is:
- when you use tcpdump, it reads raw packets from the network interface and parses them
- but it has to copy those packets into its memory (in userspace) and copying takes time
- so it would be better if tcpdump could get the packets without having to copy them
I think the idea here is that you could open a socket in mode
or something, and then you would get access to packets from a network interface
without then being copied at all.
I like tcpdump a lot so anything that could make tcpdump faster is something I am excited about! I would like to understand this better.
When I was looking up stuff about this I found this cool blog post from Marek at Cloudflare about how tcpdump works: BPF: the forgotten bytecode
a talk from Oracle
by Sowmini Varadhan, Tushar Dave
I missed the first half of this talk so I can’t tell you what it was about but she said something interesting!
She showed 4 graphs of latency vs throughput. In all the graphs:
- they started out flat (so you could do 10,000 reqs/s with 1000ms latency, 50,000 reqs/s with 1000ms latency)
- at some point they hit a throughput “cliff” – like, the system just would not process more than 100,000 reqs/s at all, and as you put more requests into it the latency would just keep going up
I know that there’s a tradeoff between latency and throughput for any system (if you want fast replies, you have to send less requests), and it was nice to see graphs of that tradeoff. And it was interesting to see such an abrupt throughput cliff.
evolution of network i/o virtualization
by: Anjali Jain, Alexander Duyck, Parthasarathy Sarangam, Nrupal Jani
This one was about the past/present/future of how network virtualization works. By people from Intel.
What I learned:
- SR-IOV is a way of doing network virtualization
- they spent most of their time talking about SR-IOV so it must be important? maybe it’s someting Intel invests in a lot?
- an “IOMMU” is an important thing
- in SR-IOV, you can only have 2^16 virtual machines at a time (16 bits)
- “east-west” is traffic between VMs on the same machines, and “north-south” is traffic between VMs on different boxes
- there are performance problems with VMs on the same host talking (“east-west traffic”). this is because “it’s bottlenecked by the PCI bus”. what is the PCI bus though and why does it cause bottlenecks?
They spent a long time in this talk talking about live migration of VMs here and how it interacts with network virtualization. Live migration means that you take a virtual machine and move it to another computer while it’s running. I thought it was interesting that they talked about it like it’s a normal thing because I think of it as like.. something totally magical that doesn’t exist in real life.
But I guess if you’re building the networking hardware we’re going to be using 5-10 years in the future, you need to be thinking ahead!
that’s all for today
more tomorrow, hopefully.