Yes, the RAM is mostly used by content sitting in the VM page cache.
Yes, you could go NVME->NIC with P2P DMA. The problem is that NICs want to read data at once TCP mss (~1448b) and NVME really wants to speak in 4K sized chunks. So there needs to be some buffers somewhere. It might eventually be CXL based memory, but for now it is host memory.
EDIT: missed the last question. No, with NIC kTLS, the host RAM usage is about the same as it would be without TLS at all. Eg, connection data sitting in the socket buffers refers to pages in the host vm page cache which can be shared among multiple connections. With software kTLS, data in the socket buffers must refer to private, per-connection encrypted data which increases RAM requirements.
Thank you, I understood that efficient offload may eventually be possible.
Back when I was in NetApp, folks had researched on splitting 4k chunks as 3 ethernet packets (NetCache) line so that they'd happily fit and issue 3 I/Os on non 4k aligned boundaries. There was also a similar issue to reassemble smaller I/Os into a bigger packet, because some disks were 512b blocks back then. The idea was to give multiple gather/scatter and the engine would take care of reassembly.
Really looking forward to what interesting things happen in this space :)
A. Just curious, are these servers performing any work besides purely serving content? Eg user auth, album art, show description, etc?
B. What’s the current biggest bottleneck preventing higher throughout?
C. Has everything been up streamed? Meaning, if I were to theoretically purchase the exact same hardware - would I be able to achieve similar throughout?
(Amazing work by the way in these continued accomplishments. These posts over thr years are always my favorite HN stories.)
a) These are CDN servers, so they serve CDN stuff. Some do serve cover art, etc sorts of things.
b) Memory bandwidth and PCIe bandwidth. I'm eagerly awaiting Gen5 PCIe NICs and Gen5 PCIe / DDR5 based servers :)
c) Yes, everything in the kernel has been upstreamed. I think there may be some patches to nginx that we have not upstreamed (SO_REUSEPORT_LB patches, TCP_REUSPORT_LB_NUMA patches).
you may have been asked this before in older stories, but why freebsd other than legacy reasons, and why not dpdk? is it purely because you need the fastest TCP implementation?
We already do. The Mellanox ConnectX6-Dx with crypto support.. It does inline crypto on TLS records as they are transmitted. This saves memory bandwidth, as compared to a traditional lookaside card.
Were you assuming they were giant FPGA based NICs..? They are production server NICs, using asics with a reasonable power budget. I don't recall any failures.
I don't have any visibility into how many DOA NICs we have, so I can't say that Mellanox is better or worse at that point. But I do see most NIC related tickets for NIC failures once machines are in production. In general, we've found Mellanox NICs to be very reliable.
This is going to be a “How long is a piece of string?”. Each ASN will be unique, and even within any large ISP, there may be many OCA deployment sites (there won’t just be one for Virgin Media in UK) and each site will likely have subtly different traffic patterns and content consumption patterns, meaning the OCA deployment may be customized to suit, and the content pushed out (particularly to these NVME-based nodes) will be tailored accordingly.
Since the alternative for an ISP is to be carrying the bits for Netflix further, the likelihood is they’ll devote whatever space is required because that’s much cheaper than backhauling the traffic and ingressing over either a settlement-free PNI or IXP link to a Netflix-operated cache site, or worse, ingressing the traffic over a paid transit link.
Meanwhile, on the flipside, since Netflix funds the OCA deployments they have a strong interest in not “oversizing” the sites. That said I’m sure there is an element of growth forecasting involved once a site has been operational for a period of time.
We use ZFS for root, but not content. For content we use UFS. This is because ZFS is not compatible with "zero-copy" sendfile, since it uses its own ARC cache rather than the kernel page cache, meaning sending data stored on ZFS requires an extra data copy out of the ARC. Its also not compatible with async sendfile, as it does not have the methods required to call the sendfile completion handler after data is read from disk into memory.
I found this extremely interesting. ZFS is almost a cure-all for what ails you WRT storage, but there is always something that even Superman can't do. Sometimes old-school is best-school.
Possibly. Remember the goal is to keep data read from the drives unmapped in the kernel address space, and hence never accessed by the CPU for efficiency. So we'd have to give up, or at least alter, zfs to not do checksums each time a block is read.
The most interesting use of ZFS for us would be on servers with traditional hard drives, where ZFS is supposedly more efficient than UFS at keeping data contiguous on disk, thus resulting in fewer seeks and increased read bandwidth.
Could you do a mix of NIC oriented siloing and Disk oriented siloing?
It seems like the bottlenecks are different, so if you serve X% of traffic on the NIC node and Y% on the disk node, you might be able to squeeze a bit more traffic?
Also, how amenable to real time analysis is this? Could you look at the request where the request comes in on node A and the disk is on node B, and tell which NIC is less loaded and send the output through that NIC. (Some selection algorithm based on relative loading anyway)
Replicating popular content is good if there's a smallish core set of popular content. Then you'd be able to serve popular content on the NIC it comes in on without cross NUMA traffic. Really depends on traffic distribution though, if you need to replicate a lot of content, then maybe you push too much content out of cache.
It'd be neat if you could teach the page cache to do replication for you... Then you might use SF_NOCACHE for not very popular content, no option medium content, and SF_NUMACACHE (or whatever) for content you wanted cached on the local NUMA node. I'm sure there's lots of dragons in there though ;)
> How did you generate those flamegraphs and what other tools did you use to measure performance?
We have an internal shell script that takes hwpmc output and generates flamegraphs from the stacks. It also works with dtrace. I'm a huge fan of dtrace. I also make heavy use of lockstat, AMD uProf, and Intel Vtune.
> Did the graph show the bottleneck contention on aio queue? Did the graph show that "a lot of time was spent accessing memory"?
See the graph on page 32 or so of the presentation. It shows huge plateaus in lock_delay called out of the aio code. Its also obvious from lockstat stacks (run as lockstat -x aggsize=4m -s 10 sleep 10 > results.txt)
See the graph on page 38 or so. The plateaus are mostly memory copy functions (memcpy, copyin, copyout).
We already use FreeBSD on our CDN, so it just made sense to do the work in FreeBSD.
They are generated by a local shell script that uses the same helpers (stackcollapse*.pl, difffolded.pl). Our revision control says the script was committed by somebody else though. It existed before I joined Netflix.
Not the OP, but that's basically in the slides. When it's kTLS, but not NIC kTLS. Maybe you could optimize that a bit more around the edges if NIC kTLS wasn't an option.
The ktls(4) man page is a start. The reference implementation is OpenSSL right now. I did support for an internal Netflix library a while ago, I probably should have documented it at the time. For now feel free to contact me via email with questions (the username in the URL, but @netflix.com)
Not the OP, but PCIe5 NICs are already available in the market; I've seen people requesting help getting them to work on desktop platforms which have PCIe5 as of the most recent chips. AFAIK, currently, both AMD and Intel release desktop before server; I don't think there's a public release date for Zen4 server chips, but probably this quarter or next? Intel's release process is too hard for me to follow, but they've got desktop chips with PCIe5, so whenever those get to the server, then that might be an option too.
At a very basic microbenchmark level, I use stream, netpef, a few private VM stress tests, etc. But the majority of my testing is done using real production traffic.