Hacker News new | past | comments | ask | show | jobs | submit login

Author here, happy to answer questions



Read the presentation. Had super noobie level questions.

Is the RAM mostly used by page content read by the NICs due to kTLS?

If there was better DMA/Offload could this be done with a fraction of the RAM? (NVME->NIC)

If there was no need to TLS, would the RAM usage drop dramatically?


These are actually fantastic questions.

Yes, the RAM is mostly used by content sitting in the VM page cache.

Yes, you could go NVME->NIC with P2P DMA. The problem is that NICs want to read data at once TCP mss (~1448b) and NVME really wants to speak in 4K sized chunks. So there needs to be some buffers somewhere. It might eventually be CXL based memory, but for now it is host memory.

EDIT: missed the last question. No, with NIC kTLS, the host RAM usage is about the same as it would be without TLS at all. Eg, connection data sitting in the socket buffers refers to pages in the host vm page cache which can be shared among multiple connections. With software kTLS, data in the socket buffers must refer to private, per-connection encrypted data which increases RAM requirements.


Thank you, I understood that efficient offload may eventually be possible.

Back when I was in NetApp, folks had researched on splitting 4k chunks as 3 ethernet packets (NetCache) line so that they'd happily fit and issue 3 I/Os on non 4k aligned boundaries. There was also a similar issue to reassemble smaller I/Os into a bigger packet, because some disks were 512b blocks back then. The idea was to give multiple gather/scatter and the engine would take care of reassembly.

Really looking forward to what interesting things happen in this space :)


A. Just curious, are these servers performing any work besides purely serving content? Eg user auth, album art, show description, etc?

B. What’s the current biggest bottleneck preventing higher throughout?

C. Has everything been up streamed? Meaning, if I were to theoretically purchase the exact same hardware - would I be able to achieve similar throughout?

(Amazing work by the way in these continued accomplishments. These posts over thr years are always my favorite HN stories.)


a) These are CDN servers, so they serve CDN stuff. Some do serve cover art, etc sorts of things.

b) Memory bandwidth and PCIe bandwidth. I'm eagerly awaiting Gen5 PCIe NICs and Gen5 PCIe / DDR5 based servers :)

c) Yes, everything in the kernel has been upstreamed. I think there may be some patches to nginx that we have not upstreamed (SO_REUSEPORT_LB patches, TCP_REUSPORT_LB_NUMA patches).


you may have been asked this before in older stories, but why freebsd other than legacy reasons, and why not dpdk? is it purely because you need the fastest TCP implementation?


At what point will it make more sense to use specialized hardware, e.g. network card that can do encryption?


We already do. The Mellanox ConnectX6-Dx with crypto support.. It does inline crypto on TLS records as they are transmitted. This saves memory bandwidth, as compared to a traditional lookaside card.


What's the error rate, or uptime ratio, of those cards?


Were you assuming they were giant FPGA based NICs..? They are production server NICs, using asics with a reasonable power budget. I don't recall any failures.


Well I wasn't, though I was expecting some non-zero amount of failures.

That's pretty impressive if it's literally zero.

How many machines are deployed with NICs?


I don't have any visibility into how many DOA NICs we have, so I can't say that Mellanox is better or worse at that point. But I do see most NIC related tickets for NIC failures once machines are in production. In general, we've found Mellanox NICs to be very reliable.


1. I got excited when I saw arm64 mentioned. How competitive is it? Do you think it will be a viable alternative for Netflix in the future?

2. On amd, did you play around with BIOS settings? Like turbo, sub-numa clustering or cTDP?


Arm64 is very competitive. As you can see from the slides, the Ampere Q80-30 is pretty much on-par with our production AMD systems.

Yes, I've spent lots of time in the AMD BIOS over the years, and lots of time with our AMD FAE (who is fantastic, BTW) poking at things.


Which NIC and driver combinations support kTLS offloading to the NIC?

How did you deal with the hardware/firmware limitations on the number of offloadable TLS sessions?


We use Mellanox ConnectX6-DX NICs, with the Mellanox drivers built into FreeBSD 14-current (which are also present in FreeBSD 13).


> We use Mellanox ConnectX6-DX NICs

Is there a plan to move to the Connect X-7 eventually?

Depending on the bandwidth available, that'd be either 2x to get the same 800Gb/s as here (or perhaps eventually with 4x to get 1600Gb/s).


Yes, I'm looking forward to CX7. And to other pcie Gen5 NICs!


How much "U's" of space do ISP typically give you (e.g. 4U, 8U, etc)?


This is going to be a “How long is a piece of string?”. Each ASN will be unique, and even within any large ISP, there may be many OCA deployment sites (there won’t just be one for Virgin Media in UK) and each site will likely have subtly different traffic patterns and content consumption patterns, meaning the OCA deployment may be customized to suit, and the content pushed out (particularly to these NVME-based nodes) will be tailored accordingly.

Since the alternative for an ISP is to be carrying the bits for Netflix further, the likelihood is they’ll devote whatever space is required because that’s much cheaper than backhauling the traffic and ingressing over either a settlement-free PNI or IXP link to a Netflix-operated cache site, or worse, ingressing the traffic over a paid transit link.

Meanwhile, on the flipside, since Netflix funds the OCA deployments they have a strong interest in not “oversizing” the sites. That said I’m sure there is an element of growth forecasting involved once a site has been operational for a period of time.


What filesystem(s) are you using for root and content?

And If ZFS, what options are you using?


We use ZFS for root, but not content. For content we use UFS. This is because ZFS is not compatible with "zero-copy" sendfile, since it uses its own ARC cache rather than the kernel page cache, meaning sending data stored on ZFS requires an extra data copy out of the ARC. Its also not compatible with async sendfile, as it does not have the methods required to call the sendfile completion handler after data is read from disk into memory.


>For content we use UFS

I found this extremely interesting. ZFS is almost a cure-all for what ails you WRT storage, but there is always something that even Superman can't do. Sometimes old-school is best-school.

Thanks for the presentation and QA!


Would netflix benefit if zfs was modified to use a unified cache to make zero-copy possible?


Possibly. Remember the goal is to keep data read from the drives unmapped in the kernel address space, and hence never accessed by the CPU for efficiency. So we'd have to give up, or at least alter, zfs to not do checksums each time a block is read.

The most interesting use of ZFS for us would be on servers with traditional hard drives, where ZFS is supposedly more efficient than UFS at keeping data contiguous on disk, thus resulting in fewer seeks and increased read bandwidth.


What prevents linux to achieve the same bandwidth?


Not sure about all other optimisations, but Linux doesn't have support for async sendfile.


How involved was Netflix in the design of the Mellanox NIC? How many stakeholders does this type of networking hardware have, relatively speaking?

Also, what percentage of CDN traffic that reaches the user is served directly from your co-located appliances?


Could you do a mix of NIC oriented siloing and Disk oriented siloing?

It seems like the bottlenecks are different, so if you serve X% of traffic on the NIC node and Y% on the disk node, you might be able to squeeze a bit more traffic?

Also, how amenable to real time analysis is this? Could you look at the request where the request comes in on node A and the disk is on node B, and tell which NIC is less loaded and send the output through that NIC. (Some selection algorithm based on relative loading anyway)


Those are good ideas. The solution I was thinking of would be to do replication of the most popular content to both nodes.


Replicating popular content is good if there's a smallish core set of popular content. Then you'd be able to serve popular content on the NIC it comes in on without cross NUMA traffic. Really depends on traffic distribution though, if you need to replicate a lot of content, then maybe you push too much content out of cache.

It'd be neat if you could teach the page cache to do replication for you... Then you might use SF_NOCACHE for not very popular content, no option medium content, and SF_NUMACACHE (or whatever) for content you wanted cached on the local NUMA node. I'm sure there's lots of dragons in there though ;)


How did you generate those flamegraphs and what other tools did you use to measure performance?

My motivation for asking comes from these findings in the pdf,

Did the graph show the bottleneck contention on aio queue? Did the graph show that "a lot of time was spent accessing memory"?

What made freebsd a better platform compared to Linux to begin tackling this problem?

Thanks! Super interesting. Both a freebsd fan and I have workloads that I'd love to explore benchmarking to squeeze more performance.


> How did you generate those flamegraphs and what other tools did you use to measure performance?

We have an internal shell script that takes hwpmc output and generates flamegraphs from the stacks. It also works with dtrace. I'm a huge fan of dtrace. I also make heavy use of lockstat, AMD uProf, and Intel Vtune.

> Did the graph show the bottleneck contention on aio queue? Did the graph show that "a lot of time was spent accessing memory"?

See the graph on page 32 or so of the presentation. It shows huge plateaus in lock_delay called out of the aio code. Its also obvious from lockstat stacks (run as lockstat -x aggsize=4m -s 10 sleep 10 > results.txt)

See the graph on page 38 or so. The plateaus are mostly memory copy functions (memcpy, copyin, copyout).

We already use FreeBSD on our CDN, so it just made sense to do the work in FreeBSD.

The talk is on Youtube https://youtu.be/36qZYL5RlgY


The flame graphs might be generated using Brendan Gregg's utility, see https://www.brendangregg.com/flamegraphs.html


They are generated by a local shell script that uses the same helpers (stackcollapse*.pl, difffolded.pl). Our revision control says the script was committed by somebody else though. It existed before I joined Netflix.


* How is the DRM applied? * Is the software, that does DRM open source, too?


There are a lot of slides and I am on my phone, so sorry if it was addressed in the slides.

How does Linux compare currently? I know in the past FreeBSD was faster, but are there any current comparisons?


If a "typical" NIC was used, what do you think the throughput would be?

I have to imagine considerably less (e.g. 100 Gb/s instead of 800).


Not the OP, but that's basically in the slides. When it's kTLS, but not NIC kTLS. Maybe you could optimize that a bit more around the edges if NIC kTLS wasn't an option.


Back of the envelop guess is ~400Gb/s. Each node has enough memory BW for about 240Gb/s, then factor in some efficiency loss for NUMA..


what do you mean by typical nic? these are COTS NICs anyone can buy


Do you know if there is any documentation regarding interfacing with the KTLS, eg to implement support for a new library?


The ktls(4) man page is a start. The reference implementation is OpenSSL right now. I did support for an internal Netflix library a while ago, I probably should have documented it at the time. For now feel free to contact me via email with questions (the username in the URL, but @netflix.com)


For Linux, there is a documentation at kernel.org: https://docs.kernel.org/networking/tls.html


How long will you be able to keep up with this near yearly doubling of bandwidth used for serving video? :)


It depends on when we get PCIe Gen5 NICs and servers with DDR5 :)


Any current estimates on timing?


Not the OP, but PCIe5 NICs are already available in the market; I've seen people requesting help getting them to work on desktop platforms which have PCIe5 as of the most recent chips. AFAIK, currently, both AMD and Intel release desktop before server; I don't think there's a public release date for Zen4 server chips, but probably this quarter or next? Intel's release process is too hard for me to follow, but they've got desktop chips with PCIe5, so whenever those get to the server, then that might be an option too.


Public release date for Zen4 server has been disclosed for November 10, FYI. https://www.servethehome.com/amd-epyc-genoa-launches-10-nove....

Looks like Intel's release is coming January 10. https://www.tomshardware.com/news/intel-sapphire-rapids-laun...


Wondering if there's a video presentation to go along with the slides?


This talk was given at this years EuroBSDcon in Vienna, recording is up on YouTube.

https://2022.eurobsdcon.org/

https://www.youtube.com/watch?v=36qZYL5RlgY

Some really great talks this year from all the *BSDs, highly recommend checking a look: https://www.youtube.com/playlist?list=PLskKNopggjc6_N7kpccFZ...


And is the video presentation on Netflix?


What tools do you use for load testing / benchmarking?


At a very basic microbenchmark level, I use stream, netpef, a few private VM stress tests, etc. But the majority of my testing is done using real production traffic.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: