Hacker News new | past | comments | ask | show | jobs | submit login
FreeBSD optimizations used by Netflix to serve video at 800Gb/s [pdf] (freebsd.org)
390 points by ltadeut on Nov 3, 2022 | hide | past | favorite | 177 comments



Yet, in order to watch Netflix on FreeBSD, you have to jump through such hoops as "downloading either google chrome, vivaldi, or brave, and [using] a small shell script which basically creates a small jail for some ubuntu binaries that actually install widevine which is essential for viewing some DRM content such as Netflix" [1]

[1] https://www.youtube.com/watch?v=mBYor4wL62Q


Devils advocate: The people who work on the server engineering at Netflix don't exactly have much control over copyright holders being lawyer brained man children


BitTorrent is probably easier. I just wish there was a good way to send money to the artists without also funding DRM enhancements.


So you want to send money to all the people who worked on the TV Show or the Movie you just downloaded?

I don’t think you realize how impractical that is. Take a look at the credits at the end of a movie some time. Or look up the list of people who worked on a particular episode of a show (yes, it can vary throughout a season).


It wouldn't be impractical if the studio planned ahead for it.

There could be a the address of a smart contact at the end of the credits. Every time more than, say $1000, piles up in that address, whatever is there gets dispensed to the contributors at the end of that month.

Plex could aggregate those addresses and tell you how to allocate your payment based on how you allocated your attention. Yes I know that's what Netflix does, but I control my Plex server. Nobody is then going to find additional ways to monetize that data.

I know it's unconventional, but I really don't think it's crazy to want to reward the creators of content that you consume while simultaneously not wanting to contribute towards the development of ecosystems that prevent people from being in control of their tech.


It wouldn't be impractical if the studio planned ahead for it.

Studios already plan for this.

For a short time in the 80's, one of my mother's job responsibilities was making sure every single person involved in the production of a movie in the 1940's got their revenue check each quarter, whether it was for $50.00, or 12¢. Hundreds of people. Hundreds of checks.


Ok, so I've torrented a movie and I want to send the equivalent of your mom a check so that next quarter it's $0.13 instead of $0.12, where do I look in the credits to get her address?

Perhaps in the 80's it would've been impractical to pay her to multiplex hundreds of $1 input checks into the appropriate set of $50 or $0.12 output checks, but that's now a job that's early done by a computer.


Certainly impractical for big budget shows, but Patreon has proved the model works


It doesn't have to be that impractical, even today you could include wallet addresses, and a website for verification attached with the movies.

Or perhaps IMBD will include a wallet addresses on cast & crew pages.


Somebody has to get in contact with each person and get an address for them, plus handle cases where somebody loses their key and needs to register a different address.

Whoever does this is essentially part of the crew now and probably deserves to get paid too, but it would be an easy scam for them to just set up each contributor with a wallet that secretly they control. I'm not sure how to prevent that.


Plus pay for the people that put up their money in advance and risked losing it if the show flops.


>I don’t think you realize how impractical that is

If only we had some sort of distributed ledger that can programmaticaly send payments to anyone from anyone on the network in almost any quantity large or small!


> which is essential for viewing some DRM content such as Netflix

Are you complaining that Netflix doesn't want people to pirate content, content they might have licensed from 3rd parties which contractually bind them to not let being pirated?

This + is the development/resources/cost of serving such few people on FreeBSD even worth it.

Note: I'm a huge FreeBSD fan. But consider this totally understandable on Netflix part.


But it doesn’t prevent it from being pirated at all. You can get any Netflix release you want within minutes of release on any torrent site. Sometimes before the official release even.

It just makes normal people jump through hoops to watch the things they are trying to pay for. That’s a DRM issue in general though, I acknowledge this isn’t just a Netflix thing.


And I will stick to getting it that way for as long as DRM exists on the given platform. I'll still pay for the subscription, but I'm handling the data my way.


> I'll still pay for the subscription, but I'm handling the data my way.

Huh, that's an interesting take. I feel like something similar might end up being what you need to do with certain video games as well.

For example, I bought Grand Theft Auto IV as a boxed copy back when it came out (though most of my games are digital now). The problem is that the game expects Games For Windows Live to be present, which is now deprecated and some folks out there can't even launch the game anymore. It's pretty obvious what one of the solutions here is.


Me too. Especially because these same DRM will soon be used to uniquely identify and profile you when these streamers also become an ad platform.


What does DRM have to do with this? They'll connect what you watch on Peacock with what you watch on Netflix on your computer? Do you have a reference?


DRM can be used to uniquely identify your device. An example of using DRM for tracking in Android:

> When a device uses DRM for the first time, a device provisioning occurs, which means that the device will obtain a unique certificate and it will be stored in the DRM service of the device ... This provisioning profile has a unique ID, and you can obtain it with a simple call. This ID is not only the same on all apps, but also it is the same for all users of the device. So a guest account, for example, will also obtain the same ID, as opposed to the ANDROID_ID.

Source: https://beltran.work/blog/2018-03-27-device-unique-id-androi...


DRM makes 0 sense since you can get any content using torrents in 2min. It's not protecting anything, as a matter of fact it's just making people download more since it's a painful experience.

For example on Windows with Chrome you only get 720p playback for Netflix, complete nonsense.


Sure, but if there was no drm, there would probably just be a chrome extension you could install and rip/share content more readily than via BitTorrent.

I don't like it, but there is some logic to it. For business types, it isn't merely the existence of ripped copies, but the ease of creating and spreading them.


Yeah, but tell that to braindead content license owners.


FreeBSD is not a "desktop first" system and has strengths elsewhere. I use it for 20+ years constantly. Sadly my experiments with FreeBSD desktop ended years ago as there always was something "not working".


Typing this on a FreeBSD laptop.

Haven't tried using netflix on it though.


I don't think it needs to be said that while FreeBSD can serve as a daily driver for some people, it is insufficient for the vast majority of computer users in the world


Ok then, you didn't need to say it.


UNIX's strength was never in the desktop experience, rather server room.


That's just not true, it's not just apple but also SGI, maybe not directly desktop but workstation for sure.


Not true for macOS.


You mean NeXTSTEP, all that makes it unique isn't part of POSIX, and Steve Jobs had a quite clear position on UNIX value for desktop computing.


macOS is not a UNIX-based operating system. It used to be marketed as "UNIX" but only because it met the Open Group's POSIX standards.


Not only does macOS comply to the POSIX standard, it also is a fully certified UNIX system.

https://www.opengroup.org/openbrand/register/brand3683.htm


That is the problem with the BSD license, it says "use my work and don't give anything back". Of course, GPL gets violated too, but would be very difficult by an American company like Netflix.


Since Netflix is NOT deploying it's code/Devices to users it would be exactly the same with a GPL.

>it says "use my work and don't give anything back".

That's not written.

But compared to GPL you just could read the whole BSD-2 license in 1 minute.


Just some napkin maths. ( Correct me if I am wrong )

Looking at the 800Gbps Config, Dell R7525 with Dual 64C / 128T and 4x Connect-DX 800Gbps in 2U.

With Zen 4C, 128C and PCI-E 5.0, Connect-7, two node could fit into 2U. i.e doubling to 1.6Tbps per 2U.

That is going from 16Tbps to 32Tbps per Rack. ( Using 40U only )

To things in perspective, if every user were to use 20Mbps Stream at the same time, ( not going to happen due to time zone difference ), the 250M Netflix subscribers worldwide would need 5000M Mbps or 5000 Tbps. That is less than 200 Racks to serve every single of their customer on planet earth. ( Ignoring Storage. ) You could ship a Rack to every Region, State, Nation, Jurisdiction or Local ISP and Exchange and be done with it.

I hope Lisa Su sent drewg123 and his team at Netflix with Zen 4C ASAP to play, cough, I mean help them test it.

Note: We have PCI-E 6.0 ( and 7.0 ), DDR6 on Roadmap. The 200 Racks could be down to 50 Racks by the end of this decade. Assuming Netflix is still streaming at the same bitrate.


There is the rather intriguing prospect of NVM Express over Fabrics (NVMe-oF): https://en.wikipedia.org/wiki/NVM_Express#NVMe-oF

Marvel Octeon 10 DPU (with an integrated 1 Terabit switch): https://www.marvell.com/content/dam/marvell/en/company/media...

Probably pretty soon you'll be able to chuck in a few hot swappable 100 TB Nimbus exadrives (https://nimbusdata.com/products/exadrive/) in there and call it a day. 1T in 1U. :)


Interesting to see that Infiniband is still kicking


Not really. Ethernet and Infiniband are both perfectly capable from a bandwidth perspective. Streaming video isn't remotely close to latency-bound, which is where Infiniband would be better suited.


Streaming video is about the perfect thing to send as you can cache it for days.

If only it could start playing faster.


> You could ship a Rack to every Region, State, Nation, Jurisdiction or Local ISP and Exchange and be done with it.

And when a single rack is down the whole region might be impacted negatively. The blast radius of such a setup would be huge!

As a CDN you really want to avoid this. Even for regular operations you need to be able to afford some servers going out of service for software update and other maintenance - without impacting availability, latency for users (the latter would happen if you route them to a complete different region), or infrastructure costs (which increase if you can't serve data from as close to the user as possible anymore, and have to pay for additional networking fees).

All major CDN providers have hundreds of regions (all with multiple hosts), and you can't really avoid the former. You could run less hosts per region, but whether 1 is the feasible will depend a lot on other parts of your system.


> That is going from 16Tbps to 32Tbps per Rack ... only need 200 racks

I doubt ISP's give an entire rack to Netflix. I wouldn't be surprised if they only get like 4U total (hence why throughput per server is so important to Netflix).


Why not? It's top bandwidth consumer for a retail ISP and surely any reasonable amount of rack space is worth the savings in interconnect bandwidth.


They are frequently rack space constrained, hence these super dense hardware.


Some ISPs give a full rack, some don't. It depends on how much traffic they have and how willing they are.

But a lot of the racks sit at internet exchange points, where Netflix rents one or more racks at a time.


The minimum requirements

https://openconnect.zendesk.com/hc/en-us/articles/3600345383...

I think it depends on the size of isp, probably a rack would be too much even for the biggest isps, but one 4u too less.


Looking at the banner pic on their main page, they seem to have at least one ISP install of multiple racks in the wild. Also, doing a little reading on how "fill" of the devices works, they talk about doing peer-to-peer filling of appliances located at the same site, which leads me to believe, even if not deploying a full rack, deploying multiple appliances to an ISP site is a relatively normal occurance

https://openconnect.netflix.com/en/peering/


You are probably overestimating Netflix traffic by a lot.

IX.br peak traffic is 20Tb/s, DE-CIX peak traffic is 14Tb/s, AMS-IX is around 11Tb/s.

The 800Gpbs machine is probably enough for a country.

Netflix traffic stats at PIT Chile, this is their only peering connection in Chile: https://www.pitchile.cl/wp/graficos-con-indicadores/streamin...


This assumption misses out on all the private interconnect links and deployed OpenConnect appliances within ISP networks - a majority of Netflix's traffic today. IXes are only a small portion of overall internet traffic.


I notice people streaming in very low resolution without realising it, and sometimes intervene when the pain gets too great.

I’d be vey surprised if the average bitrate was anywhere near the appropriation.

However that wasn’t the point of the calculation, it was looking for a maximum.


Agree and 20 mbps is a reasonable rate for modern codecs for resolutions up to 4k for the 99% of viewers


Netflix 4K is pretty low bitrate. I haven’t checked recently but was not seeing anything approaching 20mbps last time I looked. That said, it looked pretty good.


You still need the premium package for this. It maxes at 1080p with no HDR on the standard package.


Back of the Napkin Zen4 / Genoa gets you to ~500GB/s PCIe and ~500GB/s of DRAM bandwidth -- nearly 4Tbps! Zen3/Rome is ~300GB/s PCIe and ~300GB/s DRAM -- about 2.4Tbps. A single 2U box with Genoa might scale to 1.25Tbps+ of useful Netflix traffic. We'll have to see what magic Drew can pull :)


That is less than 200 Racks to serve every single of their customer on planet earth. ( Ignoring Storage. )

If you're going to ignore storage, Netflix could just ship a low-end video server to every one of its customers and be done with it.

Every problem is an easy problem if your pretend the hard parts don't exist.


How much storage does Netflix actually need for its whole library?

It's got about 17,000 titles globally [1]. If they have copies in SD, 720p, HD and 4k that would be 68,000 versions (plus some extra audio tracks for stuff dubbed in multiple languages, but I suspect this is fairly minimal in terms of storage though)

Let's assume that the resolutions have the bitrates at 5, 10, 15 and 20 mbps.

The average length of a Netflix original movie is ~90mins [2]

So that would require about 575TB in storage if I have done my maths correctly.

You would need about 20x30TB Kioxia CD6 SSDs for all that. Very expensive but definitely technically possible.

I could totally see it being possible to fit those drives in a single node to push the 800gbps required, not increasing the over rack requirement at all. Not sure if the bandwidth from that many drives is enough, might have to cache some of the most watched stuff to ram)

Not gonna see any in home boxes with all the titles pre loaded any time soon though. As a hard drive array that's still 30x20TB drives.

[1] https://www.comparitech.com/blog/vpn-privacy/netflix-statist...

[2] https://stephenfollows.com/netflix-original-movies-shows/#:~...)


> 575TB

I wouldn't be surprised if there were a few bored and cashed up engineers with private plex servers this big out there. 575TB is way smaller than I would have guessed, although 575tb of SSD is still much more expensive than 575TB of HDD.

I guess it makes sense - Netflix' library isn't that big.


In the article you link it refers to Squid Game as "just one of the platform’s many titles" so that minutes-per-title estimate is probably super low since series are gonna be heavy hitters there compared to films.

It's probably larger than we'd guess (there's likely a lot of device-specific stupid codec profile crap that may cause extra copies too).

And - even more interestingly to me - it's not a stationary target. Gotta take mobile consumption into account too!


Do they keep on every server the global library? I guess they partition it geographically.


In their OpenConnect network they keep the most demanded titles and the latest releases. And IIRC that refreshes nightly (with new releases and whatever is hot that day)

https://openconnect.zendesk.com/hc/en-us/articles/3600356180...


Netflix is more likely to use a single box of this kind of throughout at any given POP than a rack of them. For bigger installations they can use cheaper, less throughput-dense hardware (although I don’t know if they do).


Take a look at the hardware, it isn't particularly expensive stuff.


Aside from GPUs, I'm not sure how you would increase the cost density much. Those NICs doing hundreds of Gbps and TLS aren't cheap, nor are the fast SSDs needed to sustain the load, nor is RAM or top end AMD server CPUs. Of course, the cost is absolutely worth it to Netflix!


Yes, but it's still just one box, if you're building a cluster of cheaper characteristics you need more of everything. A high-end server VS a cluster of 10 machines, 10 machines wouldn't be cheaper to get to the same throughput, it's not alien specialized supertech, it's just top of the line commodity hardware. (10 is just an example number here).


I mean, I guess I disagree with your stipulation that you couldn't lower total costs somewhat using slightly more slightly lower end hardware, if rack space was cheap.

> top of the line commodity hardware

Yeah -- cost in commodity hardware scales super-linearly with performance.


On one box you don't have to buy a 15000$ (or two) network switch(es), that's significant, you don't have to pay for N chassis, motherboards, nics, NVMe.

Considering they're really eeking every last bit of juice out of the box, I'm doubtful that distributing it would be cheaper.

Also they're not maxxing the CPU out, they just need memory bandwidth and the Mellanox nics. Storage would be more expensive on more boxes since they can't distribute the storage, they have to use local storage to reach the performance we're after.


> Storage would be more expensive on more boxes since they can't distribute the storage, they have to use local storage to reach the performance we're after.

Not necessarily. To a first approximation, if they can get 800Gbps out of the disks in a single box, they could split those disks over say 8 100 Gbps boxes and get the same performance out of the disks for the same price. Once you split it into 8 boxes, maybe instead of each box getting 2x 16 TB pci-e 4.0x4 drives, you get them 4x 8 TB pci-e 3.0 x 4 drives. Half size and older generation drives are likely to be less than half the cost. Netflix does have ways to segment cache among their appliances, so they wouldn't need to have the same storage capacity on each box as they do on the combined box.

It's certainly a procurement analysis to see if those savings will add up to overall savings, and there's a good chance it won't; especially if you need to add a 800 Gbps network switch. You do often get a pretty good cost savings by having two single socket servers vs one dual socket server though; again though, probably not if you have to add a switch.


The people doing this might also be doing infra as code for the virtualization layer on the hardware itself - which this might not be able to satisfy. At minimum they surely have a ton of this stuff deployed already so changing hardware specs big time might not be worth the cost.

Also are you taking into account encryption for those specs?


Author here, happy to answer questions


Read the presentation. Had super noobie level questions.

Is the RAM mostly used by page content read by the NICs due to kTLS?

If there was better DMA/Offload could this be done with a fraction of the RAM? (NVME->NIC)

If there was no need to TLS, would the RAM usage drop dramatically?


These are actually fantastic questions.

Yes, the RAM is mostly used by content sitting in the VM page cache.

Yes, you could go NVME->NIC with P2P DMA. The problem is that NICs want to read data at once TCP mss (~1448b) and NVME really wants to speak in 4K sized chunks. So there needs to be some buffers somewhere. It might eventually be CXL based memory, but for now it is host memory.

EDIT: missed the last question. No, with NIC kTLS, the host RAM usage is about the same as it would be without TLS at all. Eg, connection data sitting in the socket buffers refers to pages in the host vm page cache which can be shared among multiple connections. With software kTLS, data in the socket buffers must refer to private, per-connection encrypted data which increases RAM requirements.


Thank you, I understood that efficient offload may eventually be possible.

Back when I was in NetApp, folks had researched on splitting 4k chunks as 3 ethernet packets (NetCache) line so that they'd happily fit and issue 3 I/Os on non 4k aligned boundaries. There was also a similar issue to reassemble smaller I/Os into a bigger packet, because some disks were 512b blocks back then. The idea was to give multiple gather/scatter and the engine would take care of reassembly.

Really looking forward to what interesting things happen in this space :)


A. Just curious, are these servers performing any work besides purely serving content? Eg user auth, album art, show description, etc?

B. What’s the current biggest bottleneck preventing higher throughout?

C. Has everything been up streamed? Meaning, if I were to theoretically purchase the exact same hardware - would I be able to achieve similar throughout?

(Amazing work by the way in these continued accomplishments. These posts over thr years are always my favorite HN stories.)


a) These are CDN servers, so they serve CDN stuff. Some do serve cover art, etc sorts of things.

b) Memory bandwidth and PCIe bandwidth. I'm eagerly awaiting Gen5 PCIe NICs and Gen5 PCIe / DDR5 based servers :)

c) Yes, everything in the kernel has been upstreamed. I think there may be some patches to nginx that we have not upstreamed (SO_REUSEPORT_LB patches, TCP_REUSPORT_LB_NUMA patches).


you may have been asked this before in older stories, but why freebsd other than legacy reasons, and why not dpdk? is it purely because you need the fastest TCP implementation?


At what point will it make more sense to use specialized hardware, e.g. network card that can do encryption?


We already do. The Mellanox ConnectX6-Dx with crypto support.. It does inline crypto on TLS records as they are transmitted. This saves memory bandwidth, as compared to a traditional lookaside card.


What's the error rate, or uptime ratio, of those cards?


Were you assuming they were giant FPGA based NICs..? They are production server NICs, using asics with a reasonable power budget. I don't recall any failures.


Well I wasn't, though I was expecting some non-zero amount of failures.

That's pretty impressive if it's literally zero.

How many machines are deployed with NICs?


I don't have any visibility into how many DOA NICs we have, so I can't say that Mellanox is better or worse at that point. But I do see most NIC related tickets for NIC failures once machines are in production. In general, we've found Mellanox NICs to be very reliable.


1. I got excited when I saw arm64 mentioned. How competitive is it? Do you think it will be a viable alternative for Netflix in the future?

2. On amd, did you play around with BIOS settings? Like turbo, sub-numa clustering or cTDP?


Arm64 is very competitive. As you can see from the slides, the Ampere Q80-30 is pretty much on-par with our production AMD systems.

Yes, I've spent lots of time in the AMD BIOS over the years, and lots of time with our AMD FAE (who is fantastic, BTW) poking at things.


Which NIC and driver combinations support kTLS offloading to the NIC?

How did you deal with the hardware/firmware limitations on the number of offloadable TLS sessions?


We use Mellanox ConnectX6-DX NICs, with the Mellanox drivers built into FreeBSD 14-current (which are also present in FreeBSD 13).


> We use Mellanox ConnectX6-DX NICs

Is there a plan to move to the Connect X-7 eventually?

Depending on the bandwidth available, that'd be either 2x to get the same 800Gb/s as here (or perhaps eventually with 4x to get 1600Gb/s).


Yes, I'm looking forward to CX7. And to other pcie Gen5 NICs!


How much "U's" of space do ISP typically give you (e.g. 4U, 8U, etc)?


This is going to be a “How long is a piece of string?”. Each ASN will be unique, and even within any large ISP, there may be many OCA deployment sites (there won’t just be one for Virgin Media in UK) and each site will likely have subtly different traffic patterns and content consumption patterns, meaning the OCA deployment may be customized to suit, and the content pushed out (particularly to these NVME-based nodes) will be tailored accordingly.

Since the alternative for an ISP is to be carrying the bits for Netflix further, the likelihood is they’ll devote whatever space is required because that’s much cheaper than backhauling the traffic and ingressing over either a settlement-free PNI or IXP link to a Netflix-operated cache site, or worse, ingressing the traffic over a paid transit link.

Meanwhile, on the flipside, since Netflix funds the OCA deployments they have a strong interest in not “oversizing” the sites. That said I’m sure there is an element of growth forecasting involved once a site has been operational for a period of time.


What filesystem(s) are you using for root and content?

And If ZFS, what options are you using?


We use ZFS for root, but not content. For content we use UFS. This is because ZFS is not compatible with "zero-copy" sendfile, since it uses its own ARC cache rather than the kernel page cache, meaning sending data stored on ZFS requires an extra data copy out of the ARC. Its also not compatible with async sendfile, as it does not have the methods required to call the sendfile completion handler after data is read from disk into memory.


>For content we use UFS

I found this extremely interesting. ZFS is almost a cure-all for what ails you WRT storage, but there is always something that even Superman can't do. Sometimes old-school is best-school.

Thanks for the presentation and QA!


Would netflix benefit if zfs was modified to use a unified cache to make zero-copy possible?


Possibly. Remember the goal is to keep data read from the drives unmapped in the kernel address space, and hence never accessed by the CPU for efficiency. So we'd have to give up, or at least alter, zfs to not do checksums each time a block is read.

The most interesting use of ZFS for us would be on servers with traditional hard drives, where ZFS is supposedly more efficient than UFS at keeping data contiguous on disk, thus resulting in fewer seeks and increased read bandwidth.


What prevents linux to achieve the same bandwidth?


Not sure about all other optimisations, but Linux doesn't have support for async sendfile.


How involved was Netflix in the design of the Mellanox NIC? How many stakeholders does this type of networking hardware have, relatively speaking?

Also, what percentage of CDN traffic that reaches the user is served directly from your co-located appliances?


Could you do a mix of NIC oriented siloing and Disk oriented siloing?

It seems like the bottlenecks are different, so if you serve X% of traffic on the NIC node and Y% on the disk node, you might be able to squeeze a bit more traffic?

Also, how amenable to real time analysis is this? Could you look at the request where the request comes in on node A and the disk is on node B, and tell which NIC is less loaded and send the output through that NIC. (Some selection algorithm based on relative loading anyway)


Those are good ideas. The solution I was thinking of would be to do replication of the most popular content to both nodes.


Replicating popular content is good if there's a smallish core set of popular content. Then you'd be able to serve popular content on the NIC it comes in on without cross NUMA traffic. Really depends on traffic distribution though, if you need to replicate a lot of content, then maybe you push too much content out of cache.

It'd be neat if you could teach the page cache to do replication for you... Then you might use SF_NOCACHE for not very popular content, no option medium content, and SF_NUMACACHE (or whatever) for content you wanted cached on the local NUMA node. I'm sure there's lots of dragons in there though ;)


How did you generate those flamegraphs and what other tools did you use to measure performance?

My motivation for asking comes from these findings in the pdf,

Did the graph show the bottleneck contention on aio queue? Did the graph show that "a lot of time was spent accessing memory"?

What made freebsd a better platform compared to Linux to begin tackling this problem?

Thanks! Super interesting. Both a freebsd fan and I have workloads that I'd love to explore benchmarking to squeeze more performance.


> How did you generate those flamegraphs and what other tools did you use to measure performance?

We have an internal shell script that takes hwpmc output and generates flamegraphs from the stacks. It also works with dtrace. I'm a huge fan of dtrace. I also make heavy use of lockstat, AMD uProf, and Intel Vtune.

> Did the graph show the bottleneck contention on aio queue? Did the graph show that "a lot of time was spent accessing memory"?

See the graph on page 32 or so of the presentation. It shows huge plateaus in lock_delay called out of the aio code. Its also obvious from lockstat stacks (run as lockstat -x aggsize=4m -s 10 sleep 10 > results.txt)

See the graph on page 38 or so. The plateaus are mostly memory copy functions (memcpy, copyin, copyout).

We already use FreeBSD on our CDN, so it just made sense to do the work in FreeBSD.

The talk is on Youtube https://youtu.be/36qZYL5RlgY


The flame graphs might be generated using Brendan Gregg's utility, see https://www.brendangregg.com/flamegraphs.html


They are generated by a local shell script that uses the same helpers (stackcollapse*.pl, difffolded.pl). Our revision control says the script was committed by somebody else though. It existed before I joined Netflix.


* How is the DRM applied? * Is the software, that does DRM open source, too?


There are a lot of slides and I am on my phone, so sorry if it was addressed in the slides.

How does Linux compare currently? I know in the past FreeBSD was faster, but are there any current comparisons?


If a "typical" NIC was used, what do you think the throughput would be?

I have to imagine considerably less (e.g. 100 Gb/s instead of 800).


Not the OP, but that's basically in the slides. When it's kTLS, but not NIC kTLS. Maybe you could optimize that a bit more around the edges if NIC kTLS wasn't an option.


Back of the envelop guess is ~400Gb/s. Each node has enough memory BW for about 240Gb/s, then factor in some efficiency loss for NUMA..


what do you mean by typical nic? these are COTS NICs anyone can buy


Do you know if there is any documentation regarding interfacing with the KTLS, eg to implement support for a new library?


The ktls(4) man page is a start. The reference implementation is OpenSSL right now. I did support for an internal Netflix library a while ago, I probably should have documented it at the time. For now feel free to contact me via email with questions (the username in the URL, but @netflix.com)


For Linux, there is a documentation at kernel.org: https://docs.kernel.org/networking/tls.html


How long will you be able to keep up with this near yearly doubling of bandwidth used for serving video? :)


It depends on when we get PCIe Gen5 NICs and servers with DDR5 :)


Any current estimates on timing?


Not the OP, but PCIe5 NICs are already available in the market; I've seen people requesting help getting them to work on desktop platforms which have PCIe5 as of the most recent chips. AFAIK, currently, both AMD and Intel release desktop before server; I don't think there's a public release date for Zen4 server chips, but probably this quarter or next? Intel's release process is too hard for me to follow, but they've got desktop chips with PCIe5, so whenever those get to the server, then that might be an option too.


Public release date for Zen4 server has been disclosed for November 10, FYI. https://www.servethehome.com/amd-epyc-genoa-launches-10-nove....

Looks like Intel's release is coming January 10. https://www.tomshardware.com/news/intel-sapphire-rapids-laun...


Wondering if there's a video presentation to go along with the slides?


This talk was given at this years EuroBSDcon in Vienna, recording is up on YouTube.

https://2022.eurobsdcon.org/

https://www.youtube.com/watch?v=36qZYL5RlgY

Some really great talks this year from all the *BSDs, highly recommend checking a look: https://www.youtube.com/playlist?list=PLskKNopggjc6_N7kpccFZ...


And is the video presentation on Netflix?


What tools do you use for load testing / benchmarking?


At a very basic microbenchmark level, I use stream, netpef, a few private VM stress tests, etc. But the majority of my testing is done using real production traffic.


Recording of the presentation can be found here: https://www.youtube.com/watch?v=36qZYL5RlgY

Pretty cool stuff


And to think not that long ago I remember being excited when the V.92 standard was released and I could get 56 kb/s on my dial-up connection:

* https://en.wikipedia.org/wiki/V.92


How about the marvel that was Walnut Creek's cdrom.com that served 10,000 simultaneous FTP connections back in 1999? [1]

I was always blown away by how much more efficient FreeBSD's network stack was compared to Linux at the time. It convinced me to go FreeBSD-only for a few years.

[1] http://www.kegel.com/c10k.html


> compared to Linux at the time

Do you consider that not to still be the case?


Before 2003, FreeBSD was definitely both faster and more reliable than Linux, especially for networking or storage applications.

After that, Intel and AMD have introduced cheap multi-threaded and multi-core CPUs. Linux was adapted very quickly to work well on such CPUs, but FreeBSD has struggled for many years until reaching an acceptable performance on multi-threaded or multi-core CPUs, so it became much slower than Linux.

Later, the performance gap between Linux and FreeBSD has diminished continuously, so now there is no longer any large difference between them.

Depending on the hardware and on the application, either Linux or FreeBSD can be faster, but in the majority of the cases the winner is Linux.

Despite that, for certain applications there may be good reasons to choose FreeBSD, even where it happens to be slower than Linux.


FreeBSD was held back by limited TCP options around when packet mobile internet (GPRS) came along. That was around 2003 too.

I remember noticing Yahoo properties being almost unusable in GPRS because they did packet loss detective and recovery in such basic ways e.g. no SACK.


Any setting for today's connection on capped mobile data? 2.7 KB/S max.


> Depending on the hardware and on the application, either Linux or FreeBSD can be faster, but in the majority of the cases the winner is Linux.

I'm not denying this, but do you have a source? I've been trying to find modern "Linux vs FreeBSD" performance tests but haven't been super successful. Mostly I find things from the early 2000s when FreeBSD had a clear lead.



Thank you very much.


> Depending on the hardware and on the application, either Linux or FreeBSD can be faster, but in the majority of the cases the winner is Linux.

Do you have any data to back that up? Everything I've seen recently and my own experience tells me this isn't the case but I also don't have any data to back up my position. Would love to find some good data on this either way.


I have been using continuously both FreeBSD and Linux since around 1995, since FreeBSD 2.0 and some Slackware Linux distribution.

In the early years, I have run many benchmarks between them, in order to choose the one that was the best suited for certain applications.

However, during the last decade, I did not bother to compare them any more, because now the main reasons why I choose one or the other do not include the speed.

Even if I have right now, besides me, several computers with FreeBSD and several with Linux, it would not be easy for me to run any benchmark, because they have very different hardware, which would influence the results much more than the OS.

For all the applications where I use FreeBSD (for various networking and storage services), its performance is adequate, and I use it instead of Linux for other reasons, not depending on whether it might be faster or slower.

In the applications where computational performance is important, I use Linux, but that is not due to some benchmark results, but because some commercial software is available only for Linux, e.g. CUDA libraries or FPGA design programs.

Many benchmark results comparing FreeBSD and Linux may be influenced more by the file systems used than by the OS kernel.

I have seen recently some benchmark comparing FreeBSD and Linux for a database application dominated by SSD I/O, but I cannot remember a link to it.

The only file system shared by Linux and FreeBSD is ZFS. With ZFS, the benchmark results were similar for Linux and FreeBSD. However, FreeBSD was faster when using UFS and Linux was much faster, when using either XFS or EXT4 (BTRFS was much slower than ZFS). Such a benchmark was much more influenced by the file system than by the operating system.

In conclusion, it is very hard to make a good comparison between FreeBSD and Linux, because you need identical hardware, which must be restricted to the shorter list that is well supported by FreeBSD, and you need to run some micro-benchmark testing some kernel system calls.

Otherwise, the result may depend more on the supported software, hardware or file systems, than on the OS kernel.


Right exactly which is why it's hard to find data. But I'd love to see someone who has tried to limit variables to just the network stack to figure out if one network stack is better than the other.

But you're right, in the end you just have to set up both for your particular use case with the best optimizations each has to offer and see which performs better.


The web run on Linux like most FANG servers do, so it makes sense with the $$$ / people / R&D that this OS is faster. A conservative number would be that 99.9% of the web runs on Linux and it's probably much higher.

At the scale of Google / MS / Amazon / Apple if servers would run faster of BSD* they would use it. We're talking about 10's millions of servers here.

https://www.phoronix.com/review/bsd-linux-eo2021/7

It gives you a pretty clear picture.


Based on that logic, Windows is the superior operating system and always has been, because it's always been used by more people on their desktop than anything else.

There are a lot more factors involved in OS choice that could drive popularity other than the speed of the network stack. And BTW, Hotmail runs on BSD. MacOS is a fork of BSD. And Yahoo ran on BSD (and may still).


I did not even mention all the Android phones 1B+ running Linux. Also all the top 500 HPC runs Linux: https://www.top500.org/statistics/details/osfam/1/

Hotmail running BSD is dead, I doubt MS is using BSD at all, they're using the same stuff as outlook.com which is not BSD.

I'm not saying Linux is better, just it is overall faster because an order of magnitude more $$$ and contribution.


Netflix distributes a fairly large chunk of the internet traffic and they do it using FreeBSD.

I suspect most system administrators and engineers use Linux because that's what everyone else is using.


It's nice to see someone actually still does proper engineering instead of farting something about cloud and webscale and just throwing money at a problem.


From what I can see in a quick search (and from this presentation), Netflix only uses FreeBSD for serving video and they run these servers themselves in their own datacenters I guess. In contrast their apps on EC2 use Linux [0]. Sounds like the time has not yet come when AWS is paying anyone full time to support FreeBSD on EC2.

[0] https://twitter.com/brendangregg/status/1412201241472471048


Netflix works because they move content close to the users. This is done by either having the ISP establish a peering connection directly to Netflix hosted servers or by having the ISPs host "Open Connect Appliances" which cache the most requested content. These appliances are based on FreeBSD.

The AWS egress savings from this setup must be immense.

https://openconnect.netflix.com/


Yup, cloud bandwidth is insanely expensive considering to what you actually pay to get link to your datacenter.

And you pay either by 95th percentile (basically "peak usage") or by whole link, not per megabyte sent


cperciva whom you link have worked quite a bit on EC2 support for FreeBSD, a lot of it documented on their blog [0] and supported by Patreons at [1].

But yeah it would be nice if there was someone who could work on it full time

[0]: https://www.daemonology.net/blog/2022-03-29-FreeBSD-EC2-repo...

[1]: https://www.patreon.com/cperciva


Yep! In the thread he describes how he is not enough.


What does it mean to support FreeBSD on EC2? Surely it's just a KVM so you can run whatever you want?


It means, for example, writing a FreeBSD kernel driver for Elastic Network Adapter (ENA). Both Linux kernel driver and FreeBSD kernel driver is available at https://github.com/amzn/amzn-drivers


And here I sit like a chump with my home server connected to a 100MBit switch. (I paid for that switch, and I'm not replacing it until it gives up the ghost.) (And before you ask, the server also runs FreeBSD, and I'm very happy with the result.)


Bring it to the max with multipath ;) since you have already Freebsd, no need to throw those beautiful reliable thing's away, maybe just buy a second...third? Dirt-cheap 100MBit card:

https://en.wikipedia.org/wiki/Multipath_TCP#Implementation


The server has a second NIC, but the switch has no more free ports. I briefly thought of bonding, but stopped when I read that the switch would need to support it (which it almost certainly does not).

But my point was that for my requirements, 100MBit are actually sufficient and FreeBSD still is a good choice for me, I was just being snarky about it. (I do find it aesthetically displeasing, though, that my wifi is now faster than my wired network, but I can live with that.)


>the switch would need to support it

That's what i was thinking and brought up multipath and not LACP ;)


Alrighty then. I'll look into multipath.


I understand the motivation, but $20 gets you an 8-port gigE switch, so it seems like the wrong hill to die on. :)


I know, but so far 100MBit is sufficient, actually, I rarely move Gigabytes of data around. When it becomes annoying, I'll get a new switch, but so far the pressure is really low.


It amazes me, that Netflix is capable of such top of the line engineering things (really mindblowing stuff, one machine that streams nearly 1 Terabit pers second), but is for the love of god unable to stream HD Content to my iPhone (newest firmware, all up2date). Tried everything gigabit wifi, cellular, multiple ISPs ...

It is better for me to pirate their content, play it with Plex and be happy. I pay for Netflix, but still have to download it, to see it an acceptable quality. Absurd. The support couldn‘t help. It doesn’t affect, because I have my Torrent/Plex Setup, but for 99.9% of people it is a subpar experience.

I think the best years are over for Netflix. The hard awakening is here to make content that the users want and they are a movie/tv content company, not primarily a „tech company“.


> unable to stream HD Content to my iPhone

Yeah this has been the case since forever. It prioritizes instant playback vs forcing 1080p or similar.

Can't speak for iPhone, but on iPad, I've moved to using the website which goes goes to 1080p immediately.

> still have to download it, to see it an acceptable quality

Downloaded content do contain a whole lot more compression than streaming at max phone supported quality, so just a tiny FYI.


You live in a bubble. The vast majority of the world likely cannot even tell the difference between HD and 4K. Netflix continues to grow its content and retain subscribers.


Netflix is a media company as I said.

Well 4K vs HD, you are right, but 480p on a Retina display right in front of me. Really obvious.


I really think Netflix could make some good money being a multimedia-cdn (even for "competitors")


I thought the same thing 10 years ago when I worked there. At the time management was not interested in losing focus on doing anything other than streaming movies to customers.

But it should be noted that the FreeBSD Openconnect boxes are highly optimized to Netflix's use case. Which is serving a predefined set of content that has been pre-rendered. Youtube and its ilk are a completely different use case.

The Netflix cache is so optimized for serving Netflix movies that for many years we still used Akamai for all of our other CDN needs, but it looks like they may have finally moved that to Netflix's own CDN now.


Wow, that is actually a really interesting idea in the context of developing a YouTube competitor. Delivery & bandwidth are a really high barrier to entry, and piggy-backing off of Netflix's existing network could really lower those costs. I agree the "providing services to your direct competition" is probably a stumbling block, and likely Netflix has other irons in the fire. But anyway it's a cool idea to think about.


It works with such high efficiency because we know how to place content in advance, and the catalog is relatively small. Trying to serve 800Gbps of YouTube content would be a nightmare.


Indeed. YT has a much different problem, which is to determine which video is going to go viral, and then transcode it into popular formats when it does.

In comparison, we pre-transcode everything to exacting standards, so all our CDN has to do is serve is static files.


I don't talk directly about Youtube, but also serving Disney, hula, but ESPECIALLY national/continental portals like Arte.tv, Play-SRF, ARD-Mediathek etc.


I wonder how much of YT traffic is the "big" (say >200k viewers in a month) vs the small guys.

But yeah, once your hot data size exceeds cache byebye efficiency


One hard part is that on the youtube side, most views occur within the first 48 hours or so and a good fraction occur within the first 6. With netflix, they have a catalogue of ~5000 videos and gets <200 new per month. Youtube has around 30k channels with more than 500k subscriptions, so that's somewhere around 30k videos per week.


I come from the time when the first internet connection my house had was a 56k modem...just before cable modems/DOCSIS started rolling out in the midwest. These speeds are somewhat mind boggling to me. (Yeah, yeah, datacenter vs home, but it's still somewhat hard to imagine saturating pipes like those.)

While standing in a state of mild awe at 800Gb/s I read reviews and consider upgrading my house to 2.5Gb/s equipment... Should I just wait for 10Gbit to get a bit cheaper? Should I ditch copper and go fiber like that guy who was on the front page here recently (probably not, but that was cool)? Maybe raw single core CPU performance is starting to level off a bit, but it seems that networking technologies are still advancing a rapid clip!


Fiber 10Gb is very cheap. NICs and SFPs from Ebay, fibre from FS.com in whatever length you want. I got a plenum rated 100 ft 4 pair cable from FS.com for $100 or so, and it was only that expensive for the plenum rating as it runs through my cold air returns.


There should be a eurobsdcon video about this, and there is:

https://www.youtube.com/watch?v=36qZYL5RlgY


What is Gb/s per watt of power between 2x400Gb/s servers and a single 800Gb/s ?

Following these reports since 2015, when I compared estimated cost of your 9Gb/s server to F5 load balancer :)


related to slide 4...

how much does netflix donate to the FreeBSD foundation relative to their profits?


"Netflix does contribute financially to the FreeBSD Foundation and has done so since 2012. Last year they engaged at the "platinum" level with contributing more than $50,000+ USD to the foundation." (2019)

Took about five seconds to Google, it's the first result for "netflix donations to freebsd".

NFLX Q3 2019 earnings were about $5.2B.

So about 0.001%, I guess.


haha


I think it’s weird and cool how Netflix used FreeBSD/Dlang.

Linux is just the automatic go to. It’s great the big tech companies are rethinking these basics.


Where are you seeing any mention of Dlang?



"In networking units"




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: