This is a well-written article with excellent explanations and I thoroughly enjoyed it.
However, none of the variants using vmsplice (i.e., all but the slowest) are safe. When you gift [1] pages to the kernel there is no reliable general purpose way to know when the pages are safe to reuse again.
This post (and the earlier FizzBuzz variant) try to get around this by assuming the pages are available again after "pipe size" bytes have been written after the gift, _but this is not true in general_. For example, the read side may also use splice-like calls to move the pages to another pipe or IO queue in zero-copy way so the lifetime of the page can extend beyond the original pipe.
This will show up as race conditions and spontaneously changing data where a downstream consumer sees the page suddenly change as it it overwritten by the original process.
The author of these splice methods, Jens Axboe, had proposed a mechanism which enabled you to determine when it was safe to reuse the page, but as far as I know nothing was ever merged. So the scenarios where you can use this are limited to those where you control both ends of the pipe and can be sure of the exact page lifetime.
I haven't digested this comment fully yet, but just to be clear, I am _not_ using SPLICE_F_GIFT (and I don't think the fizzbuzz program is either). However I think what you're saying makes sense in general, SPLICE_F_GIFT or not.
Are you sure this unsafety depends on SPLICE_F_GIFT?
Also, do you have a reference to the discussions regarding this (presumably on LKML)?
Yeah my mention of gift was a red herring: I had assumed gift was being used but the same general problem (the "page garbage collection issue") crops up regardless.
If you don't use gift, you never know when the pages are free to use again, so in principle you need to keep writing to new buffers indefinitely. One "solution" to this problem is to gift the pages, in which case the kernel does the GC for you, but you need to churn through new pages constantly because you've gifted the old ones. Gift is especially useful when the page gifted can be used directly in the page cache (i.e., writing a file, not a pipe).
Without gift some consumption patterns may be safe but I think they are exactly those which involve a copy (not using gift means that a copy will occur for additional read-side scenarios). Ultimately the problem is that if some downstream process is able to get a zero-copy view of a page from an upstream writer, how can this be safe to concurrently modification? The pipe size trick is one way it could work, but it doesn't pan out because the pages may live beyond the immediately pipe (this is actually alluded in the FizzBuzz article where they mentioned things blew up if more than one pipe was involved).
Yes, this all makes sense, although like everything splicing-related, it is very subtle. Maybe I should have mentioned the subtleness and dangerousness of splicing at the beginning, rather than at the end.
I still think the man page of vmsplice is quite misleading! Specifically:
SPLICE_F_GIFT
The user pages are a gift to the kernel. The application may not modify
this memory ever, otherwise the page cache and on-disk data may differ.
Gifting pages to the kernel means that a subsequent splice(2)
SPLICE_F_MOVE can successfully move the pages; if this flag is not speci‐
fied, then a subsequent splice(2) SPLICE_F_MOVE must copy the pages.
Data must also be properly page aligned, both in memory and length.
To me, this indicates that if we're _not_ using SPLICE_F_GIFT downstream splices will be automatically taken care of, safety-wise.
Hmm, reading this side-by-side with a paragraph from BeeOnRope's comment:
> This post (and the earlier FizzBuzz variant) try to get around this by assuming the pages are available again after "pipe size" bytes have been written after the gift, _but this is not true in general_. For example, the read side may also use splice-like calls to move the pages to another pipe or IO queue in zero-copy way so the lifetime of the page can extend beyond the original pipe.
The paragraph you quoted says that the "splice-like calls to move the pages" actually copy when SPLICE_F_GIFT is not specified. So perhaps the combination of not using SPLICE_F_GIFT and waiting until "pipe size" bytes have been written is safe.
Yes it is not clear to me when the copy actually happens but I had assumed the > 30 GB/s result after read was changed to use splice must imply zero copy.
It could be that when splicing to /dev/null (which I'm doing), the kernel knows that they their content is never witnessed, and therefore no copy is required. But I haven't verified that
Splicing seems to work well for the "middle" part of a chain of piped processes, e.g., how pv works: it can splice pages from one pipe to another w/o needing to worry about reusing the page since someone upstream already wrote the page.
Similarly for splicing from a pipe to a file or something like that. It's really the end(s) of the chain that want to (a) generate the data in memory or (b) read the data in memory that seem to create the problem.
I think you're right that the same problem applies without SPLICE_F_GIFT. One of the other fizzbuzz code golfers discusses that here: https://codegolf.stackexchange.com/a/239848
I wonder if io_uring handles this (yet). io_uring is a newer async IO mechanism by the same author which tells you when your IOs have completed. So you might think it would:
* But from a quick look, I think its vmsplice equivalent operation just tells you when the syscall would have returned, so maybe not. [edit: actually, looks like there's not even an IORING_OP_VMSPLICE operation in the latest mainline tree yet, just drafts on lkml. Maybe if/when the vmsplice op is added, it will wait to return for the right time.]
* And in this case (no other syscalls or work to perform while waiting) I don't see any advantage in io_uring's read/write operations over just plain synchronous read/write.
Perhaps it could be sortof simulated in uring using the splice op against a memfd that has been mmaped in advance? I wonder how fast that could be and how it would compare safetywise.
uring only really applies for async IO - and would tell you when an otherwise blocking syscall would have finished. Since the benchmark here uses blocking calls, there shouldn’t be any change in behavior. The lifetime of the buffer is an orthogonal concern to the lifetime of the operation. Even if the kernel knows when the operation is done inside the kernel it wouldn’t have a way to know whether the consuming application is done with it.
> uring only really applies for async IO - and would tell you when an otherwise blocking syscall would have finished. Since the benchmark here uses blocking calls, there shouldn’t be any change in behavior. The lifetime of the buffer is an orthogonal concern to the lifetime of the operation. Even if the kernel knows when the operation is done inside the kernel it wouldn’t have a way to know whether the consuming application is done with it.
That doesn't match what I've read. E.g. https://lwn.net/Articles/810414/ opens with "At its core, io_uring is a mechanism for performing asynchronous I/O, but it has been steadily growing beyond that use case and adding new capabilities."
More precisely:
* While most/all ops are async IO now, is there any reason to believe folks won't want to extend it to batch basically any hot-path non-vDSO syscall? As I said, batching doesn't help here, but it does in a lot of other scenarios.
* Several IORING_OP_s seem to be growing capabilities that aren't matched by like-named syscalls. E.g. IO without file descriptors, registered buffers, automatic buffer selection, multishot, and (as of a month ago) "ring mapped supplied buffers". Beyond the individual operation level, support for chains. Why not a mechanism that signals completion when the buffer passed to vmsplice is available for reuse? (Maybe by essentially delaying the vmsplice syscall'ss return [1], maybe by a second command, maybe by some extra completion event from the same command, details TBD.)
[1] edit: although I guess that's not ideal. The reader side could move the page and want to examine following bytes, but those won't get written until the writer sees the vmsplice return and issues further writes.
The vanilla io_uring fits "naturally" in an async model, but batching and some of the other capabilities it provide are definitely useful for stuff written to a synchronous model too.
Additionally, io_uring can avoid syscalls sometimes even without any explicit batching by the application, because it can poll the submission queue (root only, last time I checked unfortunately): so with the right setup a series of "synchronous" ops via io_uring (i.e., submit & immediately wait for the response) could happen with < 1 user-kernel transition per op, because the kernel is busy servicing ops directly from the incoming queue and the application gets the response during its polling phase before it waits.
Actually, from re-reading the man page for vmsplice, it seems like it _should_ depend on SPLICE_F_GIFT (or in other words, it should be safe without it).
But from what I know about how vmsplice is implemented, gifting or not, it sounds like it should be unsafe anyhow.
> However, none of the variants using vmsplice (i.e., all but the slowest) are safe. When you gift [1] pages to the kernel there is no reliable general purpose way to know when the pages are safe to reuse again. [snip] This will show up as race conditions and spontaneously changing data where a downstream consumer sees the page suddenly change as it it overwritten by the original process.
That sounds like a security issue - the ability of an upstream generator process to write into the memory of a downstream reader process, or more perverser vice versa is even worser. I presume that the Linux kernel only lets this happen (zero copy) when the two processes are running as the same user?
It’s not clear to me that the kernel allows the receiving process to write instead of just read.
But also, if you are sending data, why would you later read/process that send buffer?
The only attack vector I could imagine would be if one sender was splicing the same memory to two or more receivers. A malicious receiver with write access to the spliced memory could compromise other readers.
No, because "freeing" memory makes sense within a single process and really means "removing the V->P mapping for the page", or more accurately something like "telling the malloc() implementation this pointer (implying a range of virtual memory) is free, which may or may not end up unmapping the V range from the process (or doing things like M_ADV_DONTNEED))".
None of those would impact the same physical page mapped into another process with its own V->P mapping.
I once had to change my mental model for how fast some of these things were. I was using `seq` as an input for something else, and my thinking was along the lines that it is a small generator program running hot in the cpu and would be super quick. Specifically because it would only be writing things out to memory for the next program to consume, not reading anything in.
But that was way off and `seq` turned out to be ridiculously slow. I dug down a little and made a faster version of `seq`, that kind of got me what I wanted. But then noticed at the end that the point was moot anyway, because just piping it to the next program over the command line was going to be the slow point, so it didn't matter anyway.
I had a somewhat similar discovery once using GNU parallel. I was trying to generate as much web traffic as possible from a single machine to load test a service I was building, and I assumed that the network I/o would be the bottleneck by a long shot, not the overhead of spawning many processes. I was disappointed by the amount of traffic generated, so I rewrote it in Ruby using the parallel gem with threads (instead of processes), and got orders of magnitude more performance.
Not a valid comparison between the two machines because I don't know what the original machine is, but MacOS rarely comes out shining in this sort of comparison, and the simplistic approach here giving 8 GB/s rather than the author's 3.5 GB/s was better than I'd expected, even given the machine I'm using.
The majority of this overhead (and the slow transfers) naively seem to be in the scripts/systems using the pipes.
I was worried when I saw zfs send/receive used pipes for instance because of performance worries - but using it in reality I had no problems pushing 800MB/s+. It seemed limited by iop/s on my local disk arrays, not any limits in pipe performance.
Right. I’m actually surprised the test with 256kB transfers gives reasonable results, and would rather have tested with > 1GB instead. For such a small transfer it seemed likely that the overhead of spawning the process and loading libraries by far dominates the amount of actual work. I’m also surprised this didn’t show up in profiles. But if obviously depends on where the measurement start and end points are
Perhaps I've misunderstood what you're referring to, but the test in the article is measuring speed transferring 10 GiB. 256 KiB is just the buffer size.
The first C program in the blog post allocates a 256kB buffer and writes that one exactly once to stdout. I don't see another loop which writes it multiple times.
There's an outer while(true){} loop - the write side just writes continuously.
More generally though, sidenote 5 says that the code in the article itself is incomplete and the real test code is available in the github repo: https://github.com/bitonic/pipes-speed-test
I get around 1.56MiB/s with that code. PHP gets 4.04MiB/s. Python gets 4.35MiB/s.
> What's also interesting is that node crashes after about a minute
I believe this is because `while(1)` runs so fast that there is no "idle" time for V8 to actually run GC. V8 is a strange beast, and this is just a guess of mine.
The following code shouldn't crash, give it a try:
>> What's also interesting is that node crashes after about a minute
> I believe this is because `while(1)` runs so fast that there is no "idle" time for V8 to actually run GC. V8 is a strange beast, and this is just a guess of mine.
Not exactly: the GC is still running; it’s live memory that’s growing unbounded.
What’s going on here is that WritableStream is non-blocking; it has advisory backpressure, but if you ignore that it will do its best to accept writes anyway and keep them in a buffer until it can actually write them out. Since you’re not giving it any breathing room, that buffer just keeps growing until there’s no more memory left. `process.nextTick()` is presumably slowing things down enough on your system to give it a chance to drain the buffer. (I see there’s some discussion below about this changing by version; I’d guess that’s an artifact of other optimizations and such.)
To do this properly, you need to listen to the return value from `.write()` and, if it returns false, back off until the stream drains and there’s room in the buffer again.
Here’s the (not particularly optimized) function I use to do that:
async function writestream(chunks, stream) {
for await (const chunk of chunks) {
if (!stream.write(chunk)) {
// When write returns null, stream is starting to buffer and we need to wait for it to drain
// (otherwise we'll run out of memory!)
await new Promise(resolve => stream.once('drain', () => resolve()))
}
}
}
I do wish Node made it more obvious what was going on in this situation; this is a very common mistake with streams and it’s easy to not notice until things suddenly go very wrong.
ETA: I should probably note that transform streams, `readable.pipe()`, `stream.pipeline()`, and the like all handle this stuff automatically. Here’s a one-liner, though it’s not especially fast:
Are there still no async write functions which handle this easier than the old event based mechanism? Waiting for drain also sounds like it might reduce throughout since then there is 0 buffered data and the peer would be forced t Öl pause reading. A „writable“ event sounds more appropriate - but the node docs don’t mention one.
Hm, strange. With the same out of memory error as before or a different one? Tried running that one for 2 minutes, no errors here, and memory stays constant.
Huh yeah, seems to be a old memory leak. Running it on v10.24.0 crashes for me too.
After some quick testing in a couple of versions, it seems like it got fixed in v11 at least (didn't test any minor/patch versions).
By the way, all versions up to NodeJS 12 (LTS) are "end of life", and should probably not be used if you're downloading 3rd party dependencies, as there are bunch of security fixes since then, that are not being backported.
> I believe this is because `while(1)` runs so fast that there is no "idle" time for V8 to actually run GC. V8 is a strange beast, and this is just a guess of mine.
Java has (had) weird idiosyncrasies like this as well, well it doesn't crash, but depending on the construct you can get performance degradations depending on how the language inserts safepoints (where the VM is at a knowable state and a thread can be safely paused for GC or whatever).
I don't know if this holds today, but I know there was a time where you basically wanted to avoid looping over long-type variables, as they had different semantics. The details are a bit fuzzy to me right now.
If you ever need to write a random character to a pipe very fast, GNU coreutils has you covered with yes(1). It runs at about 6 GiB/s on my system:
yes | pv > /dev/null
There's an article floating around [1] about how yes(1) is extremely optimized considering its original purpose. In care you're wondering, yes(1) is meant for commands that (repeatedly) ask whether to proceed, expecting a y/n input or something like that. Instead of repeatedly typing "y", you just run "yes | the_command".
Not sure about how yes(1) compares to the techniques presented in the linked post. Perhaps there's still room for improvement.
Honest question: what are the practical use cases of this?
Repeatedly typing the 'y' character into a Linux pipe is surely not that common, especially at that bit rate. Also seems like the bottleneck would always be the consuming program...
Historically, you could have dirty filesystems after a reboot that "fsck" would ask an absurd number of questions about ("blah blah blah inode 1234567890 fix? (y/n)"). Unless you were in a very specific circumstance, you'd probably just answer "y" to them. It could easily ask thousands of questions though. So: "yes | fsck" was not uncommon.
It's probably still common in installation scripts, like in Dockerfiles. `apt-get install` has the `-y` option, but it would be useful for all other programs that don't.
Just to clarify: I was applying "historically" to "fsck", not to the use of "yes" in general. I can't remember the last time I've had the need to use "yes | fsck"
> Honest question: what are the practical use cases of this?
It also allows you to script otherwise interactive command line operations with the correct answer. Many come like tools now days provide specific options to override queries. But there are still a couple hold outs which might not.
> Repeatedly typing the 'y' character into a Linux pipe is surely not that common, especially at that bit rate.
At that rate no but I definitely use it once in a while. For example if a copy quite a few files and then get repeatedly asked if I want to overwrite the destination (when it's already present). Sure, I could get my commmand back and use the proper flag to "cp" or whatever to overwrite, but it's usually much quicker to just get back the previous line, go at the beginning (C-a), then type "yes | " and be done with it.
Note that you can pass a parameter to "yes" and then it repeats what you passed instead of 'y'.
It is optimized quite seriously. I remember there was a comparison of it with I believe a BSD version, where the latter was thousands time more readable (although slower).
I'm getting ~3.10GiB/s with both GNU's and FreeBSD's. I do see that GNU's version has some optimizations, but their effectiveness isn't apparent when doing `yes | pv > /dev/null`.
However, my point was just that its performance was never a main point of it. Even without optimizations, it's still very fast, and I don't think whoever created it first was concerned with it having to be super fast, as long as it was faster than the prompts of whatever was downstream in the pipe.
A major contributing factor is whether or not the language buffers output by default, and how big the buffer is. I don't think NodeJS buffers, whereas Python does. Here's some comparisons with Go (does not buffer by default):
- Node (no buffering): 1.2 MiB/s
- Go (no buffering): 2.4 MiB/s
- Python (8 KiB buffer): 2.7 MiB/s
- Go (8 KiB buffer): 218 MiB/s
Go program:
f := bufio.NewWriterSize(os.Stdout, 8192)
for {
f.WriteRune('1')
}
Not specifically addressed at you, but it's a bit amusing watching a younger generation of programmers rediscovering things like this, which seemed hugely important in like 1990 but largely don't matter that much to modern workflows with dedicated APIs or various shared memory or network protocols, as not much that is really performance-critical is typically piped back and forth anymore.
More than a few old backup or transfer scripts had extra dd or similar tools in the pipeline to create larger and semi-asynchronous buffers, or to re-size blocks on output to something handled better by the receiver, which was a big deal on high speed tape drives back in the day. I suspect most modern hardware devices have large enough static RAM and fast processors to make that mostly irrelevant.
I did the same test, but added a rust and bash version. My results:
Rust: 21.9MiB/s
Bash: 282KiB/s
PHP: 2.35MiB/s
Python: 2.30MiB/s
Node: 943KiB/s
In my case, node did not crash after about two minutes. I find it interesting that PHP and Python are comparable for me but not you, but I'm sure there's a plethora of reasons to explain that. I'm not surprised rust is vastly faster and bash vastly slower, I just thought it interesting to compare since I use those languages a lot.
Rust:
fn main() {
loop {
print!("1");
}
}
Bash (no discernible difference between echo and printf):
For languages like C, C++, and Rust, the bottleneck is going to mainly be system calls. With a big buffer, on an old machine, I get about 1.5 GiB/s with C++. Writing 1 char at a time, I get less than 1 MiB/s.
#include <cstddef>
#include <random>
#include <chrono>
#include <cassert>
#include <array>
#include <cstdio>
#include <unistd.h>
#include <cstring>
#include <cstdlib>
int main(int argc, char **argv) {
int rv;
assert(argc == 3);
const unsigned int n = std::atoi(argv[1]);
char *buf = new char[n];
std::memset(buf, '1', n);
const unsigned int k = std::atoi(argv[2]);
auto start = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < k; i++) {
rv = write(1, buf, n);
assert(rv == int(n));
}
auto stop = std::chrono::high_resolution_clock::now();
auto duration = stop - start;
std::chrono::duration<double> secs = duration;
std::fprintf(stderr, "buffer size: %d, num syscalls: %d, perf:%f MiB/s\n", n, k, (double(n)*k)/(1024*1024)/secs.count());
}
EDIT: Also note that a big write to a pipe (bigger than PIPE_BUF) may require multiple syscalls on the read side.
EDIT 2: Also, it appears that the kernel is smart enough to not copy anything when it's clear that there is no need. When I don't go through cat, I get rates that are well above memory bandwidth, implying that it's not doing any actual work:
I suspect (but am not sure) that the shell may be doing something clever for a stream redirection (>) and giving your program a STDOUT file descriptor directly to /dev/null.
I may be wrong, though. Check with lsof or similar.
There's no special "no work" detection needed. a.out is calling the write function for the null device, which just returns without doing anything. No pipes are involved.
Seems like it's buffering output, which Python also does. Python is much slower if you flush every write (I get 2.6 MiB/s default, 600 KiB/s with flush=True).
Interestingly, Go is very fast with a 8 KiB buffer (same as Python's), I get 218 MiB/s.
Why is it cheating to use a buffer? This is the behavior you would get in C if you used the C standard library (putc/fputc) instead of a system call (write).
That manages about 7 GiB/s reusing the same buffer, or about 300 MiB/s with clearing and refilling the buffer every time
(the magic is in using java’s APIs for writing to files/sockets, which are designed for high performance, instead of using the APIs which are designed for writing to stdout)
`process.stdout.write` is different to PHP's `echo` and Python's `print` in that it pushes a write to an event queue without waiting for the result which could result in filling event queue with writes. Instead, you can consider `await`-ing `write` so that it would write before pushing another `write` to an event queue.
For Python 3.10.4, I get about 2.8 MiB/s as you have it written, but around 5 MiB/s (same for 3.9 but only 4 MiB/s for 3.8) with this. I also get 4.8 MiB/s with 2.7:
[Edit: as pointed out below, this is no longer the case!]
Strings are printed one character at a time in Haskell. This choice is justified by unpredictability of the interaction between laziness and buffering; I am uncertain it's the correct choice, but the proper response is to use Text where performance is relevant.
With the recursive code, it buffered the output in the same way but bugged the kernel a whole lot more in-between writes. Not exactly sure what is going on:
I'm honestly surprised either of them wind up buffered! That must be a change since I stopped paying as much attention to GHC.
I'm also not sure what's going on in the second case. IIRC, at some point historically, a sufficiently tight loop could cause trouble with handling SIGINT, so it might be related to some overagressive workaround for that?
Potential buffering issues aside, as others have pointed out the node.js example is performing asynchronous writes, unlike the other languages' examples (as far as I know).
To do a proper synchronous write, you'd do something like:
You're testing a very specific operation, a loop, in each language to determine its speed, not sure if I'd generalize that. I wonder what it'd look like if you replaced the loop with static print statements that were 1000s of characters long with line breaks, the sort of things that compiler optimizations do.
I find that NodeJS runs eventually out of memory and crashes with applications that do a large amount of data processing over a long time with little breaks even if there are no memory leaks.
Edit: I've found this consistently building multiple data processing applications over multiple years and multiple companies
I'll tell you what's fun. I get 5MB/sec with Python, 1.3MB/sec with Node and.... 12.6MB/sec with Ruby! :-) (Added: Same speed as Node if I use $stdout.sync = true though..)
It looks like it is using the "Tufte" style, named after Edward Tufte, who is very famous for his writing on data visualization.
More examples: https://rstudio.github.io/tufte/
Android's flavor of Linux uses "binder" instead of pipes because of its security model. IMHO filesystem-based IPC mechanisms (notably pipes), can't be used because of a lack of a world-writable directory - i may be wrong here.
Pipes don’t necessarily mean one has to use FS permissions. Eg a server could hand out anonymous pipes to authorized clients via fd passing on Unix domain sockets. The server can then implement an arbitrary permission check before doing this.
I usually just use cat /dev/urandom > /dev/null to generate load. Not sure how this compares to their code.
Edit: it’s actually “yes” that I’ve used before for generating load. I remember reading somewhere “yes” was optimized differently than the original Unix command as part of the unix certification lawsuit(s).
Think they were generating load? Going through the urandom device not bad as it has to do a bit of work to get that rand number? Just for throughput though zero is prob better.
I don't understand. If you're testing how fast pipes are, then I'd expect you to measure throughput or latency. Why would you measure how fast something unrelated to pipes is? If you want to measure this other thing on the other hand, why would you bother with pipes, which add noise to the measurement?
UPDATE: If you mean that you want to test how fast pipes are when there is other load in the system, then I'd suggest just running a lot of stuff in the background. But I wouldn't put the process dedicated for doing something else into the pipeline you're measuring. As a matter of fact, the numbers I gave were taken with plenty of heavy processes running in the background, such as Firefox, Thunderbird, a VM with another instance of Firefox, OpenVPN, etc. etc. :)
Oh, wait. You mean that this "cat </dev/urandom >/dev/null" was meant to be running in the background and not be the pipeline which is tested? Ok, my bad for not getting the point.
I'm glad huge pages make a big difference because I just spent several hours setting them up. Also everyone says to disable transparent_hugepage, so I set it to `madvise`, but I'm skeptical that any programs outside databases will actually use them.
I just had 25Gb/s internet installed (https://www.init7.net/en/internet/fiber7/), and at those speeds Chrome and Firefox (which is Chrome-based) pretty much die when using speedtest.net at around 10-12Gbps.
The symptoms are that the whole tab freezes, and the shown speed drops from those 10-12Gbps to <1Gbps and the page starts updating itself only every second or so.
IIRC Chrome-based browsers use some form of IPC with a separate networking process, which actually handles networking, I wonder if this might be the case that the local speed limit for socketpair/pipe under Linux was reached and that's why I'm seeing this.
> and at those speeds Chrome and Firefox (which is Chrome-based)
AFAIK, Firefox is not Chrome-based anywhere.
On iOS it uses whatever iOS provides for webview - as does Chrome on iOS.
Firefox and Safari is now the only supported mainstream browsers that has their own rendering engines. Firefox is the only that has their own rendering engine and is cross platform. It is also open source.
> Firefox is the only that has their own rendering engine and is cross platform.
Interestingly safaris rendering engine is open source and cross platform, but the browser is not. Lots of linux-focused browsers (konquerer, gnome web, surf) and most embedded browsers (nintendo ds & switch, playstation) use webkit. Also some user interfaces (like WebOS, which is running all of LG's TVs and smart refrigerators) use webkit as their renderer.
Skia started in 2004 independently of google and was then acquired by google. Calling it "Chrome's Skia graphics engine" makes it sound like it was built for chrome.
Chrome fires many processes and creates an IPC based comm-network between them to isolate stuff. It's somewhat abusing your OS to get what its want in terms of isolation and whatnot.
(Which is similar to how K8S abuses ip-tables and makes it useless for other ends, and makes you install a dedicated firewall in front of your ingress path, but let's not digress).
On the other hand, Firefox is neither chromium based, nor is a cousin of it. It's a completely different codebase, inherited from Netscape days and evolved up to this point.
As another test point, Firefox doesn't even blink at a symmetric gigabit connection going at full speed (my network is capped by my NIC, the pipe is way fatter).
Yes, Firefox is also doing the same, however due to the nature of Firefox's processes, the OS doesn't lose much responsiveness or doesn't feel bogged down when I have 50+ tabs open due to some research.
If you need security, you need isolation. If you want hardware-level isolation, you need processes. That's normal.
My disagreement with Google's applications are how they're behaving like they're the only running processes on the system itself. I'm pretty aware that some of the most performant or secure things doesn't have the prettiest implementation on paper.
There used to be a setting to tweak Chrome's process behavior.
I believe the default behavior is "Coalesce tabs into the same content process if they're from the same trust domain".
Then you can make it more aggressive like "Don't coalesce tabs ever" or less aggressive like "Just have one content process". I think.
I'm not sure how Firefox decides when to spawn new processes. I know they have one GPU process and then multiple untrusted "content processes" that can touch untrusted data but can't touch the GPU.
I don't mind it. It's a trade-off between security and overhead. The IPC is pretty efficient and the page cache in both Windows and Linux _should_ mean that all the code pages are shared between all content processes.
Static pages actually feel light to me. I think crappy webapps make the web slow, not browser security.
(inb4 I'm replying to someone who works on the Firefox IPC team or something lol)
I'm harmless, don't worry. :) Also you can find more information about me in my profile.
Even if I was working on Firefox/Chrome/whatever, I'd not be mad at someone who doesn't know something very well. Why should I? We're just conversing here.
Also, I've been very wrong here at times, and this improved my conversation / discussion skills a great deal.
> As another test point, Firefox doesn't even blink at a symmetric gigabit connection going at full speed (my network is capped by my NIC, the pipe is way fatter).
FWIW Firefox under Linux (Firefox Browser 100.0.2 (64-bit)) behaves pretty much the same as Chrome. The speed raises quickly to 5-8Gb/s, then the UI starts choking, and the shown speed drops to 500Mb/s. It could be that there's some scheduling limit or other bottleneck hit in the OS itself, assuming these are different codebases (are they?).
I'd love to test and debug the path where it dies, but none of the systems we have firefox have pipes that fat (again NIC limited).
However, you can test the limits of Linux by installing CLI version of Speedtest and hitting a nearby server.
The bottleneck maybe in the browser itself, or in your graphics stack, too.
Linux can do pretty amazing things in the network department, otherwise 100Gbps Infiniband cards wouldn't be possible at Linux servers, yet we have them on our systems.
And yes, Chrome and Firefox are way different browsers. I can confidently say this, because I'm using Firefox since it's called Netscape 6.0 (and Mozilla in Knoppix).
From my experience long ago, all high performance networking under Linux was traditionally user space and pre-allocated pools (netmap, dpdk, pf-ring...). Did not follow, how much io_uring has been catching up for network stack usage... Maybe somebody else knows?
While I'm not very knowledgeable in specifics, there are many paths for networking in Linux now. The usual kernel based one is there, also there's kernel-bypass [0] paths used by very high performance cards.
Also, Infiniband can directly RDMA to and from MPI processes for making "remote memory local", allowing very low latencies and high performance in HPC environments.
I also like this post from Cloudflare [1]. I've read it completely, but the specifics are lost on me since I'm not directly concerned with the network part of our system.
I have a service that beats epoll with io_uring (it reads gre packets from one socket, and does some lookups/munging on the inner packet and re-encaps them to a different mechanism and writes them back to a different socket). General usage for io_uring vs epoll is pretty comparable IIUC. It wouldn't surprise me if streams (e.g. tcp) end up being faster via io_uring and buffer registration though.
Totally tangential - it looks like io_uring is evolving beyond just io and into an alternate syscall interface, which is pretty neat imho.
I'm not talking about the version which evolved to Seamonkey. I'm talking about Mozilla/Firefox 0.8 which had a Mozilla logo as a "Spinner" instead of Netscape logo on the top right.
I know. Firefox was not even an idea when Netscape 6 was released. However, inverse is true. Firefox is based on Netscape. It's just branched off actually. It started as a pared down version of SeaMonkey apparently.
The thing I was remembering from Knoppix 3.x days was "Mozilla Navigator" of SeaMonkey/Mozilla Suite, which is even older than Firefox, and discontinued 3 years later. I just booted the CD to look at it.
At the end of the day, Firefox is just Netscape Navigator, evolved.
> ... on August 16, 1999 that [Lars Knoll] had checked in what amounted to a complete rewrite of the KHTML library—changing KHTML to use the standard W3C DOM as its internal document representation. https://en.wikipedia.org/wiki/KHTML#Re-write_and_improvement
> In March 1998, Netscape released most of the code base for its popular Netscape Communicator suite under an open source license. The name of the application developed from this would be Mozilla, coordinated by the newly created Mozilla Organization https://en.wikipedia.org/wiki/Mozilla_Application_Suite#Hist...
Netscape Communicator (or Netscape 4) was released in 1997, so If we are tracing lineage, I'd say Firefox has a 2 year head start.
Unrelated question, what hardware do you use to setup your network for 25Gb/s?
I've been looking at init7 for a while, but gave up and stayed with Salt after trying to find the right hardware for the job.
Router: Mikrotik CCR-2004 - https://mikrotik.com/product/ccr2004_1g_12s_2xs - warning: it's good to up to ~20Gb/s one way. It can handle ~25Gb/s down, but only ~18Gb/s up, and with IPv6 the max seems to be ~10Gb/s any direction.
If Mikrotik is something you're comfortable using you can also take a look at https://mikrotik.com/product/ccr2216_1g_12xs_2xq - it's more expensive (~2500EUR), but should handle 25Gb/s easily.
Thing to note: the open source version on GitHub, installable by homebrew and native package managers, is not the same version as Ookla distributes from their website and is not accurate at all.
Is it only affecting the browser or the entire system? It might be possible that the CPU is busy handling interrupts from the ethernet controller, although in general these controllers should use DMA and should not send interrupts frequently.
It's Safari-based, which is Webkit-based. Chrome is also Safari-based on iOS, because all the browsers must be. There's no actual Chrome (as in Blink, the browser engine) on iOS, at least in Play Store.
So I'm talking only about iOS. When I said it's Safari-based, I meant Webkit based, but I thought Firefox/Chrome actually pull parts of Safari on iOS. Quick research says that's wrong and they just use Webkit. Not an iOS dev, so someone can point out better sources for the 100% correct terminology.
The small "b" is customarily used to refer to bits, with the large "B" used to refer to bytes. So 25 Gb/s would be 25 Gbit/s, while 25 GB/s would be 200 Gbit/s.
pv is written in perl so isn't the snappiest, I'm surprised to see it score so highly. I wonder what the initial speed would have been if it just wrote to /dev/null
Oh yes, Linux pipes were invented by Douglas McIlroy while working for Bell Labs on Research UNIX and first described in the man pages of Version 3 Unix, Feb. 1974, just a couple months after Linus Torvald's 4th birthday.
Where and how and when will the unjust and blatent plagiarism of Linux cease? The software was made free by BSD, so feel free to use it, roll it all into GNU/Linux, have at it, but please stop incorrectly describing these things as Linux things. Because the only software that I am certain actually belongs to Linux is systemd. So let's start calling that "Linux systemd," and stop calling anything else Linux anything.
> In this post, we will explore how Unix pipes are implemented in Linux
Seems to me like this post is pretty specifically about Unix pipes on Linux (i.e. Linux pipes), as opposed to Unix pipes in general.
The article also talks about “Linux paging”, again clearly referring to the implementation and usage of virtual memory on Linux, rather than…whatever ancient architecture first invented the page table.
This would be valid, but only if the implementation of pipes in Linux is different than other pipelines, regardless of differences in memory paging, which the article, as detailed and in depth as it is, never makes clear.
So I'm not sure it is not analogous to, "today, we're going to talk about Linux electricity, specifically the way electricity is utilized in Linux."
When you have multiple ways to interpret what someone says, it's generally a good idea to assume best intentions. In this case Linux pipes would then refer to pipes in Linux, rather than implied ownership or origin. This avoids a lot of unnecessary squabbling
So pipes in Linux are significantly different than other pipelines? Was the wheel really reinvented when Linux was developed from Minux, such that the Minix pipeline implementation was abandoned? RLY??! I doubt it.
It's just how language works. You can say "The desert sun is making me thirsty", even when in fact it is the same sun. It's not even located in the desert, nor is it doing anything special at all. Still somehow everyone intuitively knows what you mean.
However, none of the variants using vmsplice (i.e., all but the slowest) are safe. When you gift [1] pages to the kernel there is no reliable general purpose way to know when the pages are safe to reuse again.
This post (and the earlier FizzBuzz variant) try to get around this by assuming the pages are available again after "pipe size" bytes have been written after the gift, _but this is not true in general_. For example, the read side may also use splice-like calls to move the pages to another pipe or IO queue in zero-copy way so the lifetime of the page can extend beyond the original pipe.
This will show up as race conditions and spontaneously changing data where a downstream consumer sees the page suddenly change as it it overwritten by the original process.
The author of these splice methods, Jens Axboe, had proposed a mechanism which enabled you to determine when it was safe to reuse the page, but as far as I know nothing was ever merged. So the scenarios where you can use this are limited to those where you control both ends of the pipe and can be sure of the exact page lifetime.
---
[1] Specifically, using SPLICE_F_GIFT.