Hacker News new | past | comments | ask | show | jobs | submit login
How fast are Linux pipes anyway? (mazzo.li)
698 points by rostayob on June 2, 2022 | hide | past | favorite | 200 comments



This is a well-written article with excellent explanations and I thoroughly enjoyed it.

However, none of the variants using vmsplice (i.e., all but the slowest) are safe. When you gift [1] pages to the kernel there is no reliable general purpose way to know when the pages are safe to reuse again.

This post (and the earlier FizzBuzz variant) try to get around this by assuming the pages are available again after "pipe size" bytes have been written after the gift, _but this is not true in general_. For example, the read side may also use splice-like calls to move the pages to another pipe or IO queue in zero-copy way so the lifetime of the page can extend beyond the original pipe.

This will show up as race conditions and spontaneously changing data where a downstream consumer sees the page suddenly change as it it overwritten by the original process.

The author of these splice methods, Jens Axboe, had proposed a mechanism which enabled you to determine when it was safe to reuse the page, but as far as I know nothing was ever merged. So the scenarios where you can use this are limited to those where you control both ends of the pipe and can be sure of the exact page lifetime.

---

[1] Specifically, using SPLICE_F_GIFT.


(I am the author of the post)

I haven't digested this comment fully yet, but just to be clear, I am _not_ using SPLICE_F_GIFT (and I don't think the fizzbuzz program is either). However I think what you're saying makes sense in general, SPLICE_F_GIFT or not.

Are you sure this unsafety depends on SPLICE_F_GIFT?

Also, do you have a reference to the discussions regarding this (presumably on LKML)?


Yeah my mention of gift was a red herring: I had assumed gift was being used but the same general problem (the "page garbage collection issue") crops up regardless.

If you don't use gift, you never know when the pages are free to use again, so in principle you need to keep writing to new buffers indefinitely. One "solution" to this problem is to gift the pages, in which case the kernel does the GC for you, but you need to churn through new pages constantly because you've gifted the old ones. Gift is especially useful when the page gifted can be used directly in the page cache (i.e., writing a file, not a pipe).

Without gift some consumption patterns may be safe but I think they are exactly those which involve a copy (not using gift means that a copy will occur for additional read-side scenarios). Ultimately the problem is that if some downstream process is able to get a zero-copy view of a page from an upstream writer, how can this be safe to concurrently modification? The pipe size trick is one way it could work, but it doesn't pan out because the pages may live beyond the immediately pipe (this is actually alluded in the FizzBuzz article where they mentioned things blew up if more than one pipe was involved).


Yes, this all makes sense, although like everything splicing-related, it is very subtle. Maybe I should have mentioned the subtleness and dangerousness of splicing at the beginning, rather than at the end.

I still think the man page of vmsplice is quite misleading! Specifically:

       SPLICE_F_GIFT
              The  user pages are a gift to the kernel.  The application may not modify
              this memory ever, otherwise the page cache and on-disk data  may  differ.
              Gifting   pages   to   the  kernel  means  that  a  subsequent  splice(2)
              SPLICE_F_MOVE can successfully move the pages; if this flag is not speci‐
              fied,  then  a  subsequent  splice(2)  SPLICE_F_MOVE must copy the pages.
              Data must also be properly page aligned, both in memory and length.
To me, this indicates that if we're _not_ using SPLICE_F_GIFT downstream splices will be automatically taken care of, safety-wise.


Hmm, reading this side-by-side with a paragraph from BeeOnRope's comment:

> This post (and the earlier FizzBuzz variant) try to get around this by assuming the pages are available again after "pipe size" bytes have been written after the gift, _but this is not true in general_. For example, the read side may also use splice-like calls to move the pages to another pipe or IO queue in zero-copy way so the lifetime of the page can extend beyond the original pipe.

The paragraph you quoted says that the "splice-like calls to move the pages" actually copy when SPLICE_F_GIFT is not specified. So perhaps the combination of not using SPLICE_F_GIFT and waiting until "pipe size" bytes have been written is safe.


Yes it is not clear to me when the copy actually happens but I had assumed the > 30 GB/s result after read was changed to use splice must imply zero copy.


It could be that when splicing to /dev/null (which I'm doing), the kernel knows that they their content is never witnessed, and therefore no copy is required. But I haven't verified that


Makes sense. If so, some of the nice benchmark numbers for vmsplice would go away in a real scenario, so that'd be nice to know.


Splicing seems to work well for the "middle" part of a chain of piped processes, e.g., how pv works: it can splice pages from one pipe to another w/o needing to worry about reusing the page since someone upstream already wrote the page.

Similarly for splicing from a pipe to a file or something like that. It's really the end(s) of the chain that want to (a) generate the data in memory or (b) read the data in memory that seem to create the problem.


I think you're right that the same problem applies without SPLICE_F_GIFT. One of the other fizzbuzz code golfers discusses that here: https://codegolf.stackexchange.com/a/239848

I wonder if io_uring handles this (yet). io_uring is a newer async IO mechanism by the same author which tells you when your IOs have completed. So you might think it would:

* But from a quick look, I think its vmsplice equivalent operation just tells you when the syscall would have returned, so maybe not. [edit: actually, looks like there's not even an IORING_OP_VMSPLICE operation in the latest mainline tree yet, just drafts on lkml. Maybe if/when the vmsplice op is added, it will wait to return for the right time.]

* And in this case (no other syscalls or work to perform while waiting) I don't see any advantage in io_uring's read/write operations over just plain synchronous read/write.


I don't know if io_uring provides a mechanism to solve this page ownership thing but I bet Jens does: I've asked [1].

---

[1] https://twitter.com/trav_downs/status/1532491167077572608


Perhaps it could be sortof simulated in uring using the splice op against a memfd that has been mmaped in advance? I wonder how fast that could be and how it would compare safetywise.


uring only really applies for async IO - and would tell you when an otherwise blocking syscall would have finished. Since the benchmark here uses blocking calls, there shouldn’t be any change in behavior. The lifetime of the buffer is an orthogonal concern to the lifetime of the operation. Even if the kernel knows when the operation is done inside the kernel it wouldn’t have a way to know whether the consuming application is done with it.


> uring only really applies for async IO - and would tell you when an otherwise blocking syscall would have finished. Since the benchmark here uses blocking calls, there shouldn’t be any change in behavior. The lifetime of the buffer is an orthogonal concern to the lifetime of the operation. Even if the kernel knows when the operation is done inside the kernel it wouldn’t have a way to know whether the consuming application is done with it.

That doesn't match what I've read. E.g. https://lwn.net/Articles/810414/ opens with "At its core, io_uring is a mechanism for performing asynchronous I/O, but it has been steadily growing beyond that use case and adding new capabilities."

More precisely:

* While most/all ops are async IO now, is there any reason to believe folks won't want to extend it to batch basically any hot-path non-vDSO syscall? As I said, batching doesn't help here, but it does in a lot of other scenarios.

* Several IORING_OP_s seem to be growing capabilities that aren't matched by like-named syscalls. E.g. IO without file descriptors, registered buffers, automatic buffer selection, multishot, and (as of a month ago) "ring mapped supplied buffers". Beyond the individual operation level, support for chains. Why not a mechanism that signals completion when the buffer passed to vmsplice is available for reuse? (Maybe by essentially delaying the vmsplice syscall'ss return [1], maybe by a second command, maybe by some extra completion event from the same command, details TBD.)

[1] edit: although I guess that's not ideal. The reader side could move the page and want to examine following bytes, but those won't get written until the writer sees the vmsplice return and issues further writes.


Yeah this.

The vanilla io_uring fits "naturally" in an async model, but batching and some of the other capabilities it provide are definitely useful for stuff written to a synchronous model too.

Additionally, io_uring can avoid syscalls sometimes even without any explicit batching by the application, because it can poll the submission queue (root only, last time I checked unfortunately): so with the right setup a series of "synchronous" ops via io_uring (i.e., submit & immediately wait for the response) could happen with < 1 user-kernel transition per op, because the kernel is busy servicing ops directly from the incoming queue and the application gets the response during its polling phase before it waits.


Hello

https://mazzo.li/posts/fast-pipes.html#what-are-pipes-made-o...

I think the diagram near the start of this section has "head" and "tail" swapped.

Edit: Nevermind, I didn't read far enough.


Actually, from re-reading the man page for vmsplice, it seems like it _should_ depend on SPLICE_F_GIFT (or in other words, it should be safe without it).

But from what I know about how vmsplice is implemented, gifting or not, it sounds like it should be unsafe anyhow.


> However, none of the variants using vmsplice (i.e., all but the slowest) are safe. When you gift [1] pages to the kernel there is no reliable general purpose way to know when the pages are safe to reuse again. [snip] This will show up as race conditions and spontaneously changing data where a downstream consumer sees the page suddenly change as it it overwritten by the original process.

That sounds like a security issue - the ability of an upstream generator process to write into the memory of a downstream reader process, or more perverser vice versa is even worser. I presume that the Linux kernel only lets this happen (zero copy) when the two processes are running as the same user?


It’s not clear to me that the kernel allows the receiving process to write instead of just read.

But also, if you are sending data, why would you later read/process that send buffer?

The only attack vector I could imagine would be if one sender was splicing the same memory to two or more receivers. A malicious receiver with write access to the spliced memory could compromise other readers.


What if the writer frees the memory entirely? Can you segv the reader? That would be quite a dangerous pattern.


No, because "freeing" memory makes sense within a single process and really means "removing the V->P mapping for the page", or more accurately something like "telling the malloc() implementation this pointer (implying a range of virtual memory) is free, which may or may not end up unmapping the V range from the process (or doing things like M_ADV_DONTNEED))".

None of those would impact the same physical page mapped into another process with its own V->P mapping.


I once had to change my mental model for how fast some of these things were. I was using `seq` as an input for something else, and my thinking was along the lines that it is a small generator program running hot in the cpu and would be super quick. Specifically because it would only be writing things out to memory for the next program to consume, not reading anything in.

But that was way off and `seq` turned out to be ridiculously slow. I dug down a little and made a faster version of `seq`, that kind of got me what I wanted. But then noticed at the end that the point was moot anyway, because just piping it to the next program over the command line was going to be the slow point, so it didn't matter anyway.

https://github.com/tverniquet/hseq


I had a somewhat similar discovery once using GNU parallel. I was trying to generate as much web traffic as possible from a single machine to load test a service I was building, and I assumed that the network I/o would be the bottleneck by a long shot, not the overhead of spawning many processes. I was disappointed by the amount of traffic generated, so I rewrote it in Ruby using the parallel gem with threads (instead of processes), and got orders of magnitude more performance.


Node is great for this usecase


Ran the basic initial implementation on my Mac Studio and was pleasantly surprised to see

  @elysium pipetest % pipetest | pv > /dev/null
   102GiB 0:00:13 [8.00GiB/s] 

  @elysium ~ % pv < /dev/zero > /dev/null
   143GiB 0:00:04 [36.4GiB/s] 
Not a valid comparison between the two machines because I don't know what the original machine is, but MacOS rarely comes out shining in this sort of comparison, and the simplistic approach here giving 8 GB/s rather than the author's 3.5 GB/s was better than I'd expected, even given the machine I'm using.


Given the machine as in a brand new Mac?


given that the machine is the most performant Mac that Apple make.


This was a long but highly insightful read!

(And as an aside, the combination of that font with the hand-drawn diagrams is really cool)


Would definitely be curious to know the font name


It's the IBM Plex font, and they're using a combination of IBM Plex Mono, IBM Plex Serif, and IBM Plex Sans.

Here is the source: https://www.ibm.com/plex/

Hope that helps!


Netmap offers zero-copy pipes (included in FreeBSD, on Linux it's a third party module): https://www.freebsd.org/cgi/man.cgi?query=netmap&sektion=4


The majority of this overhead (and the slow transfers) naively seem to be in the scripts/systems using the pipes.

I was worried when I saw zfs send/receive used pipes for instance because of performance worries - but using it in reality I had no problems pushing 800MB/s+. It seemed limited by iop/s on my local disk arrays, not any limits in pipe performance.


Right. I’m actually surprised the test with 256kB transfers gives reasonable results, and would rather have tested with > 1GB instead. For such a small transfer it seemed likely that the overhead of spawning the process and loading libraries by far dominates the amount of actual work. I’m also surprised this didn’t show up in profiles. But if obviously depends on where the measurement start and end points are


Perhaps I've misunderstood what you're referring to, but the test in the article is measuring speed transferring 10 GiB. 256 KiB is just the buffer size.


The first C program in the blog post allocates a 256kB buffer and writes that one exactly once to stdout. I don't see another loop which writes it multiple times.


There's an outer while(true){} loop - the write side just writes continuously.

More generally though, sidenote 5 says that the code in the article itself is incomplete and the real test code is available in the github repo: https://github.com/bitonic/pipes-speed-test


For some reason, this raised my curiosity how fast different languages write individual characters to a pipe:

PHP comes in at about 900KiB/s:

    php -r 'while (1) echo 1;' | pv > /dev/null
Python is about 50% faster at about 1.5MiB/s:

    python3 -c 'while (1): print (1, end="")' | pv > /dev/null
Javascript is slowest at around 200KiB/s:

    node -e 'while (1) process.stdout.write("1");' | pv > /dev/null
What's also interesting is that node crashes after about a minute:

    FATAL ERROR: Ineffective mark-compacts
    near heap limit Allocation failed -
    JavaScript heap out of memory
All results from within a Debian 10 docker container with the default repo versions of PHP, Python and Node.

Update:

Checking with strace shows that Python caches the output:

    strace python3 -c 'while (1): print (1, end="")' | pv > /dev/null
Outputs a series of:

    write(1, "11111111111111111111111111111111"..., 8193) = 8193
PHP and JS do not.

So the Python equivalent would be:

    python3 -c 'while (1): print (1, end="", flush=True)' | pv > /dev/null
Which makes it compareable to the speed of JS.

Interesting, that PHP is over 4x faster than the Python and JS.


> Javascript is slowest at around 200KiB/s:

I get around 1.56MiB/s with that code. PHP gets 4.04MiB/s. Python gets 4.35MiB/s.

> What's also interesting is that node crashes after about a minute

I believe this is because `while(1)` runs so fast that there is no "idle" time for V8 to actually run GC. V8 is a strange beast, and this is just a guess of mine.

The following code shouldn't crash, give it a try:

    node -e 'function write() {process.stdout.write("1"); process.nextTick(write)} write()' | pv > /dev/null
It's slower for me though, giving me 1.18MiB/s.

More examples with Babashka and Clojure:

    bb -e "(while true (print \"1\"))" | pv > /dev/null
513KiB/s

    clj -e "(while true (print \"1\"))" | pv > /dev/null
3.02MiB/s

    clj -e "(require '[clojure.java.io :refer [copy]]) (while true (copy \"1\" *out*))" | pv > /dev/null
3.53MiB/s

    clj -e "(while true (.println System/out \"1\"))" | pv > /dev/null
5.06MiB/s

Versions: PHP 8.1.6, Python 3.10.4, NodeJS v18.3.0, Babashka v0.8.1, Clojure 1.11.1.1105


>> What's also interesting is that node crashes after about a minute

> I believe this is because `while(1)` runs so fast that there is no "idle" time for V8 to actually run GC. V8 is a strange beast, and this is just a guess of mine.

Not exactly: the GC is still running; it’s live memory that’s growing unbounded.

What’s going on here is that WritableStream is non-blocking; it has advisory backpressure, but if you ignore that it will do its best to accept writes anyway and keep them in a buffer until it can actually write them out. Since you’re not giving it any breathing room, that buffer just keeps growing until there’s no more memory left. `process.nextTick()` is presumably slowing things down enough on your system to give it a chance to drain the buffer. (I see there’s some discussion below about this changing by version; I’d guess that’s an artifact of other optimizations and such.)

To do this properly, you need to listen to the return value from `.write()` and, if it returns false, back off until the stream drains and there’s room in the buffer again.

Here’s the (not particularly optimized) function I use to do that:

  async function writestream(chunks, stream) {
      for await (const chunk of chunks) {
          if (!stream.write(chunk)) {
              // When write returns null, stream is starting to buffer and we need to wait for it to drain
              // (otherwise we'll run out of memory!)
              await new Promise(resolve => stream.once('drain', () => resolve()))
          }
      }
  }
I do wish Node made it more obvious what was going on in this situation; this is a very common mistake with streams and it’s easy to not notice until things suddenly go very wrong.

ETA: I should probably note that transform streams, `readable.pipe()`, `stream.pipeline()`, and the like all handle this stuff automatically. Here’s a one-liner, though it’s not especially fast:

  node -e 'const {Readable} = require("stream"); Readable.from(function*(){while(1) yield "1"}()).pipe(process.stdout)' | pv > /dev/null


Are there still no async write functions which handle this easier than the old event based mechanism? Waiting for drain also sounds like it might reduce throughout since then there is 0 buffered data and the peer would be forced t Öl pause reading. A „writable“ event sounds more appropriate - but the node docs don’t mention one.


Your node version indeed did not crash. Tried for 2 minutes.

But using a longer string crashed after 23s here:

    node -e 'function write() {process.stdout.write("1111111111222222222233333333334444444444555555555566666666667777777777888888888899999999990000000000"); process.nextTick(write)} write()' | pv > /dev/null


Hm, strange. With the same out of memory error as before or a different one? Tried running that one for 2 minutes, no errors here, and memory stays constant.

Also, what NodeJS version are you on?


Yes, same error as before. Memory usage stays the same for a while, then starts to skyrocket shortly before it crashes.

node is v10.24.0. (Default from the Debian 10 repo)


Huh yeah, seems to be a old memory leak. Running it on v10.24.0 crashes for me too.

After some quick testing in a couple of versions, it seems like it got fixed in v11 at least (didn't test any minor/patch versions).

By the way, all versions up to NodeJS 12 (LTS) are "end of life", and should probably not be used if you're downloading 3rd party dependencies, as there are bunch of security fixes since then, that are not being backported.


I used this exact issue today while pointing out how Debian support dates can be misleading as packages themselves aren’t always getting fixes: https://github.com/endoflife-date/endoflife.date/issues/763#...


> I believe this is because `while(1)` runs so fast that there is no "idle" time for V8 to actually run GC. V8 is a strange beast, and this is just a guess of mine.

Java has (had) weird idiosyncrasies like this as well, well it doesn't crash, but depending on the construct you can get performance degradations depending on how the language inserts safepoints (where the VM is at a knowable state and a thread can be safely paused for GC or whatever).

I don't know if this holds today, but I know there was a time where you basically wanted to avoid looping over long-type variables, as they had different semantics. The details are a bit fuzzy to me right now.


If you ever need to write a random character to a pipe very fast, GNU coreutils has you covered with yes(1). It runs at about 6 GiB/s on my system:

  yes | pv > /dev/null
There's an article floating around [1] about how yes(1) is extremely optimized considering its original purpose. In care you're wondering, yes(1) is meant for commands that (repeatedly) ask whether to proceed, expecting a y/n input or something like that. Instead of repeatedly typing "y", you just run "yes | the_command".

Not sure about how yes(1) compares to the techniques presented in the linked post. Perhaps there's still room for improvement.

[1] Previous HN discussion: https://news.ycombinator.com/item?id=14542938


Faster still is

  pv < /dev/zero > /dev/null


Yes but you don't have control of which character is written (only NULLs).

yes lets you specify which character to output. 'yes n' for example to output n.


Yes doesn't just let you choose a character. It lets you choose a string that will be repeated. So

    yes 123abc
will print

    123abc123abc123abc123abc123abc
and so on.


each time terminated by a newline, so:

  123abc
  123abc
  123abc
  ...


> It runs at about 6 GiB/s on my system...

Honest question: what are the practical use cases of this?

Repeatedly typing the 'y' character into a Linux pipe is surely not that common, especially at that bit rate. Also seems like the bottleneck would always be the consuming program...


Historically, you could have dirty filesystems after a reboot that "fsck" would ask an absurd number of questions about ("blah blah blah inode 1234567890 fix? (y/n)"). Unless you were in a very specific circumstance, you'd probably just answer "y" to them. It could easily ask thousands of questions though. So: "yes | fsck" was not uncommon.


> Historically

It's probably still common in installation scripts, like in Dockerfiles. `apt-get install` has the `-y` option, but it would be useful for all other programs that don't.


Just to clarify: I was applying "historically" to "fsck", not to the use of "yes" in general. I can't remember the last time I've had the need to use "yes | fsck"


> Honest question: what are the practical use cases of this?

It also allows you to script otherwise interactive command line operations with the correct answer. Many come like tools now days provide specific options to override queries. But there are still a couple hold outs which might not.


> Repeatedly typing the 'y' character into a Linux pipe is surely not that common, especially at that bit rate.

At that rate no but I definitely use it once in a while. For example if a copy quite a few files and then get repeatedly asked if I want to overwrite the destination (when it's already present). Sure, I could get my commmand back and use the proper flag to "cp" or whatever to overwrite, but it's usually much quicker to just get back the previous line, go at the beginning (C-a), then type "yes | " and be done with it.

Note that you can pass a parameter to "yes" and then it repeats what you passed instead of 'y'.


> especially at that bit rate. Also seems like the bottleneck would always be the consuming program...

It's not made to be fast; it's just fast by nature, because there's no other computation it needs to do than to just output the string.


It is optimized quite seriously. I remember there was a comparison of it with I believe a BSD version, where the latter was thousands time more readable (although slower).


I'm getting ~3.10GiB/s with both GNU's and FreeBSD's. I do see that GNU's version has some optimizations, but their effectiveness isn't apparent when doing `yes | pv > /dev/null`.

However, my point was just that its performance was never a main point of it. Even without optimizations, it's still very fast, and I don't think whoever created it first was concerned with it having to be super fast, as long as it was faster than the prompts of whatever was downstream in the pipe.


It really is! It's been a few years since I saw the article on HN so I just reposted it: https://news.ycombinator.com/item?id=31619076


Yes can repeat any string, not just "y". It can be useful for basic load generation.


I've used it to test some db behavior with `yes 'insert ...;' | mysql ...`. Fastest insertions I could think of.


A major contributing factor is whether or not the language buffers output by default, and how big the buffer is. I don't think NodeJS buffers, whereas Python does. Here's some comparisons with Go (does not buffer by default):

- Node (no buffering): 1.2 MiB/s

- Go (no buffering): 2.4 MiB/s

- Python (8 KiB buffer): 2.7 MiB/s

- Go (8 KiB buffer): 218 MiB/s

Go program:

    f := bufio.NewWriterSize(os.Stdout, 8192)
    for {
       f.WriteRune('1')
    }


Not specifically addressed at you, but it's a bit amusing watching a younger generation of programmers rediscovering things like this, which seemed hugely important in like 1990 but largely don't matter that much to modern workflows with dedicated APIs or various shared memory or network protocols, as not much that is really performance-critical is typically piped back and forth anymore.

More than a few old backup or transfer scripts had extra dd or similar tools in the pipeline to create larger and semi-asynchronous buffers, or to re-size blocks on output to something handled better by the receiver, which was a big deal on high speed tape drives back in the day. I suspect most modern hardware devices have large enough static RAM and fast processors to make that mostly irrelevant.


In addition to buffering within the process, Linux (usually) buffers process stdout with ~16KB, and does not buffer stderr.


I did the same test, but added a rust and bash version. My results:

Rust: 21.9MiB/s

Bash: 282KiB/s

PHP: 2.35MiB/s

Python: 2.30MiB/s

Node: 943KiB/s

In my case, node did not crash after about two minutes. I find it interesting that PHP and Python are comparable for me but not you, but I'm sure there's a plethora of reasons to explain that. I'm not surprised rust is vastly faster and bash vastly slower, I just thought it interesting to compare since I use those languages a lot.

Rust:

  fn main() {
      loop {
          print!("1");
      }
  }
Bash (no discernible difference between echo and printf):

  while :; do printf "1"; done | pv > /dev/null


For languages like C, C++, and Rust, the bottleneck is going to mainly be system calls. With a big buffer, on an old machine, I get about 1.5 GiB/s with C++. Writing 1 char at a time, I get less than 1 MiB/s.

    $ ./a.out 1000000 2000 | cat >/dev/null
    buffer size: 1000000, num syscalls: 2000, perf:1578.779593 MiB/s
    $ ./a.out 1 2000000 | cat >/dev/null
    buffer size: 1, num syscalls: 2000000, perf:0.832587 MiB/s
Code is:

    #include <cstddef>
    #include <random>
    #include <chrono>
    #include <cassert>
    #include <array>
    #include <cstdio>
    #include <unistd.h>
    #include <cstring>
    #include <cstdlib>

    int main(int argc, char **argv) {

        int rv;

        assert(argc == 3);
        const unsigned int n = std::atoi(argv[1]);
        char *buf = new char[n];
        std::memset(buf, '1', n);

        const unsigned int k = std::atoi(argv[2]);

        auto start = std::chrono::high_resolution_clock::now();
        for (size_t i = 0; i < k; i++) {
            rv = write(1, buf, n);
            assert(rv == int(n));
        }
        auto stop = std::chrono::high_resolution_clock::now();

        auto duration = stop - start;
        std::chrono::duration<double> secs = duration;

        std::fprintf(stderr, "buffer size: %d, num syscalls: %d, perf:%f MiB/s\n", n, k, (double(n)*k)/(1024*1024)/secs.count());
    }
EDIT: Also note that a big write to a pipe (bigger than PIPE_BUF) may require multiple syscalls on the read side.

EDIT 2: Also, it appears that the kernel is smart enough to not copy anything when it's clear that there is no need. When I don't go through cat, I get rates that are well above memory bandwidth, implying that it's not doing any actual work:

    $ ./a.out 1000000 1000 >/dev/null
    buffer size: 1000000, num syscalls: 1000, perf: 1827368.373827 MiB/s


I suspect (but am not sure) that the shell may be doing something clever for a stream redirection (>) and giving your program a STDOUT file descriptor directly to /dev/null.

I may be wrong, though. Check with lsof or similar.


There's no special "no work" detection needed. a.out is calling the write function for the null device, which just returns without doing anything. No pipes are involved.



Seems like it's buffering output, which Python also does. Python is much slower if you flush every write (I get 2.6 MiB/s default, 600 KiB/s with flush=True).

Interestingly, Go is very fast with a 8 KiB buffer (same as Python's), I get 218 MiB/s.


for the bash case, the cost of forking to write two chars is overwhelming compared to anything related to I/O.


Echo and printf are shell built-ins in bash[0]. Does it have to fork to execute them?

You could probably answer this by replacing printf with /bin/echo and comparing the results. I'm not in front of a Linux box, or I'd try.

[0] https://www.gnu.org/software/bash/manual/html_node/Bash-Buil...


> Echo and printf are shell built-ins in bash

Ah, yeah, good point, I am wrong.


There's no forking and it's wrinting one character.


with Rust you could also avoid using a lock on STDOUT and get it even faster!


Tested it, seems to about double the speed (from 22.3mb/s to 47.6mb/s).


> python3 -c 'while (1): print (1, end="")' | pv > /dev/null

python actually buffers its writes with print only flushing to stdout occasionally, you may want to try:

    python3 -c 'while (1): print (1, end="", flush=True)' | pv > /dev/null
which I find goes much slower (550Kib/s)


Luajit using print and io.write

  LuaJIT 2.1.0-beta3
Using print is about 17 MiB/s

  luajit -e "while true do print('x') end" | pv > /dev/null
Using io.write is about 111 MiB/s

  luajit -e "while true do io.write('x') end" | pv > /dev/null


"Javascript" is slowest probably because node pushes the writes to a thread instead of printing directly from the main process like PHP.

Python cheats, and it's still slow as heck even while cheating (buffers the output at 8192 chunks instead of issuing 1 byte writes).

write(1, "1", 1) loop in C pushes 6.38MiB/s on my PC. :)


Why is it cheating to use a buffer? This is the behavior you would get in C if you used the C standard library (putc/fputc) instead of a system call (write).


Because it doesn't answer the question "how fast individual languages write individual characters to a pipe" if in fact some languages do not.

It's not language "cheating" of course. It's just OP "measuring the wrong thing".


If you want to compare apples to apples, you can switch Python to use unbuffered stdout/stderr (via `-u` or fcntl inside the script [0])

[0]: https://stackoverflow.com/a/881751/6001364


Adding a few results:

Using OP's code for following

    php 1.8mb/sec
    python 3.8 Mb/sec
    node 1.0 Mb/sec
Java print 1.3 Mb/sec

    echo 'class Code {public static void main(String[] args) {while (true){System.out.print("1");}}}' >Code.java; javac Code.java ; java Code | pv>/dev/null
Java with buffering 57.4 Mb/sec

    echo 'import java.io.*;class Code2 {public static void main(String[] args) throws IOException {BufferedWriter log = new BufferedWriter(new OutputStreamWriter(System.out));while(true){log.write("1");}}}' > Code2.java ; javac Code2.java ; java Code2 | pv >/dev/null


Java can get even much much faster: https://gist.github.com/justjanne/12306b797f4faa977436070ec0...

That manages about 7 GiB/s reusing the same buffer, or about 300 MiB/s with clearing and refilling the buffer every time

(the magic is in using java’s APIs for writing to files/sockets, which are designed for high performance, instead of using the APIs which are designed for writing to stdout)


Nice, that's pretty cool!


`process.stdout.write` is different to PHP's `echo` and Python's `print` in that it pushes a write to an event queue without waiting for the result which could result in filling event queue with writes. Instead, you can consider `await`-ing `write` so that it would write before pushing another `write` to an event queue.

    node -e '
        const stdoutWrite = util.promisify(process.stdout.write).bind(process.stdout);
        (async () => {
            while (true) {
                await stdoutWrite("1");
            }
        })();
    ' | pv > /dev/null


I'm on a 2015 MB Air with two browsers running, probably a dozen tabs between them, three tabs in iTerm2, Outlook, Word, and Teams running.

Perl 5.18.0 gives me 3.5 MiB per second. Perl 5.28.3, 5.30.3, and 5.34.0 gives 4 MiB per second.

    perl5.34.0 -e 'while (){ print 1 }' | pv > /dev/null
For Python 3.10.4, I get about 2.8 MiB/s as you have it written, but around 5 MiB/s (same for 3.9 but only 4 MiB/s for 3.8) with this. I also get 4.8 MiB/s with 2.7:

    python3 -c 'while (1): print (1)' | pv > /dev/null
If I make Perl behave like yes and print a character and a newline, it has a jump of its own. The following gives me 37.3 MiB per second.

    perl5.34.0 -e 'while (){ print "1\n" }' | pv > /dev/null
Interestingly, using Perl's say function (which is like a Println) slows it down significantly. This version is only 7.3 MiB/s.

    perl5.34.0 -E 'while (1) {say 1}' | pv > /dev/null
Go 1.18 has 940 KiB/s with fmt.Print and 1.5 MiB/s with fmt.Println for some comparison.

    package main

    import "fmt"

    func main() {
            for ;; {
                    fmt.Println("1")
            }
    }

These are all macports builds.


For me:

Python3: 3 MiB/s

Node: 350 KiB/s

Lua: 12 MiB/s

  lua -e 'while true do io.write("1") end' | pv > /dev/null
Haskell: 5 MiB/s

  loop = do
    putStr "1"
    loop

  main = loop
Awk: 4.2 MiB/s

  yes | awk '{printf("1")}' | pv > /dev/null


Lua is an interesting one.

    while true do
      io.write "1"
    end
PUC-Rio 5.1: 25 MiB/s

PUC-Rio 5.4: 25 MiB/s

LuaJIT 2.1.0-beta3: 550 MiB/s <--- WOW

They all go slightly faster if you localize the reference to `io.write`

    local write = io.write
    while true do
      write "1"
    end


> They all go slightly faster if you localize the reference to `io.write`

No noticeable difference for LuaJIT, which makes sense, since JIT should figure it out without help.


And this, folks, is why you have immutable modules. If you know before runtime what something is, lookup is a lot faster.


Ah yes you're right. Basically no difference with LuaJIT.

5.1 and 5.4 show about ~8% improvement.


Haskell can be even simpler:

    main = putStr (repeat '1')
[Edit: as pointed out below, this is no longer the case!]

Strings are printed one character at a time in Haskell. This choice is justified by unpredictability of the interaction between laziness and buffering; I am uncertain it's the correct choice, but the proper response is to use Text where performance is relevant.


Wow, this does 160 MiB/s. That's a huge improvement! The output of strace looks completely different:

  poll([{fd=1, events=POLLOUT}], 1, 0)    = 1 ([{fd=1, revents=POLLOUT}])
  write(1, "11111111111111111111111111111111"..., 8192) = 8192
  poll([{fd=1, events=POLLOUT}], 1, 0)    = 1 ([{fd=1, revents=POLLOUT}])
  write(1, "11111111111111111111111111111111"..., 8192) = 8192
With the recursive code, it buffered the output in the same way but bugged the kernel a whole lot more in-between writes. Not exactly sure what is going on:

  poll([{fd=1, events=POLLOUT}], 1, 0)    = 1 ([{fd=1, revents=POLLOUT}])
  write(1, "11111111111111111111111111111111"..., 8192) = 8192
  rt_sigprocmask(SIG_BLOCK, [INT], [], 8) = 0
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=920390843}) = 0
  rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
  rt_sigprocmask(SIG_BLOCK, [INT], [], 8) = 0
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=920666397}) = 0
  ...
  rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
  poll([{fd=1, events=POLLOUT}], 1, 0)    = 1 ([{fd=1, revents=POLLOUT}])
  write(1, "11111111111111111111111111111111"..., 8192) = 8192


I'm honestly surprised either of them wind up buffered! That must be a change since I stopped paying as much attention to GHC.

I'm also not sure what's going on in the second case. IIRC, at some point historically, a sufficiently tight loop could cause trouble with handling SIGINT, so it might be related to some overagressive workaround for that?


On my extremely old desktop PC (Phenom II 550) running an out-of-date OS (Slackware 14.2):

Bash:

    while :; do printf "1"; done  | ./pv > /dev/null
    [ 156KiB/s]
Python3 3.7.2:

    python3 -c 'while (1): print (1, end="")' | ./pv > /dev/null
    [1,02MiB/s]
Perl 5.22.2:

    perl -e 'while (true) {print 1}'  | ./pv > /dev/null
    [3,03MiB/s]
Node.js v12.22.1:

    node -e 'while (1) process.stdout.write("1");' | ./pv > /dev/null
    [ 482KiB/s]


Potential buffering issues aside, as others have pointed out the node.js example is performing asynchronous writes, unlike the other languages' examples (as far as I know).

To do a proper synchronous write, you'd do something like:

  node -e 'const { writeSync } = require("fs"); while (1) writeSync(1, "1");' | pv > /dev/null
That gets me ~1.1MB/s with node v18.1.0 and kernel 5.4.0.


You're testing a very specific operation, a loop, in each language to determine its speed, not sure if I'd generalize that. I wonder what it'd look like if you replaced the loop with static print statements that were 1000s of characters long with line breaks, the sort of things that compiler optimizations do.


I find that NodeJS runs eventually out of memory and crashes with applications that do a large amount of data processing over a long time with little breaks even if there are no memory leaks.

Edit: I've found this consistently building multiple data processing applications over multiple years and multiple companies


Perhaps different approaches to caching?

I'm reminded of this StackOverflow question, Why is reading lines from stdin much slower in C++ than Python?

https://stackoverflow.com/q/9371238/


I'll tell you what's fun. I get 5MB/sec with Python, 1.3MB/sec with Node and.... 12.6MB/sec with Ruby! :-) (Added: Same speed as Node if I use $stdout.sync = true though..)


Python pushes 15MiB on my M1 Pro if you go down a level and use sys directly.

   python3 -c 'import sys
   while (1): sys.stdout.write("1")'| pv>/dev/null


That caches though. You can see it when you strace it.


    python3 -u -c 'import sys
      while (1): sys.stdout.write("1")'| pv>/dev/null
427KiB/s

    python3 -c 'import sys
      while (1): sys.stdout.write("1")'| pv>/dev/null
6.08MiB/s

Using python 3.9.7 on macOS Monterey.


Good point, but so does a standard print call. Calling flush() after each write does bring the perf to 1.5MiB


I was getting different results depending on when I run it. Took me a second to realize it was my processor frequency scaling.


What version of node are you using? It seems to run indefinitely on 14.19.3 that comes with Ubuntu 20.04.


Using `sys.stdout.write()` instead of `print()` gets ~8MiB/s on my machine.


This site is pleasing to the eye.


It looks like it is using the "Tufte" style, named after Edward Tufte, who is very famous for his writing on data visualization. More examples: https://rstudio.github.io/tufte/


Love the subtle "stonks" overlay on the first chart


Now this is the kind of content I come to HN for. Absolutely fascinating read.


Android's flavor of Linux uses "binder" instead of pipes because of its security model. IMHO filesystem-based IPC mechanisms (notably pipes), can't be used because of a lack of a world-writable directory - i may be wrong here.

Binder comes from Palm actually (OpenBinder)


Pipes don’t necessarily mean one has to use FS permissions. Eg a server could hand out anonymous pipes to authorized clients via fd passing on Unix domain sockets. The server can then implement an arbitrary permission check before doing this.


"lack of a world-writable directory"

What's that?

A lot of programs store sockets in /run which is typically implemented by `tmpfs`.


History of binder is more involved and has its seeds on BeOS IIRC.


I've dumped pixels and pcm audio through a pipe, it certainly was fast enough for that https://git.cloudef.pw/glcapture.git/tree/glcapture.c (I suggest gamescope + pipewire to do this instead nowadays however)


I usually just use cat /dev/urandom > /dev/null to generate load. Not sure how this compares to their code.

Edit: it’s actually “yes” that I’ve used before for generating load. I remember reading somewhere “yes” was optimized differently than the original Unix command as part of the unix certification lawsuit(s).

Long night.


On 5.10.0-14-amd64 "pv < /dev/urandom >/dev/null" reports 72.2MiB/s. "pv < /dev/zero >/dev/null" reports 16.5GiB/s. AMD Ryzen 7 2700X with 16GB of DDR4 3000MHz memory.

"tr '\0' 1 </dev/zero | pv >/dev/null" reports 1.38GiB/s.

"yes | pv >/dev/null" reports 7.26GiB/s.

So "/dev/urandom" may not be the best source when testing performance.


You’re right and I miss-typed, it’s yes that I usually use. I think it’s optimized for throughput.


Think they were generating load? Going through the urandom device not bad as it has to do a bit of work to get that rand number? Just for throughput though zero is prob better.


I don't understand. If you're testing how fast pipes are, then I'd expect you to measure throughput or latency. Why would you measure how fast something unrelated to pipes is? If you want to measure this other thing on the other hand, why would you bother with pipes, which add noise to the measurement?

UPDATE: If you mean that you want to test how fast pipes are when there is other load in the system, then I'd suggest just running a lot of stuff in the background. But I wouldn't put the process dedicated for doing something else into the pipeline you're measuring. As a matter of fact, the numbers I gave were taken with plenty of heavy processes running in the background, such as Firefox, Thunderbird, a VM with another instance of Firefox, OpenVPN, etc. etc. :)


Because they mentioned generating load, not testing pipe performance.


Oh, wait. You mean that this "cat </dev/urandom >/dev/null" was meant to be running in the background and not be the pipeline which is tested? Ok, my bad for not getting the point.


"Generating load" for measuring pipe performance means generating bytes. Any bytes. urandom is terrible for that.


I'm glad huge pages make a big difference because I just spent several hours setting them up. Also everyone says to disable transparent_hugepage, so I set it to `madvise`, but I'm skeptical that any programs outside databases will actually use them.


JVM can. I have JetBrains set up to use them.


yep, you want perf? Don't mutex then yield, do spin and check your cpu heat sink.

:)


Maybe a stupid question, but why aren't pipes simply implemented as a contiguous buffer in a shared memory segment + a futex?


The visual design is amazing.


Something maybe a bit related.

I just had 25Gb/s internet installed (https://www.init7.net/en/internet/fiber7/), and at those speeds Chrome and Firefox (which is Chrome-based) pretty much die when using speedtest.net at around 10-12Gbps.

The symptoms are that the whole tab freezes, and the shown speed drops from those 10-12Gbps to <1Gbps and the page starts updating itself only every second or so.

IIRC Chrome-based browsers use some form of IPC with a separate networking process, which actually handles networking, I wonder if this might be the case that the local speed limit for socketpair/pipe under Linux was reached and that's why I'm seeing this.


> and at those speeds Chrome and Firefox (which is Chrome-based)

AFAIK, Firefox is not Chrome-based anywhere.

On iOS it uses whatever iOS provides for webview - as does Chrome on iOS.

Firefox and Safari is now the only supported mainstream browsers that has their own rendering engines. Firefox is the only that has their own rendering engine and is cross platform. It is also open source.


> Firefox is the only that has their own rendering engine and is cross platform.

Interestingly safaris rendering engine is open source and cross platform, but the browser is not. Lots of linux-focused browsers (konquerer, gnome web, surf) and most embedded browsers (nintendo ds & switch, playstation) use webkit. Also some user interfaces (like WebOS, which is running all of LG's TVs and smart refrigerators) use webkit as their renderer.


WebKit itself is a fork of the Konqueror's original KHTML engine by the way.


Browser Genealogy


Now I want to see the family tree!



IOS uses WebKit which is also what Chrome is based on.


Chrome uses Blink, which was forked from WebKit's WebCore in 2013. They replaced JavaScriptCore with V8.


> AFAIK, Firefox is not Chrome-based anywhere.

Not technically "Chrome-based", but Firefox draws graphics using Chrome's Skia graphics engine.

Firefox is not completely independent from Chrome.


Skia started in 2004 independently of google and was then acquired by google. Calling it "Chrome's Skia graphics engine" makes it sound like it was built for chrome.


I feel like counting every library is silly.

In any case, i thought chrome used libnss which is a mozilla library, so you could say the reverse as well.


Chrome fires many processes and creates an IPC based comm-network between them to isolate stuff. It's somewhat abusing your OS to get what its want in terms of isolation and whatnot.

(Which is similar to how K8S abuses ip-tables and makes it useless for other ends, and makes you install a dedicated firewall in front of your ingress path, but let's not digress).

On the other hand, Firefox is neither chromium based, nor is a cousin of it. It's a completely different codebase, inherited from Netscape days and evolved up to this point.

As another test point, Firefox doesn't even blink at a symmetric gigabit connection going at full speed (my network is capped by my NIC, the pipe is way fatter).


It is using what OS processes where created in first place.

Unfortunately the security industry has proven the why threads are a bad ideas for applications when security is a top concern.

Same applies to dynamically loaded code as plugins, where the host application takes the blame for all instabilty and exploits they introduce.


Yes, Firefox is also doing the same, however due to the nature of Firefox's processes, the OS doesn't lose much responsiveness or doesn't feel bogged down when I have 50+ tabs open due to some research.

If you need security, you need isolation. If you want hardware-level isolation, you need processes. That's normal.

My disagreement with Google's applications are how they're behaving like they're the only running processes on the system itself. I'm pretty aware that some of the most performant or secure things doesn't have the prettiest implementation on paper.


There used to be a setting to tweak Chrome's process behavior.

I believe the default behavior is "Coalesce tabs into the same content process if they're from the same trust domain".

Then you can make it more aggressive like "Don't coalesce tabs ever" or less aggressive like "Just have one content process". I think.

I'm not sure how Firefox decides when to spawn new processes. I know they have one GPU process and then multiple untrusted "content processes" that can touch untrusted data but can't touch the GPU.

I don't mind it. It's a trade-off between security and overhead. The IPC is pretty efficient and the page cache in both Windows and Linux _should_ mean that all the code pages are shared between all content processes.

Static pages actually feel light to me. I think crappy webapps make the web slow, not browser security.

(inb4 I'm replying to someone who works on the Firefox IPC team or something lol)


> inb4 I'm replying to someone who works on the Firefox IPC team or something lol

The danger and joy of commenting on HN!


I'm harmless, don't worry. :) Also you can find more information about me in my profile.

Even if I was working on Firefox/Chrome/whatever, I'd not be mad at someone who doesn't know something very well. Why should I? We're just conversing here.

Also, I've been very wrong here at times, and this improved my conversation / discussion skills a great deal.

So, don't worry, and comment away.


> As another test point, Firefox doesn't even blink at a symmetric gigabit connection going at full speed (my network is capped by my NIC, the pipe is way fatter).

FWIW Firefox under Linux (Firefox Browser 100.0.2 (64-bit)) behaves pretty much the same as Chrome. The speed raises quickly to 5-8Gb/s, then the UI starts choking, and the shown speed drops to 500Mb/s. It could be that there's some scheduling limit or other bottleneck hit in the OS itself, assuming these are different codebases (are they?).


I'd love to test and debug the path where it dies, but none of the systems we have firefox have pipes that fat (again NIC limited).

However, you can test the limits of Linux by installing CLI version of Speedtest and hitting a nearby server.

The bottleneck maybe in the browser itself, or in your graphics stack, too.

Linux can do pretty amazing things in the network department, otherwise 100Gbps Infiniband cards wouldn't be possible at Linux servers, yet we have them on our systems.

And yes, Chrome and Firefox are way different browsers. I can confidently say this, because I'm using Firefox since it's called Netscape 6.0 (and Mozilla in Knoppix).


From my experience long ago, all high performance networking under Linux was traditionally user space and pre-allocated pools (netmap, dpdk, pf-ring...). Did not follow, how much io_uring has been catching up for network stack usage... Maybe somebody else knows?


While I'm not very knowledgeable in specifics, there are many paths for networking in Linux now. The usual kernel based one is there, also there's kernel-bypass [0] paths used by very high performance cards.

Also, Infiniband can directly RDMA to and from MPI processes for making "remote memory local", allowing very low latencies and high performance in HPC environments.

I also like this post from Cloudflare [1]. I've read it completely, but the specifics are lost on me since I'm not directly concerned with the network part of our system.

[0]: https://medium.com/@penberg/on-kernel-bypass-networking-and-...

[1]: https://blog.cloudflare.com/how-to-receive-a-million-packets...


I have a service that beats epoll with io_uring (it reads gre packets from one socket, and does some lookups/munging on the inner packet and re-encaps them to a different mechanism and writes them back to a different socket). General usage for io_uring vs epoll is pretty comparable IIUC. It wouldn't surprise me if streams (e.g. tcp) end up being faster via io_uring and buffer registration though.

Totally tangential - it looks like io_uring is evolving beyond just io and into an alternate syscall interface, which is pretty neat imho.


> I can confidently say this, because I'm using Firefox since it's called Netscape 6.0 (and Mozilla in Knoppix).

Mozilla suite/seamonkey isn't usually considered the same as firefox, although obviously related.


I'm not talking about the version which evolved to Seamonkey. I'm talking about Mozilla/Firefox 0.8 which had a Mozilla logo as a "Spinner" instead of Netscape logo on the top right.


Netscape 6 was not firefox based https://en.m.wikipedia.org/wiki/Netscape_6

Firefox 0.8 did not have netscape branding http://theseblog.free.fr/firefox-0.8.jpg


> Netscape 6 was not Firefox based.

I know. Firefox was not even an idea when Netscape 6 was released. However, inverse is true. Firefox is based on Netscape. It's just branched off actually. It started as a pared down version of SeaMonkey apparently.

The thing I was remembering from Knoppix 3.x days was "Mozilla Navigator" of SeaMonkey/Mozilla Suite, which is even older than Firefox, and discontinued 3 years later. I just booted the CD to look at it.

At the end of the day, Firefox is just Netscape Navigator, evolved.


Firefox is not based on the chromium codebase, it is older.


Well if we're talking ancestors that's technically true, but not by that much - Firefox comes from Netscape, Chrome/Safari/... come from KHTML.


> ... on August 16, 1999 that [Lars Knoll] had checked in what amounted to a complete rewrite of the KHTML library—changing KHTML to use the standard W3C DOM as its internal document representation. https://en.wikipedia.org/wiki/KHTML#Re-write_and_improvement

> In March 1998, Netscape released most of the code base for its popular Netscape Communicator suite under an open source license. The name of the application developed from this would be Mozilla, coordinated by the newly created Mozilla Organization https://en.wikipedia.org/wiki/Mozilla_Application_Suite#Hist...

Netscape Communicator (or Netscape 4) was released in 1997, so If we are tracing lineage, I'd say Firefox has a 2 year head start.


AFAIR, KHTML was/is not related to Netscape/Gecko in any way.


Unrelated question, what hardware do you use to setup your network for 25Gb/s? I've been looking at init7 for a while, but gave up and stayed with Salt after trying to find the right hardware for the job.


NIC: Intel E810-XXVDA2

Optics: To ISP: Flexoptics (https://www.flexoptix.net/de/p-b1625g-10-ad.html?co10426=972...), Router-PC: https://mikrotik.com/product/S-3553LC20D

Router: Mikrotik CCR-2004 - https://mikrotik.com/product/ccr2004_1g_12s_2xs - warning: it's good to up to ~20Gb/s one way. It can handle ~25Gb/s down, but only ~18Gb/s up, and with IPv6 the max seems to be ~10Gb/s any direction.

If Mikrotik is something you're comfortable using you can also take a look at https://mikrotik.com/product/ccr2216_1g_12xs_2xq - it's more expensive (~2500EUR), but should handle 25Gb/s easily.


IIRC most Mikrotik products lack hardware IPv6 offload which is probably why you're seeing lower speeds.


In that case 10Gb/s sounds actually pretty good, if that's without hardware offload.


Speedtest does have a CLI as well, might be interesting to compare them.


Thing to note: the open source version on GitHub, installable by homebrew and native package managers, is not the same version as Ookla distributes from their website and is not accurate at all.



This makes me wonder... does anyone offer an iperf-based speedtest service on the Internet?


Ha.. my ISP does :) I can hit those 25Gb/s when connecting directly (bypassing the router as it barely handles those 25Gb/s).

With it in the way I get ~15-20Gb/s

  $ iperf3 -l 1M --window 64M -P10 -c speedtest.init7.net
  ..
  [SUM]   0.00-1.00   sec  1.87 GBytes  16.0 Gbits/sec  181406

  $ iperf3 -R -l 1M --window 64M -P10 -c speedtest.init7.net
  ..
  [SUM]   0.00-1.00   sec  2.29 GBytes  19.6 Gbits/sec


Well there are some public iperf servers listed here: https://iperf.fr/iperf-servers.php


https://github.com/R0GGER/public-iperf3-servers is also a good list - a couple more US servers in various datacenters


Is it only affecting the browser or the entire system? It might be possible that the CPU is busy handling interrupts from the ethernet controller, although in general these controllers should use DMA and should not send interrupts frequently.


Only browser(s), the OS is capable of 25Gb/s - checked with iperf and also speedtest-cli - https://www.speedtest.net/result/c/e9104814-294f-4927-af9f-d...


I ran into this with a VDI environment in a data center. We had initially delivered 10Gb Ethernet to the VMs, because why not.

Turned out windows 7 or the NICs needed a lot of tuning to work well. There was alot of freezing and other fail.


Sounds like a hard drive cache filling up.


One would assume speed testing website would use `Cache-Control: no-store`...

But alas, they do not, lol. They just use no-cache on the query which will not prevent the browser from storing the data.

https://megous.com/dl/tmp/8112dd9346dd66e8.png


Firefox is only Chrome-based on iOS.


It's Safari-based, which is Webkit-based. Chrome is also Safari-based on iOS, because all the browsers must be. There's no actual Chrome (as in Blink, the browser engine) on iOS, at least in Play Store.


> It's Safari-based, which is Webkit-based.

Firefox only uses Webkit on iOS, due to Apple requirements. It uses Gecko everywhere else. And I don't think it's ever been Safari-based anywhere.


Replying to

> Firefox is only Chrome-based on iOS.

So I'm talking only about iOS. When I said it's Safari-based, I meant Webkit based, but I thought Firefox/Chrome actually pull parts of Safari on iOS. Quick research says that's wrong and they just use Webkit. Not an iOS dev, so someone can point out better sources for the 100% correct terminology.


You mean WebKit.


Do you actually mean Gbit/s? 25Gb/s would translate to 200Gbit/s ...


The small "b" is customarily used to refer to bits, with the large "B" used to refer to bytes. So 25 Gb/s would be 25 Gbit/s, while 25 GB/s would be 200 Gbit/s.


Gb != GB. Per Wikipedia, which aligns with my understanding,

"The gigabit has the unit symbol Gbit or Gb."

25GB/s would translate to 200Gbit/s and also 200Gb/s.


Does an API similar to vmsplice exist for Windows?


Love the subtle stonks background in the first image.


pv is written in perl so isn't the snappiest, I'm surprised to see it score so highly. I wonder what the initial speed would have been if it just wrote to /dev/null


It's not written in perl, it's written in C, and it uses splice() (one of the syscalls discussed in the post).


Definitely C, per what appears to be the official repo (linking the splice syscall) - https://github.com/icetee/pv/blob/master/src/pv/transfer.c#L...


I was totally wrong. Thank you for showing me the facts.


Confused with parallel, maybe?


Linux pipes?

Oh yes, Linux pipes were invented by Douglas McIlroy while working for Bell Labs on Research UNIX and first described in the man pages of Version 3 Unix, Feb. 1974, just a couple months after Linus Torvald's 4th birthday.

Where and how and when will the unjust and blatent plagiarism of Linux cease? The software was made free by BSD, so feel free to use it, roll it all into GNU/Linux, have at it, but please stop incorrectly describing these things as Linux things. Because the only software that I am certain actually belongs to Linux is systemd. So let's start calling that "Linux systemd," and stop calling anything else Linux anything.


> In this post, we will explore how Unix pipes are implemented in Linux

Seems to me like this post is pretty specifically about Unix pipes on Linux (i.e. Linux pipes), as opposed to Unix pipes in general.

The article also talks about “Linux paging”, again clearly referring to the implementation and usage of virtual memory on Linux, rather than…whatever ancient architecture first invented the page table.


This would be valid, but only if the implementation of pipes in Linux is different than other pipelines, regardless of differences in memory paging, which the article, as detailed and in depth as it is, never makes clear.

So I'm not sure it is not analogous to, "today, we're going to talk about Linux electricity, specifically the way electricity is utilized in Linux."


When you have multiple ways to interpret what someone says, it's generally a good idea to assume best intentions. In this case Linux pipes would then refer to pipes in Linux, rather than implied ownership or origin. This avoids a lot of unnecessary squabbling


So pipes in Linux are significantly different than other pipelines? Was the wheel really reinvented when Linux was developed from Minux, such that the Minix pipeline implementation was abandoned? RLY??! I doubt it.


It's just how language works. You can say "The desert sun is making me thirsty", even when in fact it is the same sun. It's not even located in the desert, nor is it doing anything special at all. Still somehow everyone intuitively knows what you mean.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: