Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

John Carmack just had a tweet today on this problem:

>I started a tar with bzip command on a big directory, and it has been running for two days. Of course, it is only using 1.07 cores out of the 128 available. The Unix pipeline tool philosophy often isn’t aligned with parallel performance.

https://twitter.com/ID_AA_Carmack/status/1656708636570271768...



I wasn't aware the Unix philosophy was to not use multithreading on large jobs that can be parallelized.

You can complain about philosophies but this is just using the wrong tool for the job. Complain about bzip if you feel the bzip authors should have made multithreaded implementation for you.


Unix generally favors processes over threads, at least the old school Unix. Threads are a more recent innovation.

The old approach was that programs don't need internal parallelism because you can get it by just piping stuff and relying on the kernel's buffering to keep multiple processes busy.

Eg, tar is running on one core dealing with the filesystem, gzip is running on another core compressing stuff.

In the early days, Windows would have a single program doing everything (eg, Winzip) on a single core, while Unix would have a pipeline with the task split into multiple programs that would allow for this implicit parallelism and perform noticeably better.

Today this is all old and doesn't cut it anymore on 128 core setups.


“Recent” as in the 80s or 90s, sure. Threads are older than unix was when threads were introduced.


I’m not sure I understand. Typically I’ve thought of threads called “threads” being formalized with POSIX threads. Before that, while my memory is vague on this, it was just lumped into multiprocessing under various names.


Even if you ignore threading models prior to pthreads, pthreads itself dates to 1995.


And Unix 1973


So: Unix was ~22 when Pthreads were introduced in 1995, and Pthreads are 28 years old now.


I agree with most of these points except blaming this on processes vs threads. The only difference is all memory being shared by default, vs explicitly deciding what memory to share. With all the emphasis on memory safety on HN you think this point would be appreciated.


That’s fairly new, with threads and processes becoming basically the same. Historically threads didn’t exist, then they were horrifically implemented and non standard, then they standardized and were horrifically implemented, then they were better implemented but the APIs were difficult to use safely, etc etc. Also threads were much more light weight than a process. This shifted with light weight processes, etc.


That's true, but if you're looking that far back, multicore is new too


I guess “that far back” becomes different as you get older :-) it doesn’t seem that long ago to me :-)


Well, if you split the file into chunks you could fan it across cores by compressing each individually.


Having to do things like this is exactly the problem.


Processes can still be fine. If you're lucky you can be debating over whether you'd like something to be 99x or 100x faster.


With all due respect to Carmack he’s using bzip in 2023, that’s pretty outdated on every front.


You'd be surprised. There are some workloads - for me, it's geospatial data - where bzip2 clobbers all of the alternatives.


I'm using bzip2 to compress a specific type of backup. In my case I cannot afford to steal CPU time from other running processes, so the backup process runs with severely limited CPU percentage. By crude testing I found that bzip2 used 10x less memory and finished several times faster than xz, while being very close on the compression rate.

Other algorithms like zstd and gz resulted in much lower compression rates.

I'm sure there is a more efficient solution, but changing three letters in a script was pretty much the maximum amount of effort I was going to put in.

On an unrelated note, has someone already made a meta-compression algorithm which simply picks the best performing compression algorithm for each input?


I've not seen one that picks the best compression algorithm, but I've seen ones that perform a test to try and determine if it is worth compressing. For example borg-backup software can be configured to try a light & fast compression algorithm on each chunk of data. If the chunk is compressed, then it uses a more computationally expensive algorithm to really squash it down.


I've also noticed for some text documents (was it json? I don't remember) that bzip compresses significantly better than xz (and of course gzip/pigz). Not sure if I tested zstd with high/extreme settings at that time.


For some reason, bzip compresses text incredibly well. And it has for years, I remember noticing this almost 20 years ago.


It uses the Burrows-Wheeler transform to place similar strings next to each other before using other compression tricks, so it usually does a bit better.


Oh, that's interesting, I stopped using bzip2 at the time kernel sources started shipping in xz.

Do you know if there are any tests showing which compressor is better (compression wise) for which data?


With all respect to John Carmack (and it is really a lot of respect!) I'm surprised he seems unaware of pbzip2? It's a parallel implementation scales almost linearly with the amount of cores, and has been around since ~2010, so it's not yet old enough to drive, but anyone dealing with bzip2'ing large amounts of data should have discovered it long ago.

And yes, use zstandard (or xz, where the default binary in your distro is already multithreaded) where you can.


> anyone dealing with bzip2'ing large amounts of data

... should really reconsider their choice of compression algorithm at this point.


But pigz shows that the unix pipeline philosophy works just fine. (of course compressing before tarring is probably better than compressing the tarred file, but that should be pipelinable as well)


For many small files, compressing first will compress worse, because each file has its own dictionary; compressing last means you can take advantage of similarities in files to improve compression ratio.

Compressing first can also be slower if the average file size is smaller than the block size, because the main thread cannot queue new jobs as fast as cores complete them (this happens e.g. with 7zip at fast compression settings with solid archive turned off). Tarring then compressing means small files can be aggregated into a single block, giving both good speed and compression ratio.


Zstandard has a dictionary functionality which allows you to pre-train on sample data to achieve higher compression ratios and faster compression on large numbers of small files.


Don't you then need to store that dictionary somewhere out of band? It seems like you would still need a tar-like format to manage that, at which point as an archive format it seems more complicated with a worse separation of concerns.


Interesting - does it fit that automatically or is there a manual step?


> compressing before tarring is probably better

Not if the files are similar. If you're compressing the files separately you'll start with a clean state rather than reusing previous fragments. Compressing a BMP after a TXT may not be beneficial, but compressing 3 tar'ed TXTs is definitely better than doing them separately.


That is true and seems easily half the total size with small and similar files, but it also means you have to unpack the whole archive when you need the last file in a tarball.

AFAIK the gzip command still cannot compress directory information and therefore needs tar in front of it if you want to retain a folder structure.


Sometimes you need fast indexed access to a specific file in the compressed content without decompressing the entire file (let's say JARs, that are just ZIPs).

TIL: you can use method 93 - Zstandard (zstd) Compression - with ZIPs


The only point of using .zip is for maximum compatibility and you loose that if you use anything other than deflate or no compression. If you are going to use something else you might as well use a less wasteful and better defined archive container - or something entirely custom like most games do.


93?



There is a _parallel_ bzip2:

http://compression.great-site.net/pbzip2/

which should solve the 'my cores are idle' issue.


He should just be using pbzip2 :)

https://linux.die.net/man/1/pbzip2


How big is that file... I have 2TB files compressed down to ~300GB and gunzip'ing them takes ~2-3 hours. Granted, that's still a long ass time, but not 2-3 days.

If anything, I wonder what kind of hard drive John has. If you're reading them off a network drive backed by tape drums it's probably going to take a while ;P


bzip2 is much slower than gzip.


Very bzip2 problem is that even decompression is slower.


bzip2 is not much slower at compression than gzip; gzip is far faster at decompression though. Either way since bzip2 is a block based compressor, parallelization is trivial, and parallel implementations started appearing about 20 years ago; pbzip2 is almost certainly in whichever package manager is in use for TFA.


Of course the problem there is that `tar` is outputting a single stream. You might, in similar situations, start multiple `tar` running on subsets of the input, which pipelines then become fully parallel again.


Shame he didn't discover pbzip2 before starting that job.


If it's less than 98% complete he could still stop it, start over and still finish sooner.


I'd bet tar becomes the bottleneck before pbzip2 does on that multicore monster. It can be surprisingly slow and I don't think any version of tar uses more than one core.


There are several programs to solve this problem. He just wanted to complain and write something built around that.


Is the .07 of a core a margin of error or some kind of status report done on a different core?


iirc its because core usage is an average over time, and you can get weird situations where switching between cores (which happens frequently depending on the scheduler) causes the average usage of one core by the program to not be zero yet, and the usage on another core to already be 100.

This is purely speculation though




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: