John Carmack just had a tweet today on this problem: >I started a tar with bzip ...

Aachen · on May 12, 2023

I wasn't aware the Unix philosophy was to not use multithreading on large jobs that can be parallelized.

You can complain about philosophies but this is just using the wrong tool for the job. Complain about bzip if you feel the bzip authors should have made multithreaded implementation for you.

dale_glass · on May 12, 2023

Unix generally favors processes over threads, at least the old school Unix. Threads are a more recent innovation.

The old approach was that programs don't need internal parallelism because you can get it by just piping stuff and relying on the kernel's buffering to keep multiple processes busy.

Eg, tar is running on one core dealing with the filesystem, gzip is running on another core compressing stuff.

In the early days, Windows would have a single program doing everything (eg, Winzip) on a single core, while Unix would have a pipeline with the task split into multiple programs that would allow for this implicit parallelism and perform noticeably better.

Today this is all old and doesn't cut it anymore on 128 core setups.

loeg · on May 12, 2023

“Recent” as in the 80s or 90s, sure. Threads are older than unix was when threads were introduced.

fnordpiglet · on May 12, 2023

I’m not sure I understand. Typically I’ve thought of threads called “threads” being formalized with POSIX threads. Before that, while my memory is vague on this, it was just lumped into multiprocessing under various names.

loeg · on May 12, 2023

Even if you ignore threading models prior to pthreads, pthreads itself dates to 1995.

fnordpiglet · on May 12, 2023

And Unix 1973

loeg · on May 12, 2023

So: Unix was ~22 when Pthreads were introduced in 1995, and Pthreads are 28 years old now.

wanttocomment · on May 12, 2023

I agree with most of these points except blaming this on processes vs threads. The only difference is all memory being shared by default, vs explicitly deciding what memory to share. With all the emphasis on memory safety on HN you think this point would be appreciated.

fnordpiglet · on May 12, 2023

That’s fairly new, with threads and processes becoming basically the same. Historically threads didn’t exist, then they were horrifically implemented and non standard, then they standardized and were horrifically implemented, then they were better implemented but the APIs were difficult to use safely, etc etc. Also threads were much more light weight than a process. This shifted with light weight processes, etc.

wanttocomment · on May 12, 2023

That's true, but if you're looking that far back, multicore is new too

fnordpiglet · on May 12, 2023

I guess “that far back” becomes different as you get older :-) it doesn’t seem that long ago to me :-)

saagarjha · on May 12, 2023

Well, if you split the file into chunks you could fan it across cores by compressing each individually.

frde · on May 12, 2023

Having to do things like this is exactly the problem.

mhh__ · on May 12, 2023

Processes can still be fine. If you're lucky you can be debating over whether you'd like something to be 99x or 100x faster.

masklinn · on May 12, 2023

With all due respect to Carmack he’s using bzip in 2023, that’s pretty outdated on every front.

dgacmu · on May 12, 2023

You'd be surprised. There are some workloads - for me, it's geospatial data - where bzip2 clobbers all of the alternatives.

PhilipRoman · on May 12, 2023

I'm using bzip2 to compress a specific type of backup. In my case I cannot afford to steal CPU time from other running processes, so the backup process runs with severely limited CPU percentage. By crude testing I found that bzip2 used 10x less memory and finished several times faster than xz, while being very close on the compression rate.

Other algorithms like zstd and gz resulted in much lower compression rates.

I'm sure there is a more efficient solution, but changing three letters in a script was pretty much the maximum amount of effort I was going to put in.

On an unrelated note, has someone already made a meta-compression algorithm which simply picks the best performing compression algorithm for each input?

sudobash1 · on May 12, 2023

I've not seen one that picks the best compression algorithm, but I've seen ones that perform a test to try and determine if it is worth compressing. For example borg-backup software can be configured to try a light & fast compression algorithm on each chunk of data. If the chunk is compressed, then it uses a more computationally expensive algorithm to really squash it down.

Aachen · on May 12, 2023

I've also noticed for some text documents (was it json? I don't remember) that bzip compresses significantly better than xz (and of course gzip/pigz). Not sure if I tested zstd with high/extreme settings at that time.

elteto · on May 12, 2023

For some reason, bzip compresses text incredibly well. And it has for years, I remember noticing this almost 20 years ago.

tehjoker · on May 12, 2023

It uses the Burrows-Wheeler transform to place similar strings next to each other before using other compression tricks, so it usually does a bit better.

krzyk · on May 12, 2023

Oh, that's interesting, I stopped using bzip2 at the time kernel sources started shipping in xz.

Do you know if there are any tests showing which compressor is better (compression wise) for which data?

gruturo · on May 12, 2023

With all respect to John Carmack (and it is really a lot of respect!) I'm surprised he seems unaware of pbzip2? It's a parallel implementation scales almost linearly with the amount of cores, and has been around since ~2010, so it's not yet old enough to drive, but anyone dealing with bzip2'ing large amounts of data should have discovered it long ago.

And yes, use zstandard (or xz, where the default binary in your distro is already multithreaded) where you can.

account42 · on May 16, 2023

> anyone dealing with bzip2'ing large amounts of data

... should really reconsider their choice of compression algorithm at this point.

gpderetta · on May 12, 2023

But pigz shows that the unix pipeline philosophy works just fine. (of course compressing before tarring is probably better than compressing the tarred file, but that should be pipelinable as well)

cout · on May 12, 2023

For many small files, compressing first will compress worse, because each file has its own dictionary; compressing last means you can take advantage of similarities in files to improve compression ratio.

Compressing first can also be slower if the average file size is smaller than the block size, because the main thread cannot queue new jobs as fast as cores complete them (this happens e.g. with 7zip at fast compression settings with solid archive turned off). Tarring then compressing means small files can be aggregated into a single block, giving both good speed and compression ratio.

rhinoceraptor · on May 12, 2023

Zstandard has a dictionary functionality which allows you to pre-train on sample data to achieve higher compression ratios and faster compression on large numbers of small files.

sgtnoodle · on May 12, 2023

Don't you then need to store that dictionary somewhere out of band? It seems like you would still need a tar-like format to manage that, at which point as an archive format it seems more complicated with a worse separation of concerns.

cout · on May 12, 2023

Interesting - does it fit that automatically or is there a manual step?

viraptor · on May 12, 2023

> compressing before tarring is probably better

Not if the files are similar. If you're compressing the files separately you'll start with a clean state rather than reusing previous fragments. Compressing a BMP after a TXT may not be beneficial, but compressing 3 tar'ed TXTs is definitely better than doing them separately.

zweifuss · on May 12, 2023

That is true and seems easily half the total size with small and similar files, but it also means you have to unpack the whole archive when you need the last file in a tarball.

AFAIK the gzip command still cannot compress directory information and therefore needs tar in front of it if you want to retain a folder structure.

ElectricalUnion · on May 12, 2023

Sometimes you need fast indexed access to a specific file in the compressed content without decompressing the entire file (let's say JARs, that are just ZIPs).

TIL: you can use method 93 - Zstandard (zstd) Compression - with ZIPs

account42 · on May 16, 2023

The only point of using .zip is for maximum compatibility and you loose that if you use anything other than deflate or no compression. If you are going to use something else you might as well use a less wasteful and better defined archive container - or something entirely custom like most games do.

cout · on May 12, 2023

zweifuss · on May 12, 2023

See 4.4.5. in https://pkware.cachefly.net/webdocs/APPNOTE/APPNOTE-6.3.8.TX... and here: http://facebook.github.io/zstd/

sllabres · on May 12, 2023

There is a _parallel_ bzip2:

http://compression.great-site.net/pbzip2/

which should solve the 'my cores are idle' issue.

res0nat0r · on May 12, 2023

He should just be using pbzip2 :)

https://linux.die.net/man/1/pbzip2

aigoochamna · on May 12, 2023

How big is that file... I have 2TB files compressed down to ~300GB and gunzip'ing them takes ~2-3 hours. Granted, that's still a long ass time, but not 2-3 days.

If anything, I wonder what kind of hard drive John has. If you're reading them off a network drive backed by tape drums it's probably going to take a while ;P

lathiat · on May 12, 2023

bzip2 is much slower than gzip.

mobilio · on May 12, 2023

Very bzip2 problem is that even decompression is slower.

aidenn0 · on May 12, 2023

bzip2 is not much slower at compression than gzip; gzip is far faster at decompression though. Either way since bzip2 is a block based compressor, parallelization is trivial, and parallel implementations started appearing about 20 years ago; pbzip2 is almost certainly in whichever package manager is in use for TFA.

h2odragon · on May 12, 2023

Of course the problem there is that `tar` is outputting a single stream. You might, in similar situations, start multiple `tar` running on subsets of the input, which pipelines then become fully parallel again.

jandrese · on May 12, 2023

Shame he didn't discover pbzip2 before starting that job.

zargon · on May 12, 2023

If it's less than 98% complete he could still stop it, start over and still finish sooner.

jandrese · on May 12, 2023

I'd bet tar becomes the bottleneck before pbzip2 does on that multicore monster. It can be surprisingly slow and I don't think any version of tar uses more than one core.

stjohnswarts · on May 13, 2023

There are several programs to solve this problem. He just wanted to complain and write something built around that.

boomboomsubban · on May 12, 2023

Is the .07 of a core a margin of error or some kind of status report done on a different core?

lionkor · on May 15, 2023

iirc its because core usage is an average over time, and you can get weird situations where switching between cores (which happens frequently depending on the scheduler) causes the average usage of one core by the program to not be zero yet, and the usage on another core to already be 100.

This is purely speculation though