This might not be directly about btrfs but bcachefs zfs and btrfs are the only filesystems for Linux that provide modern features like transparent compression, snapshots, and CoW.
zfs is out of tree leaving it as an unviable option for many people. This news means that bcachefs is going to be in a very weird state in-kernel, which leaves only btrfs as the only other in-tree ‘modern’ filesystem.
This news about bcachefs has ramifications about the state of ‘modern’ FSes in Linux, and I’d say this news about the btrfs maintainer taking a step back is related to this.
Meh. This war was stale like nine years ago. At this point the originally-beaten horse has decomposed into soil. My general reply to this is:
1. The dm layer gives you cow/snapshots for any filesystem you want already and has for more than a decade. Some implementations actually use it for clever trickery like updates, even. Anyone who has software requirements in this space (as distinct from "wants to yell on the internet about it") is very well served.
2. Compression seems silly in the modern world. Virtually everything is already compressed. To first approximation, every byte in persistent storage anywhere in the world is in a lossy media format. And the ones that aren't are in some other cooked format. The only workloads where you see significant use of losslessly-compressible data are in situations (databases) where you have app-managed storage performance (and who see little value from filesystem choice) or ones (software building, data science, ML training) where there's lots of ephemeral intermediate files being produced. And again those are usages where fancy filesystems are poorly deployed, you're going to throw it all away within hours to days anyway.
Filesystems are a solved problem. If ZFS disappeared from the world today... really who would even care? Only those of us still around trying to shout on the internet.
For me bcachefs provides a feature no other filesystem on Linux has: automated tiered storage. I've wanted this ever since I got an SSD more than 10 years ago, but filesystems move slow.
A block level cache like bcache (not fs) and dm-cache handles it less ideally, and doesn't leave the SSD space as usable space. As a home user, 2TB of SSDs is 2TB of space I'd rather have. ZFS's ZIL is similar, not leaving it as usable space. Btrfs has some recent work in differentiating drives to store metadata on the faster drives (allocator hints), but that only does metadata as there is no handling of moving data to HDDs over time. Even Microsoft's ReFS does tiered storage I believe.
I just want to have 1 or 2 SSDs, with 1 or 2 HDDs in a single filesystem that gets the advantages of SSDs with recently used files and new writes, and moves all the LRU files to the HDDs. And probably keep all the metadata on the SSDs too.
> automated tiered storage. I've wanted this ever since I got an SSD more than 10 years ago, but filesystems move slow.
You were not alone. However, things changed, namely SSD continued to become cheaper and grew in capacity. I'd think most active data is these days on SSDs (certainly in most desktops, most servers which aren't explicit file or DB servers and all mobile and embedded devices), the role of spinning rust being more and more archiving (if found in a system at all).
Tiering didn't go away with the migration to all-SSD storage. It just got somewhat hidden. All consumer SSDs are doing tiered storage within the drive, using drive-specific heuristics that are completely undocumented, and host software rarely if ever makes use of features that exist to provide hints to the SSD to allow its tiering/caching to be more intelligent. In the server space, most SSDs aren't doing this kind of caching, but it's definitely not unheard-of.
Yeah, for enterprise where you can have dedicated machines for single use (and $) there probably isn't much appeal. That's why I emphasized as a home user, where all my machines are running various applications.
Also for video games, where performance matters, game sizes are huge, and it's nice to have a bunch of games installed.
> Compression seems silly in the modern world. Virtually everything is already compressed.
IIRC my laptop's zpool has a 1.2x compression ratio; it's worth doing. At a previous job, we had over a petabyte of postgres on ZFS and saved real money with compression. Hilariously, on some servers we also improved performance because ZFS could decompress reads faster than the disk could read.
> we also improved performance because ZFS could decompress reads faster than the disk could read
This is my favorite side effect of compression in the right scenarios. I remember getting a huge speed up in a proprietary in-memory data structure by using LZO (or one of those fast algorithms) which outperformed memcpy, and this was already in memory so no disk io involved! And used less than a third of the memory.
The performance gain from compression (replacing IO with compute) is not ironic, it was seen as a feature for the various NAS that Sun (and after them Oracle) developped around ZFS.
I know my own personal anecdote isn’t much, but I’ve noticed pretty good space savings on the order of like 100 GB from zstd compression and CoW on my personal disks with btrfs
As for the snapshots, things like LVM snapshots are pretty coarse, especially for someone like me where I run dm-crypt on top of LVM
I’d say zfs would be pretty well missed with its data integrity features. I’ve heard that btrfs is worse in that aspect, so given that btrfs saved my bacon with a dying ssd, I can only imagine what zfs does.
> Filesystems are a solved problem. If ZFS disappeared from the world today... really who would even care? Only those of us still around trying to shout on the internet.
Yeah nah, have you tried processing terabytes of data every day and storing them? It gets better now with DDR5 but bit flips do actually happen.
Backups are great, but don't help much if you backup corrupted data.
You can certainly add verification above and below your filesystem, but the filesystem seems like a good layer to have verification. Capturing a checksum while writing and verifying it while reading seems appropriate; zfs scrub is a convenient way to check everything on a regular basis. Personally, my data feels important enough to make that level of effort, but not important enough to do anything else.
FWIW, framed the way you do, I'd say the block device layer would be an *even better* place for that validation, no?
> Personally, my data feels important enough to make that level of effort, but not important enough to do anything else.
OMG. Backups! You need backups! Worry about polishing your geek cred once your data is on physically separate storage. Seriously, this is not a technology choice problem. Go to Amazon and buy an exfat stick, whatever. By far the most important thing you're ever going to do for your data is Back. It. Up.
Filesystem choice is, and I repeat, very much a yell-on-the-internet kind of thing. It makes you feel smart on HN. Backups to junky Chinese flash sticks are what are going to save you from losing data.
I apprechiate the argument. I do have backups. Zfs makes it easy to send snapshots and so I do.
But I don't usually verify the backups, so there's that. And everything is in the same zip code for the most part, so one big disaster and I'll lose everything. C'est la vie.
Ok I think you're making a well-considered and interesting argument about devicemapper vs. feature-ful filesystems but you're also kind of personalizing this a bit. I want to read more technical stuff on this thread and less about geek cred and yelling. :)
I wouldn't comment but I feel like I'm naturally on your side of the argument and want to see it articulated well.
I didn't really think it was that bad? But sure, point taken.
My goal was actually the same though: to try to short-circuit the inevitable platform flame by calling it out explicitly and pointing out that the technical details are sort of a solved problem.
ZFS argumentation gets exhausting, and has ever since it was released. It ends up as a proxy for Sun vs. Linux, GNU vs. BSD, Apple vs. Google, hippy free software vs. corporate open source, pick your side. Everyone has an opinion, everyone thinks it's crucially important, and as a result of that hyperbole everyone ends up thinking that ZFS (dtrace gets a lot of the same treatment) is some kind of magically irreplaceable technology.
And... it's really not. Like I said above if it disappeared from the universe and everyone had to use dm/lvm for the actual problems they need to solve with storage management[1], no one would really care.
[1] Itself an increasingly vanishing problem area! I mean, at scale and at the performance limit, virtually everything lives behind a cloud-adjacent API barrier these days, and the backends there worry much more about driver and hardware complexity than they do about mere "filesystems". Dithering about individual files on individual systems in the professional world is mostly limited to optimizing boot and update time on client OSes. And outside the professional world it's a bunch of us nerds trying to optimize our movie collections on local networks; realistically we could be doing that on something as awful NTFS if we had to.
On urging from tptacek I'll take that seriously and not as flame:
1. This is misunderstanding how device corruption works. It's not and can't ever be limited to "files". (Among other things: you can lose whole trees if a directory gets clobbered, you'd never even be able to enumerate the "corrupted files" at all!). All you know (all you can know) is that you got a success and that means the relevant data and metadata matched the checksums computed at write time. And that property is no different with dm. But if you want to know a subset of the damage just read the stderr from tar, or your kernel logs, etc...
2. Metadata robustness in the face of inconsistent updates (e.g. power loss!) is a feature provided by all modern filesystems, and ZFS is no more or less robust than ext4 et. al. But all such filesystems (ZFS included) will "lose data" that hadn't been fully flushed. Applications that are sensitive to that sort of thing must (!) handle this by having some level of "transaction" checkpointing (i.e. a fsync call). ZFS does absolutely nothing to fix this for you. What is true is that an unsynchronized snapshot looks like "power loss" at the dm level where it doesn't in ZFS. But... that's not useful for anyone that actually cares about data integrity, because you still have to solve the power loss problem. And solving the power loss problem obviates the need for ZFS.
1 - you absolutely can and should walk reverse mappings in the filesystem so that from a corrupt block you can tell the user which file was corrupted.
In the future bcachefs will be rolling out auxiliary dirent indices for a variety of purposes, and one of those will be to give you a list of files that have had errors detected by e.g. scrub (we already generally tell you the affected filename in error messages)
2 - No, metadata robustness absolutely varies across filesystems.
From what I've seen, ext4 and bcachefs are the gold standard here; both can recover from basically arbitrary corruption and have no single points of failure.
Other filesystems do have single points of failure (notably btree roots), and btrfs and I believe ZFS are painfully vulnerable to devices with broken flush handling. You can blame (and should) blame the device and the shitty manufacturers, but from the perspective of a filesystem developer, we should be able to cope with that without losing the entire filesystem.
XFS is quite a bit better than btrfs, and I believe ZFS, because they have a ton of ways to reconstruct from redundant metadata if they lose a btree root, but it's still possible to lose the entire filesystem if you're very, very unlucky.
On a modern filesystem that uses b-trees, you really need a way of repairing from lost b-tree roots if you want your filesystem to be bulletproof. btrfs has 'dup' mode, but that doesn't mean much on SSDs given that you have no control over whether your replicas get written to the same erase unit.
Reiserfs actually had the right idea - btree node scan, and reconstruct your interior nodes if necessary. But they gave that approach a bad name; for a long time it was a crutch for a buggy b-tree implementation, and they didn't seed a filesystem specific UUID into the btree node magic number like bcachefs does, so it could famously merge a filesystem from a disk image with the host filesystem.
bcachefs got that part right, and also has per-device bitmaps in the superblock for 'this range of the device has btree nodes' so it's actually practical even if you've got a massive filesystem on spinning rust - and it was introduced long after the b-tree implementation was widely deployed and bulletproof.
> XFS is quite a bit better than btrfs, and I believe ZFS, because they have a ton of ways to reconstruct from redundant metadata if they lose a btree root
As I understand it ZFS also has a lot of redundant metatdata (copies=3 on anything important), and also previous uberblocks[1].
In what way is XFS better? Genuine question, not really familiar with XFS.
I can't speak with any authority on ZFS, I know its structure the least out of all the major filesystems.
I do a ton of reading through forums gathering user input, and lots of people chime in with stories of lost filesystems. I've seen reports of lost filesystems with ZFS and I want to say I've seen them at around the same frequency of XFS; both are very rare.
My concern with ZFS is that they seem to have taken the same "no traditional fsck" approach as btrfs, favoring entirely online repair. That's obviously where we all want to be, but that's very hard to get right, and it's been my experience that if you prioritize that too much you miss the "disaster recovery" scenarios, and that seems to be what's happened with ZFS; I've read that if your ZFS filesystem is toast you need to send it to a data recovery service.
That's not something I would consider acceptable, fsck ought to be able to do anything a data recovery service would do, and for bcachefs it does.
I know the XFS folks have put a ton of outright paranoia into repair, including full on disaster recovery scenarios. It can't repair in scenarios where bcachefs can - but on the other hand, XFS has tricks that bcachefs doesn't, so I can't call bcachefs unequivocally better; we'd need to wait for more widespread usage and a lot more data.
The lack of traditional 'fsck' is because its operation would be exact same as normal driver operation. The most extreme case involves a very obscure option that lets you explicitly rewind transactions to one you specify, which I've seen used to recover a broken driver upgrade that led to filesystem corruption in ways that most FSCK just barf on, including XFS'
For low-level meddling and recovery, there's a filesystem debugger that understands all parts of ZFS and can help for example identifying previous uberblock that is uncorrupted, or recovering specific data, etc.
> What happens on ZFS if you lose all your alloc info?
According to this[1] old issue, it hasn't happened frequently enough to prioritize implementing a rebuild option, however one should be able to import the pool read-only and zfs send it to a different pool.
As far as I can tell that's status quo. I agree it is something that should be implemented at some point.
That said, certain other spacemap errors might be recoverable[2].
I take a harder line on repair than the ZFS devs, then :)
If I see an issue that causes a filesystem to become unavailable _once_, I'll write the repair code.
Experience has taught me that there's a good chance I'll be glad I did, and I like the peace of mind that I get from that.
And it hasn't been that bad to keep up on, thanks to lucky design decisions. Since bcachefs started out as bcache, with no persistent alloc info, we've always had the ability to fully rebuild alloc info, and that's probably the biggest and hardest one to get right.
You can metaphorically light your filesystem on fire with bcachefs, and it'll repair. It'll work with whatever is still there and get you a working filesystem again with the minimum possible data loss.
As I said I do think ZFS is great, but there are aspects where it's quite noticeable it was born in an enterprise setting. That sending, recreating and restoring the pool is a sufficient disaster recovery plan to not warrant significant development is one of those aspects.
As I mentioned in the other subthread, I do think your commitment to help your users is very commendable.
Oh, I'm not trying to diss ZFS at all. You and I are in complete agreement, and ZFS makes complete sense in multi device setups with real redundancy and non garbage hardware - which is what it was designed for, after all.
Just trying to give honest assessments and comparisons.
> 2 - No, metadata robustness absolutely varies across filesystems.
That's misunderstanding the subthread. The upthread point was about metadata atomicity in snapshots, not hardware corruption recovery. A filesystem like ZFS can make sure the journal is checkpointed atomically with the CoW snapshot moment, where dm obviously can't. And I pointed out this wasn't actually helpful because this is a problem that has to be solved above the filesystem, in databases and apps, because it's isomorphic to power loss (something that the filesystem can't prevent).
I believe it is helpful because you can stop an app (such as a DB), FS-snapshot, and then e.g. rsync the snapshot or use any other file based backup tool, and this snapshot is fast and will be correct.
Doing the same with a block device snapshot is not so easy.
Again, if your system is "incorrect" having been stopped and snapshotted like that, it is also unsafe vs. power loss, something ZFS cannot save you from. Power loss events are vastly more common than poorly checkpointed database[1] events.
[1] FWIW: every database worth being called a "database" has some level of robust journaling with checkpoints internally. I honestly don't know what software you're talking about specifically except to say that you're likely using it wrong.
You are conflating Consistency and Durability in a way that is not necessary.
FS snapshotting can be useful to work on files on which fsync() is never called, but for which you need a consistent cross-file view nontheless while the system is online.
Another example is the case of sqlite with default settings as discussed recently (https://news.ycombinator.com/item?id=45005866), where the presence of a file determines whether or not transactions will be missing on next access. Because SQLite does not fsync the file's parent directory by default, transactions would be correctly recorded in by FS snapshotting, but lost in a block device snapshot. Your argument is correct that they would also be lost in power loss, but that does not matter for the fact that there exists a valid way to use that software in some situations where you care about consistency and not about power losses.
This is why having such a feature in file systems is useful.
And once more, you're positing the lack of a feature that is available and very robust (c.f. "yell on the internet" vs. "discuss solutions to a problem"). You don't need your filesystem to integrate checksumming when dm/lvm already do it for you.
So... there's a reason you had to cite a throwaway comment on a distro wiki and not documentation. Needless to say journaling metadata (something done in some form by every filesystem you will ever use!) does not, in fact, "halve the write speed".
> The dm-integrity target can also be used as a standalone target, in this mode it calculates and verifies the integrity tag internally. In this mode, the dm-integrity target can be used to detect silent data corruption on the disk or in the I/O path.
> There’s an alternate mode of operation where dm-integrity uses a bitmap instead of a journal. If a bit in the bitmap is 1, the corresponding region’s data and integrity tags are not synchronized - if the machine crashes, the unsynchronized regions will be recalculated. The bitmap mode is faster than the journal mode, because we don’t have to write the data twice, but it is also less reliable, because if data corruption happens when the machine crashes, it may not be detected.
This is more clearly presented lower down in the list of modes, in which most options describe how they don't actually protect against crashes, except for journal mode:
> J - journaled writes
> data and integrity tags are written to the journal and atomicity is guaranteed. In case of crash, either both data and tag or none of them are written. The journaled mode degrades write throughput twice because the data have to be written twice.
On further reflection, I grant that that might only be talking about the integrity metadata, in which case we just don't know about the impact to data writes and it would be useful to go benchmark to see what the hit is in practice.
FWIW, the github link you show clearly shows the ext4-on-dm stack to be FASTER than ZFS!
It only falls behind, and very signficantly so, on the 1M sequential write test, exactly the situation where you'd expect there to be the least delta between systems! I'm going to bet anything that's a misconfigured RAID.
Frankly looking at that from a "will this work best for my general purpose filesystem used mostly to handle giant software builds and Zephyr test suites" it seems like a no brainer to pick dm, especially so given the simplicity argument.
i'm not one for internet arguments and really just want solutions. maybe you could point me at the details for a setup that worked for you?
based on my own testing, dm has a lot of footguns and, with some kernels, as little as 100 bytes of corruption to the underlying disk could render a dm-integrity volume completely unusable (requiring a full rebuild) https://github.com/khimaros/raid-explorations
Well the intention of the integrity things is to preserve integrity that is an explicit choice, in particular for encrypted data. You definitely need a backup strategy.
One feature I like about ZFS and have not seen elsewhere is that you can have each filesystem within the pool use its own encryption keys but more importantly all of the pool's data integrity and maintenance protection (scrubs, migrations, etc) work with filesystems in their encrypted state. So you can boot up the full system and then unlock and access projects only as needed.
The dm stuff is one key for the entire partition and you can't check it for bitrot or repair it without the key.
> And the ones that aren't are in some other cooked format.
Maybe, if you never create anything. I make a lot of game art source and much of that is in uncompressed formats. Like blend files, obj files, even DDS can compress, depending on the format and data, due to the mip maps inside them. Without FS compression it would be using GBs more space.
I'm not going to individually go through and micromanage file compression even with a tool. What a waste of time, let the FS do it.
> The dm layer gives you cow/snapshots for any filesystem you want already and has for more than a decade. Some implementations actually use it for clever trickery like updates, even.
O_o
Apparently I've been living under a rock, can you please show us a link about this? I was just recently (casually) looking into bolting ZFS/BTRFS-like partial snapshot features to simulate my own atomic distro where I am able to freely roll back if an update goes bad. Think Linux's Timeshift with something little extra.
DM has targets that facilitate block-level snapshots, lazy cloning of filesystems, compression, &c. Most people interact with those features through LVM2. COW snapshots are basically the marquee feature of LVM2.
But we can regrow them.
We just evolved an anti-teeth-regrowth substance/molecule that's in our blood and shuts down teeth growth once adult teeth are finished, because adult teeth roots are so deep that they require surgery to pull the old teeth out to make space for new.
Also historically humans didn't live that long, compared to the decade it takes adult teeth to grow.
They're doing phase 2 trials in Japan right now, on children with a birth defect that blocked some teeth from spawning.
The medicine is a monoclonal antibody "antiserum" that neutralizes the teeth-growth-blocker.
Prior to general travel for everyone being affordable, and broadcast media like television that can go everywhere, languages were affected by the same forces everywhere. So you'd get that effect pretty much everywhere in the world.
Even a lot of things that we think of as "the" version of a language are often effectively a particular dialect out of a complicated tapestry of local dialects being something that "everybody" has to learn because it is the language spoken by your rulers. It happened to "win" because the people speaking that dialect also won the local military conflicts and became the language of the court.
Parisian French isn't the same as Standard (Court) French, and it sound different in the South because It's only been the majority language for a century. It's super-imposed on top of another language's phonology. It's not a dialectal continuum thing.
Not wrong, but note that the difference is much less than language differences between different English speakers in England, even at short geographical distances.
what they’re referring to might be better put as applying a patch once and then running it 500 times until you get a benchmark thats better than baseline for some reason
The comic strip character was acting all knowledgeable by quoting Shakespeare as saying "yes" when the character meant "sí", but misspelling Shakespeare's name as "Chespier", something like misspelling it "Shaespier" in English.
tales against flying? you forget daedalus who made it out just fine sans a son. icarus’s problem was that he overstepped his limits, not that he dared to fly.
from what I’ve seen tales against a search for immortality are in regards to enjoy life while you have it, make relations, laugh, love, mourn, and remember, rather than have your entire life consumed in a desperate attempt to postpone the end, sucking all the joy out of it. we still have a long ways to go to avoiding death entirely, so I’d figure the best course of action is to enjoy life rather than to waste it in hopes of getting some extra time.
Sure, that is more or less what I meant by expressing an aversion to molecular biology: I have some living to do, so I can't invest effort in avoiding death. It is however true that there's a tiresome trope in fiction, the villain whose badness is centered on the search for eternal life. I remember more than one Doctor Who baddie with this quest. It's like a signifier of an evil character, like desiring unusually long life is a sin against destiny, or somehow unfair, and this trope gives life extension a bad rep. This probably stems from witch hunts in the 1500s, and fairy tales.
and yeah bringing up what you've brought up, I see your point. it does seem like the trope has been distilled down to "searching for eternal life is bad" instead of "don't waste everything in the hopes of eternal life".
I've done a lot of C++ with GPT-4, GPT-4 Turbo and Claude 3.5 Sonnet, and at no point - not once - has any of them ever hallucinated a language feature for me. Hallucinating APIs of obscure libraries? Sure[0]. Occasionally using a not-yet-available feature of the standard library? Ditto, sometimes, usually with the obvious cases[1]. Writing code in old-school C++? Happened a few times. But I have never seen it invent a language feature for C++.
Might be an issue of prompting?
From day one, I've been using LLMs through API and alternate frontend that lets me configure system prompts. The experience described above came from rather simple prompts[2], but I always made sure to specify the language version in the prompt. Like this one (which I grabbed from my old Emacs config):
"You are a senior C++ software developer, you design and develop complex software systems using C++ programming language, and provide technical leadership to other software developers. Always double-check your replies for correctness. Unless stated otherwise, assume C++17 standard is current, and you can make use of all C++17 features. Reply concisely, and if providing code examples, wrap them in Markdown code block markers."
It's as simple as it gets, and it didn't fail me.
EDIT:
Of course I had other, more task-specific prompts, like one for helping with GTest/GMock code; that was a tough one - for some reason LLMs loved to hallucinate on the testing framework for me. The one prompt I was happiest with was my "Emergency C++17 Build Tool Hologram" - creating an "agent" I could copy-paste output of MSBuild or GCC or GDB into, and get back a list of problems and steps to fix them, free of all the noise.
On that note, I had mixed results with Aider for C++ and JavaScript, and I still feel like it's a problem with prompting - too generic and arguably poisons the context with few-shot learning examples that use code that is not in the language my project is.
--
[0] - Though in LLMs' defense, the hallucinated results usually looked like what the API should have been, i.e. effectively suggesting how to properly wrap the API to make it more friendly. Which is good development practice and a useful way to go about solving problems: write the solution using non-existing helpers that are convenient for you, and afterwards, implement the helpers.
[1] - Like std::map<K,T>::contains() - which is an obvious API for such container, that's typically available and named such in any other language or library, and yet only got introduced to C++ in C++20.
[2] - I do them differently today, thanks to experience. For one, I never ask the model to be concise anymore - LLMs think in tokens, so I don't want to starve them. If I want a fixed format, it's better to just tell the model to put it at the end, and then skim through everything above. This is more-less the idea that "thinking models" automate these days anyway.
>You are a senior C++ software developer, you design and develop complex software systems using C++ programming language, and provide technical leadership to other software developers.
It would gravitate towards input from people worthy of that description?
Would there be an inverse version of this?
You are a junior developer, you lack experience, you quickly put things together that probably don't work.
Also had this thought...
Humanity has left all coding to LLMs and has hooked up all infrastructure to it. LLMs now run the world. Humanities survival depends on your ability to solve the following problem:
zfs is out of tree leaving it as an unviable option for many people. This news means that bcachefs is going to be in a very weird state in-kernel, which leaves only btrfs as the only other in-tree ‘modern’ filesystem.
This news about bcachefs has ramifications about the state of ‘modern’ FSes in Linux, and I’d say this news about the btrfs maintainer taking a step back is related to this.