I'm designing a new format for a sub-niche of cheminformatics. I've used ideas common to FourCC formats: blocks with a size, 4-character name, and data chunk. I have blocks >2GB so I'm using 64-bit sizes instead of 32, so I can't leverage the formats I know about exactly.
I have questions about the design, and would like feedback:
1) PNG ends the chunk with a CRC-32 check value. The other IFF/FourCC formats don't. How important has the CRC-32 been in practice? How often does a PNG chunk fail to pass the check?
1b) (Assuming the check value is useful): Given multi-gigabyte chunks, should I use CRC-64-ECMA-182 instead? I'm assuming that there will eventually be 10-100TB of data across the world in this format. Is this something I should worry about?
1c) How often should I validate? I don't want to do it automatically on load because it takes cat > /dev/null 43 seconds to process a ~3GB file. But single-lookup command-line queries are sub-second. Should it be a user-defined paranoia test? My feel is that those never get run.
2) The PNG format uses the NUL character as separators, e.g.: the "tEXt" is "an uncompressed keyword or key phrase, a null (zero) byte, and the actual text." The end position of the 'actual text' is determined by the end of the chunk. Wouldn't a NUL terminator would make the code easier for C programs to handle? Is there a reason to not using NUL-terminated fields, other than to save a byte?
3) The PNG format uses bit 5 (the upper-case/lower-case bit) on the chunk code to encode things like "Safe-to-copy". This seems cute idea, but has it proved useful? Does it work? That is, do people add their own chunk types, and find that other software mostly tends to follow those bit settings? Or does it end up that most software just ignores/skips chunks that it doesn't know about?
1: http://fileformats.archiveteam.org/wiki/Electronic_File_Form...
I will answer what little I can based on some work I did in 2005-2006 that resulted in me abusing JPEG restart markers.
1. Part of the reason PNG is chunked is because it's designed for network use (that's the N), where you can lose or corrupt a portion of the file and still get most of the file usable. That's what JPEG restart markers are for: you can repeat the headers throughout the file (rather than CRC throughout the file) so if something in the middle gets messed up, you can always pick right back up. Formats expected to sit on a disk and never move and trust rotational media don't have those checks because it's assumed you check it when you write it and then you're fine for a reasonable value of forever. So: are you transferring over a network constantly like you do with JPEGs and PNGs, or can you do an MD5 sum (or whatever) once and be done with it?
1b. I don't know what these words mean.
1c. This should be answered by 1.
2. This is probably explained somewhere in the PNG spec or the PNG mailing list archives, but specifying the length of a run of data instead of looping until you find a \0 helps you optimize and prevents buffer overflows.
3. Yes, lots of things use user-defined chunks and all sorts of crazy stuff gets stored in them. A video game once stored its character data as PNG files with custom chunks so you could put them online as images, and then right-click-save-as and import them into the game. But no, most software doesn't understand other chunks and ignores them (but preserves them).