Have experience with PNG/IFF/4CC formats? I want design help.

vitovito · on Dec 21, 2012

This is an interesting question, but I don't know how many file format neckbeards you're going to find on HN, it's an awfully young community. Maybe Stack Overflow? Maybe the File Formats wiki[1] will have links to format inventors? Maybe read the specs for the formats and ask the authors, as well as the authors of standard libraries for them?

1: http://fileformats.archiveteam.org/wiki/Electronic_File_Form...

I will answer what little I can based on some work I did in 2005-2006 that resulted in me abusing JPEG restart markers.

1. Part of the reason PNG is chunked is because it's designed for network use (that's the N), where you can lose or corrupt a portion of the file and still get most of the file usable. That's what JPEG restart markers are for: you can repeat the headers throughout the file (rather than CRC throughout the file) so if something in the middle gets messed up, you can always pick right back up. Formats expected to sit on a disk and never move and trust rotational media don't have those checks because it's assumed you check it when you write it and then you're fine for a reasonable value of forever. So: are you transferring over a network constantly like you do with JPEGs and PNGs, or can you do an MD5 sum (or whatever) once and be done with it?

1b. I don't know what these words mean.

1c. This should be answered by 1.

2. This is probably explained somewhere in the PNG spec or the PNG mailing list archives, but specifying the length of a run of data instead of looping until you find a \0 helps you optimize and prevents buffer overflows.

3. Yes, lots of things use user-defined chunks and all sorts of crazy stuff gets stored in them. A video game once stored its character data as PNG files with custom chunks so you could put them online as images, and then right-click-save-as and import them into the game. But no, most software doesn't understand other chunks and ignores them (but preserves them).

dalke · on Dec 21, 2012

Thanks for your response.

I tried programming.stackexchange a year or two back. (I've been waiting for funding and/or a specific need to work on this format, which I now have.) The one response then also pointed out PNG was designed to work with the network, particularly dialup, but could give no numbers as to the failure rates.

Based on what I've read, internet packets have a failure rate of something like 16 million to 10 billion packets, which is O(1 TB) and Bram Cohen concurs, saying that BitTorrent sees failures in the 1 per 10TB range. I wonder about when/if I should worry about this sort of problem.

Most of the data files will be sitting on disk, mmap'ed for use. I'm going to take your advice that this problem is mostly theoretical for the situation I'm in, and suggest that users manually md5 if they want to verify a copy, or use BitTorrent if they want better guarantees on data transfer over a flaky network.

1b. was a fancy way to say CRC-64 instead of CRC-32. There's several -64s, so I picked the one in xz.

1c. Yep. I've given up worrying. Now it's intellectual curiosity.

2. I should ask the PNG list, yes. The documentation doesn't explain the logic, and the 15+ year old mailing lists are only available via ftp'ed zip files, making them a bit harder to trawl than a web search.

I don't believe the logic about length vs. NUL since the first field is NUL delimited, making it susceptible to the same optimization and possible overflow attacks. The iTXT field, for example, has four NUL separators instead of using a length parameter.

3) That's neat! I'll need to think more about including this PNG-like sort of functionality.

dalke · on Dec 22, 2012

I asked on one of the PNG lists, and dug up the Jan.-Feb. 1995 discussion, where the CRC discussion mostly took place.

It seems there was an early advocate for CRC-32 support. This person had experience as the main UnZip developer, and knew that large archival files need this sort of check. Zip is a container format and PNG is a container format, so I see how works, but PNG isn't really seen as an archival format.

Still, there was an early push for CRC, and that was taken up. Though I found almost no discussion about failure modes or frequency of failures. One of the few examples I did find argued that uuencoded images on Usenet can be corrupted mid-stream, and PNG was better able to detect those problems.

The original proposal seemed to have put the CRC in the terminal chunk. Someone decided it was more elegant to have it at the end of every chunk. Others agreed. It was put in.

CRC failures don't occur often. I wonder if the failure due to incorrectly implemented CRCs is higher, given that Firefox has a switch to disable CRC check failures in 'ancillary' chunks after the image data, as a workaround for someone's bad chunk.

I've decided not to have the CRC in my own format.