I think that’s less of a problem than often claimed. For formats such as *csv*, ...

I think that’s less of a problem than often claimed.

For formats such as csv, parquet or a SQL database that store the names of fields only once, overhead is constant, regardless of number of items.

So, it only can be a problem for formats such as xml or json that repeat the names of fields in every record.

Those happen to be formats with variable record length, so you can’t index into such files; the only way to process them is in their entirety.

If so, and storage size is a problem, you can compress the files. Your typical LZW variant will, if the files are large enough, eventually encode both the “(lat=“ and “,lon=“ parts to single codes of (typically) 12 bits each. That will happen after the fifth occurrence of such strings, so fairly soon). That’s 24 bits of overhead per item. Significant, but if you use xml or json, chances are you’re already giving up way more by storing floating point values as text strings.

So, that leaves json or xml files that each store only a few items per file. With a typical file system block size of 8 kB, those already give up 4 kB on average per file.