CSV is the source of so many problems. CSV has no character set, no rule for escaping double-quotation marks, commas, and newlines. There's not even a way to look at a CSV file and tell what it's "rules" are beyond heuristics and those only take you so far.
I ask for XLSX files since at least it's structured, unambiguous and documented, but even better: a minimal XLSX parser is trivial (about a page) to write.
Also: Educating users on how to specify the character set in every application that the user seems to want to use is a special kind of hell.
I'd say that the spreadsheet model is long in the tooth, but there has been a failure of will in the industry to kill it.
People use Excel when they should really use a database, they use it because they want to format something on a 2-d grid, edit tabular data, make plots, do calculations, make projections, etc.
The problems go down to the data structures in use.
For instance there is nothing 2-dimensional about financial reports (and projections), really financial reports are hyperdimensional. Proper handling of dates and times is absolutely critical. Also the use of floating point with a binary exponent is a huge distraction in any kind of math tools aimed at ordinary people. (Mainframes got that right in the 1960s!)
Google Sheets is just a stripped down version of Excel and other than the ability for multiple people to work on it simultaneously, is really no better.
It's always fun to look down on the likes of Excel and PHP etc. Of course, we should try to learn, too.
Excel for all its faults is easy for beginners to pick up. Semi-technical people can quickly hack together a bug-ridden prototype of their ideas. In most companies the alternative is not well-written software, but spending countless meetings to get IT to spend millions to provide a bug-ridden prototype in a few years time.
One of the main problems with Excel as I can see it is that it is effectively a write-only language. Auditing an Excel sheet is famously almost impossible. And that's before you add VBA in the mix.
Since I mentioned PHP: one of the best things I can say about the language and its ecosystem is that they made it really easy to add a 'web-counter' to your otherwise late 90s static html-only page on shared hosting. You begin with a web-counter, and then just keep on copy-and-pasting.. Haskell is not nearly that easy to pick up, even if you are prepared to do it badly.
So true. At my wife's accounting firm they recently moved from excel to Postgres. Accountants now interact with the database directly rather than messing with the mess that is excel and everyone is better for it.
That's not at all helpful unless the document comes with a statement that it complies with RFC4180. His point wasn't that there are no standards for doing so, more that there are so many standards for doing so, and it's near impossible to determine which standard a document uses by itself.
RFC 4180 just tried to write down what the most common quirks of CSVs were in the wild at the time. It's a reference more than a specification.
It's really a missed opportunity. Had they had the balls to actually specify CSV properly this wouldn't be nearly as much of a problem. This would have probably left a lot of the existing CSV writers non-conformant, but it would have likely improved the future situation considerably.
It doesn't even need to be complex. Just a few rules:
(basic CSV spec here)
1. Reserved characters are , " \
2. If a reserved character appears in a field, escape it with \
3. Alternatively you can double quote a field, in which case you only need to escape " characters in the field
4. Line endings are either CR, CR/LF, or LF. Parsers must suppport all line endings.
5. Character set is UTF-8, however the " , and \ characters must be 0x22, 0x2C, and 0x5C. Alternative language lookalike characters will not be treated as control characters.
6. Whitespace adjacent to a comma is ignored by the parser
RFC4180 is worse than worthless; Do not implement it and claim you support "CSV files". Almost every paragraph is useless, wrong, or dangerous; it's few citations offer conflicting and sometimes clearly-stupid advice where they advice the use of CSV at all. Some examples:
§ 2.1 says that lines end in CRLF. This is in opposition to every UNIX-based system out there (which outnumber systems that use CRLF as a line delimiter by between 2:1 and 8:1 depending on how you choose to estimate this) and means that CSV files either don't exist on Mac OS X or Linux, or aren't "text files" by the standard definition of that term -- both absolutely silly conclusions! Nevertheless, following RFC4180-logic, § 2.6 thus suggests that a bare line-feed can appear unquoted.
§ 2.3 says the header is optional and that a client knows if it is present because the MIME type says "text/csv; header" then § 3 admits that this isn't very helpful and clients will have to "make their own decision" anyway.
§ 2.7 requires fortran-style doubling of double-quote marks, like COBOL and SQL, and ignores that many "CSV" systems use backslash to escape quotes.
§ 3 says that the character set is in the mime type. Operating systems which don't use MIME types for their file system (i.e. almost all of them) thus cannot support any character set other than "US-ASCII".
None of these "suggestions" are true of any operating system I'm aware of, nor are they true of any popular CSV consumer; If a conforming implementation of RFC4180 exists, it's definitely not useful. In fact, one of the citations (ESR) says:
The bad results of proliferating special cases are twofold. First, the complexity of the parser (and its vulnerability to bugs) is increased. Second, because the format rules are complex and underspecified, different implementations diverge in their handling of edge cases. Sometimes continuation lines are supported, by starting the last field of the line with an unterminated double quote — but only in some products! Microsoft has incompatible versions of CSV files between its own applications, and in some cases between different versions of the same application (Excel being the obvious example here).
A better spec would be honest about these special cases. A "good" implementation of CSV:
• needs a flag or switch indicating whether there is a header or not
• needs to be explicitly told the character set
• needs a flag to specify the escaping method \ or ""
• needs the line-ending specified (CR, LF, CRLF, [\r\n]{1,}, or \r{0,}\n{1,})
... and it needs all of these flags and switches "user-accessible". RFC4180 doesn't mention any of this, and so anyone who picks it up looking for guidance is going to be deluded into thinking that there are rules for "escaping double quotes" or "commas" or "newlines" that will help them consume and produce CSV files. Anyone writing specifications for developers who tries to use RFC4180 for guidance to implement the "import CSV files" feature is going to be left to dry.
The devil has demanded I support CSV, so the advice I can give to anyone who has received a similar requirement:
• parse using a state machine (and not by recursive splitting or a regex-with-backtracking).
• use heuristics to "guess" the delimiter by observing that a file is likely rectangular -- tricky, since some implementations don't include trailing delimiters if the trailing values in a row are missing. I use a bag of delimiters (, ; and tab) and choose the one that produces the most columns, but has no rows with more columns than the first.
• use heuristics to "guess" the character set especially if the import might have come from Windows. For web applications I use header-hints to guess the operating system and language settings to adjust my weights.
• use heuristics to "guess" the line-ending method. I normally use whatever the first-line ends in unless [\r\n]{1,} produces a better rectangle and no subsequent lines extend beyond the first one.
A successful and fast implementation of all of these tricks is a challenge for any programmer. If you guess first, then let the user guide changes with switches, your users will be happiest, but this is very relative: Users think parsing CSV is "easy" so they're sure to complain about any issues. I have saved per-user coefficients for my heuristic guesses to try and make users happier. I have wasted more time on CSV than I have any other file format.
My advice is that we should ignore RFC4180 completely and, given the choice, avoid CSV files for any real work. XLSX is unambiguous and easy enough that a useful implementation takes less space than my criticism of CSV takes- and weirdly enough- many uses of "CSV" are just "I want to get my data out of Excel" anyway.
To me the issue isn't how we should read in CSV (as you stated, we should use a different format that is less ambiguous) but rather getting vendors to provide the data in that new format. I've found that most vendors I work with are using a cobbled together solution built on top of a mainframe and getting them to add/remove a field from CSV is a monumental task, there is no way they will move to a different format like XLSX or JSON or XML. Until there is an open, industry-wide standard, I just don't see CSV going away
None? You seemed to have touched mostly on consuming CSV and safe ways to do that and suggested XLSX - I'm just adding that digesting CSV safely isn't our biggest issue, it's getting vendors to use something other than CSV and unfortunately that is typically out of most people's control.
I've used several different characters for escaping, though I don't know why backslash (\) isn't the default universal one. It seems pretty sensible, it's used across multiple scripting languages and CLIs. If you need to escape a backslash, you escape it with a backslash.
The comma-separated-value file almost certainly comes from list-directed I/O in Fortran, which used doubling of quote characters to "escape" them. This was probably an IBM-thing since other IBM-popularised languages also do this (including SQL and COBOL). As someone else pointed out, RFC4180's own interpretation of "csv" also does this.
My guess is that they did this because parsing and encoding was simple (only a few instructions) on these systems and required no extra memory.
You mean does Google Sheets suffer from the same vulnerabilities? IMPORTXML is a Google Sheets function and isn't in Excel.
My php-based xlsx parser is about 100 lines. If you don't need time series you can save almost twenty lines. Strings cost about ten lines. Even still, whilst it may not support every string-type or every time format, it is more reliable than any CSV parser...
If you are opening up the resulting file in excel, yes. If you are writing your own minimal XLSX parser like geocar say he has, probably not. It's unlikely that your own parser is going to implement the functionality to go send an arbitrary AJAX request.
I ask for XLSX files since at least it's structured, unambiguous and documented, but even better: a minimal XLSX parser is trivial (about a page) to write.
Also: Educating users on how to specify the character set in every application that the user seems to want to use is a special kind of hell.