Hacker News new | past | comments | ask | show | jobs | submit login

It feels like issues like those are more common in parsers, this specific kind of software.

But why?

Why is parsing so hard? or is it just in low lvl languages? or maybe languages with poor string primitives?

I've written parsers in high level languages and it didnt felt dangerous or insanely hard




Poor string primitives is the answer, I think. Or more generally, C’s tendency to require a buffer of sufficient size to be passed for output, and the complexity involved with:

    - What size to pick for the buffer
    - Making sure the buffer is of the size you think it is
    - What to do when the buffer isn’t big enough
    - How to know for sure that the buffer is big enough
A parser runs right into these problems quickly: An input string may have some formatting specifiers (%d, etc), may involve control characters that cause the output buffer to expand, etc etc… Any function that deals with this kind of thing has to do be able to correctly tell the caller that their buffer wasn’t big enough, and by how much, and has to make sure it didn’t accidentally write past the end of the buffer before doing this.

Callers also have to make sure the buffer size they’re passing to the parser is actually accurate… you don’t want to malloc 100 bytes and tell the parser the buffer is 200. They also have to make sure that if they say the buffer is 100 bytes, that the string actually terminates with a \0 by the end of it, or it gets even more confusing.

It’s overall a complicated problem, and arises specifically because of the tendency of C library functions to avoid allocations… because the ownership model of idiomatic C is that callers should be the ones that allocate (because they probably know best how to allocate and when to free, etc.)


It's not hard to use a "vector" library or even just a small set of macros that ensure memory safety when you are manipulating slices of strings or buffers.

Unfortunately it's very tempting to play fast and loose with raw pointers, with ad hoc validation logic all over the place.

OTOH, maybe SSL maintainers consider punycode parsing performance more important than security.


Parsing isn’t all that hard, it’s about working with the right tools and level of abstractions. Messing around in C with null-terminated strings is just hilariously error prone.


I think it's easy to hand write a parser that can get into a weird state. Some formats are also easier to write a parser for safely vs others.

For example, if you have a Tag Length Value data format that says "5ABCDE" maybe that means "the next 5 bytes are a string, then there's something else afterwards or it's the end", you don't want to allocate a buffer of 5 values and just keep writing into it without checking that you're reading valid values.

Doing this sort of thing efficiently may also mean that you're tempted to remove something like a bounds check. After all, if you already know that the value is 5 bytes, why check on every access? People really want parsers to be fast. Ideally in the above example I wouldn't even need to copy those bytes out, I could just reference those values, which now leads to potential lifetime issues.

Further, in C, buffers don't have lengths attached to them. In every other language you typically don't work with null terminated strings. So now you have to manage sizes of things throughout your parser.

One way to view this is that the programmer's mental model of the parsing machine can easily drift from the implementation of the parsing machine, leading to vulnerabilities.

Basically it ends up being very easy to accidentally end up with out of bounds reads/writes + there's pressure to be fast + formats can be very complex.

That's my view on it at least.


>Doing this sort of thing efficiently may also mean that you're tempted to remove something like a bounds check. After all, if you already know that the value is 5 bytes, why check on every access? People really want parsers to be fast.

In what kind of software bounds check have this significant perf. penalty?

The only people I've heard talking about such a stuff were firmware devs.

>Further, in C, buffers don't have lengths attached to them. In every other language you typically don't work with null terminated strings. So now you have to manage sizes of things throughout your parser.

Cannot C have some wrapper over those poor strings

that leads to better safety? dev's experience, etc, etc?


> In what kind of software bounds check have this significant perf. penalty?

All kinds. But a lot of it is just that parsers are extremely easy to benchmark and benchmarks promote optimization.

> Cannot C have some wrapper over those poor strings

Sure, but there's nothing native.


>Sure, but there's nothing native.

But why? strings are used by all programmers everyday

I struggle to understand why you wouldn't want to make them state of the art, or decent at least.


> > Sure, but there's nothing native.

> But why? strings are used by all programmers everyday

Different kinds of strings can have extremely different performance profiles.

A statically allocated string of ASCII (single byte) characters.

A dynamically allocated string of unicode (multibyte) characters where the length of allocation is not known ahead of time is very different. C requires the developer to know the differences and knows how to deal with them treat them.


C is an extremely conservative language. You don't even get booleans unless you're in C99 and import it.


You do get them without #include <stdbool.h>, you'll just have to contend with ugly spellings like _Bool and _True.


bool, true, and false as native keywords are on track to show up in C23. https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2393.pdf


Because WG14 doesn't care about security, for them C is basically a portable macro assembler with added conveniences.


Your questions are less interrelated than you might think.

Why is general parsing hard? It's sense-making from free form input. We have so many different languages and syntaxes and there are different algorithmic approaches needed to parse them depending on the format and language, frequently made as dense as possible for various more or less (usually more) misguided "optimization" reasons.

Why is parsing in low level languages hard? This is more about incidental features of low level languages. C just happens to be exceptionally unsuited and unsafe for the tasks involved in parsing. From a safety and security POV C is chainsaw juggling, but C for parsing untrusted data is chainsaw juggling while running through a mine field.


I'll give you a tautological and useless answer. I hope you don't mind. A parser is a kind of interpreter, where code is executed based on external input. When user input controls how code is executed you have opened the doors of hell: it's hard to guarantee that all of the possible executions are safe.


But why?

Parsers executes safe code fragments in order based on the input

How safe operations result in unsafe results?

I'm not talking about side channels here.


Well, I see two ways it might be: the operations are not really safe in an absolute sense or the operations composition does not preserve safety. Or you could say that an operation can be safe in a context but ops+context does not compose as well as we would like to guarantee the preservation of safety


Here's an even more tautological answer. You can't misparse input unless you're parsing input. That's why parsers have trouble with parsing input.


true


Probably low level + safe + performant == hard


But why? where does the complexity come from


There are naturally lots of edge cases when you parse a format, because you have to constrain the combination of all the different fields.

Some formats are simple and the fields don't interact with each other at all, some are complex and the format changes depending on other values.

Parsing is hard because you have to handle all the possible inputs someone could throw at you, and depending on the format that can leave hundreds of very rare edge case no reasonable human would normally think of.

This is also why fuzzing is so effective on parser, fuzzers are great at throwing many different combinations at the wall until they find a new interesting edge case, and jumping off from there to see if they can mutate it into more.


You are of course correct. This also ties into my sibling comment - the cleverness in clever low-level parsing code is based on assumptions that may be wrong for some part of the huge input space.


The first level of complexity comes from the format. A bit array is super easy to parse (in C, and assuming you take care of endianness). JSON is more complicated; YAML is more complicated than JSON; XML is more complicated than YAML; X.509 is more complicated than XML (I think, anyway). The more complex the data format, the more complex the parsing; the more complex; the more opportunity for bugs.

The second level of complexity comes from variability. A bit array doesn't vary, every bit is in the same place. A string varies. Anything that can vary causes complexity; more varying, more complexity. This applies to the data format and the data.

The third level of complexity comes from features. Every feature is a new thing that has to be parsed and then affects some code somewhere, the result of which affects more parsing and code. The more features and options there are, the more complexity.

"Why does it seem easier in high-level languages?" High-level languages have slowly had their bugs stripped out, and give you features that are rarer in low-level languages. You literally aren't writing the same routines in high-level languages because you don't need to. If you had to do all the same things, you'd have the same bugs. And a lot of newbies simply are lucky and don't personally run into the bugs that are already there.


>"Why does it seem easier in high-level languages?" High-level languages have slowly had their bugs stripped out, and give you features that are rarer in low-level languages. You literally aren't writing the same routines in high-level languages because you don't need to. If you had to do all the same things, you'd have the same bugs. And a lot of newbies simply are lucky and don't personally run into the bugs that are already there.

Which bugs precisely are you talking about?

Sane string implementation, so instead of performing some shenanigans with buffers to concat two strings, I can just "a" + "b"?

>If you had to do all the same things, you'd have the same bugs.

Why in lower level languages people cannot write some handy abstractions which will result in better security and dev. experience?


>Sane string implementation, so instead of performing some shenanigans with buffers to concat two strings, I can just "a" + "b"?

Because system language users care about what happens when you do that. Where is it being allocated? What happens to the original strings? And a lot of other questions that relate to memory management. These languages are faster in part because of the control you get over allocations.

In C++ one could use a vector<char> to build a temporary string (or use stringstream for a fancier interface), but that also means that whenever something is appended to it, it must check the container capacity to handle possible reallocations, etc. Very similar to how most high level languages handle strings. But that comes at a cost.

A common C pattern is to simply receive a pointer to a block of memory to use for the string, write to that and null terminate it. If the buffer size is insufficient, it will stop writing but continue parsing so it can return the size of the required buffer so the user can allocate that and call the function again. Most libraries avoid allocating stuff on behalf of the user. On the common case (buffer size is enough), it'll be much faster as no heap memory needs be allocated, moved around, etc.


What is a string? A series of bytes? Characters? What locale is being used? What are you assigning it to? What's dealing with memory? What happens when you leave current scope? What if the two strings aren't the same type, or one or both of them contains "binary" (and what is binary)? What are you doing with the string? Does your language have a bunch of fancy features like calculating length, taking slices, copying? Are you going to read character by character, or use a regex? Are you going to implement a state machine, linked list, hash, sorting algorithm, objects/classes? For any of the functions you'll be passing this data to, will they ever expect some specific type, much less encoding, specific character set, or length? Do you need to use "safe" functions "safely"? What do you do about 3rd party libraries, external applications, network sockets? Are your operations thread-safe and concurrent?

Higher-level languages, being, you know, higher-level, have a barrage of subtly difficult and uninteresting crap taken care of for you. Sure, you could "write some handy abstractions which will result in better security and dev. experience". And you would end up with.... a high-level language. But it would be slow, fat, and there'd be a dozen things you just couldn't do. Kernels and crypto pretty much need to be low-level.

The real reason low-level languages don't get any easier to program in is standards bodies. A bunch of chuckleheads argue over the dumbest things and it takes two decades to get some marginally better feature built into the language, or some braindeadness removed. There's only so much that function abstractions can do before you lose the low-level benefits of speed, size and control. (and to be fair, good compilers aren't exactly easy to write)


As the handy abstraction would have a significant performance overhead, a low-level language would not want to use it everywhere or even as a default - especially in the third-party libraries you'll be using everywhere.

The handy abstraction is not a win-win, it's a tradeoff of better security and dev. experience versus performance and control - and a key point is that people who have explicitly chosen a low-level language likely have done so exactly because they want this tradeoff to be more towards performance or control. If someone really wants to get better security and dev. experience at the cost of performance, then why not just use a high-level language instead of combining the worst of both worlds where you have to do the work in a low-level language (even if partially mitigated by these handy abstractions) but still pay the performance overhead that a high-level language would have?


> Why in lower level languages people cannot write some handy abstractions which will result in better security and dev. experience?

Because those programs are sloooooooooooooooooooooooooooooow


parsers and serializers have one thing they often do: read (and write) to a (usually manually allocated) byte array. And the content of that byte array is often under attacker control.

C has terrible support for things dealing with byte arrays. They must be manually allocated, and accesses must be checked to be in-bound manually.

Lots of critical software have parsers written in C. This combination leads to CVEs like this one.

FWIW, a bug such as this one (which ends up with an invalid array access) could happen in any language, and would end up with a panic in Rust, or a NullPointerException in Java, etc... The thing that makes this especially dangerous is that, because C is low-level and unchecked, this can also lead to Remote Code Execution instead of a simple Denial Of Service/crash.


> or a NullPointerException in Java

Just nitpicking, but the exception for an invalid array access in Java would be IndexOutOfBoundsException (or one of its subclasses), not NullPointerException.


Parsing code without any kind of framework or high level helpers is, for lack of a better word, fiddly. C strings are an awful tool to do it because memory safety depends on getting a lot of buffer size / string length calculations right - there are plenty of opportunities to make mistakes. The fiddliness also induces developers to make changes in existing code in the form of local clever tricks instead of adapting the code to cleanly implement new requirements. It's only a small change and the tests (hopefully there are any) pass, job done! But maybe it violates a non-obvious assumption for another clever trick somewhere else, etc...


I think the biggest difference is bounds checking. If you write off the end of your array in Java, you get an exception. In C, you get a CVE. Other things like manual memory management and fiddly string types don't help, but simple bounds checks would catch so many things.


tl;dr: Using C/C++ and being human => memory safety problems.


Also, as we learned back when Heartbleed was discovered, the OpenSSL code is not in good shape. It "suffers from maintenance", as one clever wag said about legacy code. There's a reason LibreSSL forked the code. More distributions need to switch away from OpenSSL.

And before anyone pipes up, I'm not claiming LibreSSL does not and will not ever haver vulnerabilities. I'm saying that ripping stuff like punycode out of the library reduces the attack surface. https://isc.sans.edu/diary/rss/29208


>Also, as we learned back when Heartbleed was discovered, the OpenSSL code is not in good shape. It "suffers from maintenance", as one clever wag said about legacy code. There's a reason LibreSSL forked the code. More distributions need to switch away from OpenSSL.

Anyone who's ever worked with the OpenSSL API or looked at its code can tell you that it's a steaming pile of crap. It's no surprise that this vulnerability was discovered. Honestly, OpenSSL should just be banned because it's so horrible, and there are better alternatives available.


I just started making openssl -Werror safe. Oh my, what did I get into.

Halfway through it's about 125 changed files, > 1000 changes. look at the WIP commit. The API is insane. 50% of args are unused. All the structs and vtables updates are uninitialized, ie missing methods.

https://github.com/rurban/openssl/commits/Werror


One of the (possibly first?) things the LibreSSL people did after forking OpenSSL was to enable -Wall, -Werror, -Wextra, -Wuninitialized on the code[1]. Many years ago we'd look at compiler (and linter) warnings with a skeptical eye, but these days, they really mean something. That alone smoked out a lot of lurking problems.

1 https://en.wikipedia.org/wiki/LibreSSL#Proactive_measures




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: