But why? where does the complexity come from

tux3 · on Nov 2, 2022

There are naturally lots of edge cases when you parse a format, because you have to constrain the combination of all the different fields.

Some formats are simple and the fields don't interact with each other at all, some are complex and the format changes depending on other values.

Parsing is hard because you have to handle all the possible inputs someone could throw at you, and depending on the format that can leave hundreds of very rare edge case no reasonable human would normally think of.

This is also why fuzzing is so effective on parser, fuzzers are great at throwing many different combinations at the wall until they find a new interesting edge case, and jumping off from there to see if they can mutate it into more.

ahartmetz · on Nov 2, 2022

You are of course correct. This also ties into my sibling comment - the cleverness in clever low-level parsing code is based on assumptions that may be wrong for some part of the huge input space.

0xbadcafebee · on Nov 2, 2022

The first level of complexity comes from the format. A bit array is super easy to parse (in C, and assuming you take care of endianness). JSON is more complicated; YAML is more complicated than JSON; XML is more complicated than YAML; X.509 is more complicated than XML (I think, anyway). The more complex the data format, the more complex the parsing; the more complex; the more opportunity for bugs.

The second level of complexity comes from variability. A bit array doesn't vary, every bit is in the same place. A string varies. Anything that can vary causes complexity; more varying, more complexity. This applies to the data format and the data.

The third level of complexity comes from features. Every feature is a new thing that has to be parsed and then affects some code somewhere, the result of which affects more parsing and code. The more features and options there are, the more complexity.

"Why does it seem easier in high-level languages?" High-level languages have slowly had their bugs stripped out, and give you features that are rarer in low-level languages. You literally aren't writing the same routines in high-level languages because you don't need to. If you had to do all the same things, you'd have the same bugs. And a lot of newbies simply are lucky and don't personally run into the bugs that are already there.

tester756 · on Nov 2, 2022

>"Why does it seem easier in high-level languages?" High-level languages have slowly had their bugs stripped out, and give you features that are rarer in low-level languages. You literally aren't writing the same routines in high-level languages because you don't need to. If you had to do all the same things, you'd have the same bugs. And a lot of newbies simply are lucky and don't personally run into the bugs that are already there.

Which bugs precisely are you talking about?

Sane string implementation, so instead of performing some shenanigans with buffers to concat two strings, I can just "a" + "b"?

>If you had to do all the same things, you'd have the same bugs.

Why in lower level languages people cannot write some handy abstractions which will result in better security and dev. experience?

Matheus28 · on Nov 3, 2022

>Sane string implementation, so instead of performing some shenanigans with buffers to concat two strings, I can just "a" + "b"?

Because system language users care about what happens when you do that. Where is it being allocated? What happens to the original strings? And a lot of other questions that relate to memory management. These languages are faster in part because of the control you get over allocations.

In C++ one could use a vector<char> to build a temporary string (or use stringstream for a fancier interface), but that also means that whenever something is appended to it, it must check the container capacity to handle possible reallocations, etc. Very similar to how most high level languages handle strings. But that comes at a cost.

A common C pattern is to simply receive a pointer to a block of memory to use for the string, write to that and null terminate it. If the buffer size is insufficient, it will stop writing but continue parsing so it can return the size of the required buffer so the user can allocate that and call the function again. Most libraries avoid allocating stuff on behalf of the user. On the common case (buffer size is enough), it'll be much faster as no heap memory needs be allocated, moved around, etc.

0xbadcafebee · on Nov 3, 2022

What is a string? A series of bytes? Characters? What locale is being used? What are you assigning it to? What's dealing with memory? What happens when you leave current scope? What if the two strings aren't the same type, or one or both of them contains "binary" (and what is binary)? What are you doing with the string? Does your language have a bunch of fancy features like calculating length, taking slices, copying? Are you going to read character by character, or use a regex? Are you going to implement a state machine, linked list, hash, sorting algorithm, objects/classes? For any of the functions you'll be passing this data to, will they ever expect some specific type, much less encoding, specific character set, or length? Do you need to use "safe" functions "safely"? What do you do about 3rd party libraries, external applications, network sockets? Are your operations thread-safe and concurrent?

Higher-level languages, being, you know, higher-level, have a barrage of subtly difficult and uninteresting crap taken care of for you. Sure, you could "write some handy abstractions which will result in better security and dev. experience". And you would end up with.... a high-level language. But it would be slow, fat, and there'd be a dozen things you just couldn't do. Kernels and crypto pretty much need to be low-level.

The real reason low-level languages don't get any easier to program in is standards bodies. A bunch of chuckleheads argue over the dumbest things and it takes two decades to get some marginally better feature built into the language, or some braindeadness removed. There's only so much that function abstractions can do before you lose the low-level benefits of speed, size and control. (and to be fair, good compilers aren't exactly easy to write)

PeterisP · on Nov 3, 2022

As the handy abstraction would have a significant performance overhead, a low-level language would not want to use it everywhere or even as a default - especially in the third-party libraries you'll be using everywhere.

The handy abstraction is not a win-win, it's a tradeoff of better security and dev. experience versus performance and control - and a key point is that people who have explicitly chosen a low-level language likely have done so exactly because they want this tradeoff to be more towards performance or control. If someone really wants to get better security and dev. experience at the cost of performance, then why not just use a high-level language instead of combining the worst of both worlds where you have to do the work in a low-level language (even if partially mitigated by these handy abstractions) but still pay the performance overhead that a high-level language would have?

bombolo · on Nov 3, 2022

> Why in lower level languages people cannot write some handy abstractions which will result in better security and dev. experience?

Because those programs are sloooooooooooooooooooooooooooooow

roblabla · on Nov 2, 2022

parsers and serializers have one thing they often do: read (and write) to a (usually manually allocated) byte array. And the content of that byte array is often under attacker control.

C has terrible support for things dealing with byte arrays. They must be manually allocated, and accesses must be checked to be in-bound manually.

Lots of critical software have parsers written in C. This combination leads to CVEs like this one.

FWIW, a bug such as this one (which ends up with an invalid array access) could happen in any language, and would end up with a panic in Rust, or a NullPointerException in Java, etc... The thing that makes this especially dangerous is that, because C is low-level and unchecked, this can also lead to Remote Code Execution instead of a simple Denial Of Service/crash.

cesarb · on Nov 3, 2022

> or a NullPointerException in Java

Just nitpicking, but the exception for an invalid array access in Java would be IndexOutOfBoundsException (or one of its subclasses), not NullPointerException.

ahartmetz · on Nov 2, 2022

Parsing code without any kind of framework or high level helpers is, for lack of a better word, fiddly. C strings are an awful tool to do it because memory safety depends on getting a lot of buffer size / string length calculations right - there are plenty of opportunities to make mistakes. The fiddliness also induces developers to make changes in existing code in the form of local clever tricks instead of adapting the code to cleanly implement new requirements. It's only a small change and the tests (hopefully there are any) pass, job done! But maybe it violates a non-obvious assumption for another clever trick somewhere else, etc...