Hacker News new | past | comments | ask | show | jobs | submit login
Begrudgingly Choosing CBOR over MessagePack (taylor.town)
55 points by jjgreen 3 months ago | hide | past | favorite | 78 comments



The article links to this[1] older HN comment which argues against CBOR. I must admit this passage made me laugh out loud:

    A decoder that comes across a simple value (Section 2.3) that it does not
    recognize, such as a value that was added to the IANA registry after the
    decoder was deployed or a value that the decoder chose not to implement,
    might issue a warning, might stop processing altogether, might handle the
    error by making the unknown value available to the application as such (as
    is expected of generic decoders), or take some other type of action.
I choose not to implement `int`. I decide instead to fill up your home folder. I'm a compliant CBOR implementation.

Seems this part of the specification has been rewritten[2], so now a generic decoder is to pass on the tag and value to the user or return an error.

[1]: https://news.ycombinator.com/item?id=14072598

[2]: https://www.rfc-editor.org/rfc/rfc8949.html#name-generic-enc...


Disclaimer: I wrote and maintain a MessagePack implementation.

Hey that's me!

Yeah they fixed that, but there's other parts of the spec that are basically unworkable like indefinite length values, "canonicalization", and tags, making it essentially MP (MP does have extension types, I should say, the virtue of tossing out CBOR's tags is you then don't have to implement things like datetimes/timezones, bignums, etc), and indeed at least FIDO tosses this all out: https://fidoalliance.org/specs/fido-v2.0-ps-20190130/fido-cl...

Beyond that, CBOR is MessagePack. The story is that Carsten Bormann wanted to create an IETF standardized MP version, the creators asked him not to (after he acted in pretty bad faith), he forked off a version, added the aforementioned ill-advised tweaks, named it after himself, and submitted it anyway. All this design by committee stuff is mostly wrong--though IETF has modified it by committee since.

There's no reason an MP implementation has to be slower than a CBOR implementation. If a given library wanted to be very fast it could be. If anything, the fact that CBOR more or less requires you to allocate should put a ceiling on how fast it can really be. Or, put another way, benchmarks of dynamic language implementations of a serialization format aren't a high signal indication of its speed ceiling. If you use a dynamic language and speed is a concern to this degree, you'd write an adapter yourself, probably building on one of the low level implementations.

That said, people are usually disappointed by MP's speed over JSON. A lot of engineering hours have gone into making JSON fast, to the point where I don't think it ever made sense to choose MP over it for speed reasons (there are other good reasons). Other posters here have pointed out that your metrics are usually dominated by something else.

But finally, CBOR is fine! The implementations are good and it's widely used. Users of CBOR and MP alike will probably have very similar experiences unless you have a niche use case (on an embedded device that can't allocate, you really need bignums, etc).


> there's other parts of the spec that are basically unworkable like indefinite length values, "canonicalization", and tags, making it essentially MP [...].

I'm not sure but your wording suggests that CBOR is equally unworkable as MP because they implement the same feature set...?

But anyway, those features are not always required but useful from time to time and any complete serialization format ought to include them in some way. Canonicalization for example is an absolute requirement for cryptographic applications; you know JWT got so cursed due to JSON's lack of this property. Tag facilities are well thought out in my opinion, while specific tags are less so but implementations can choose to ignore them anyway---thankfully after the aforementioned revisions.

> The story is that Carsten Bormann wanted to create an IETF standardized MP version, the creators asked him not to (after he acted in pretty bad faith), he forked off a version, added the aforementioned ill-advised tweaks, named it after himself, and submitted it anyway.

To me it looks more like Markdown vs. CommonMark disputes; John Gruber repeatedly refused to standardize Markdown or even any subset in spite of huge needs because he somehow believes that standardization ruins simplicity. I don't really agree---simple but correct standard is possible, albeit with efforts. So people did their own standardization including CommonMark, which subtly differ from each other, but any further efforts would be inadvently blocked by Gruber. MessagePack seems no different to me.


> To me it looks more like Markdown vs. CommonMark disputes; John Gruber repeatedly refused to standardize Markdown or even any subset in spite of huge needs because he somehow believes that standardization ruins simplicity. I don't really agree---simple but correct standard is possible, albeit with efforts.

Right, that was my take after reading about it for a while. The way MessagePack and CBOR frame the problem is fairly different, with CBOR intentionally opting for an open tagging system.

Plus it feels a bit childish brining up the circumstances of the fork (correct or not) when they've clearly diverged bit in purpose and scope. The Markdown vs CommonMark is an apt comparison.

I've used both and both work very well. They're both stable, and be parsed into native objects at a speed nearing that of memory copy with the right implementations.


> CBOR intentionally opting for an open tagging system

CBOR's tags are MP's extension types


Not so much. (This response seems to repeat, please refrain from ignoring details in such bold claim.) MP extension types are opaque and applications can do nothing about them. CBOR tags are extensions to the existing data model and can be processed to some extent---for example unknown tags don't prevent implementations from inspecting their contents. And I don't think MP had any sort of registry for extension types, they are more like "reserved for future expansion" instead of "less used but still spec-worthy types to be defined with a reasonable proposal".


> Not so much. (This response seems to repeat, please refrain from ignoring details in such bold claim.)

I said what I meant; feel free to disagree but this policing is pretty condescending. I don't need to constantly repeat that the fundamental data format is lifted from MP and the extra features/process Bormann added on top of it are uniformly poorly thought out.

Just like CBOR's tags, extension types are additional, non core types. Bormann renamed them and bumped up the size so you can have way more of them in CBOR, but the tag also takes up more space, and since the odds of needing billions of extension types are basically zero it's not a good tradeoff.

> MP extension types are opaque and applications can do nothing about them. CBOR tags are extensions to the existing data model and can be processed to some extent

I think CBOR's "have your cake and eat it too" design has confused you. Yes CBOR establishes a tag registry, but implementations are free to ignore all tags. In practice what this means is if you can control the receiver you can use whatever tags you want, and if you don't control the receiver you have to either avoid tags or potentially limit your audience (i.e. do I use the "Standard date/time string" and eat the extra int8 or do I just send it as a string and note it as a date/time in my docs). You might think, "oh pish posh what can't process a date/time string", but the answer is "many embedded devices you'd want to use CBOR on". It's yet another feature added with no consideration for real world use cases.

> for example unknown tags don't prevent implementations from inspecting their contents. And I don't think MP had any sort of registry for extension types, they are more like "reserved for future expansion" instead of "less used but still spec-worthy types to be defined with a reasonable proposal".

You fundamentally misunderstand MP's extension types. Instead of guessing you can read about them in the MP spec [0]:

---

Extension types

MessagePack allows applications to define application-specific types using the Extension type. Extension type consists of an integer and a byte array where the integer represents a kind of types and the byte array represents data.

Applications can assign 0 to 127 to store application-specific type information. An example usage is that application defines type = 0 as the application's unique type system, and stores name of a type and values of the type at the payload.

MessagePack reserves -1 to -128 for future extension to add predefined types. These types will be added to exchange more types without using pre-shared statically-typed schema across different programming environments.

[0, 127]: application-specific types

[-128, -1]: reserved for predefined types

Because extension types are intended to be added, old applications may not implement all of them. However, they can still handle such type as one of Extension types. Therefore, applications can decide whether they reject unknown Extension types, accept as opaque data, or transfer to another application without touching payload of them.

---

[0]: https://github.com/msgpack/msgpack/blob/master/spec.md#exten...


> I said what I meant; feel free to disagree but this policing is pretty condescending.

Sorry for that feeling, but when the same thing repeats three times (I think) I have to note that something is off in your messaging. I'll try to be more cautious in the future.

> You fundamentally misunderstand MP's extension types. Instead of guessing you can read about them in the MP spec [0]:

Maybe my line of thought is confusing to you, but I have read all of that in order to avoid relying on my fragile recollection. And they are qualitatively different to me. You can't really do that much with an encoded bytes `c7 05 00 94 01 02 03 04` if the application-specific type 0 is unknown, even though `94 01 02 03 04` is a valid MP sequence and the author probably have intended so. So tag-unaware tools like diagnostics or compression algorithms would have to guess. The equivalent CBOR bytes `c0 84 01 02 03 04` clearly express such intent. If there is no such intent, you can put a byte string instead (`c0 45 84 01 02 03 04`).

As you have acknowledged, the tag registry has its pros and cons. It might not be obvious which tag should be used in a given use case. Tags are prone to be ill-designed and stuck forever (this already happened for IPv4/v6 tags, to be clear). But the registry means that the spec development can happen in the distributed manner and for more specific situations. I mean, the only extension type ever defined by MP is a timestamp. It even doesn't have other obvious tags like UUID. Is it justified?


The registry isn't useful for this. Either you're defining a format to be consumed by a generic decoder and therefore can't rely on tags in the registry, or you're defining a format to be consumed by a custom decoder you control, so it can understand whatever tags/extension types you make it understand. The registry is strictly a negative because--again--you can't rely on it, and it requires extensions to go through the registration process. You can't define application-specific types in CBOR.

> It even doesn't have other obvious tags like UUID. Is it justified?

Yes; UUIDs are huge 128-bit values and many popular embedded platforms are 32-bit. If your app needs them in MP that's what extension types are for.

---

I think maybe what makes us talk past each other is: there's no use-case for a generic CBOR (or MP) decoder on its own. JSON/XML/HTML won in that space (you know things are bad when there are more public XML APIs than public CBOR APIs). There's no serious use-case for a "tag-unaware diagnostic" tool for CBOR or MP APIs. You will always build things on top of the CBOR/MP decoder, there will always be API docs, or reverse engineering the wire format is trivial. CBOR really wants this to not be true; it really wants to be the binary JSON despite the fact this is more or less an oxymoron. The questions that illustrate the difference are:

- how does the format avoid forcing things on you you don't need

- how does the format provide for extension

MP's answers to these questions are:

- be very conservative about what's required of implementations

- extension types

CBOR's answers to these questions are:

- interact with IETF

- interact with IETF

Different people will react to that differently, but that's the bottom line.


> Yes; UUIDs are huge 128-bit values and many popular embedded platforms are 32-bit.

Size is irrelevant because UUIDs are meant to be used as is (see my other comment). `b210cdca-5d10-4c2e-a604-0fdd9502f02b` has no intrinsic meaning as a number 236689833926310967579631802650001076267; every parsing and formatting against UUID can be done bytewise for that reason.

> If your app needs them in MP that's what extension types are for.

No, I mean that why aren't UUIDs built-in (negative) extension types. I'm not talking about application-specific types which would be literally anything by definition, and CBOR does support such "private" tags (starting at 80000) if you want anyway [1].

[1] In fact, I would argue that about a half of tags past them should be made private.

> there's no use-case for a generic CBOR (or MP) decoder on its own. JSON/XML/HTML won in that space (you know things are bad when there are more public XML APIs than public CBOR APIs).

And that's your claim, not the verifiable fact. Public CBOR APIs are now starting to appear (even though very slowly), while I have never seen a single public MP API---please let me know if there is. CBOR API is rare mainly because CBOR is new. The same thing can't be said for MP API, which existed much longer than CBOR. MP API is even rare if not non-existent because of some other reason. Maybe we can tell whether CBOR API was indeed rare for the same reason as MP API in, say, 10 years though.

> CBOR really wants this to not be true; it really wants to be the binary JSON despite the fact this is more or less an oxymoron.

So, everything you claimed seems to be ultimately originated from this line of thought. And you know what? Needs for binary JSON were always high, otherwise we didn't even need any sort of schematic serialization and so many people tried to design one! CBOR is probably one of the best alternatives as binary JSON we have ever seen. (Again, that doesn't mean that MP is not one of them. But I will avoid any sort of irrelevant judgement here.) Maybe you may have been right, but I think there is no concrete evidence for nor against your claim right now.

It is not true that I'm totally satisfied about CBOR, of course. Some tags are proposed too late to undo harm already done, like private tags I've mentioned. Bormann in particular seems to be more interested in adding more tags instead of doing the most out of the optimal number of tags, and I don't like his attitude in general. So my ideal is actually somewhere between CBOR and MP, it just happens that it can be implemented as a subset of CBOR and MP is just insufficient.


> Size is irrelevant because UUIDs are meant to be used as is (see my other comment). `b210cdca-5d10-4c2e-a604-0fdd9502f02b` has no intrinsic meaning as a number 236689833926310967579631802650001076267; every parsing and formatting against UUID can be done bytewise for that reason.

I don't understand your point here. Either there's benefits to representing them numerically (size, speed of comparison, etc) that can be realized w/ MP's extension types, or we can just leave them as strings and MP supports everything you'd want to do with them. What's your issue w/ MP here again?

> CBOR does support such "private" tags (starting at 80000) if you want anyway

I really thought this too, but I can't find it in the spec. The spec links to a big ass list of tags [0] which, holy shit haha, what is going on here? "YANG bits datatype"? "Gordian Envelope"? "Bigfloat with arbitrary exponent"? "Extended bigfloat"? What on earth supports any of this? Anyway, can you link what you're looking at?

Later: Oh, I found it! It's in the big ass list of tags. Although I don't think it's really official? I read through the linked email thread and they don't mention the port range. They seem like they settle on using 1010 and then switching on data after the tag.

> And that's your claim, not the verifiable fact.

You're not seriously claiming CBOR has anywhere near the usage of JSON/XML/HTML.

> CBOR API is rare mainly because CBOR is new

CBOR is over ten years old. That's not new.

> Needs for binary JSON were always high

Where are all these binary JSON APIs? Is there a list anywhere near as large as this big public APIs list on GitHub [1]?

---

I've been nerd sniped by this enough so I'm gonna quit following these threads. I want to leave you with the fact that I've been right about everything all along, and that the world would be a better place if everyone just listened to me always. Good luck out there.

[0]: https://www.iana.org/assignments/cbor-tags/cbor-tags.xhtml

[1]: https://github.com/public-apis/public-apis


> I'm not sure but your wording suggests that CBOR is equally unworkable as MP because they implement the same feature set...?

That's fair; I've been a little confusing when I say things like "CBOR is MessagePack". To be clear I mean that CBOR's format is fundamentally MessagePack's, and my issues are with the stuff added beyond that.

> But anyway, those features are not always required but useful from time to time and any complete serialization format ought to include them in some way.

Totally! I think MP's extension types (CBOR's "tags") are pretty perfect for this. I mean, bignums or datetimes or whatever are often useful, and supporting extension types/tags in an implementation is really straightforward. I just don't think stuff like this belongs in a data representation format. There's a reason JSON, gRPC, Cap'n Proto, Thrift, etc. don't even support datetimes.

> Canonicalization for example is an absolute requirement for cryptographic applications; you know JWT got so cursed due to JSON's lack of this property.

This is the example I always have in my head too. But canonicalization puts some significant requirements on a serializer. Like, when do you enable canonicalization? CBOR limits the feature set when canonicalizing, so you can do it up front and error if someone tries to add an indefinite-length type, or you can do it at the end and error then, and this by itself is a big question. You also have to recursively descend through any potentially nested type and canonicalize it. What about duplicate keys? CBOR's description on how to handle them is pretty hands off [0], and canonicalization is silent on it [1].

But alright, you can make a reasonable library even aside from all this stuff. But are you really just trusting that things are canonicalized receiver side? Definitely not, so you do a lot of validation on your own which pretty much obviates any advantage you might get. JWT is a great use-case of people assuming the JSON was well-formed: canonicalized or not, you gotta validate. You're a lot better off just defining the format for JWT and validating receiver side; canonicalization is basically just extra work.

> To me it looks more like Markdown vs. CommonMark disputes

There was some of this because of the bytes vs. strings debate. Basically people were like, "hey wait, when I deserialize in a dynamic language that assumes strings are UTF-8, I get raw byte strings? I don't like that", but on the other hand Treasure Data (MP creators) had lots of data already stored in the existing format so they needed (well, wanted anyway) a solution that was backwards compatible, plus you want to consider languages that don't really know about things like UTF-8 or use something else internally (C/C++, Java, Python for a while). That's where MPv5 came from, and the solution is really elegant. If CBOR was MPv4 + strings I'd 100% agree with you, but it's a kitchen sink of stuff that's pretty poorly thought out. You can see this in the diversity of support in CBOR implementations. I'm not an expert so LMK if you know differently, but for example the "best" Rust lib for this doesn't support canonicalization [2]. Go's is really comprehensive [3] but the lengths it has to go to (internal buffer pools, etc) are pretty bananas and beyond what you'd expect for a data representation format, plus it has knobs like disabling indefinite-length encodings, limiting the sizes of them, limiting the stack depth for nesting, and so on, again really easy to get into trouble here.

[0]: https://datatracker.ietf.org/doc/html/rfc8949#name-specifyin...

[1]: https://datatracker.ietf.org/doc/html/rfc8949#name-serializa...

[2]: https://github.com/enarx/ciborium/issues/144

[3]: https://github.com/fxamacker/cbor


> What about duplicate keys? CBOR's description on how to handle them is pretty hands off [0], and canonicalization is silent on it [1].

I agree on this point, DAG-CBOR in my knowledge is defined to avoid such pitfall. Again, we can agree that Bormann is not a good spec writer nor a good communicator regardless of his design skill.

> You're a lot better off just defining the format for JWT and validating receiver side; canonicalization is basically just extra work.

> I'm not an expert so LMK if you know differently, but for example the "best" Rust lib for this doesn't support canonicalization [2].

However this argument is... absurd to be frank. Canonicalization is an additional stuff and not every implementation is going to implement that. More specifically, I'm only leaning on the fact that there is a single defined canonicalization scheme that can be leveraged by any interested user, not that it is mandatory (say, unlike bencode) because canonicalization and other stuffs naturally require different API designs anyway.

Let's think about a concrete case of sorted keys in maps. Most implementations are expected to return a standard mapping type for them because that's natural to do so. But many if not most mapping types are not sorted by keys. (Python is a rare counterexample AFAIK, but its decision to order keys by default was motivated by the exact point I'm about to say.) So you have to shift the burden of verification to the implementation, or you need an ordered key iterator API which will remain a niche. We seem to agree that the canonicalization itself has to be done somewhere, but we ended up with an implementation burden wherever we put the verification step. So this is not a good argument against format-standardized canonicalization at all.


I don't think canonicalization is really important in the world of data serialization formats (ex: Protocol Buffers doesn't do it and things seem fine). If you're defining something you're--for example--gonna HMAC, canonicalization is overkill because a data serialization format is overkill. The problem w/ JWT wasn't that JSON didn't have canonicalization (I think this is true?) at the time, the problem is that it used JSON at all. There was no real reason to do this, especially when everyone uses a JWT library anyway: the underlying format could have been anything (and newer token formats have learned this lesson).


> The story is that Carsten Bormann wanted to create an IETF standardized MP version, the creators asked him not to (after he acted in pretty bad faith), he forked off a version, added the aforementioned ill-advised tweaks, named it after himself, and submitted it anyway. All this design by committee stuff is mostly wrong--though IETF has modified it by committee since.

The IETF does not have a committee process. The CBOR RFC has 2 authors, Carsten Bormann and Paul Hoffman. Authors bring documents into the IETF, and the process is basically that everyone bashes¹ on it (the doc, not the people, please) until either there's a reasonable amount of agreement or they give up.

And everyone here means everyone. You could've sent a mail to the mailing list to bash on CBOR. Other MessagePack people could've sent a mail to the mailing list. You could've had comments relayed on the microphone for IETF meetings. Did that happen?

One of very few things that won't fly there is that standardization in general is bad, because the IETF doesn't believe that. But that's only the general argument — "standardizing this particular thing is bad" can and has gone through before. From some sibling comments I see this may have been a major point of contention, but I don't know it was the only one. *If* it is, it's in poor faith to drag this personal dispute into the discussion (I don't know what other disagreements and bad faith there were.)

¹ bashing here means pointing out flaws. It's up to the authors to make text changes to address them.


Aren't there a bunch of emails and what-not about it? I think that's what people are referring to.

EDIT:

> Other MessagePack people could've sent a mail to the mailing list. You could've had comments relayed on the microphone for IETF meetings. Did that happen?

Yes


> Aren't there a bunch of emails and what-not about it? I think that's what people are referring to.

Sorry, what "it"/"that" is this? I'm failing to process due to unclear references.

> > Other MessagePack people could've sent a mail to the mailing list. You could've had comments relayed on the microphone for IETF meetings. Did that happen?

> Yes

Can you point to anything? Best I can find is https://mailarchive.ietf.org/arch/msg/apps-discuss/iZM_ZqA9i... but that's not particularly useful. Boils down to questioning the utility of standards...



That's unhelpful to the degree that if it's any indication of MessagePack people's behaviour back then, I can see why the IETF would ignore the input.

I understand it's not easy to find things there, I tried, and I understand you might not want to spend the time to dig things up. I primarily asked since I hoped you could call something up by remembering some searchable content. If you can't, just say so and that's fine. Throwing that link at me is just rude.


All the links you want are in the post I made that was linked by TFA (IETF very annoyingly killed its URLs for some reason so you have to wayback machine a little). That, plus trying to argue IETF doesn't design by committee (which if anyone knows anything about IETF it's that) has made me assume you're just trolling me. If not, sorry! But what are you trying to add to the conversation here? Is your argument "IETF good, get involved"? It turns out someone can take yr serialization format, rename it, and standardize it entirely without your consent, so no thank you.

I'm happy to discuss stuff, even (especially) in depth, even to be your entree into this whole thing, but you gotta meet me halfway. This whole thread is me saying "hey that's my post!" Please start there.


Actually what happened here is that I saw the first small-grey-text paragraph, dismissed it as not interesting, saw the second such paragraph, dismissed that as well, and when it came to the 4th one which contains the links you're referring to I was in auto-dismiss mode. After your saying the links are in the article, I had to reread it twice until I found the links.

I now see there's been a bunch of back and forth. Good. (in a sense)

The IETF is not a committee process and your claim of "which if anyone knows anything about IETF it's that" is very likely something coloured by your bubble and context. Committee means the members of the standard body itself design things. The IETF doesn't even have members. It's the most open standards body I'm aware of (no idea how W3C works though, maybe they're even better.)

The IETF looks like a committee if you contrast against working without any standardization process, e.g. single project github protocol development. This works until it doesn't. It looks like MessagePack sat on making a string type/tag for more than 2 years; I don't know the story but that's not great either way. I've found a bunch of discussion now and honestly I can empathize with both "sides". What I don't understand is the hostility at the fork. IETF people needed something for use in other protocols, with a standards doc to reference. They had a choice of either making up something completely new, or base it on an existing design. They acknowledged MessagePack as a good design and extended upon it to fix issues, after those weren't addressed there. What's the problem?

And of course it's not compatible. It's not intended to be, it's not MessagePack, just MessagePack-derived.

> It turns out someone can take yr serialization format, rename it, and standardize it entirely without your consent, so no thank you.

That's the most pessimistic view possible, and ignores that changes were made. An optimistic view would be, someone acknowledged your serialization format as good, extended upon it, and took on the hassle to standardize it.


> The IETF is not a committee process and your claim of "which if anyone knows anything about IETF it's that" is very likely something coloured by your bubble and context. Committee means the members of the standard body itself design things.

There's no reward to this argument. I concede there's no actual committees. OK how do I add an "address" type to CBOR, that's right, I email a lot of people on IETF mailing lists, try and build support among people who have influence, spec authors feel persuaded/pressured, spec changes. This is a distinction without a difference, and is what almost everyone really means when they say "design by committee".

> What I don't understand is the hostility at the fork.

MP creators asked Bormann repeatedly to not submit anything to the IETF, and then to withdraw what he did submit. That wasn't cool of him!

> IETF people needed something for use in other protocols, with a standards doc to reference.

"needed" isn't the right verb here.

> They had a choice of either making up something completely new

That's fine. Or they could have waited. I think MPv5 came out like, a few months after Bormann's 1st draft (I'm guessing I don't really know). It's not like there was real urgency here. Nothing the IETF does is urgent.

> They acknowledged MessagePack as a good design and extended upon it to fix issues

CBOR's spec literally includes a section about MP [0] which was wrong when it was written and has become more wrong as time passes. Also not cool!

> That's the most pessimistic view possible, and ignores that changes were made. An optimistic view would be, someone acknowledged your serialization format as good, extended upon it, and took on the hassle to standardize it.

The changes were universally very bad, there was a good chance of causing huge confusion with slightly different very similar differently named formats, and there was an existing community of implementations. Imagine coming up w/ a pretty good data serialization format, building a company around it, fostering a diverse community of implementations, and some dude comes by, submits a similar yet incompatible version for standardization, adds a bunch of bad things to it, disses you in the spec, and names it after himself. What did standardization get them?

[0]: https://datatracker.ietf.org/doc/html/rfc8949#name-messagepa...


> OK how do I add an "address" type to CBOR, that's right, I email a lot of people on IETF mailing lists, try and build support among people who have influence,

It's normally about getting rid of complaints rather than building support, and the "people who have influence" thing is not entirely untrue but also not quite true either (I'm not willing to spend the time to go into this here.)

> spec authors feel persuaded/pressured, spec changes.

No, you'd publish a new document describing your extension; existing IETF documents never change and previous authors can't do jack shit other than writing sad mails on the mailing lists.

> This is a distinction without a difference, and is what almost everyone really means when they say "design by committee".

idk, I guess I'm not "almost everyone". For me, design by committee means the following things:

  - standards are written by a preexisting group of people
  - outside contributions not accepted
  - high barrier of entry to joining that group (financial, academic, or "by company name")
  - group membership is about the group rather than the standard, can't join just for one thing
  - the people in the group frequently aren't even interested in the particular work
The IETF is none of these things.

> MP creators asked Bormann repeatedly to not submit anything to the IETF, and then to withdraw what he did submit. That wasn't cool of him!

I'll (sadly) agree that from the stuff I've seen by now that certainly wasn't done well.

> > IETF people needed something for use in other protocols, with a standards doc to reference.

> "needed" isn't the right verb here.

IMHO it is; CBOR was standardized in order to get used in a whole bunch of IoT RFCs. RFCs can certainly reference external specifications, the barrier here is that the same people that bash the document itself also need to be happy with its references.

> That's fine. Or they could have waited. I think MPv5 came out like, a few months after Bormann's 1st draft (I'm guessing I don't really know).

I mean, sure, but at that point it was already 2 years of trying to get string encoding into MessagePack? I think it's reasonable that at some point people give up…

> It's not like there was real urgency here. Nothing the IETF does is urgent.

This is sadly very untrue, e.g. the homenet stuff died because it did not get ready in time for incorporation into CableLabs CPE specs, and as a result of that now no normal CPE on the planet is IPv6 multi-router or multi-uplink capable. I don't know enough about the IoT stuff back then to say anything about urgency there.

> CBOR's spec literally includes a section about MP [0] which was wrong when it was written and has become more wrong as time passes. Also not cool!

Yeah that should've been updated when RFC 8949 superseded 7049.

As far as your last paragraph is concerned, I don't have the background to agree or disagree on the changes being "universally bad", have no idea what the risk of confusion was (there seems to be none now), will certainly agree that the naming choice is pretty poor, and as far as the "forked idea" is concerned, that's a personal-political opinion about ownership of ideas and concepts.

(And I think I've just about run out of time I want to spend on this. Thanks for your notes!)


> Yeah they fixed that, but there's other parts of the spec that are basically unworkable

Yeah it just made me chuckle cause it was such an obvious oversight and a fun way of pointing it out. That said I totally get that writing specs are hard, so not dissing the authors as such.

> There's no reason an MP implementation has to be slower than a CBOR implementation.

Yeah that also struck me. Like ok that CBOR library might be faster than that MP library, but could be either is just missing some optimizations. And it didn't look like there were orders-of-magnitude differences in either case.

Anyway I've only looked at CBOR and MessagePack when I dabbled with some microcontroller projects. I found both to be too big, ie couldn't find a library suitably small, either compiled size or memory requirements or both. So I ended up with JSON for those due to that. Using a SAX-like parser I could avoid dynamic allocations entirely (or close enough).


> That said I totally get that writing specs are hard, so not dissing the authors as such.

Oh definitely. Yeah maybe I come off as anti-spec or something, but in this case I just think MP was really well thought out, and then Bormann hung a bunch of stuff on it that really wasn't, and I'm salty haha.

> Anyway I've only looked at CBOR and MessagePack when I dabbled with some microcontroller projects. I found both to be too big, ie couldn't find a library suitably small, either compiled size or memory requirements or both. So I ended up with JSON for those due to that. Using a SAX-like parser I could avoid dynamic allocations entirely (or close enough).

Whaa? I wrote an MP implementation specifically for this use case: https://github.com/camgunz/cmp. JSON parsing terrifies me; there was some table of tons of JSON (de)serializers with all their weirdo bugs that I never would've thought of. There are probably pretty good test suites now though? I've never looked.


> I wrote an MP implementation specifically for this use case

Perhaps I missed that, can't recall. Will definitely try (again) tho, looks very promising.

As for parsing JSON, the upside is that's its trivial to debug over serial, both viewing and sending, and in my case I could assume limited shenanigans and fail hard if there were issues.


> As for parsing JSON, the upside is that's its trivial to debug over serial, both viewing and sending, and in my case I could assume limited shenanigans and fail hard if there were issues.

Totally yeah, text formats are way easier to work with. This is a very undersold benefit of JSON.


> parts of the spec that are basically unworkable like indefinite length values,

Is this really a problem in practice? Say, an HTTP/1.1 message also may have the body of indefinite length, and it usually works just fine.


No in practice people just ignore the spec, but that's not really what you're hoping for when writing one.


Is it? Take JSON: its spec states that JSON numbers are theoretically infinite precision rationals, but implementations are free to impose their own restrictions. And they do: Python, for instance, is perfectly happy with both serializing and deserializing e.g. 2*7000 while many other implementations (e.g. Golang's) would balk at such values. Still, it works out mostly fine in practice. Is CBOR really worse than JSON?


> Is it? Take JSON: its spec states that JSON numbers are theoretically infinite precision rationals, but implementations are free to impose their own restrictions.

Well like you say JSON gives you an out: "An implementation may set limits on the range and precision of numbers"; CBOR doesn't. I'm not really making a claim about CBOR vs. JSON (or HTTP). My TL;DR on this is: the nice thing about MP is that it asks very little of implementations, and that gives them a lot of freedom. CBOR asks way more of implementations--which by itself isn't bad--but it reckons with the tradeoffs not at all. A good example is "indefinite-length encodings"; this paragraph is still in the spec:

---

Note that some applications and protocols will not want to use indefinite-length encoding. Using indefinite-length encoding allows an encoder to not need to marshal all the data for counting, but it requires a decoder to allocate increasing amounts of memory while waiting for the end of the item. This might be fine for some applications but not others.

---

You might think, "well that's not so bad, maybe I don't have to implement this, after all it is in the seemingly optional 'Creating CBOR-Based Protocols' section", but unfortunately it's a core part of CBOR [1].

Confusingly, CBOR seems to care about this kind of thing: "The design does not allow nesting indefinite-length strings as chunks into indefinite-length strings. If it were allowed, it would require decoder implementations to keep a stack, or at least a count, of nesting levels." [2]. But what's the difference between having to keep a stack of nesting levels and having to allocate as much as your network peer tells you to?

The fact is you can't reasonably implement streaming in a data representation format. It's protocol-level functionality, which is why HTTP/1.1's description of it is way more useful. Including it is pretty indicative of the whole "let's just throw some features into the spec and see what happens" attitude.

[0]: https://datatracker.ietf.org/doc/html/rfc8949#section-5.1

[1]: https://datatracker.ietf.org/doc/html/rfc8949#name-indefinit...

[2]: https://datatracker.ietf.org/doc/html/rfc8949#name-indefinit...


I admit I've only skimmed the RFC, but it seems to explicitly allow receivers to refuse to deal with lots of features. It doesn't say anywhere that compliant decoders MUST accept all well-formed CBOR inputs. In fact, it says quite the opposite.

> But what's the difference between having to keep a stack of nesting levels and having to allocate as much as your network peer tells you to?

In that you may preallocate a (large enough) buffer in the latter case, and bail out when the incoming message grows out of it but still be able to skip to the rest of the message as opposed to not being able to re-synchronize because the minimally needed parsing context grows without bound?


> I admit I've only skimmed the RFC, but it seems to explicitly allow receivers to refuse to deal with lots of features.

That's true but like, at what point are people disappointed your library doesn't support functionality that they think is pretty core? I'm not saying libraries can or should support everything, just that specs that ask a lot of implementations are inviting this kind of thing, and that while MP considers this, CBOR pretty clearly does not.

> In that you may preallocate a (large enough) buffer in the latter case, and bail out when the incoming message grows out of it but still be able to skip to the rest of the message as opposed to not being able to re-synchronize because the minimally needed parsing context grows without bound?

Neither of these are recoverable because you don't know how much to skip to resync (assuming skipping a bunch of data isn't by itself fatal):

- Stack: Maps and arrays only tell you their elements/pair counts, not how many bytes to skip in a stream, so to skip them you have to parse them fully, because they may be nested.

- Stream: By definition you don't know how much to skip.

Again these are the kinds of issues you'd address in a protocol definition. Though it kind of tries to pose as one, CBOR is not a protocol definition. A more reasonable thing to do would be to stream MP over HTTP, because you get so many things for free (connection management, caching and TLS to name a few).


CBOR started as a complimentary project to previous-decade IoT (Internet of Things) and WSN (Wireless Sensor Networks) initiaties. It was designed together with 6LoWPAN, CoAP, RPL and other standards. Main improvement over message pack was discriminating between byte strings and text strings - an important usecase for firmware updates etc. Reasoning is probably available in IETF mailing archive somewhere.

All these standards were designed as a research and seem rather slow to gain general popularity (6LoWPAN is used by Thread, but its uptake is also quite slow - e.g. Nanoleaf announced dropping support for it).

I would say if CBOR fits your purpose it's a good pick, and you shouldn't be worried by it being "not cool". Design by committee is how IETF works, and I wouldn't call it a weakness, although in DOGE times it might sound bloated and outdated.


To be fair, CBOR proper is amazingly well designed given its constraints and design-by-committee nature. It is not even hard to remember the whole specification in your head due to the regular design. Unfortunately though I can't say that for any other CBOR ecosystem; many related specs do show varying level of signs of bloat. I recently heavily criticized the packed CBON draft because I couldn't make any sense out of it [1], and Bormann seemed to have clearly missed most of my points.

[1] https://mailarchive.ietf.org/arch/msg/cbor/qdMZwu-CxHT5XP0nj...


Disclaimer: I wrote and maintain a MessagePack implementation.

To be uncharitable, that's probably because CBOR's initial design was lifted from MP, and everything Bormann added to it was pretty bad. This snippet from your great post captures it pretty well I think:

---

CBOR records the number of nested items and thus has to maintain a stack to skip to a particular nested item.

Alternatively, we can define the "processability" to only include a particular set of operations. The statement 3c implies and 3d seems to confirm that it should include a space-constrained decoding, but even that is quite vague. For example,

- Can we assume we have enough memory to buffer the whole packed CBOR data item? If we can't, how many past bytes can we keep during the decoding process?


> To be uncharitable, that's probably because CBOR's initial design was lifted from MP, and everything Bormann added to it was pretty bad.

To be clear, I disagree and believe that Bormann did make a great addition by forking. I can explain this right away by how my point can be fixed entirely within CBOR itself.

CBOR tags are of course not required to be processed at all, but some common tags have useful functions that many implementations are expected to implement them. One example is the tag 24 "Encoded CBOR data item" (Section 3.4.5.1), which indicates that the following byte string is encoded as CBOR. Since this string has the size in bytes, every array or map can be embedded in such tags to ensure the easy skippability. [1] This can be made into a formal rule if the supposed processability is highly desirable. And given those tags are defined so early, my design sketch should have been already considered in advance, which is why I believe CBOR is indeed designed better.

[1] Alternatively RFC 8742 CBOR sequences (tag 63) can be used to emulated an array or map of an indeterminate size.


Sure, I think CBOR's "suggested" tags (or whatever they are) are probably useful to most people. The tradeoff is that they create pressure for implementations to support them, and that's not free. For example, bignum libraries are pretty heavyweight; they're not really the kind of thing you'd want to include in a C implementation as a dependency, especially when very few of your users will use them. Well OK, now you have a choice between:

- include it anyway, bloat your library for almost everyone, maybe consider supporting different underlying implementations, manage all these dependencies forever, also those libraries have different ways of setting precision, allocating statically or dynamically, etc, so expose that somehow

- don't include it, you're probably now incompatible with all dynamic language implementations that get bignums for free and you should note that up front

This is just one example, but it's pretty representative of Bormann's "have your cake and eat it too" design instincts where he tosses on features and doesn't consider the tradeoffs.

> One example is the tag 24 "Encoded CBOR data item" (Section 3.4.5.1), which indicates that the following byte string is encoded as CBOR. Since this string has the size in bytes, every array or map can be embedded in such tags to ensure the easy skippability.

This only works for types that aren't nested unless you significantly complicate bookkeeping during serialization (store the byte size of every compound object up front), which has the potential to seriously slow down serializing. My approach to that would be to let individual apps do that if they want (encode the size manually), because I don't think it's a common usage.


> Well OK, now you have a choice between: - include it anyway, [...] - don't include it, [...]

So guess that's why MP doesn't have a bignum. But MP's inability to store anything more than (u)int64 and float64 does make its data model technically different from JSON because JSON didn't properly specify that its number format should be round-trippable in those native types. Even worse, if you could assume that everything is at most float64 then you still have to write a considerable amount of subtle code to do the correct round-trip! [1] At this point your code would already contain some bignum stuffs anyway. So why not support bignums then?

[1] Correct floating point formatting and parsing is very difficult and needs a non-trivial amount of precomputed tables and sometimes bignum routines (depends on the exact algorithm)---for the record I'm the main author of Rust's floating point formatting routine. Also for this reason, most language-standard libraries already have a hidden support for size-limited bignums!

> My approach to that would be to let individual apps do that if they want (encode the size manually), because I don't think it's a common usage.

I mean, the supposed processability is already a poorly defined metric as I wrote earlier. I too suppose that it would be entirely up to the application's (or possibly library's educated) request


> But MP's inability to store anything more than (u)int64 and float64 does make its data model technically different from JSON....

Yeah I don't love the MP/JSON comparison the site pushes. I don't really think they solve the same problems, but the reasons are kind of obscure so shrug. MP is quite different from JSON and yeah, numbers is one of those ways.

> [1] Correct floating point formatting and parsing is very difficult and needs a non-trivial amount of precomputed tables and sometimes bignum routines (depends on the exact algorithm)---for the record I'm the main author of Rust's floating point formatting routine. Also for this reason, most language-standard libraries already have a hidden support for size-limited bignums!

Oh man yeah tell me about it; I attempted this way back when and gave up lol. I was doing a bunch of research into arbitrary precision libraries and the benchmarks all contain "rendering a big 'ol floating point number" and that's why. Wild.

> I mean, the supposed processability is already a poorly defined metric as I wrote earlier. I too suppose that it would be entirely up to the application's (or possibly library's educated) request

I think in practice implementations are either heavily spec'd (FIDO) on top of a restricted subset of CBOR, or they control both sender and receiver. This is why I think much of the additional protocol discussion in CBOR is pretty moot; if you're taking the CBOR spec's advice on protocols you're not building a good protocol.


> Oh man yeah tell me about it; I attempted this way back when and gave up lol. I was doing a bunch of research into arbitrary precision libraries and the benchmarks all contain "rendering a big 'ol floating point number" and that's why. Wild.

Yes, it is a stuff that people generally don't even realize its existence. To my knowledge only RapidJSON and simdjson seriously invested in optimizing this aspect---their authors do know this stuff and difficulty. Others tend to use a performant but not optimal library like double-conversion (which was the SOTA at the time of release!).


> Well OK, now you have a choice between: - include it anyway, [...] - don't include it, [...]

I do not see an issue here. In decoder, one does not need bignum library, just pass bignum as a memory blob to application.

In application, one knows semantic restriction on given values, and either reject bignums as semantically-invalid out-of-range, or need bignum processing library anyways.


Nah it's a pain in the ass if I'm writing a C program to consume your API and I need to pull in MPFR because you used bignums.


A reasonable C API would just give a pointer to decimal digits and a scaling factor. Why did you think MPFR is needed?


You can replace "pull in MPFR" with "work any harder than just using `double`". Bignums are an obvious pain in the ass; I can think of no data representation formats that include support for them and that's why


I'm aware of plenty (though I have surveyed at least 20 formats in the past and so that would include more obscure ones). At the very least, you can feed it back to sscanf if you are fine with an ordinary float or double, a thoughtful API would include this as an option too. That's what I expect for the supposed bignum support: round-trippability.


Maybe an example is useful. I want to build a generic CBOR decoder in C. I have 2 options:

- link GMP/mpdecimal/whatever (or hey, provide an abstraction layer and let a user choose)

- accept function pointers to handle bignum tags

Function pointers are an irritation (I know this because my MP library uses them), they're slower than not using them, you've gotta check for NULL a lot, you're also asking any application that uses your library and wants bignum support to include GMP itself (with all the attendant maintenance, setup, etc.)

Or, you can include it yourself, but welcome to doing all the maintenance yourself, and exposing all of GMP's knobs (ex: [0])

You might argue that these aren't the only options, but a deserialized value has to be understood by the application; your suggestions aren't good tradeoffs. sscanf (also do not use sscanf) doesn't work if the value is actually a bignum, and yielding a bespoke bignum format is just as unusable as simply returning whatever's encoded in CBOR. How would I add two such values together? How would I display it? This is what bignum libraries are for.

All this is made far worse by the fact that there are effectively no public CBOR (or MP) APIs where you're expecting them to be consumed entirely by generic decoders, so there's not even a need to force generic decoders to go through all this effort to support bignums (etc.) Further, unlike MP, CBOR doesn't let you use tags for application-specific purposes. Put it all together and it's uniformly worse: implementations are either more complex or have surprising holes, you can't count on generic decoders supporting tags when building an API or defining messages, and you can't even just say, "for this protocol, tag 31 is a UUID".

This is probably a big reason (though I can think of others) why the only formats you can think of w/ bignum support are obscure.

> That's what I expect for the supposed bignum support: round-trippability.

Round-tripping is only meaningful if a receiver can use the values before reserializing, otherwise memcpy meets your requirements. If a sender gives me a serialized bignum, the deserializing library has to deserialize it into a value I can understand and use; that's the whole point of a deserialization library.

MP's support for timestamps is a reasonable example here: it decomposes into a time_t, and it can do this because it defines the max size. You can't do that w/ a bignum--the whole point of a bignum is it's big beyond defining. A CBOR sender can send you an infinite series of digits, and the spec doesn't reckon with this at all.

[0]: https://gmplib.org/manual/Memory-Management


> I have 2 options: - link GMP/mpdecimal/whatever (or hey, provide an abstraction layer and let a user choose) - accept function pointers to handle bignum tags

I would just provide two kinds of functions:

    // For each representative native type...
    cbor_read_t cbor_read_float(struct cbor *ctx, float *f);

    // And there is a generic number handling:
    struct cbor_num {
        int sign; // -1, 0 or 1
        int base; // 10 or 16
        int exponent;
        const char *digits;
        size_t digits_len;
    };
    cbor_read_t cbor_read_number(struct cbor *ctx, struct cbor_num *num);

    // And then someone will define the following on top of cbor_read_number:
    cbor_read_t my_cbor_read_mpz(struct cbor *ctx, mpz_t num);
Memory lifetime and similar has to be also considered here (left as an exercise), but the point is that you never need function pointers in this case. In fact I would actively avoid them because proper function pointer support is indeed a PITA as you said. They can generally be avoided with the (sorta) inversion of control, which is popular in compact C APIs and to some extent also in Rust APIs. It is just you haven't thought of this possibility.

> sscanf (also do not use sscanf) doesn't work if the value is actually a bignum, and yielding a bespoke bignum format is just as unusable as simply returning whatever's encoded in CBOR. How would I add two such values together? How would I display it? This is what bignum libraries are for.

In practice many bignums are just left as is. For example X.509 certificate serial numbers are technically bignums, but you never compute anything out of them. So you don't need any bignum to read serial numbers. If you do need computation then you need an adapter function as above, but the library proper needs no knowledge about such adapter. What's a problem now?

By the way, sscanf is fine here because the API's contract constrains sscanf's inputs enough to be safe. Sscanf in general is also safe when every `char*` outputs are bounded. It is certainly a difficult beast, but so is everything about C.


This isn't responsive to what I wrote:

> and yielding a bespoke bignum format is just as unusable as simply returning whatever's encoded in CBOR. How would I add two such values together? How would I display it? This is what bignum libraries are for.

I know this is what you've been getting at. Maybe I've been unclear about why this isn't useful, but here are the main points:

- Without bignum functionality, your data structure doesn't provide any more functionality than memcpy. How do I apply the base? How do I apply the exponent? How would I add two of them together? This may as well just be a `char *`.

- Speaking of just being a `char *`, CBOR's bignums are just that, so you'd just call `mpz_init_set_str` on whatever is in the buffer (zero terminate it in a different buffer, I guess, whatever). Parsing into your struct here is counterproductive.

- Even the minimal functionality you're proposing here is added bloat to every application that doesn't care about bignums and wants to ignore the tag (probably almost all applications). Ameliorating this requires conditional compilation.

> In practice many bignums are just left as is.

I'd believe this; I'd also believe there's very little real need for them generally. This is an argument for not including them in a data serialization format.

> By the way, sscanf is fine here

The problem with sscanf isn't that it can never be safe, it's that if you're not safe every time you blow everything up. It's better to just not use it.


> Design by committee is how IETF works

If the IETF is design by committee, almost any collaboratively developed standard could be called designed by committee. And I'm rather confident in assuming you haven't seen ITU or IEEE in action, or you'd be singing angel's choir praises on the IETF process…

(The IETF really does not have a committee process by any reasonable definition.)


> Obviously MessagePack is what cool kids would use.

Why is that even a consideration?

> To measure complexity, you can often use documentation length as a proxy. MessagePack is just a markdown file. The CBOR spec has its own gravitational field.

That's a proxy for underspecification, not complexity.


>> Obviously MessagePack is what cool kids would use. >Why is that even a consideration?

What’s “cool” is not important but being “cool” _can_ mean there is a larger ecosystem around it. It can also be a proxy for how well your stuff will interact with other systems and how many people you can hire that will just know how it works without having to learn another new thing.

On the other hand, none of that applies to this person’s personal project. And in that context, I think your comment stands.


because he seems to be working on "fun hobby projects" not work


> Everything about CBOR is uncool. It was designed by a committee. It reeks of RFCs. Acronyms are lame. Saying "SEE-BORE" is like licking a nickel. One of the authors is "Carsten Bormann", which makes the name feel masturbatory.

Carsten Bormann was my professor for Rechnernetze (computer networks). He is also one of the authors of GNU Screen. I mentioned to him that I use tmux, and he asked what was wrong with screen :). His wife is also in the same IT department at the university, where she was the dean. She helped me sort out a problem regarding my course selections, a very kind person. I think he is a decent teacher and knowledgeable in his field, but if you look at his work over the past decades, it's evident that he has a tendency to author RFCs that are rarely used.


Interesting. I think single person RFCs are like most other forms of publication in that there is a long tail of unused ones and a few that take off. But you miss all the shots you don't take.


Just a question where you trying to optimize for speed or size?:

I never tried CBOR when looking for a sub 10ms solution for websocket comms, however my use case was not bound by datasize but entirely by speed (network not inet).

However it all came down to a suprising realisation: "compression on both ends is the primary performance culprit"

Optimizing the hell out of the protocol over websockets got me to a fairly ok response time, just using string json and 0 compression blew it out of the water.

So the result was that data load was faster and easier the debug going with strings of json vs any other optimization (messagesize where in the 10-50mb realm)

The amount of shotty ws sever implementations and gzip operations in the communications pipeline is mindblowing - would be interested in hearing how just pure json and zero compression/binary transforms performed :)


Virtually every use case of zlib (can be used to implement gzip) should be replaced with zlib-ng, in fact. The stock zlib is too slow for modern computers. If you have a right workload---no streaming, fit in memory etc.---, then libdeflate is even faster. The compression can't be a bottleneck when you've got a correct library.


zlib-rs (the rust port) is now faster in most cases (which exposes a zlib compatible API)


Fun fact: Turning off permessage deflate is nearly impossible when browsers are in the mix, best Option is to strip the header in a reverse proxy, since many browsers ignore the setting and emits the header despite your configs - add to that that most SERVERs assume the client knows what it is doing and adhere to the header request while not allowing a global override to turn it off.

gives you a fun ball of yarn to untangle


Patently untrue,

Seriously try to Deflate a 50mb compressed json structure vs just piping it down the wire on a high bandwith connection and try to measure that. (In real life with a server and browser - browsers are super slow at this)


> browsers are super slow at this

No, DEFLATE is an asymmetric compression algorithm, meaning that decompression is disproportionally faster (at least 100 MB/s in my experience) than compression. It should mostly be a server's fault to use too high and ineffective compression setting or to use an inefficient library like the stock zlib.


> The compression can't be a bottleneck when you've got a correct library.

It absolutely can tho. You’re not going to do memory compression using zlib regardless of its flavour.


In this context, of course. It is not a general statement ;-)


ramdisk!


I think on important thing to realize is that using CBOR or MessagePack does not involve compression (except if you add it in the same way you do for JSON as another layer).

CBOR and MessagePack are more compact but they do not archive this by compression but instead by adding less noise in-between your data when placing your data on the wire.

E.g. Instead of (in JSON) outputting a " then going through every utf-8 code point and checking if they need escaping and escaping it and then placing another " They place some type tag +length hint and then just memcopy the utf-8 to the wire (assuming they can rely on the input being valid utf-8).

The only thing which goes a bit in the direction of compression is that you can encode a integer as tiny, short or long field. But then that is still way faster then converting it to it's decimal us-ascii representation...

Through that doesn't mean they are guaranteed to always be faster. There are some heavy absurdly optimized JSON libraries using all kind of trickery like SIMD and many "straight forward" implementations of CBOR and MessagePack.

Similar you data might already be in JSON in which case cross encoding it is likely to outweigh and gains.


Is your data already JSON at rest? Because encoding/decoding CBOR should easily beat encoding/decoding JSON.


I wasn't aware of CBOR as I have yet to come across a project that needed an alternative to MessagePack or even MassagePack (only considered using it once in the past). However, based on my experience on various projects, if you have to get approvals and buy-ins from architects, legal, and security teams, using something that has an RFC helps you win those battles regardless of the technical merits of a RFC or non-RFC backed project/tool/protocol.


Yibico's python-fido2 library (https://github.com/Yubico/python-fido2) contains (a minimal) CBOR implementation too: https://developers.yubico.com/python-fido2/API_Documentation...

I found it wouldn't encode `None`s, but didn't dig at all, just worked around it.

Star count would place it about midway in the list.


Just curious if you considered Cap'n Proto as another option, or if it wasn't in the running?

[1] https://capnproto.org/


It covers a different use case. JSON, MessagePack and CBOR fall into the "schameless" format which mandate a common but useful enough data model (CBOR is novel in that this data model can be somehow extensible too). Cap'n Proto and Protobuf fall into the "schametic" or "schemaful" format where you always need the correct schema to encode and decode the thing, but is possibly more efficient in encoded size and general performance.


Thanks for clarifying! I thought that Cap'n Proto allows for evolving schemas, but I guess it's true that if each of your messages is completely different, it's not going to benefit you as much perhaps.


I think the list of stars is probably not a good representation of popularity. Serde and JSON For Modern C++ have vastly more stars than any of those libraries and they both support CBOR and MessagePack.

I think CBOR is pretty decent though it is fairly inexplicable that a format designed in 2013 uses big endian.


I don't understand why the author doesn't prefer CBOR, isn't doing things according to an RFC standard better? MsgPack and CBOR are pretty much comparable, feature-wise.

Anyway, I work on IBM mainframes, and big endian is so much easier to read in hex. Not sure why anybody would want little endian, honestly.


> big endian is so much easier to read in hex. Not sure why anybody would want little endian, honestly.

Because you don't need to read these files in a hex editor and 99.999% of people aren't working on an IBM mainframe; they're working on a little endian machine.


> MsgPack and CBOR are pretty much comparable, feature-wise.

They're pretty much exactly the same thing. IIRC the difference is that CBOR specifies how to handle custom types slightly more verbosely.


serde doesn't support CBOR/MP, implementations of those support serde, and those implementations are listed in the table. You might have a point about JfMC++, though.


Good point and actually the MessagePack Serde library has way more stars than the CBOR one.


TL;DR: CBOR is a bit more complex, but mainly due to additional features (tags, infinite/unknown length types) which if you need them will make using CBOR the simpler choice and which libraries can decide to support only in a "naive" but simple way (ignore most tags, directly collect unknown length types with some hard size limits).

---

On interesting aspect of CBOR vs. MessagePack is simplicity.

But it's not really that much of a win for MessagePack, especially if we ignore "tags".

Like sure message packs splitting their type/length markers at clean nibble boundaries and as such if you need to read a message pack file through a hex editor its easier.

but why would you do so without additional tooling????

Like, if we are realistic most developer won't even read (non trivial small) JSON without tooling which displays the data with nice formatting and syntax highlighting. Weather it's in your browser, editor or by piping it through jq on the command line tooling is often just a click way.

And if you anyway use tooling it doesn't matter weather type/size hints use 4/4-bit split are a 3/5-bit split. Implementation wise outside of maybe some crazy SIMD tricks or similar it also doesn't matter (too) much.

And the moment 4/4-bit and 3/5-bit are seen as in practice similar complex they are now really similar in complexity.

CBOR is a bit more consistent with it's encodings then MessagePack (e.g. MessagePack has "fix int" special cases which doesn't follow the 4/4-bit split). But it's also lacking boolean and null in it's core data model.

CBORs encoding of true,false and null is ... iffy (and 2 bytes). It also has both null and undefined for whatever cursed reasons.

MessagePack has a extension system which associate a custom type [0;127] with a byte array.

CBOR has a tagging system which associates a custom type [0;2*64-1] with any kind of CBOR type.

In this aspect MessagePack is simpler but also so restrictive that running into collisions in large interconnected enterprise use cases is not just viable but expected if extension types are widely used (but then for most apps 0..127 is good enough).

On the other hand if a CBOR library wants to support all kinds of complex custom tag based extension use cases it could add quite a bit to the libraries complexity. But then ignoring tags when parsing is valid. On benefit of the large tag space is that there are some pretty useful predefined tags, e.g. for a unix timestamp or that a byte slice is itself valid CBOR or big num encoding.

Lastly CBOR supports "infinite"/"unknown when encoding starts" length bytes string, utf-8 strings, lists and maps. That clearly adds complexity to any library fully implementing CBOR, but if you need it removes a lot of complexity from you as you now don't need to come up with some JsonL-ish format for MessagePack (which is trivial but outside of the spec). Most importantly you can stream insides of nested items, not viable with MessagePack).




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: