Smart kids can be a distraction as well. It certainly would have benefitted me to enter G&T at Kindergarten instead of 3rd grade. Much of my first grade was spent separate from the other kids doing 5th grade workbooks.
While I think this is Rust's biggest flaw, this doesn't stem from any particular hatred of C/C++. This is related to memory safety, as it is very difficult to reason about memory lifetimes of object graphs with cycles.
I've started to use agents on some very low-level code, and have middling results. For pure algorithmic stuff, it works great. But I asked it to write me some arm64 assembly and it failed miserably. It couldn't keep track of which registers were which.
I was there for three years and you're totally right that everything's been automated, but also there are a large number of product level decisions that just don't make sense. They make financial sense, sure, but then that means the engineer has drank the MBA cool aid (or not enough of it), things get killed off, and they are no longer to be trusted around things that need proper love and care put into them. Promo packets though, sure.
It's hard to read that as a human, though, and not want to build a system that lets people update bad map data? Which there used to be, but then yeah.
So yeah, the inmmates (engineers) used to run the asylum (Google), but then a group of fucking psychopaths (DoubleClick) got added to the asylum, got given meth (ad money) and shits fucking unhinged.
Unfortunately, c-ares is not problem-free on all platforms.
On iOS, its use triggers a local network access popup (it tries to reach your DNS server, which is often on your LAN). If a user denies acess, your app will simply not work.
On Android, it's not compatible with some VPN apps. Those apps are to blame, but your users are going to blame you not them.
So, at my previous company we ended up building libcurl with a threaded DNS resolver on both iOS and Android.
This. Plus ASN.1 is pluggable as to encoding rules and has a large family of them:
- BER/DER/CER (TLV)
- OER and PER ("packed" -- no tags and
no lengths wherever
possible)
- XER (XML!)
- JER (JSON!)
- GSER (textual representation)
- you can add your own!
(One could add one based on XDR,
which would look a lot like OER/PER
in a way.)
ASN.1 also gives you a way to do things like formalize typed holes.
Not looking at ASN.1, not even its history and evolution, when creating PB was a crime.
The people who wrote PB clearly knew ASN.1. It was the most famous IDL at the time. Do you assume they just came one morning and decided to write PB without taking a look at what existed?
Anyway, as stated PB does more than ASN.1. It specifies both the description format and the encoding. PB is ready to be used out of the box. You have a compact IDL and a performant encoding format without having to think about anything. You have to remember that PB was designed for internal Google use as a tool to solve their problems, not as a generic solution.
ASN.1 is extremely unwieldy in comparaison. It has accumulated a lot of cruft through the year. Plus they don’t provide a default implementation.
> Strange that at the same time (2001) people were busy implementing everyting in Java and XML, not ASN.1
Yes. Meanwhile Google was designing an IDL with a default binary serialisation format. And this is not 2025 typical big corp, over staffed, fake HR levels heavy Google we are talking about. That’s Google in its heyday. I think you have answered your own comment.
> Do you assume they just came one morning and decided to write PB without taking a look at what existed?
Considering how bad an imitation of 1984 ASN.1 PB's IDL is, and how bad an imitation of 1984 DER PB is, yes I assume that PB's creators did not in fact know ASN.1 well. They almost certainly knew of ASN.1, and they almost certainly did not know enough about it because all the worst mistakes in ASN.1 PB re-created while adding zero new ideas or functionality. It's a terrible shame.
PB is not a bad imitation of 1984 ASN.1. ASN.1 is choke full of useless representations clearly there to serve what a committee thought the need of the telco industry should be.
I find it funny you are making it looks like a good and pleasant to use IDL. It’s a perfect example of design by committee at its worst.
PB is significantly more space efficient than DER by the way.
From an engineering perspective, most browsers are "pretty much just Chrome(ium)", but that's not what I'm talking about here. The delivery mechanism isn't really relevant from a product perspective. It is a different product with a different price and different features.
Also, my point was just just say that there's a market for something like this. Chrome Enterprise is not even really that competitive of a product in the space.
For the most part, default Chrome and Firefox are designed primarily for B2C use cases.
Which is what Enterprises need. They don't need their own version of Chrome, they need to ability to make changes to it like force Proxy, insert MiTM certs and various other Enterprise stuff.
I wish the same applied to written numbers in LTR scripts. Arithmetic operations would be a lot easier to do that way on paper or even mentally. I also wish that the world would settle on a sane date-time format like the ISO 8601 or RFC 3339 (both of which would reverse if my first wish is also granted).
> It will be relegated to the computing dustbin like non-8-bit bytes and EBCDIC.
I never really understood those non-8-bit bytes, especially the 7 bit byte. If you consider the multiplexer and demux/decoder circuits that are used heavily in CPUs, FPGAs and custom digital circuits, the only number that really makes sense is 8. It's what you get for a 3 bit selector code. The other nearby values being 4 and 16. Why did they go for 7 bits instead of 8? I assume that it was a design choice made long before I was even born. Does anybody know the rationale?
> I also wish that the world would settle on a sane date-time format like the ISO 8601
IIRC, in most countries the native format is D-M-Y (with varying separators), but some Asian countries use Y-M-D. Since those formats are easy to distinguish, that's no problem. That's why Y-M-D is spreading in Europe for official or technical documents.
There's mainly one country which messes things up...
YYYY-MM-DD is also the official date format in Canada, though it's not officially enforced, so outside of government documents you end up seeing a bit of all three formats all over the place. I've always used ISO 8601 and no one bats an eye, and it's convenient since YYYY-DD-MM isn't really a thing, so it can't be confused for anything else, unlike the other two formats.
YMD has caught on, I think, because it allows for the numbers to be "in order" (not mixed-endian) while still having the month before the day which matches the practice for speaking dates in (at least) the US and Canada.
I used to think this was really important, but what's the use case here?
If I'm writing a document for human consumption then why would I expect the dates to be sortable by a naive string sorting algorithm?
On the other hand, if it's data for computer consumption then just skip the complicated serialisation completely and dump the Unix timestamp as a decimal. Any modern data format would include the ability to label that as a timestamp data type. If you really want to be able to "read" the data file then just include another column with a human-formatted timestamp, but I can't imagine why in 2025 I would be manually reading through a data file like some ancient mathematician using a printed table of logarithms.
> If I'm writing a document for human consumption then why would I expect the dates to be sortable by a naive string sorting algorithm?
If you're naming a document for human consumption, having the files sorted by date easily without relying on modification date (which is changed by fixing a typo/etc...) is pretty neat
YYYY-MM-DD is ISO8601 extended format, YYYYMMDD is ISO8601 basic format (section 5.2.1.1 of ISO8601:2000(E)[1]). Both are fully according to spec, and neither format takes precedence over the other.
It does have a good name: RFC 3339. Unlike the ISO standard, that one mandates the "-" separators. Meanwhile it lets you substitute a space for the ugly "T" separator between date and time:
> NOTE: ISO 8601 defines date and time separated by "T". Applications using this syntax may choose, for the sake of readability, to specify a full-date and full-time separated by (say) a space character.
I live in that country, and I am constantly messing up date forms. My brain always goes yyyy-mm-dd. If I write it out, September 1st, 2025, I get it in the “right” order. But otherwise, especially if I’m tired, it’s always in a sortable format.
There are a lot of computations where 256 is too small of a range but 65536 is overkill. When designers of early computers were working out how many digits of precision their calculations needed to have for their intended purpose 12 bits commonly ended up being a sweet spot.
When your RAM is vacuum tubes or magnetic core memory, you don't want 25% of it to go unused, just to round your word size up a power of two.
> There are a lot of computations where 256 is too small of a range but 65536 is overkill
wasnt this more to do with cost? they could do arbitrary precision code even back then. its not like they were only calculating numbers less than 65537, ignoring anything larger
I don't know that 7-bit bytes were ever used. Computer word sizes have historically been multiples of 6 or 8 bits, and while I can't say as to why particular values were chosen, I would hypothesize that multiples of 6 and 8 work well for representation in octal and hexadecimal respectively. For many of these early machines, sub-word addressability wasn't really a thing, so the question of 'byte' is somewhat academic.
For the representation of text of an alphabetic language, you need to hit 6 bits if your script doesn't have case and 7 bits if it does have case. ASCII ended up encoding English into 7 bits and EBCDIC chose 8 bits (as it's based on a binary-coded decimal scheme which packs a decimal digit into 4 bits). Early machines did choose to use the unused high bit of an ASCII character stored in 8 bits as a parity bit, but most machines have instead opted to extend the character repertoire in a variety of incompatible ways, which eventually led to Unicode.
On the DEC-10 the word size is 36 bits. There was (an option to include) a special set of instructions to enable any given byte size with bytes packed. Five 7-bit bytes per word, for example, with a wasted bit in each word.
I wouldn’t be surprised if other machines had something like this in hardware.
> For the representation of text of an alphabetic language, you need to hit 6 bits if your script doesn't have case
Only if you assume a 1:1 mapping. But e.g. the original Baudot code was 5-bit, with codes reserved to switch between letters and "everything else". When ASCII was designed, some people wanted to keep the same arrangement.
To get the little-endian ordering. The place values of digits increase from left to right - in the same direction as how we write literature (assuming LTR scripts), allowing us to do arithmetic operations (addition, multiplication, etc) in the same direction.
> The brilliance of 8601/3339 is that string sorting is also correct datetime sorting.
I hadn't thought about that. But it does reveal something interesting. In literature, we assign the highest significance to the left-most (first) letter - in the direction opposite to how we write. This needs a bit more contemplation.
I was asking about ASCII encoding and not the word size. But this information is also useful. So apparently, people were representing both numbers and script codes (EBCDIC in particular) in packed decimal or octal at times. The standardization on 8 bits and adoption of raw binary representation seems to have come later.
I believe that 10- and 12-bit bytes were also attested in the early days. As for "why": the tradeoffs are different when you're at the scale that any computer was at in the 70s (and 60s), and while I can't speak to the specific reasons for such a choice, I do know that nobody was worrying about scaling up to billions of memory locations, and also using particular bit combinations to signal "special" values was a lot more common in older systems, so I imagine both were at play.
In Britain the standard way to write a date has always been, e.g "12th March 2023” or 12/3/2023 for short. Don't think there's a standard for where to put the time, though, I can imagine it both before and after.
Doing numbers little-endian does make more sense. It's weird that we switch to RTL when doing arithmetic. Amusingly the Wikipedia page for Hindu-Arabic numeral system claims that their RTL scripts switch to LTR for numbers. Nope... the inventors of our numeral system used little-endian and we forgot to reverse it for our LTR scripts...
Edit: I had to pull out Knuth here (vol. 2). So apparently the original Hindu scripts were LTR, like Latin, and Arabic is RTL. According to Knuth the earliest known Hindu manuscripts have the numbers "backwards", meaning most significant digit at the right, but soon switched to most significant at the left. So I read that as starting in little-endian but switching to big-endian.
These were later translated to Arabic (RTL), but the order of writing numbers remained the same, so became little-endian ("backwards").
Later still the numerals were introduced into Latin but, again, the order remained the same, so becoming big-endian again.
We in India use the same system for dates as you described, for obvious reasons. But I really don't like the pattern of switching directions multiple times when reading a date and time.
And as for numbers, perhaps it isn't too late to set it right once and for all. The French did that with the SI system after all.
> So apparently the original Hindu scripts were LTR
I can confirm. All Indian scripts are LTR (Though there are quite a few of them. I'm not aware of any exceptions). All of them seem to have evolved from an ancient and now extinct script named Brahmi. That one was LTR. It's unlikely to have switched direction any time during subsequent evolution into modern scripts.
> I also wish that the world would settle on a sane date-time format like the ISO 8601 or RFC 3339 (both of which would reverse if my first wish is also granted).
YYYY-MM-DD to me always feels like a timestamp, while when I want to write a date, I think of a name, (for me DD. MM. YYYY).
7 bits was chosen to reduce transmission costs, not storage costs, because you send 12.5% less data. Also, because computers usually worked on 8-bit bytes, the 8th bit could be used as a parity bit, where extra reliability was needed.
Big endian will stay around as long as IBM continues to put in the resources to provide first-class Linux support on s390x. Of course if you don’t expect your software to ever be run on s390x you can just assume little-endian, but that’s already been the case for the vast majority of software developers ever since Apple stopped supporting PowerPC.
Of the two UTF-16 is much less of a problem, it's trivially[1] and losslessly convertible.
[1] Ok I admit, not trivially when it comes to unpaired surrogates, BOMs, endian detection, and probably a dozen other edge and corner cases I don't even know about. But you can offload the work to pretty well-understood and trouble-free library calls.
It causes such issues as opening of files by filename not being cross platform with standard libc functions since in Windows w-string versions are required since UTF-16 chars can have 0-bytes that aren't zero terminators
Yes, it makes sense, but it still resulted in a lot of work.
Most Unix syscalls use C-style strings, which are a string of 8-bit bytes terminated with a zero byte. With many (most?) character encodings you can continue to present string data to syscalls in the same way, since they often also reserved a byte value of zero for the same purpose. Even some multi-byte encodings would work if they chose to avoid using 0-value bytes for this reason.
UTF-16LE/BE (and UTF-32 for that matter) chose not to allow for this, and the result is that if you want UTF-16 support in your existing C-string-based syscalls you need to make a second copy of every syscall which supports strings in your UTF-16 type of choice.
> Most Unix syscalls use C-style strings, which are a string of 8-bit bytes terminated with a zero byte. With many (most?) character encodings you can continue to present string data to syscalls in the same way, since they often also reserved a byte value of zero for the same purpose
That's completely wrong. If a syscall (or a function) expects text in encoding A, you should not be sending it in encoding B because it would be interpreted incorrectly, or even worse, this would become a vulnerability.
For every function, encoding must be specified as are specified the types of arguments, constraints and ownership rules. Sadly many open source libraries do not do it. How are you supposed to call a function when you don't know the expected encoding?
Also, it is better to send a pointer and a length of the string rather than potentially infinitely search for a zero byte.
> and the result is that if you want UTF-16 support in your existing C-string-based syscalls
There is no need to support multiple encodings, it only makes things complicated. The simplest solution would be to use UTF-8 for all kernel facilities as a standard.
For example, it would be better if open() syscall required valid UTF-8 string for a file name. This would leave no possibility for displaying file names as question marks.
Yes and my argument is that the OS should treat strings as a blob and not care about the encoding. How can it know what shiny new encoding the program uses? Encoding is a concern of the program, the OS should just leave it alone and not try to decode it.
The OS treats strings as a blob, yes, but typically specifies that they're a blob of nul-terminated data.
Unfortunately some text encodings (UTF-16 among them) use nuls for codepoints other than U+00. In fact UTF-16 will use nuls for every character before U+100, in other words all of ASCII and Latin-1. Therefore you can't just support _all_ text encodings for filenames on these OSes, unless the OS provides a second syscall for it (this is what Windows did since they wanted to use UTF-16LE across the board).
I've only mentioned syscalls in this, in truth it extends all through the C stdlib which everything ends up using in some way as well.
You should not be passing file names in different encodings because other apps won't be able to display them properly. There should be one standard encoding for file names. It would also help with things like looking up a name ignoring case and extra spaces.
I mean, I agree there _should_ be one standard encoding, but the Unix API (to pick the example I'm closest to) predates these nuances. All it says is that filenames are a string [of bytes] and can't contain the bytes '/' or '\0'.
It is good for an implementation to enforce this at some level, sure. MacOS has proved features like case insensitivity and unicode normalization can be integrated with Unix filename APIs.
File name is not a blob because it is entered by the user as a text string and displayed for the user as a text string and not as a bunch of hex digits. Also, it cannot contain some characters (like slash or null) so it's not a blob anyway.
And you should be using one specified encoding for file names if you want them to be displayed correctly in all applications. It would be inconvenient if different applications stored file names in different encodings.
For the same reason, encoding should be specified in libraries documentation for all functions accepting or returning strings.
It's not just Windows and JavaScript. On Apple platforms, NSString is UTF-16. On Linux, Qt uses UTF-16 strings. Looking at languages, we have Java (which is where JS got this bug from) and C# both enshrining it in their respective language specs.
So it's far more pervasive than people think, and will likely be in the picture for decades to come.
Qt could really change if they wanted to and really should have by now - it's not like they keep long-term backwards compatibility anyway unlike the others that you mentioned.
Of course they chose to integrate JavaScript so that's less likely now.
ICU (International Components for Unicode, the library published by the Unicode folks) itself uses UTF-16 internally, and most of these things are built on ICU. I agree strongly with your conclusion--UTF-16 isn't going anywhere. I don't think the ICU people are even talking about changing the internals to UTF-8.
UTF-16 arguably is Unicode 2.0+. It's how the code point address space is defined. Code points are either 1 or 2 16-bit code units. Easy. Compare w/ UTF-8 where a code point may be 1, 2, 3, or 4 8-bit code units.
UTF-16 is annoying, but it's far from the biggest design failure in Unicode.
We can argue about "biggest" all day long but UTF-16 is a huge design failure because it made a huge chunk of the lower Unicode space unusable, thereby making better encodings like UTF-8 that could easily represent those code points less efficient. This layer-violating hack should have made it clear that UTF-16 was a bad idea from the start.
Then there is also the issue that technically there is no such thing as UTF-16, instead you need to distinguish UTF-16LE and UTF-16BE. Even though approximately no one uses the latter we still can't ignore it and have to prepend documents and strings with byte order markers (another wasted pair of code points for the sake of an encoding issue) which mean you can't even trivially concatenate them anymore.
Meanwhile UTF-8 is backwards compatible with ASCII, byte order independent, has tons of useful properties and didn't require any Unicode code point assignments to achieve that.
The only reason we have UTF-16 is because early adopters of Unicode bet on UCS-2 and were too cheap to correct their mistake properly when it became clear that two bytes wasn't going to be enough. It's a dirty hack to cover up a mistake that should have never existed.
There are many OS interfaces that were deprecated after five years or even longer. It's been multiple times those five years since then and we'll likely have to deal with UTF-16 for much longer still. Having to provide backwards compatibility for UTF-16 interface doesn't mean they had to keep these as the defaults or provide new UTF-16 interfaces. In particular WIN32 already has 8-bit char interfaces that Microsoft could have easily added UTF-8 support to right then and re-blessed as the default. The decision not to do that was not a technical one but a political one.
This isn't "deprecate a few functions" -- it's basically an effort on par with migrating to Unicode in the first place.
I disagree you could just "easily" shove it into the "A" version of functions. Functions that accept UTF-8 could accept ASCII, but you can't just change the semantics of existing functions that emit text because it would blow up backwards compatibility. In a sense it is covariant but not contravariant.
And now, after you've gone through all of this effort: what was the actual payoff? And at what cost if maintaining compatibility with the other representations?
UTF-32 is arguably even more worst of all worlds. You don't get fixed-size units in any meaningful way. Yes you have fixed sized code points, but those aren't the "units" you care about; you still have variable size grapheme clusters, so you still can't do things like reversing a string or splitting a string at an arbitrary index or anything else like that. Yet it consumes twice the space of UTF-16 for almost everything, and four times the space of UTF-8 for many things.
UTF-32 is the worst of all worlds. UTF-16 has the teeny tiny advantage that pure Chinese text takes a bit less space in UTF-16 than UTF-8 (typically irrelevant because that advantage is outweighed by the fact that the markup surrounding the text takes more space). UTF-8 is the best option for pretty much everything.
As a consequence, never use UTF-32, only use UTF-16 where necessary due to backwards compatibility, always use UTF-8 where possible.
In order to implement grapheme cluster segmentation, you have to start with a sequence of Unicode scalars. In practice, that means a sequence of 32-bit integers, which is UTF-32 in all but name. It's not a good interchange format, but it is a necessary intermediate/internal format.
There's also the problem that grapheme cluster boundaries change over time. Unicode has become a true mess.
Yeah, you need some kind of sequence of Unicode scalars. But there's no reason for that sequence to be "a contiguous chunk of memory filled with 32-bit ints" (aka a UTF-32 string); it can just as well be an iterator which operates on an in-memory UTF-8 string and produces code points.
> It's how the code point address space is defined.
Not really. Unicode is still fundamentally based off of the codepoints, which go from 0 to 2^16 + 2^20, and all of the algorithms of Unicode properties operate on these codepoints. It's just that Unicode has left open a gap of codepoints so that the upper 2^20 codepoints can be encoded in UTF-16 without risk of confusion of other UCS-2 text.
You forgot `- 2^11` for the surrogate pairs. Gee, why isn't Unicode 2^21 code points? To understand the Unicode code point space you must understand UTF-16. The code space is defined by how UTF-16 works. That was my initial point.
If you're going to count the surrogate pairs as not-a-Unicode-codepoint, you should also count the other noncharacters: the last two codepoints on each of the 17 planes and the range U+FDD0-U+FDEF.
The expansion of Unicode beyond the BMP was designed to facilitate an upgrade compatibility path from UCS-2 systems, but it is extremely incorrect to somehow equate Unicode with UTF-16.
I have some places in some software where I assume little endian for simplicity, and I just leave in a static_assert(std::endian::native == std::endian::little) to let future me (or future someone else) know that a particular piece of code must be modified if it is ever to run on a not-little-endian machine.
In an ideal world you could just write endian-independent code (i.e. read byte by byte) and leave the compiler optimizer to sort it out. This has the benefit of also not tripping up any alignment restrictions.
I have a relatively large array of uint16_t with highly repetitive (low entropy) data. I want to serialize that to disk, without wasting a lot of space. I run compress2 from zlib on the data when serializinsg it, and decompress it when deserializing. However, these files make sense to use between machines, so I have defined the file format to use compressed little endian 16-bit unsigned ints. Therefore, if you ever want to run this code on a big-endian machine, you need to add some code to first flip the bytes around before compressing, then flipping them back after decompressing.
You're right that when your code is iterating through data byte for byte, you can write it in an endian-agnostic way and let the optimizer take care of recognizing that your shifts and ORs can be replaced with a memcpy on little-endian systems. But it's not always that simple.
reply