More

jnwatson · 2025-10-02T19:55:42 1759434942

Smart kids can be a distraction as well. It certainly would have benefitted me to enter G&T at Kindergarten instead of 3rd grade. Much of my first grade was spent separate from the other kids doing 5th grade workbooks.

jnwatson · 2025-10-01T22:15:47 1759356947

While I think this is Rust's biggest flaw, this doesn't stem from any particular hatred of C/C++. This is related to memory safety, as it is very difficult to reason about memory lifetimes of object graphs with cycles.

jnwatson · 2025-09-23T18:22:05 1758651725

I've started to use agents on some very low-level code, and have middling results. For pure algorithmic stuff, it works great. But I asked it to write me some arm64 assembly and it failed miserably. It couldn't keep track of which registers were which.

jnwatson · 2025-09-22T12:10:35 1758543035

Em dashes are widely used. The diaeresis is only used in The New Yorker and those that copied their style.

jnwatson · 2025-09-19T02:13:19 1758247999

> "SE is a low class low power field"

This is the difference with FAANGs. Software engineering is king. The inmates are running the asylum.

Google is at least 4x as efficient as other large companies I've worked for. Nearly every internal process that can possibly be automated is.

fragmede · 2025-09-19T07:24:52 1758266692

I was there for three years and you're totally right that everything's been automated, but also there are a large number of product level decisions that just don't make sense. They make financial sense, sure, but then that means the engineer has drank the MBA cool aid (or not enough of it), things get killed off, and they are no longer to be trusted around things that need proper love and care put into them. Promo packets though, sure.

https://therussofirm.com/man-dies-after-following-google-map...

It's hard to read that as a human, though, and not want to build a system that lets people update bad map data? Which there used to be, but then yeah.

So yeah, the inmmates (engineers) used to run the asylum (Google), but then a group of fucking psychopaths (DoubleClick) got added to the asylum, got given meth (ad money) and shits fucking unhinged.

jnwatson · 2025-09-14T15:07:56 1757862476

As the article mentions, why not just delegate it to a library dedicated to the solution? c-ares is a solid, well-maintained library.

JosifA · 2025-09-14T18:12:48 1757873568

Unfortunately, c-ares is not problem-free on all platforms.

On iOS, its use triggers a local network access popup (it tries to reach your DNS server, which is often on your LAN). If a user denies acess, your app will simply not work.

On Android, it's not compatible with some VPN apps. Those apps are to blame, but your users are going to blame you not them.

So, at my previous company we ended up building libcurl with a threaded DNS resolver on both iOS and Android.

jnwatson · 2025-09-05T16:12:23 1757088743

ASN.1 implements message versioning in an extremely precise way. Implementing a linter would be trivial.

cryptonector · 2025-09-06T18:48:31 1757184511

This. Plus ASN.1 is pluggable as to encoding rules and has a large family of them:

  - BER/DER/CER (TLV)
  - OER and PER ("packed" -- no tags and
                 no lengths wherever
                 possible)
  - XER (XML!)
  - JER (JSON!)
  - GSER (textual representation)
  - you can add your own!
    (One could add one based on XDR,
     which would look a lot like OER/PER
     in a way.)

ASN.1 also gives you a way to do things like formalize typed holes.

Not looking at ASN.1, not even its history and evolution, when creating PB was a crime.

StopDisinfo910 · 2025-09-07T16:25:07 1757262307

The people who wrote PB clearly knew ASN.1. It was the most famous IDL at the time. Do you assume they just came one morning and decided to write PB without taking a look at what existed?

Anyway, as stated PB does more than ASN.1. It specifies both the description format and the encoding. PB is ready to be used out of the box. You have a compact IDL and a performant encoding format without having to think about anything. You have to remember that PB was designed for internal Google use as a tool to solve their problems, not as a generic solution.

ASN.1 is extremely unwieldy in comparaison. It has accumulated a lot of cruft through the year. Plus they don’t provide a default implementation.

troupo · 2025-09-07T20:46:18 1757277978

> The people who wrote PB clearly knew ASN.1.

And your assumption is based on what exactly?

> It was the most famous IDL at the time.

Strange that at the same time (2001) people were busy implementing everyting in Java and XML, not ASN.1

> Do you assume they just came one morning and decided to write PB without taking a look at what existed?

Yes, that is a great assumption. Looking at what most companies do, this is an assumption bordering on prescience.

StopDisinfo910 · 2025-09-08T10:49:47 1757328587

> Strange that at the same time (2001) people were busy implementing everyting in Java and XML, not ASN.1

Yes. Meanwhile Google was designing an IDL with a default binary serialisation format. And this is not 2025 typical big corp, over staffed, fake HR levels heavy Google we are talking about. That’s Google in its heyday. I think you have answered your own comment.

cryptonector · 2025-09-07T21:54:32 1757282072

> Do you assume they just came one morning and decided to write PB without taking a look at what existed?

Considering how bad an imitation of 1984 ASN.1 PB's IDL is, and how bad an imitation of 1984 DER PB is, yes I assume that PB's creators did not in fact know ASN.1 well. They almost certainly knew of ASN.1, and they almost certainly did not know enough about it because all the worst mistakes in ASN.1 PB re-created while adding zero new ideas or functionality. It's a terrible shame.

StopDisinfo910 · 2025-09-08T10:48:15 1757328495

PB is not a bad imitation of 1984 ASN.1. ASN.1 is choke full of useless representations clearly there to serve what a committee thought the need of the telco industry should be.

I find it funny you are making it looks like a good and pleasant to use IDL. It’s a perfect example of design by committee at its worst.

PB is significantly more space efficient than DER by the way.

jnwatson · 2025-09-04T14:27:30 1756996050

Enterprise Chrome is just regular Chrome with remote policy enforcement. It isn't a different browser.

kube-system · 2025-09-04T14:47:36 1756997256

From an engineering perspective, most browsers are "pretty much just Chrome(ium)", but that's not what I'm talking about here. The delivery mechanism isn't really relevant from a product perspective. It is a different product with a different price and different features.

Also, my point was just just say that there's a market for something like this. Chrome Enterprise is not even really that competitive of a product in the space.

For the most part, default Chrome and Firefox are designed primarily for B2C use cases.

stackskipton · 2025-09-04T17:47:20 1757008040

Which is what Enterprises need. They don't need their own version of Chrome, they need to ability to make changes to it like force Proxy, insert MiTM certs and various other Enterprise stuff.

jnwatson · 2025-09-01T19:48:32 1756756112

It is amazing that big endian is almost dead.

It will be relegated to the computing dustbin like non-8-bit bytes and EBCDIC.

Main-core computing is vastly more homogenous than when I was born almost 50 years ago. I guess that's a natural progression for technology.

goku12 · 2025-09-01T20:20:21 1756758021

> It is amazing that big endian is almost dead.

I wish the same applied to written numbers in LTR scripts. Arithmetic operations would be a lot easier to do that way on paper or even mentally. I also wish that the world would settle on a sane date-time format like the ISO 8601 or RFC 3339 (both of which would reverse if my first wish is also granted).

> It will be relegated to the computing dustbin like non-8-bit bytes and EBCDIC.

I never really understood those non-8-bit bytes, especially the 7 bit byte. If you consider the multiplexer and demux/decoder circuits that are used heavily in CPUs, FPGAs and custom digital circuits, the only number that really makes sense is 8. It's what you get for a 3 bit selector code. The other nearby values being 4 and 16. Why did they go for 7 bits instead of 8? I assume that it was a design choice made long before I was even born. Does anybody know the rationale?

idoubtit · 2025-09-01T20:43:04 1756759384

> I also wish that the world would settle on a sane date-time format like the ISO 8601

IIRC, in most countries the native format is D-M-Y (with varying separators), but some Asian countries use Y-M-D. Since those formats are easy to distinguish, that's no problem. That's why Y-M-D is spreading in Europe for official or technical documents.

There's mainly one country which messes things up...

tavavex · 2025-09-01T22:29:15 1756765755

YYYY-MM-DD is also the official date format in Canada, though it's not officially enforced, so outside of government documents you end up seeing a bit of all three formats all over the place. I've always used ISO 8601 and no one bats an eye, and it's convenient since YYYY-DD-MM isn't really a thing, so it can't be confused for anything else, unlike the other two formats.

zahlman · 2025-09-01T23:07:43 1756768063

YMD has caught on, I think, because it allows for the numbers to be "in order" (not mixed-endian) while still having the month before the day which matches the practice for speaking dates in (at least) the US and Canada.

adgjlsfhk1 · 2025-09-02T00:48:03 1756774083

The primary reason for YMD is that DDMMYYYY is ambiguous with MMDDYYYY

christophilus · 2025-09-02T01:09:29 1756775369

It is also sortable, which I think is the real advantage.

globular-toast · 2025-09-02T08:42:17 1756802537

I used to think this was really important, but what's the use case here?

If I'm writing a document for human consumption then why would I expect the dates to be sortable by a naive string sorting algorithm?

On the other hand, if it's data for computer consumption then just skip the complicated serialisation completely and dump the Unix timestamp as a decimal. Any modern data format would include the ability to label that as a timestamp data type. If you really want to be able to "read" the data file then just include another column with a human-formatted timestamp, but I can't imagine why in 2025 I would be manually reading through a data file like some ancient mathematician using a printed table of logarithms.

Majestic121 · 2025-09-02T09:10:47 1756804247

> If I'm writing a document for human consumption then why would I expect the dates to be sortable by a naive string sorting algorithm?

If you're naming a document for human consumption, having the files sorted by date easily without relying on modification date (which is changed by fixing a typo/etc...) is pretty neat

bombcar · 2025-09-02T10:59:59 1756810799

This is exactly it - file name is easy to control and sort on; creates date and modified date are (for most users) random and uncontrolled.

globular-toast · 2025-09-03T05:55:11 1756878911

So you can't sort by name, author etc? One sort key? What year is it?!

privatelypublic · 2025-09-02T01:49:37 1756777777

As ling as you pad to two characters!

rocqua · 2025-09-02T05:52:13 1756792333

8601, when used fully according to spec sucks. It makes today 20250902. It doesn't have seperators. And for adding a time it gets even less readable.

Its a serialization and machine communication format. And that makes me sad. Because YYYY-MM-DD is a great format, without a good name.

em500 · 2025-09-02T07:49:06 1756799346

YYYY-MM-DD is ISO8601 extended format, YYYYMMDD is ISO8601 basic format (section 5.2.1.1 of ISO8601:2000(E)[1]). Both are fully according to spec, and neither format takes precedence over the other.

[1] https://www.pvv.org/~nsaa/8601v2000.pdf

account42 · 2025-09-02T08:08:59 1756800539

It does have a good name: RFC 3339. Unlike the ISO standard, that one mandates the "-" separators. Meanwhile it lets you substitute a space for the ugly "T" separator between date and time:

> NOTE: ISO 8601 defines date and time separated by "T". Applications using this syntax may choose, for the sake of readability, to specify a full-date and full-time separated by (say) a space character.

Y_Y · 2025-09-02T10:51:48 1756810308

I always like the compromise of the M/D/M system popularised by the British documentary series Look Around You, e.g. "January the fourth of March".

christophilus · 2025-09-02T01:08:21 1756775301

I live in that country, and I am constantly messing up date forms. My brain always goes yyyy-mm-dd. If I write it out, September 1st, 2025, I get it in the “right” order. But otherwise, especially if I’m tired, it’s always in a sortable format.

pavon · 2025-09-02T04:41:05 1756788065

There are a lot of computations where 256 is too small of a range but 65536 is overkill. When designers of early computers were working out how many digits of precision their calculations needed to have for their intended purpose 12 bits commonly ended up being a sweet spot.

When your RAM is vacuum tubes or magnetic core memory, you don't want 25% of it to go unused, just to round your word size up a power of two.

skeezyboy · 2025-09-02T15:44:58 1756827898

> There are a lot of computations where 256 is too small of a range but 65536 is overkill

wasnt this more to do with cost? they could do arbitrary precision code even back then. its not like they were only calculating numbers less than 65537, ignoring anything larger

jcranmer · 2025-09-01T21:25:38 1756761938

I don't know that 7-bit bytes were ever used. Computer word sizes have historically been multiples of 6 or 8 bits, and while I can't say as to why particular values were chosen, I would hypothesize that multiples of 6 and 8 work well for representation in octal and hexadecimal respectively. For many of these early machines, sub-word addressability wasn't really a thing, so the question of 'byte' is somewhat academic.

For the representation of text of an alphabetic language, you need to hit 6 bits if your script doesn't have case and 7 bits if it does have case. ASCII ended up encoding English into 7 bits and EBCDIC chose 8 bits (as it's based on a binary-coded decimal scheme which packs a decimal digit into 4 bits). Early machines did choose to use the unused high bit of an ASCII character stored in 8 bits as a parity bit, but most machines have instead opted to extend the character repertoire in a variety of incompatible ways, which eventually led to Unicode.

cardiffspaceman · 2025-09-01T22:30:44 1756765844

On the DEC-10 the word size is 36 bits. There was (an option to include) a special set of instructions to enable any given byte size with bytes packed. Five 7-bit bytes per word, for example, with a wasted bit in each word.

I wouldn’t be surprised if other machines had something like this in hardware.

Affric · 2025-09-02T03:02:33 1756782153

Could you use the extra bit for parity?

cardiffspaceman · 2025-09-02T20:29:18 1756844958

I don’t remember if there are instructions for putting any value of parity as such into the spare bit.

int_19h · 2025-09-02T01:43:50 1756777430

> For the representation of text of an alphabetic language, you need to hit 6 bits if your script doesn't have case

Only if you assume a 1:1 mapping. But e.g. the original Baudot code was 5-bit, with codes reserved to switch between letters and "everything else". When ASCII was designed, some people wanted to keep the same arrangement.

dboreham · 2025-09-02T04:00:32 1756785632

Quick note that parity was never used in "characters stored". It was only ever used in transmission, and checked/removed by hardware[1].

[1] Yes, I remember you could bit-bang a UART in software, but still the parity bit didn't escape the serial decoding routine.

goku12 · 2025-09-02T03:19:56 1756783196

I wasn't asking about word sizes in particular, and had ASCII in mind. Nevertheless, your answer is in the right direction.

creshal · 2025-09-02T07:27:38 1756798058

> both of which would reverse if my first wish is also granted

But why? The brilliance of 8601/3339 is that string sorting is also correct datetime sorting.

goku12 · 2025-09-02T17:52:21 1756835541

> But why?

To get the little-endian ordering. The place values of digits increase from left to right - in the same direction as how we write literature (assuming LTR scripts), allowing us to do arithmetic operations (addition, multiplication, etc) in the same direction.

> The brilliance of 8601/3339 is that string sorting is also correct datetime sorting.

I hadn't thought about that. But it does reveal something interesting. In literature, we assign the highest significance to the left-most (first) letter - in the direction opposite to how we write. This needs a bit more contemplation.

1718627440 · 2025-09-04T14:42:57 1756996977

> we assign the highest significance to the left-most (first) letter

Yes, we do that with everything, which is why little-endian numbers would be really inconsistent for humans.

formerly_proven · 2025-09-01T20:52:48 1756759968

Computers never used 7-bit bytes similarly to how 5-bit bytes were uncommon, but both 6-bit and 8-bit bytes were common in their respective eras.

goku12 · 2025-09-02T03:30:15 1756783815

I was asking about ASCII encoding and not the word size. But this information is also useful. So apparently, people were representing both numbers and script codes (EBCDIC in particular) in packed decimal or octal at times. The standardization on 8 bits and adoption of raw binary representation seems to have come later.

formerly_proven · 2025-09-02T11:40:21 1756813221

Because character encodings were primarily designed around transmission over serial lines.

goku12 · 2025-09-02T17:36:09 1756834569

Agreed. But how does that affect how encoding are designed? I mean packed decimal vs octal vs full binary, etc?

blahedo · 2025-09-01T20:54:26 1756760066

I believe that 10- and 12-bit bytes were also attested in the early days. As for "why": the tradeoffs are different when you're at the scale that any computer was at in the 70s (and 60s), and while I can't speak to the specific reasons for such a choice, I do know that nobody was worrying about scaling up to billions of memory locations, and also using particular bit combinations to signal "special" values was a lot more common in older systems, so I imagine both were at play.

globular-toast · 2025-09-02T07:03:15 1756796595

In Britain the standard way to write a date has always been, e.g "12th March 2023” or 12/3/2023 for short. Don't think there's a standard for where to put the time, though, I can imagine it both before and after.

Doing numbers little-endian does make more sense. It's weird that we switch to RTL when doing arithmetic. Amusingly the Wikipedia page for Hindu-Arabic numeral system claims that their RTL scripts switch to LTR for numbers. Nope... the inventors of our numeral system used little-endian and we forgot to reverse it for our LTR scripts...

Edit: I had to pull out Knuth here (vol. 2). So apparently the original Hindu scripts were LTR, like Latin, and Arabic is RTL. According to Knuth the earliest known Hindu manuscripts have the numbers "backwards", meaning most significant digit at the right, but soon switched to most significant at the left. So I read that as starting in little-endian but switching to big-endian.

These were later translated to Arabic (RTL), but the order of writing numbers remained the same, so became little-endian ("backwards").

Later still the numerals were introduced into Latin but, again, the order remained the same, so becoming big-endian again.

goku12 · 2025-09-02T17:34:42 1756834482

We in India use the same system for dates as you described, for obvious reasons. But I really don't like the pattern of switching directions multiple times when reading a date and time.

And as for numbers, perhaps it isn't too late to set it right once and for all. The French did that with the SI system after all.

> So apparently the original Hindu scripts were LTR

I can confirm. All Indian scripts are LTR (Though there are quite a few of them. I'm not aware of any exceptions). All of them seem to have evolved from an ancient and now extinct script named Brahmi. That one was LTR. It's unlikely to have switched direction any time during subsequent evolution into modern scripts.

1718627440 · 2025-09-04T14:38:33 1756996713

> I also wish that the world would settle on a sane date-time format like the ISO 8601 or RFC 3339 (both of which would reverse if my first wish is also granted).

YYYY-MM-DD to me always feels like a timestamp, while when I want to write a date, I think of a name, (for me DD. MM. YYYY).

vrighter · 2025-09-03T06:36:28 1756881388

7 bits was chosen to reduce transmission costs, not storage costs, because you send 12.5% less data. Also, because computers usually worked on 8-bit bytes, the 8th bit could be used as a parity bit, where extra reliability was needed.

ndiddy · 2025-09-02T00:35:06 1756773306

Big endian will stay around as long as IBM continues to put in the resources to provide first-class Linux support on s390x. Of course if you don’t expect your software to ever be run on s390x you can just assume little-endian, but that’s already been the case for the vast majority of software developers ever since Apple stopped supporting PowerPC.

metaphor · 2025-09-02T02:42:43 1756780963

> ...that’s already been the case for the vast majority of software developers ever since Apple stopped supporting PowerPC.

For better or worse, PowerPC is still quite entrenched in the industrial embedded space.

ndiddy · 2025-09-02T03:00:43 1756782043

Right, and the vast majority of software developers aren't writing software intended to run on PowerPC industrial control boards.

Aardwolf · 2025-09-01T19:56:28 1756756588

Now just UTF-16 and non '\n' newline types remaining to go

syncsynchalt · 2025-09-01T22:38:29 1756766309

Of the two UTF-16 is much less of a problem, it's trivially[1] and losslessly convertible.

[1] Ok I admit, not trivially when it comes to unpaired surrogates, BOMs, endian detection, and probably a dozen other edge and corner cases I don't even know about. But you can offload the work to pretty well-understood and trouble-free library calls.

Aardwolf · 2025-09-02T06:24:11 1756794251

It causes such issues as opening of files by filename not being cross platform with standard libc functions since in Windows w-string versions are required since UTF-16 chars can have 0-bytes that aren't zero terminators

codedokode · 2025-09-02T16:26:12 1756830372

It makes sense because UTF-16 is a sequence of uint16_t values, so both bytes must be zero, not just one.

syncsynchalt · 2025-09-02T19:39:39 1756841979

Yes, it makes sense, but it still resulted in a lot of work.

Most Unix syscalls use C-style strings, which are a string of 8-bit bytes terminated with a zero byte. With many (most?) character encodings you can continue to present string data to syscalls in the same way, since they often also reserved a byte value of zero for the same purpose. Even some multi-byte encodings would work if they chose to avoid using 0-value bytes for this reason.

UTF-16LE/BE (and UTF-32 for that matter) chose not to allow for this, and the result is that if you want UTF-16 support in your existing C-string-based syscalls you need to make a second copy of every syscall which supports strings in your UTF-16 type of choice.

codedokode · 2025-09-03T18:18:42 1756923522

> Most Unix syscalls use C-style strings, which are a string of 8-bit bytes terminated with a zero byte. With many (most?) character encodings you can continue to present string data to syscalls in the same way, since they often also reserved a byte value of zero for the same purpose

That's completely wrong. If a syscall (or a function) expects text in encoding A, you should not be sending it in encoding B because it would be interpreted incorrectly, or even worse, this would become a vulnerability.

For every function, encoding must be specified as are specified the types of arguments, constraints and ownership rules. Sadly many open source libraries do not do it. How are you supposed to call a function when you don't know the expected encoding?

Also, it is better to send a pointer and a length of the string rather than potentially infinitely search for a zero byte.

> and the result is that if you want UTF-16 support in your existing C-string-based syscalls

There is no need to support multiple encodings, it only makes things complicated. The simplest solution would be to use UTF-8 for all kernel facilities as a standard.

For example, it would be better if open() syscall required valid UTF-8 string for a file name. This would leave no possibility for displaying file names as question marks.

1718627440 · 2025-09-04T14:53:16 1756997596

Why should the OS mess with application data? I think syscalls should treat text as the blob it is and not care about the encoding at all.

codedokode · 2025-09-04T15:33:18 1756999998

File name is a string, not a blob.

1718627440 · 2025-09-04T15:56:42 1757001402

Yes and my argument is that the OS should treat strings as a blob and not care about the encoding. How can it know what shiny new encoding the program uses? Encoding is a concern of the program, the OS should just leave it alone and not try to decode it.

syncsynchalt · 2025-09-04T18:57:58 1757012278

The OS treats strings as a blob, yes, but typically specifies that they're a blob of nul-terminated data.

Unfortunately some text encodings (UTF-16 among them) use nuls for codepoints other than U+00. In fact UTF-16 will use nuls for every character before U+100, in other words all of ASCII and Latin-1. Therefore you can't just support _all_ text encodings for filenames on these OSes, unless the OS provides a second syscall for it (this is what Windows did since they wanted to use UTF-16LE across the board).

I've only mentioned syscalls in this, in truth it extends all through the C stdlib which everything ends up using in some way as well.

codedokode · 2025-09-07T07:30:33 1757230233

You should not be passing file names in different encodings because other apps won't be able to display them properly. There should be one standard encoding for file names. It would also help with things like looking up a name ignoring case and extra spaces.

syncsynchalt · 2025-09-08T15:01:26 1757343686

I mean, I agree there _should_ be one standard encoding, but the Unix API (to pick the example I'm closest to) predates these nuances. All it says is that filenames are a string [of bytes] and can't contain the bytes '/' or '\0'.

It is good for an implementation to enforce this at some level, sure. MacOS has proved features like case insensitivity and unicode normalization can be integrated with Unix filename APIs.

1718627440 · 2025-09-04T20:04:32 1757016272

You're right I missed that. Sounds like blob size should be communicated out of band.

codedokode · 2025-09-07T07:24:22 1757229862

File name is not a blob because it is entered by the user as a text string and displayed for the user as a text string and not as a bunch of hex digits. Also, it cannot contain some characters (like slash or null) so it's not a blob anyway.

And you should be using one specified encoding for file names if you want them to be displayed correctly in all applications. It would be inconvenient if different applications stored file names in different encodings.

For the same reason, encoding should be specified in libraries documentation for all functions accepting or returning strings.

hypeatei · 2025-09-01T20:09:27 1756757367

UTF-16 will be quite the mountain as Windows APIs and web specifications/engines default to it for historical reasons.

int_19h · 2025-09-02T01:46:54 1756777614

It's not just Windows and JavaScript. On Apple platforms, NSString is UTF-16. On Linux, Qt uses UTF-16 strings. Looking at languages, we have Java (which is where JS got this bug from) and C# both enshrining it in their respective language specs.

So it's far more pervasive than people think, and will likely be in the picture for decades to come.

account42 · 2025-09-02T08:17:32 1756801052

Qt could really change if they wanted to and really should have by now - it's not like they keep long-term backwards compatibility anyway unlike the others that you mentioned.

Of course they chose to integrate JavaScript so that's less likely now.

electroly · 2025-09-02T03:25:42 1756783542

ICU (International Components for Unicode, the library published by the Unicode folks) itself uses UTF-16 internally, and most of these things are built on ICU. I agree strongly with your conclusion--UTF-16 isn't going anywhere. I don't think the ICU people are even talking about changing the internals to UTF-8.

jeberle · 2025-09-01T23:04:04 1756767844

UTF-16 arguably is Unicode 2.0+. It's how the code point address space is defined. Code points are either 1 or 2 16-bit code units. Easy. Compare w/ UTF-8 where a code point may be 1, 2, 3, or 4 8-bit code units.

UTF-16 is annoying, but it's far from the biggest design failure in Unicode.

account42 · 2025-09-02T08:29:46 1756801786

We can argue about "biggest" all day long but UTF-16 is a huge design failure because it made a huge chunk of the lower Unicode space unusable, thereby making better encodings like UTF-8 that could easily represent those code points less efficient. This layer-violating hack should have made it clear that UTF-16 was a bad idea from the start.

Then there is also the issue that technically there is no such thing as UTF-16, instead you need to distinguish UTF-16LE and UTF-16BE. Even though approximately no one uses the latter we still can't ignore it and have to prepend documents and strings with byte order markers (another wasted pair of code points for the sake of an encoding issue) which mean you can't even trivially concatenate them anymore.

Meanwhile UTF-8 is backwards compatible with ASCII, byte order independent, has tons of useful properties and didn't require any Unicode code point assignments to achieve that.

The only reason we have UTF-16 is because early adopters of Unicode bet on UCS-2 and were too cheap to correct their mistake properly when it became clear that two bytes wasn't going to be enough. It's a dirty hack to cover up a mistake that should have never existed.

anonymars · 2025-09-02T16:27:58 1756830478

> The only reason we have UTF-16 is because early adopters of Unicode bet on UCS-2 and were too cheap to correct their mistake properly

That's a strange way to characterize years of backwards compatibility to deal with

https://devblogs.microsoft.com/oldnewthing/20190830-00/?p=10...

account42 · 2025-09-03T07:48:13 1756885693

There are many OS interfaces that were deprecated after five years or even longer. It's been multiple times those five years since then and we'll likely have to deal with UTF-16 for much longer still. Having to provide backwards compatibility for UTF-16 interface doesn't mean they had to keep these as the defaults or provide new UTF-16 interfaces. In particular WIN32 already has 8-bit char interfaces that Microsoft could have easily added UTF-8 support to right then and re-blessed as the default. The decision not to do that was not a technical one but a political one.

anonymars · 2025-09-03T12:20:07 1756902007

This isn't "deprecate a few functions" -- it's basically an effort on par with migrating to Unicode in the first place.

I disagree you could just "easily" shove it into the "A" version of functions. Functions that accept UTF-8 could accept ASCII, but you can't just change the semantics of existing functions that emit text because it would blow up backwards compatibility. In a sense it is covariant but not contravariant.

And now, after you've gone through all of this effort: what was the actual payoff? And at what cost if maintaining compatibility with the other representations?

adgjlsfhk1 · 2025-09-02T00:51:02 1756774262

UTF-16 is the worst of all worlds. Either use UTF32 where code-points are fixed, or if you care about space efficiency use UTF8

mort96 · 2025-09-02T07:22:14 1756797734

UTF-32 is arguably even more worst of all worlds. You don't get fixed-size units in any meaningful way. Yes you have fixed sized code points, but those aren't the "units" you care about; you still have variable size grapheme clusters, so you still can't do things like reversing a string or splitting a string at an arbitrary index or anything else like that. Yet it consumes twice the space of UTF-16 for almost everything, and four times the space of UTF-8 for many things.

UTF-32 is the worst of all worlds. UTF-16 has the teeny tiny advantage that pure Chinese text takes a bit less space in UTF-16 than UTF-8 (typically irrelevant because that advantage is outweighed by the fact that the markup surrounding the text takes more space). UTF-8 is the best option for pretty much everything.

As a consequence, never use UTF-32, only use UTF-16 where necessary due to backwards compatibility, always use UTF-8 where possible.

kbolino · 2025-09-02T16:33:40 1756830820

In order to implement grapheme cluster segmentation, you have to start with a sequence of Unicode scalars. In practice, that means a sequence of 32-bit integers, which is UTF-32 in all but name. It's not a good interchange format, but it is a necessary intermediate/internal format.

There's also the problem that grapheme cluster boundaries change over time. Unicode has become a true mess.

mort96 · 2025-09-02T16:43:14 1756831394

Yeah, you need some kind of sequence of Unicode scalars. But there's no reason for that sequence to be "a contiguous chunk of memory filled with 32-bit ints" (aka a UTF-32 string); it can just as well be an iterator which operates on an in-memory UTF-8 string and produces code points.

jcranmer · 2025-09-02T01:26:57 1756776417

> It's how the code point address space is defined.

Not really. Unicode is still fundamentally based off of the codepoints, which go from 0 to 2^16 + 2^20, and all of the algorithms of Unicode properties operate on these codepoints. It's just that Unicode has left open a gap of codepoints so that the upper 2^20 codepoints can be encoded in UTF-16 without risk of confusion of other UCS-2 text.

jeberle · 2025-09-02T14:42:10 1756824130

You forgot `- 2^11` for the surrogate pairs. Gee, why isn't Unicode 2^21 code points? To understand the Unicode code point space you must understand UTF-16. The code space is defined by how UTF-16 works. That was my initial point.

jcranmer · 2025-09-02T16:00:35 1756828835

If you're going to count the surrogate pairs as not-a-Unicode-codepoint, you should also count the other noncharacters: the last two codepoints on each of the 17 planes and the range U+FDD0-U+FDEF.

The expansion of Unicode beyond the BMP was designed to facilitate an upgrade compatibility path from UCS-2 systems, but it is extremely incorrect to somehow equate Unicode with UTF-16.

kbolino · 2025-09-02T16:42:00 1756831320

FWIW there is an official term for "code points excluding surrogates", it is "Unicode scalar value".

jeberle · 2025-09-02T20:34:59 1756845299

OK, I'm lost here. Why is there a 1:1 correspondence between the two?

welferkj · 2025-09-02T10:14:28 1756808068

UTF-8 is superior simply because you can trivially choose to parse it as ascii and ignore all the weird foreign bytes.

augustk · 2025-09-02T11:09:30 1756811370

> Now just UTF-16 and non '\n' newline types remaining to go

Also ISO 8601 (YYYY-MM-DD) should be the default date format.

dgshsg · 2025-09-01T19:56:49 1756756609

We'll have to deal with it forever in network protocols. Thankfully that's rather walled off from most software.

newpavlov · 2025-09-01T20:08:48 1756757328

As well as in a number of widely spread cryptographic algorithms (e.g. SHA-2), which use BE for historic reasons.

NooneAtAll3 · 2025-09-02T00:33:47 1756773227

just call it 2-AHS and you're done :)

delduca · 2025-09-01T20:02:38 1756756958

Good call out, I have just removed some #ifdef about endianness from my engine.

mort96 · 2025-09-01T20:56:52 1756760212

I have some places in some software where I assume little endian for simplicity, and I just leave in a static_assert(std::endian::native == std::endian::little) to let future me (or future someone else) know that a particular piece of code must be modified if it is ever to run on a not-little-endian machine.

account42 · 2025-09-02T08:36:16 1756802176

In an ideal world you could just write endian-independent code (i.e. read byte by byte) and leave the compiler optimizer to sort it out. This has the benefit of also not tripping up any alignment restrictions.

mort96 · 2025-09-02T09:07:00 1756804020

Here's my most recent use case:

I have a relatively large array of uint16_t with highly repetitive (low entropy) data. I want to serialize that to disk, without wasting a lot of space. I run compress2 from zlib on the data when serializinsg it, and decompress it when deserializing. However, these files make sense to use between machines, so I have defined the file format to use compressed little endian 16-bit unsigned ints. Therefore, if you ever want to run this code on a big-endian machine, you need to add some code to first flip the bytes around before compressing, then flipping them back after decompressing.

You're right that when your code is iterating through data byte for byte, you can write it in an endian-agnostic way and let the optimizer take care of recognizing that your shifts and ORs can be replaced with a memcpy on little-endian systems. But it's not always that simple.

delduca · 2025-09-01T21:59:13 1756763953

Another good another call. Thank you.

chasil · 2025-09-02T14:48:51 1756824531

"...with the CPU running in big endian mode."

Hey, you! You're supposed to be dead!

https://wiki.netbsd.org/ports/evbarm/

jnwatson · 2025-08-31T00:35:46 1756600546

Older people that grew up with "desktop publishing" and "The Mac is not a Typewriter" grew up with the em dash.

JKCalhoun · 2025-08-31T00:49:07 1756601347

Correct. And my typewriter dad will do two dashes --.

patrickmay · 2025-08-31T00:56:59 1756601819