It causes such issues as opening of files by filename not being cross platform w...

codedokode · 2025-09-02T16:26:12 1756830372

It makes sense because UTF-16 is a sequence of uint16_t values, so both bytes must be zero, not just one.

syncsynchalt · 2025-09-02T19:39:39 1756841979

Yes, it makes sense, but it still resulted in a lot of work.

Most Unix syscalls use C-style strings, which are a string of 8-bit bytes terminated with a zero byte. With many (most?) character encodings you can continue to present string data to syscalls in the same way, since they often also reserved a byte value of zero for the same purpose. Even some multi-byte encodings would work if they chose to avoid using 0-value bytes for this reason.

UTF-16LE/BE (and UTF-32 for that matter) chose not to allow for this, and the result is that if you want UTF-16 support in your existing C-string-based syscalls you need to make a second copy of every syscall which supports strings in your UTF-16 type of choice.

codedokode · 2025-09-03T18:18:42 1756923522

> Most Unix syscalls use C-style strings, which are a string of 8-bit bytes terminated with a zero byte. With many (most?) character encodings you can continue to present string data to syscalls in the same way, since they often also reserved a byte value of zero for the same purpose

That's completely wrong. If a syscall (or a function) expects text in encoding A, you should not be sending it in encoding B because it would be interpreted incorrectly, or even worse, this would become a vulnerability.

For every function, encoding must be specified as are specified the types of arguments, constraints and ownership rules. Sadly many open source libraries do not do it. How are you supposed to call a function when you don't know the expected encoding?

Also, it is better to send a pointer and a length of the string rather than potentially infinitely search for a zero byte.

> and the result is that if you want UTF-16 support in your existing C-string-based syscalls

There is no need to support multiple encodings, it only makes things complicated. The simplest solution would be to use UTF-8 for all kernel facilities as a standard.

For example, it would be better if open() syscall required valid UTF-8 string for a file name. This would leave no possibility for displaying file names as question marks.

1718627440 · 2025-09-04T14:53:16 1756997596

Why should the OS mess with application data? I think syscalls should treat text as the blob it is and not care about the encoding at all.

codedokode · 2025-09-04T15:33:18 1756999998

File name is a string, not a blob.

1718627440 · 2025-09-04T15:56:42 1757001402

Yes and my argument is that the OS should treat strings as a blob and not care about the encoding. How can it know what shiny new encoding the program uses? Encoding is a concern of the program, the OS should just leave it alone and not try to decode it.

syncsynchalt · 2025-09-04T18:57:58 1757012278

The OS treats strings as a blob, yes, but typically specifies that they're a blob of nul-terminated data.

Unfortunately some text encodings (UTF-16 among them) use nuls for codepoints other than U+00. In fact UTF-16 will use nuls for every character before U+100, in other words all of ASCII and Latin-1. Therefore you can't just support _all_ text encodings for filenames on these OSes, unless the OS provides a second syscall for it (this is what Windows did since they wanted to use UTF-16LE across the board).

I've only mentioned syscalls in this, in truth it extends all through the C stdlib which everything ends up using in some way as well.

codedokode · 2025-09-07T07:30:33 1757230233

You should not be passing file names in different encodings because other apps won't be able to display them properly. There should be one standard encoding for file names. It would also help with things like looking up a name ignoring case and extra spaces.

syncsynchalt · 2025-09-08T15:01:26 1757343686

I mean, I agree there _should_ be one standard encoding, but the Unix API (to pick the example I'm closest to) predates these nuances. All it says is that filenames are a string [of bytes] and can't contain the bytes '/' or '\0'.

It is good for an implementation to enforce this at some level, sure. MacOS has proved features like case insensitivity and unicode normalization can be integrated with Unix filename APIs.

1718627440 · 2025-09-04T20:04:32 1757016272

You're right I missed that. Sounds like blob size should be communicated out of band.

codedokode · 2025-09-07T07:24:22 1757229862

File name is not a blob because it is entered by the user as a text string and displayed for the user as a text string and not as a bunch of hex digits. Also, it cannot contain some characters (like slash or null) so it's not a blob anyway.

And you should be using one specified encoding for file names if you want them to be displayed correctly in all applications. It would be inconvenient if different applications stored file names in different encodings.

For the same reason, encoding should be specified in libraries documentation for all functions accepting or returning strings.