It causes such issues as opening of files by filename not being cross platform with standard libc functions since in Windows w-string versions are required since UTF-16 chars can have 0-bytes that aren't zero terminators
Yes, it makes sense, but it still resulted in a lot of work.
Most Unix syscalls use C-style strings, which are a string of 8-bit bytes terminated with a zero byte. With many (most?) character encodings you can continue to present string data to syscalls in the same way, since they often also reserved a byte value of zero for the same purpose. Even some multi-byte encodings would work if they chose to avoid using 0-value bytes for this reason.
UTF-16LE/BE (and UTF-32 for that matter) chose not to allow for this, and the result is that if you want UTF-16 support in your existing C-string-based syscalls you need to make a second copy of every syscall which supports strings in your UTF-16 type of choice.
> Most Unix syscalls use C-style strings, which are a string of 8-bit bytes terminated with a zero byte. With many (most?) character encodings you can continue to present string data to syscalls in the same way, since they often also reserved a byte value of zero for the same purpose
That's completely wrong. If a syscall (or a function) expects text in encoding A, you should not be sending it in encoding B because it would be interpreted incorrectly, or even worse, this would become a vulnerability.
For every function, encoding must be specified as are specified the types of arguments, constraints and ownership rules. Sadly many open source libraries do not do it. How are you supposed to call a function when you don't know the expected encoding?
Also, it is better to send a pointer and a length of the string rather than potentially infinitely search for a zero byte.
> and the result is that if you want UTF-16 support in your existing C-string-based syscalls
There is no need to support multiple encodings, it only makes things complicated. The simplest solution would be to use UTF-8 for all kernel facilities as a standard.
For example, it would be better if open() syscall required valid UTF-8 string for a file name. This would leave no possibility for displaying file names as question marks.
Yes and my argument is that the OS should treat strings as a blob and not care about the encoding. How can it know what shiny new encoding the program uses? Encoding is a concern of the program, the OS should just leave it alone and not try to decode it.
The OS treats strings as a blob, yes, but typically specifies that they're a blob of nul-terminated data.
Unfortunately some text encodings (UTF-16 among them) use nuls for codepoints other than U+00. In fact UTF-16 will use nuls for every character before U+100, in other words all of ASCII and Latin-1. Therefore you can't just support _all_ text encodings for filenames on these OSes, unless the OS provides a second syscall for it (this is what Windows did since they wanted to use UTF-16LE across the board).
I've only mentioned syscalls in this, in truth it extends all through the C stdlib which everything ends up using in some way as well.
You should not be passing file names in different encodings because other apps won't be able to display them properly. There should be one standard encoding for file names. It would also help with things like looking up a name ignoring case and extra spaces.
I mean, I agree there _should_ be one standard encoding, but the Unix API (to pick the example I'm closest to) predates these nuances. All it says is that filenames are a string [of bytes] and can't contain the bytes '/' or '\0'.
It is good for an implementation to enforce this at some level, sure. MacOS has proved features like case insensitivity and unicode normalization can be integrated with Unix filename APIs.
File name is not a blob because it is entered by the user as a text string and displayed for the user as a text string and not as a bunch of hex digits. Also, it cannot contain some characters (like slash or null) so it's not a blob anyway.
And you should be using one specified encoding for file names if you want them to be displayed correctly in all applications. It would be inconvenient if different applications stored file names in different encodings.
For the same reason, encoding should be specified in libraries documentation for all functions accepting or returning strings.