digitalmars.D - The encoding of Windows/Linux filenames
- Arcane Jill (38/38) Jun 26 2004 To clarify a point made in another post... Suppose you have a file calle...
- Walter (4/4) Jun 26 2004 This issue is already fixed for std.file operations under Win32, this fi...
-
Carlos Santander B.
(10/10)
Jun 26 2004
"Walter"
escribió en el mensaje - Walter (4/9) Jun 26 2004 fix
To clarify a point made in another post... Suppose you have a file called "café". This filename will be stored within the Window filesystem as a sequence of 16-bit words, each representing a Unicode character, which in this case will be the sequence { 0x0063, 0x0061, 0x0066, 0x00C9 }. (This is not of course true with old DOS 8.3 filenames, but we'll ignore them). The D char[] string, "café", on the other hand, will be stored as the five-byte UTF-8 sequence { 0x63, 0x61, 0x66, 0xC3, 0x89 }. However, when you call the char version of CreateFile(), the filename string will be interpretted as if it were encoded in the default windows codepage (normally WINDOWS-1252 in English-speaking countries. Under this interpretation, the byte sequence { 0x63, 0x61, 0x66, 0xC3, 0x89 } will be seen as the string "cafÃ?", and so Windows will attempt to open a file of that name. So, either it will fail, or it will open the wrong file. The fix, of course, to pass to CreateFile() the value (std.utf.toUTF16(filename)), instead of (filename). This should not have to be done by users - it needs to be done at the Phobos level. The situation is more complicated on Linux, unfortunately. On Linux filenames are stored as a sequence of bytes, not 16-bit-words. On one level that sequence of bytes is kind of "raw" - fopen() can be passed any sequence of bytes not containing "/" or "\0", and it will consider a filename to match only if it is byte-for-byte identical. However, this does not really mitigate the problem, because bytes only turn into characters - even 8-bit-wide ones - when you interpret them according to an encoding. Thus, if you have C source code which says fopen("café", "r"), your C++ compiler will still need to know what sequence of bytes should represent these characters. By and large, it will assume the system default encoding, called the "locale" in Linux-speak (although it has very little to do with the ISO langauage-country-variant understanding of "locale"). Some Linux users will have set their default "locale" to UTF-8. Others won't. Getting this right will be tricky. Unfortunately, you can't ignore this problem. Unless you want to tell people that D's File (FileStream?) class will only work for filenames containing ASCII characters, that it - and that is hardly a realistic option if you want D to compete seriously with C++ and Java. It will be easier to fix this for Windows, for the reasons given above. I think, at least, that should happen as part of the ongoing std.stream improving. Someone who knows more about Linux encoding will have to help out on the Linux fix. Arcane Jill
Jun 26 2004
This issue is already fixed for std.file operations under Win32, this fix just needs to be propagated to std.stream. For linux, the file name operations assume the linux APIs take UTF-8. I don't know how to do code pages in linux, so this will have to wait until I figure it out <g>.
Jun 26 2004
"Walter" <newshound digitalmars.com> escribió en el mensaje news:cbkbv8$abd$2 digitaldaemon.com | This issue is already fixed for std.file operations under Win32, this fix | just needs to be propagated to std.stream. ... which I already did, and posted in the bugs ng. | For linux, the file name | operations assume the linux APIs take UTF-8. I don't know how to do code | pages in linux, so this will have to wait until I figure it out <g>. ----------------------- Carlos Santander Bernal
Jun 26 2004
"Carlos Santander B." <carlos8294 msn.com> wrote in message news:cbkdvf$d4a$1 digitaldaemon.com..."Walter" <newshound digitalmars.com> escribió en el mensaje news:cbkbv8$abd$2 digitaldaemon.com | This issue is already fixed for std.file operations under Win32, thisfix| just needs to be propagated to std.stream. ... which I already did, and posted in the bugs ng.Yes, you did.
Jun 26 2004