digitalmars.D.learn - UTF-16 endianess
- Marek Janukowicz (16/16) Jan 29 2016 I have trouble understanding how endianess works for UTF-16.
- Steven Schveighoffer (10/24) Jan 29 2016 It's not any different from other endianness.
- Marek Janukowicz (9/19) Jan 29 2016 To be precise - my case is IMAP UTF7 folder name encoding and I finally ...
- Steven Schveighoffer (16/37) Jan 29 2016 No clever way, just the straightforward way ;)
- Johannes Pfau (3/36) Jan 29 2016 There's also a phobos solution: bigEndianToNative in std.bitmanip.
- Marek Janukowicz (4/17) Jan 30 2016 That's a good point, thanks.
- Adam D. Ruppe (7/8) Jan 29 2016 UTF-16 (as well as UTF-32) comes in both little-endian and
I have trouble understanding how endianess works for UTF-16. For example UTF-16 code for 'ł' character is 0x0142. But this program shows otherwise: import std.stdio; public void main () { ubyte[] properOrder = [0x01, 0x42]; ubyte[] reverseOrder = [0x42, 0x01]; writefln( "proper: %s, reverse: %s", cast(wchar[])properOrder, cast(wchar[])reverseOrder ); } output: proper: 䈁, reverse: ł Is there anything I should know about UTF endianess? -- Marek Janukowicz
Jan 29 2016
On 1/29/16 5:36 PM, Marek Janukowicz wrote:I have trouble understanding how endianess works for UTF-16. For example UTF-16 code for 'ł' character is 0x0142. But this program shows otherwise: import std.stdio; public void main () { ubyte[] properOrder = [0x01, 0x42]; ubyte[] reverseOrder = [0x42, 0x01]; writefln( "proper: %s, reverse: %s", cast(wchar[])properOrder, cast(wchar[])reverseOrder ); } output: proper: 䈁, reverse: ł Is there anything I should know about UTF endianess?It's not any different from other endianness. In other words, a UTF16 code unit is expected to be in the endianness of the platform you are running on. If you are on x86 or x86_64 (very likely), then it should be little endian. If your source of data is big-endian (or opposite from your native endianness), it will have to be converted before treating as a wchar[]. Note the version identifiers BigEndian and LittleEndian can be used to compile the correct code. -Steve
Jan 29 2016
On Fri, 29 Jan 2016 17:43:26 -0500, Steven Schveighoffer wrote:To be precise - my case is IMAP UTF7 folder name encoding and I finally found out it's indeed big endian, which explains my problem (as I'm indeed on x86_64).Is there anything I should know about UTF endianess?It's not any different from other endianness. In other words, a UTF16 code unit is expected to be in the endianness of the platform you are running on. If you are on x86 or x86_64 (very likely), then it should be little endian. If your source of data is big-endian (or opposite from your native endianness),it will have to be converted before treating as a wchar[].Is there any clever way to do the conversion? Or do I need to swap the bytes manually?Note the version identifiers BigEndian and LittleEndian can be used to compile the correct code.This solution is of no use to me as I don't want to change the endianess in general. -- Marek Janukowicz
Jan 29 2016
On 1/29/16 6:03 PM, Marek Janukowicz wrote:On Fri, 29 Jan 2016 17:43:26 -0500, Steven Schveighoffer wrote:No clever way, just the straightforward way ;) Swapping endianness of 32-bits can be done with core.bitop.bswap. Doing it with 16 bits I believe you have to do bit shifting. Something like: foreach(ref elem; wcharArr) elem = ((elem << 8) & 0xff00) | ((elem >> 8) & 0x00ff); Or you can do it with the bytes directly before castingTo be precise - my case is IMAP UTF7 folder name encoding and I finally found out it's indeed big endian, which explains my problem (as I'm indeed on x86_64).Is there anything I should know about UTF endianess?It's not any different from other endianness. In other words, a UTF16 code unit is expected to be in the endianness of the platform you are running on. If you are on x86 or x86_64 (very likely), then it should be little endian. If your source of data is big-endian (or opposite from your native endianness),it will have to be converted before treating as a wchar[].Is there any clever way to do the conversion? Or do I need to swap the bytes manually?What I mean is that you can annotate your code with version statements like: version(LittleEndian) { // perform the byteswap ... } so your code is portable to BigEndian systems (where you would not want to byte swap). -SteveNote the version identifiers BigEndian and LittleEndian can be used to compile the correct code.This solution is of no use to me as I don't want to change the endianess in general.
Jan 29 2016
Am Fri, 29 Jan 2016 18:58:17 -0500 schrieb Steven Schveighoffer <schveiguy yahoo.com>:On 1/29/16 6:03 PM, Marek Janukowicz wrote:There's also a phobos solution: bigEndianToNative in std.bitmanip.On Fri, 29 Jan 2016 17:43:26 -0500, Steven Schveighoffer wrote:No clever way, just the straightforward way ;) Swapping endianness of 32-bits can be done with core.bitop.bswap. Doing it with 16 bits I believe you have to do bit shifting. Something like: foreach(ref elem; wcharArr) elem = ((elem << 8) & 0xff00) | ((elem >> 8) & 0x00ff); Or you can do it with the bytes directly before castingTo be precise - my case is IMAP UTF7 folder name encoding and I finally found out it's indeed big endian, which explains my problem (as I'm indeed on x86_64).Is there anything I should know about UTF endianess?It's not any different from other endianness. In other words, a UTF16 code unit is expected to be in the endianness of the platform you are running on. If you are on x86 or x86_64 (very likely), then it should be little endian. If your source of data is big-endian (or opposite from your native endianness),it will have to be converted before treating as a wchar[].Is there any clever way to do the conversion? Or do I need to swap the bytes manually?
Jan 29 2016
On Fri, 29 Jan 2016 18:58:17 -0500, Steven Schveighoffer wrote:That's a good point, thanks. -- Marek JanukowiczWhat I mean is that you can annotate your code with version statements like: version(LittleEndian) { // perform the byteswap ... } so your code is portable to BigEndian systems (where you would not want to byte swap).Note the version identifiers BigEndian and LittleEndian can be used to compile the correct code.This solution is of no use to me as I don't want to change the endianess in general.
Jan 30 2016
On Friday, 29 January 2016 at 22:36:37 UTC, Marek Janukowicz wrote:I have trouble understanding how endianess works for UTF-16.UTF-16 (as well as UTF-32) comes in both little-endian and big-endian variants. A byte-order marker in the file can help you detect which one it is in. See t his t able: http://www.unicode.org/faq/utf_bom.html#gen6
Jan 29 2016