digitalmars.D.learn - D1: UTF8 char[] casting to wchar[] array cast misalignment ERROR
- jicman (120/120) Jun 16 2014 Greetings!
- Jacob Carlborg (8/21) Jun 16 2014 I don't know if you use Tango [1], but it has a module [2] to help with
- monarch_dodra (19/41) Jun 17 2014 No, the issue is that the OP is taking an array of smaller
- jicman (2/44) Jun 17 2014
- jicman (6/39) Jun 17 2014 Thanks, but can't use Tango. Historically, Tango (originally
- "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> (7/18) Jun 17 2014 If the length of the data is odd, it cannot be (valid) UTF16. You
- jicman (2/25) Jun 17 2014 Indeed. Thanks.
- Jesse Phillips (9/17) Jun 17 2014 If the BOM is missing and it is not UTF-8, it isn't a valid UTF
- jicman (2/22) Jun 18 2014 Thanks.
Greetings! I have a bunch of files plain ASCII, UTF8 and UTF16 with and without BOM (Byte Order Mark). I had, "I thought", a nice way of figuring out what type of encoding the file was (ASCII, UTF8 or UTF16) when the BOM was missing, by reading the content and applying the std.utf.validate function to the char[] or, wchar[] string. The problem is that lately, I am hitting into a wall with the "array cast misalignment" when casting wchar[]. ie. auto text = cast(string) file.read(); wchar[] temp = cast(wchar[]) text; What would be the correct process to find out a text file encoding? Any help would be greatly appreciated. This is the code that I have right now... //begin code char[] ReadFileData2UTF8(char[] file, out char[] bom) { auto text = cast(string) file.read(); if (text.length == 0) { bom = "NO_BOM"; return ""; } else if (text.length == 1) { ubyte[1] b = cast(ubyte[]) text[0 .. 1]; bom = getBOM(b); } else if (text.length == 2) { ubyte[2] b = cast(ubyte[]) text[0 .. 2]; bom = getBOM(b); } else if (text.length == 3) { ubyte[3] b = cast(ubyte[]) text[0 .. 3]; bom = getBOM(b); } else if (text.length > 3) { ubyte[4] b = cast(ubyte[]) text[0 .. 4]; bom = getBOM(b); } //writefln(bom); if (std.string.find(bom, "UTF16") == 0) { ubyte[] bs = cast(ubyte[]) text; if (bs[0 .. 2] == UTF16_be || bs[0 .. 2] == UTF16_le) bs = bs[2 .. $]; text = cast(char[]) bs; wchar[] temp = cast(wchar[]) text; //text[2 .. $]; text = std.utf.toUTF8(temp); } else if (std.string.find(bom, "UTF32") == 0) { ubyte[] bs = cast(ubyte[]) text; if (bs[0 .. 4] == UTF32_be || bs[0 .. 4] == UTF32_le) bs = bs[4 .. $]; text = cast(char[]) bs; dchar[] temp = cast(dchar[]) text; //text[2 .. $]; text = std.utf.toUTF8(temp); } else if (bom == "UTF8") { ubyte[] bs = cast(ubyte[]) text; if (bs[0 .. 3] == UTF8) bs = bs[3 .. $]; text = cast(char[]) bs; // text is already UTF8 } else // hopeing I can figure out the type... { //msgBox("No BOM"); //ubyte[] bs = cast(ubyte[]) text; try // utf8 { validate(text); bom = "UTF8"; } catch (UtfException e) { //msgBox("Failed UTF8. Trying UTF16"); //text = cast(char[]) bs; //if ((text.length % 2) == 1) // text ~= " "; try //utf16 { wchar[] temp = cast(wchar[]) text; //text[2 .. $]; //wchar[] temp = std.utf.toUTF16(text); //text[2 .. $]; validate(temp); text = std.utf.toUTF8(temp); bom = "UTF16_le"; } catch (UtfException e) { //msgBox("Failed UTF16. Trying UTF32"); //text = cast(char[]) bs; try // utf32 { dchar[] temp = cast(dchar[]) text; //text[2 .. $]; //dchar[] temp = std.utf.toUTF32(text); //text[2 .. $]; validate(temp); text = std.utf.toUTF8(temp); bom = "UTF32_le"; } catch (UtfException e) // hoping for ASCII { //msgBox("Failed UTF32. Hoping ASCII"); text ~= "\000"; char[] temp = std.windows.charset.fromMBSz(text.ptr,0); text = std.utf.toUTF8(temp); //text = temp; bom = "NO_BOM"; } } } } return text; } //end code
Jun 16 2014
On 17/06/14 04:27, jicman wrote:Greetings! I have a bunch of files plain ASCII, UTF8 and UTF16 with and without BOM (Byte Order Mark). I had, "I thought", a nice way of figuring out what type of encoding the file was (ASCII, UTF8 or UTF16) when the BOM was missing, by reading the content and applying the std.utf.validate function to the char[] or, wchar[] string. The problem is that lately, I am hitting into a wall with the "array cast misalignment" when casting wchar[]. ie. auto text = cast(string) file.read(); wchar[] temp = cast(wchar[]) text;How about casting to "wchar[]" directory, instead of going through "string".What would be the correct process to find out a text file encoding? Any help would be greatly appreciated. This is the code that I have right now...I don't know if you use Tango [1], but it has a module [2] to help with this sort of things. [1] http://dsource.org/projects/tango [2] http://dsource.org/projects/tango/docs/stable/tango.io.UnicodeFile.html -- /Jacob Carlborg
Jun 16 2014
On Tuesday, 17 June 2014 at 06:44:40 UTC, Jacob Carlborg wrote:On 17/06/14 04:27, jicman wrote:No, the issue is that the OP is taking an array of smaller elements (probably containing an *ODD* amount of elements), and casting that as a bigger element type. If the original array size is not a multiple of the target element size, then it'll end up "slicing" the last element, and trigger said "array cast misalignment". The error message (IMO), is unclear, since "misalignement" usually refers to *position* in memory. I think "array cast length mismatch" would be a better error message. In any case, OP, something like: auto text = file.read().assumeUnique; size_t u = text.length % wchar.sizeof/char.sizeof; if (u != 0) { // text MUST be of "char[]" type } else { //OK! Here, the cast is legal: "text" *can* be of type "wchar[]" type. }Greetings! I have a bunch of files plain ASCII, UTF8 and UTF16 with and without BOM (Byte Order Mark). I had, "I thought", a nice way of figuring out what type of encoding the file was (ASCII, UTF8 or UTF16) when the BOM was missing, by reading the content and applying the std.utf.validate function to the char[] or, wchar[] string. The problem is that lately, I am hitting into a wall with the "array cast misalignment" when casting wchar[]. ie. auto text = cast(string) file.read(); wchar[] temp = cast(wchar[]) text;How about casting to "wchar[]" directory, instead of going through "string".
Jun 17 2014
On Tuesday, 17 June 2014 at 07:49:59 UTC, monarch_dodra wrote:On Tuesday, 17 June 2014 at 06:44:40 UTC, Jacob Carlborg wrote:Thanks.On 17/06/14 04:27, jicman wrote:Greetings! I have a bunch of files plain ASCII, UTF8 and UTF16 with and without BOM (Byte Order Mark). I had, "I thought", a nice way of figuring out what type of encoding the file was (ASCII, UTF8 or UTF16) when the BOM was missing, by reading the content and applying the std.utf.validate function to the char[] or, wchar[] string. The problem is that lately, I am hitting into a wall with the "array cast misalignment" when casting wchar[]. ie. auto text = cast(string) file.read(); wchar[] temp = cast(wchar[]) text;How about casting to "wchar[]" directory, instead of going through "string".No, the issue is that the OP is taking an array of smaller elements (probably containing an *ODD* amount of elements), and casting that as a bigger element type. If the original array size is not a multiple of the target element size, then it'll end up "slicing" the last element, and trigger said "array cast misalignment". The error message (IMO), is unclear, since "misalignement" usually refers to *position* in memory. I think "array cast length mismatch" would be a better error message. In any case, OP, something like: auto text = file.read().assumeUnique; size_t u = text.length % wchar.sizeof/char.sizeof; if (u != 0) { // text MUST be of "char[]" type } else { //OK! Here, the cast is legal: "text" *can* be of type "wchar[]" type. }
Jun 17 2014
On Tuesday, 17 June 2014 at 06:44:40 UTC, Jacob Carlborg wrote:On 17/06/14 04:27, jicman wrote:Thanks, but can't use Tango. Historically, Tango (originally Mango) and Phobos did not play well, and by the time Tango came along, my project was done totally using Phobos, so I have to continue to use Phobos. joséGreetings! I have a bunch of files plain ASCII, UTF8 and UTF16 with and without BOM (Byte Order Mark). I had, "I thought", a nice way of figuring out what type of encoding the file was (ASCII, UTF8 or UTF16) when the BOM was missing, by reading the content and applying the std.utf.validate function to the char[] or, wchar[] string. The problem is that lately, I am hitting into a wall with the "array cast misalignment" when casting wchar[]. ie. auto text = cast(string) file.read(); wchar[] temp = cast(wchar[]) text;How about casting to "wchar[]" directory, instead of going through "string".What would be the correct process to find out a text file encoding? Any help would be greatly appreciated. This is the code that I have right now...I don't know if you use Tango [1], but it has a module [2] to help with this sort of things. [1] http://dsource.org/projects/tango [2] http://dsource.org/projects/tango/docs/stable/tango.io.UnicodeFile.html
Jun 17 2014
On Tuesday, 17 June 2014 at 02:27:43 UTC, jicman wrote:Greetings! I have a bunch of files plain ASCII, UTF8 and UTF16 with and without BOM (Byte Order Mark). I had, "I thought", a nice way of figuring out what type of encoding the file was (ASCII, UTF8 or UTF16) when the BOM was missing, by reading the content and applying the std.utf.validate function to the char[] or, wchar[] string. The problem is that lately, I am hitting into a wall with the "array cast misalignment" when casting wchar[]. ie. auto text = cast(string) file.read(); wchar[] temp = cast(wchar[]) text;If the length of the data is odd, it cannot be (valid) UTF16. You can check for that, and skip the test for UTF16 in this case. Another thing: it is better not to cast the data to `string` before you know that it's actually UTF8. Better make it `ubyte[]`; this way you don't need all the casts inside the if-blocks.
Jun 17 2014
On Tuesday, 17 June 2014 at 12:54:39 UTC, Marc Schütz wrote:On Tuesday, 17 June 2014 at 02:27:43 UTC, jicman wrote:Indeed. Thanks.Greetings! I have a bunch of files plain ASCII, UTF8 and UTF16 with and without BOM (Byte Order Mark). I had, "I thought", a nice way of figuring out what type of encoding the file was (ASCII, UTF8 or UTF16) when the BOM was missing, by reading the content and applying the std.utf.validate function to the char[] or, wchar[] string. The problem is that lately, I am hitting into a wall with the "array cast misalignment" when casting wchar[]. ie. auto text = cast(string) file.read(); wchar[] temp = cast(wchar[]) text;If the length of the data is odd, it cannot be (valid) UTF16. You can check for that, and skip the test for UTF16 in this case. Another thing: it is better not to cast the data to `string` before you know that it's actually UTF8. Better make it `ubyte[]`; this way you don't need all the casts inside the if-blocks.
Jun 17 2014
On Tuesday, 17 June 2014 at 02:27:43 UTC, jicman wrote:Greetings! I have a bunch of files plain ASCII, UTF8 and UTF16 with and without BOM (Byte Order Mark). I had, "I thought", a nice way of figuring out what type of encoding the file was (ASCII, UTF8 or UTF16) when the BOM was missing, by reading the content and applying the std.utf.validate function to the char[] or, wchar[] string. The problem is that lately, I am hitting into a wall with the "array cast misalignment" when casting wchar[].If the BOM is missing and it is not UTF-8, it isn't a valid UTF encoding. Otherwise you have your answer. Don't cast a char[] to wchar[], if you have a valid char[] then it must be converted (use std.conv.to); Some testing, the mentioned check for UTF-16 being even is exactly what caused the "array cast misalignment" error (the array wasn't an even number of bytes).
Jun 17 2014
On Wednesday, 18 June 2014 at 02:25:34 UTC, Jesse Phillips wrote:On Tuesday, 17 June 2014 at 02:27:43 UTC, jicman wrote:Thanks.Greetings! I have a bunch of files plain ASCII, UTF8 and UTF16 with and without BOM (Byte Order Mark). I had, "I thought", a nice way of figuring out what type of encoding the file was (ASCII, UTF8 or UTF16) when the BOM was missing, by reading the content and applying the std.utf.validate function to the char[] or, wchar[] string. The problem is that lately, I am hitting into a wall with the "array cast misalignment" when casting wchar[].If the BOM is missing and it is not UTF-8, it isn't a valid UTF encoding. Otherwise you have your answer. Don't cast a char[] to wchar[], if you have a valid char[] then it must be converted (use std.conv.to); Some testing, the mentioned check for UTF-16 being even is exactly what caused the "array cast misalignment" error (the array wasn't an even number of bytes).
Jun 18 2014