www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - D1: UTF8 char[] casting to wchar[] array cast misalignment ERROR

reply "jicman" <cabrera wrc.xerox.com> writes:
Greetings!

I have a bunch of files plain ASCII, UTF8 and UTF16 with and 
without BOM (Byte Order Mark).  I had, "I thought", a nice way of 
figuring out what type of encoding the file was (ASCII, UTF8 or 
UTF16) when the BOM was missing, by reading the content and 
applying the std.utf.validate function to the char[] or, wchar[] 
string.  The problem is that lately, I am hitting into a wall 
with the "array cast misalignment" when casting wchar[].  ie.

auto text = cast(string) file.read();
wchar[] temp = cast(wchar[]) text;

What would be the correct process to find out a text file 
encoding?

Any help would be greatly appreciated.  This is the code that I 
have right now...

//begin code
char[] ReadFileData2UTF8(char[] file, out char[] bom)
{
   auto text = cast(string) file.read();
   if (text.length == 0)
   {
     bom = "NO_BOM";
     return "";
   }
   else if (text.length == 1)
   {
     ubyte[1] b = cast(ubyte[]) text[0 .. 1];
     bom = getBOM(b);
   }
   else if (text.length == 2)
   {
     ubyte[2] b = cast(ubyte[]) text[0 .. 2];
     bom = getBOM(b);
   }
   else if (text.length == 3)
   {
     ubyte[3] b = cast(ubyte[]) text[0 .. 3];
     bom = getBOM(b);
   }
   else if (text.length > 3)
   {
     ubyte[4] b = cast(ubyte[]) text[0 .. 4];
     bom = getBOM(b);
   }
   //writefln(bom);
   if (std.string.find(bom, "UTF16") == 0)
   {
     ubyte[] bs = cast(ubyte[]) text;
     if (bs[0 .. 2] == UTF16_be || bs[0 .. 2] == UTF16_le)
       bs = bs[2 .. $];
     text = cast(char[]) bs;
     wchar[] temp = cast(wchar[]) text; //text[2 .. $];
     text = std.utf.toUTF8(temp);
   }
   else if (std.string.find(bom, "UTF32") == 0)
   {
     ubyte[] bs = cast(ubyte[]) text;
     if (bs[0 .. 4] == UTF32_be || bs[0 .. 4] == UTF32_le)
       bs = bs[4 .. $];
     text = cast(char[]) bs;
     dchar[] temp = cast(dchar[]) text; //text[2 .. $];
     text = std.utf.toUTF8(temp);
   }
   else if (bom == "UTF8")
   {
     ubyte[] bs = cast(ubyte[]) text;
     if (bs[0 .. 3] == UTF8)
       bs = bs[3 .. $];
     text = cast(char[]) bs;
     // text is already UTF8
   }
   else // hopeing I can figure out the type...
   {
     //msgBox("No BOM");
     //ubyte[] bs = cast(ubyte[]) text;
     try // utf8
     {
       validate(text);
       bom = "UTF8";
     }
     catch (UtfException e)
     {
       //msgBox("Failed UTF8. Trying UTF16");
       //text = cast(char[]) bs;
       //if ((text.length % 2) == 1)
       //  text ~= " ";
       try //utf16
       {
         wchar[] temp = cast(wchar[]) text; //text[2 .. $];
         //wchar[] temp = std.utf.toUTF16(text); //text[2 .. $];
         validate(temp);
         text = std.utf.toUTF8(temp);
         bom = "UTF16_le";
       }
       catch (UtfException e)
       {
         //msgBox("Failed UTF16. Trying UTF32");
         //text = cast(char[]) bs;
         try // utf32
         {
           dchar[] temp = cast(dchar[]) text; //text[2 .. $];
           //dchar[] temp = std.utf.toUTF32(text); //text[2 .. $];
           validate(temp);
           text = std.utf.toUTF8(temp);
           bom = "UTF32_le";
         }
         catch (UtfException e) // hoping for ASCII
         {
           //msgBox("Failed UTF32. Hoping ASCII");
           text ~= "\000";
           char[] temp = std.windows.charset.fromMBSz(text.ptr,0);
           text = std.utf.toUTF8(temp);
           //text = temp;
           bom = "NO_BOM";
         }
       }
     }
   }
   return text;
}
//end code
Jun 16 2014
next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 17/06/14 04:27, jicman wrote:
 Greetings!

 I have a bunch of files plain ASCII, UTF8 and UTF16 with and without BOM
 (Byte Order Mark).  I had, "I thought", a nice way of figuring out what
 type of encoding the file was (ASCII, UTF8 or UTF16) when the BOM was
 missing, by reading the content and applying the std.utf.validate
 function to the char[] or, wchar[] string.  The problem is that lately,
 I am hitting into a wall with the "array cast misalignment" when casting
 wchar[].  ie.

 auto text = cast(string) file.read();
 wchar[] temp = cast(wchar[]) text;
How about casting to "wchar[]" directory, instead of going through "string".
 What would be the correct process to find out a text file encoding?

 Any help would be greatly appreciated.  This is the code that I have
 right now...
I don't know if you use Tango [1], but it has a module [2] to help with this sort of things. [1] http://dsource.org/projects/tango [2] http://dsource.org/projects/tango/docs/stable/tango.io.UnicodeFile.html -- /Jacob Carlborg
Jun 16 2014
next sibling parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Tuesday, 17 June 2014 at 06:44:40 UTC, Jacob Carlborg wrote:
 On 17/06/14 04:27, jicman wrote:
 Greetings!

 I have a bunch of files plain ASCII, UTF8 and UTF16 with and 
 without BOM
 (Byte Order Mark).  I had, "I thought", a nice way of figuring 
 out what
 type of encoding the file was (ASCII, UTF8 or UTF16) when the 
 BOM was
 missing, by reading the content and applying the 
 std.utf.validate
 function to the char[] or, wchar[] string.  The problem is 
 that lately,
 I am hitting into a wall with the "array cast misalignment" 
 when casting
 wchar[].  ie.

 auto text = cast(string) file.read();
 wchar[] temp = cast(wchar[]) text;
How about casting to "wchar[]" directory, instead of going through "string".
No, the issue is that the OP is taking an array of smaller elements (probably containing an *ODD* amount of elements), and casting that as a bigger element type. If the original array size is not a multiple of the target element size, then it'll end up "slicing" the last element, and trigger said "array cast misalignment". The error message (IMO), is unclear, since "misalignement" usually refers to *position* in memory. I think "array cast length mismatch" would be a better error message. In any case, OP, something like: auto text = file.read().assumeUnique; size_t u = text.length % wchar.sizeof/char.sizeof; if (u != 0) { // text MUST be of "char[]" type } else { //OK! Here, the cast is legal: "text" *can* be of type "wchar[]" type. }
Jun 17 2014
parent "jicman" <cabrera wrc.xerox.com> writes:
On Tuesday, 17 June 2014 at 07:49:59 UTC, monarch_dodra wrote:
 On Tuesday, 17 June 2014 at 06:44:40 UTC, Jacob Carlborg wrote:
 On 17/06/14 04:27, jicman wrote:
 Greetings!

 I have a bunch of files plain ASCII, UTF8 and UTF16 with and 
 without BOM
 (Byte Order Mark).  I had, "I thought", a nice way of 
 figuring out what
 type of encoding the file was (ASCII, UTF8 or UTF16) when the 
 BOM was
 missing, by reading the content and applying the 
 std.utf.validate
 function to the char[] or, wchar[] string.  The problem is 
 that lately,
 I am hitting into a wall with the "array cast misalignment" 
 when casting
 wchar[].  ie.

 auto text = cast(string) file.read();
 wchar[] temp = cast(wchar[]) text;
How about casting to "wchar[]" directory, instead of going through "string".
Thanks.
 No, the issue is that the OP is taking an array of smaller 
 elements (probably containing an *ODD* amount of elements), and 
 casting that as a bigger element type. If the original array 
 size is not a multiple of the target element size, then it'll 
 end up "slicing" the last element, and trigger said "array cast 
 misalignment".

 The error message (IMO), is unclear, since "misalignement" 
 usually refers to *position* in memory. I think "array cast 
 length mismatch" would be a better error message.

 In any case, OP, something like:

 auto text = file.read().assumeUnique;
 size_t u = text.length % wchar.sizeof/char.sizeof;
 if (u != 0) {
     // text MUST be of "char[]" type
 } else {
   //OK! Here, the cast is legal: "text" *can* be of type 
 "wchar[]" type.
 }
Jun 17 2014
prev sibling parent "jicman" <cabrera wrc.xerox.com> writes:
On Tuesday, 17 June 2014 at 06:44:40 UTC, Jacob Carlborg wrote:
 On 17/06/14 04:27, jicman wrote:
 Greetings!

 I have a bunch of files plain ASCII, UTF8 and UTF16 with and 
 without BOM
 (Byte Order Mark).  I had, "I thought", a nice way of figuring 
 out what
 type of encoding the file was (ASCII, UTF8 or UTF16) when the 
 BOM was
 missing, by reading the content and applying the 
 std.utf.validate
 function to the char[] or, wchar[] string.  The problem is 
 that lately,
 I am hitting into a wall with the "array cast misalignment" 
 when casting
 wchar[].  ie.

 auto text = cast(string) file.read();
 wchar[] temp = cast(wchar[]) text;
How about casting to "wchar[]" directory, instead of going through "string".
 What would be the correct process to find out a text file 
 encoding?

 Any help would be greatly appreciated.  This is the code that 
 I have
 right now...
I don't know if you use Tango [1], but it has a module [2] to help with this sort of things. [1] http://dsource.org/projects/tango [2] http://dsource.org/projects/tango/docs/stable/tango.io.UnicodeFile.html
Thanks, but can't use Tango. Historically, Tango (originally Mango) and Phobos did not play well, and by the time Tango came along, my project was done totally using Phobos, so I have to continue to use Phobos. josé
Jun 17 2014
prev sibling next sibling parent reply "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:
On Tuesday, 17 June 2014 at 02:27:43 UTC, jicman wrote:
 Greetings!

 I have a bunch of files plain ASCII, UTF8 and UTF16 with and 
 without BOM (Byte Order Mark).  I had, "I thought", a nice way 
 of figuring out what type of encoding the file was (ASCII, UTF8 
 or UTF16) when the BOM was missing, by reading the content and 
 applying the std.utf.validate function to the char[] or, 
 wchar[] string.  The problem is that lately, I am hitting into 
 a wall with the "array cast misalignment" when casting wchar[].
  ie.

 auto text = cast(string) file.read();
 wchar[] temp = cast(wchar[]) text;
If the length of the data is odd, it cannot be (valid) UTF16. You can check for that, and skip the test for UTF16 in this case. Another thing: it is better not to cast the data to `string` before you know that it's actually UTF8. Better make it `ubyte[]`; this way you don't need all the casts inside the if-blocks.
Jun 17 2014
parent "jicman" <cabrera wrc.xerox.com> writes:
On Tuesday, 17 June 2014 at 12:54:39 UTC, Marc Schütz wrote:
 On Tuesday, 17 June 2014 at 02:27:43 UTC, jicman wrote:
 Greetings!

 I have a bunch of files plain ASCII, UTF8 and UTF16 with and 
 without BOM (Byte Order Mark).  I had, "I thought", a nice way 
 of figuring out what type of encoding the file was (ASCII, 
 UTF8 or UTF16) when the BOM was missing, by reading the 
 content and applying the std.utf.validate function to the 
 char[] or, wchar[] string.  The problem is that lately, I am 
 hitting into a wall with the "array cast misalignment" when 
 casting wchar[].
 ie.

 auto text = cast(string) file.read();
 wchar[] temp = cast(wchar[]) text;
If the length of the data is odd, it cannot be (valid) UTF16. You can check for that, and skip the test for UTF16 in this case. Another thing: it is better not to cast the data to `string` before you know that it's actually UTF8. Better make it `ubyte[]`; this way you don't need all the casts inside the if-blocks.
Indeed. Thanks.
Jun 17 2014
prev sibling parent reply "Jesse Phillips" <Jesse.K.Phillips+D gmail.com> writes:
On Tuesday, 17 June 2014 at 02:27:43 UTC, jicman wrote:
 Greetings!

 I have a bunch of files plain ASCII, UTF8 and UTF16 with and 
 without BOM (Byte Order Mark).  I had, "I thought", a nice way 
 of figuring out what type of encoding the file was (ASCII, UTF8 
 or UTF16) when the BOM was missing, by reading the content and 
 applying the std.utf.validate function to the char[] or, 
 wchar[] string.  The problem is that lately, I am hitting into 
 a wall with the "array cast misalignment" when casting wchar[].
If the BOM is missing and it is not UTF-8, it isn't a valid UTF encoding. Otherwise you have your answer. Don't cast a char[] to wchar[], if you have a valid char[] then it must be converted (use std.conv.to); Some testing, the mentioned check for UTF-16 being even is exactly what caused the "array cast misalignment" error (the array wasn't an even number of bytes).
Jun 17 2014
parent "jicman" <cabrera wrc.xerox.com> writes:
On Wednesday, 18 June 2014 at 02:25:34 UTC, Jesse Phillips wrote:
 On Tuesday, 17 June 2014 at 02:27:43 UTC, jicman wrote:
 Greetings!

 I have a bunch of files plain ASCII, UTF8 and UTF16 with and 
 without BOM (Byte Order Mark).  I had, "I thought", a nice way 
 of figuring out what type of encoding the file was (ASCII, 
 UTF8 or UTF16) when the BOM was missing, by reading the 
 content and applying the std.utf.validate function to the 
 char[] or, wchar[] string.  The problem is that lately, I am 
 hitting into a wall with the "array cast misalignment" when 
 casting wchar[].
If the BOM is missing and it is not UTF-8, it isn't a valid UTF encoding. Otherwise you have your answer. Don't cast a char[] to wchar[], if you have a valid char[] then it must be converted (use std.conv.to); Some testing, the mentioned check for UTF-16 being even is exactly what caused the "array cast misalignment" error (the array wasn't an even number of bytes).
Thanks.
Jun 18 2014