digitalmars.D.learn - D1: UTF8 char[] casting to wchar[] array cast misalignment ERROR

jicman (120/120) Jun 16 2014 Greetings!

Jacob Carlborg (8/21) Jun 16 2014 I don't know if you use Tango [1], but it has a module [2] to help with

monarch_dodra (19/41) Jun 17 2014 No, the issue is that the OP is taking an array of smaller

jicman (2/44) Jun 17 2014

jicman (6/39) Jun 17 2014 Thanks, but can't use Tango. Historically, Tango (originally

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> (7/18) Jun 17 2014 If the length of the data is odd, it cannot be (valid) UTF16. You

jicman (2/25) Jun 17 2014 Indeed. Thanks.

Jesse Phillips (9/17) Jun 17 2014 If the BOM is missing and it is not UTF-8, it isn't a valid UTF

jicman (2/22) Jun 18 2014 Thanks.

"jicman" <cabrera wrc.xerox.com> writes:

Greetings!

I have a bunch of files plain ASCII, UTF8 and UTF16 with and 
without BOM (Byte Order Mark).  I had, "I thought", a nice way of 
figuring out what type of encoding the file was (ASCII, UTF8 or 
UTF16) when the BOM was missing, by reading the content and 
applying the std.utf.validate function to the char[] or, wchar[] 
string.  The problem is that lately, I am hitting into a wall 
with the "array cast misalignment" when casting wchar[].  ie.

auto text = cast(string) file.read();
wchar[] temp = cast(wchar[]) text;

What would be the correct process to find out a text file 
encoding?

Any help would be greatly appreciated.  This is the code that I 
have right now...

//begin code
char[] ReadFileData2UTF8(char[] file, out char[] bom)
{
   auto text = cast(string) file.read();
   if (text.length == 0)
   {
     bom = "NO_BOM";
     return "";
   }
   else if (text.length == 1)
   {
     ubyte[1] b = cast(ubyte[]) text[0 .. 1];
     bom = getBOM(b);
   }
   else if (text.length == 2)
   {
     ubyte[2] b = cast(ubyte[]) text[0 .. 2];
     bom = getBOM(b);
   }
   else if (text.length == 3)
   {
     ubyte[3] b = cast(ubyte[]) text[0 .. 3];
     bom = getBOM(b);
   }
   else if (text.length > 3)
   {
     ubyte[4] b = cast(ubyte[]) text[0 .. 4];
     bom = getBOM(b);
   }
   //writefln(bom);
   if (std.string.find(bom, "UTF16") == 0)
   {
     ubyte[] bs = cast(ubyte[]) text;
     if (bs[0 .. 2] == UTF16_be || bs[0 .. 2] == UTF16_le)
       bs = bs[2 .. $];
     text = cast(char[]) bs;
     wchar[] temp = cast(wchar[]) text; //text[2 .. $];
     text = std.utf.toUTF8(temp);
   }
   else if (std.string.find(bom, "UTF32") == 0)
   {
     ubyte[] bs = cast(ubyte[]) text;
     if (bs[0 .. 4] == UTF32_be || bs[0 .. 4] == UTF32_le)
       bs = bs[4 .. $];
     text = cast(char[]) bs;
     dchar[] temp = cast(dchar[]) text; //text[2 .. $];
     text = std.utf.toUTF8(temp);
   }
   else if (bom == "UTF8")
   {
     ubyte[] bs = cast(ubyte[]) text;
     if (bs[0 .. 3] == UTF8)
       bs = bs[3 .. $];
     text = cast(char[]) bs;
     // text is already UTF8
   }
   else // hopeing I can figure out the type...
   {
     //msgBox("No BOM");
     //ubyte[] bs = cast(ubyte[]) text;
     try // utf8
     {
       validate(text);
       bom = "UTF8";
     }
     catch (UtfException e)
     {
       //msgBox("Failed UTF8. Trying UTF16");
       //text = cast(char[]) bs;
       //if ((text.length % 2) == 1)
       //  text ~= " ";
       try //utf16
       {
         wchar[] temp = cast(wchar[]) text; //text[2 .. $];
         //wchar[] temp = std.utf.toUTF16(text); //text[2 .. $];
         validate(temp);
         text = std.utf.toUTF8(temp);
         bom = "UTF16_le";
       }
       catch (UtfException e)
       {
         //msgBox("Failed UTF16. Trying UTF32");
         //text = cast(char[]) bs;
         try // utf32
         {
           dchar[] temp = cast(dchar[]) text; //text[2 .. $];
           //dchar[] temp = std.utf.toUTF32(text); //text[2 .. $];
           validate(temp);
           text = std.utf.toUTF8(temp);
           bom = "UTF32_le";
         }
         catch (UtfException e) // hoping for ASCII
         {
           //msgBox("Failed UTF32. Hoping ASCII");
           text ~= "\000";
           char[] temp = std.windows.charset.fromMBSz(text.ptr,0);
           text = std.utf.toUTF8(temp);
           //text = temp;
           bom = "NO_BOM";
         }
       }
     }
   }
   return text;
}
//end code

Jun 16 2014

Jacob Carlborg <doob me.com> writes:

On 17/06/14 04:27, jicman wrote:
 Greetings!

 I have a bunch of files plain ASCII, UTF8 and UTF16 with and without BOM
 (Byte Order Mark).  I had, "I thought", a nice way of figuring out what
 type of encoding the file was (ASCII, UTF8 or UTF16) when the BOM was
 missing, by reading the content and applying the std.utf.validate
 function to the char[] or, wchar[] string.  The problem is that lately,
 I am hitting into a wall with the "array cast misalignment" when casting
 wchar[].  ie.

 auto text = cast(string) file.read();
 wchar[] temp = cast(wchar[]) text;

How about casting to "wchar[]" directory, instead of going through "string".

 What would be the correct process to find out a text file encoding?

 Any help would be greatly appreciated.  This is the code that I have
 right now...

I don't know if you use Tango [1], but it has a module [2] to help with 
this sort of things.

[1] http://dsource.org/projects/tango
[2] http://dsource.org/projects/tango/docs/stable/tango.io.UnicodeFile.html

-- 
/Jacob Carlborg

Jun 16 2014

"monarch_dodra" <monarchdodra gmail.com> writes:

On Tuesday, 17 June 2014 at 06:44:40 UTC, Jacob Carlborg wrote:
 On 17/06/14 04:27, jicman wrote:
 Greetings!

 I have a bunch of files plain ASCII, UTF8 and UTF16 with and 
 without BOM
 (Byte Order Mark).  I had, "I thought", a nice way of figuring 
 out what
 type of encoding the file was (ASCII, UTF8 or UTF16) when the 
 BOM was
 missing, by reading the content and applying the 
 std.utf.validate
 function to the char[] or, wchar[] string.  The problem is 
 that lately,
 I am hitting into a wall with the "array cast misalignment" 
 when casting
 wchar[].  ie.

 auto text = cast(string) file.read();
 wchar[] temp = cast(wchar[]) text;

 How about casting to "wchar[]" directory, instead of going 
 through "string".

No, the issue is that the OP is taking an array of smaller 
elements (probably containing an *ODD* amount of elements), and 
casting that as a bigger element type. If the original array size 
is not a multiple of the target element size, then it'll end up 
"slicing" the last element, and trigger said "array cast 
misalignment".

The error message (IMO), is unclear, since "misalignement" 
usually refers to *position* in memory. I think "array cast 
length mismatch" would be a better error message.

In any case, OP, something like:

auto text = file.read().assumeUnique;
size_t u = text.length % wchar.sizeof/char.sizeof;
if (u != 0) {
     // text MUST be of "char[]" type
} else {
   //OK! Here, the cast is legal: "text" *can* be of type 
"wchar[]" type.
}

Jun 17 2014

"jicman" <cabrera wrc.xerox.com> writes:

On Tuesday, 17 June 2014 at 07:49:59 UTC, monarch_dodra wrote:
 On Tuesday, 17 June 2014 at 06:44:40 UTC, Jacob Carlborg wrote:
 On 17/06/14 04:27, jicman wrote:
 Greetings!

 I have a bunch of files plain ASCII, UTF8 and UTF16 with and 
 without BOM
 (Byte Order Mark).  I had, "I thought", a nice way of 
 figuring out what
 type of encoding the file was (ASCII, UTF8 or UTF16) when the 
 BOM was
 missing, by reading the content and applying the 
 std.utf.validate
 function to the char[] or, wchar[] string.  The problem is 
 that lately,
 I am hitting into a wall with the "array cast misalignment" 
 when casting
 wchar[].  ie.

 auto text = cast(string) file.read();
 wchar[] temp = cast(wchar[]) text;

 How about casting to "wchar[]" directory, instead of going 
 through "string".


Thanks.
 No, the issue is that the OP is taking an array of smaller 
 elements (probably containing an *ODD* amount of elements), and 
 casting that as a bigger element type. If the original array 
 size is not a multiple of the target element size, then it'll 
 end up "slicing" the last element, and trigger said "array cast 
 misalignment".

 The error message (IMO), is unclear, since "misalignement" 
 usually refers to *position* in memory. I think "array cast 
 length mismatch" would be a better error message.

 In any case, OP, something like:

 auto text = file.read().assumeUnique;
 size_t u = text.length % wchar.sizeof/char.sizeof;
 if (u != 0) {
     // text MUST be of "char[]" type
 } else {
   //OK! Here, the cast is legal: "text" *can* be of type 
 "wchar[]" type.
 }

Jun 17 2014

"jicman" <cabrera wrc.xerox.com> writes:

On Tuesday, 17 June 2014 at 06:44:40 UTC, Jacob Carlborg wrote:
 On 17/06/14 04:27, jicman wrote:
 Greetings!

 I have a bunch of files plain ASCII, UTF8 and UTF16 with and 
 without BOM
 (Byte Order Mark).  I had, "I thought", a nice way of figuring 
 out what
 type of encoding the file was (ASCII, UTF8 or UTF16) when the 
 BOM was
 missing, by reading the content and applying the 
 std.utf.validate
 function to the char[] or, wchar[] string.  The problem is 
 that lately,
 I am hitting into a wall with the "array cast misalignment" 
 when casting
 wchar[].  ie.

 auto text = cast(string) file.read();
 wchar[] temp = cast(wchar[]) text;

 How about casting to "wchar[]" directory, instead of going 
 through "string".

 What would be the correct process to find out a text file 
 encoding?

 Any help would be greatly appreciated.  This is the code that 
 I have
 right now...

 I don't know if you use Tango [1], but it has a module [2] to 
 help with this sort of things.

 [1] http://dsource.org/projects/tango
 [2] 
 http://dsource.org/projects/tango/docs/stable/tango.io.UnicodeFile.html

Thanks, but can't use Tango.  Historically, Tango (originally 
Mango) and Phobos did not play well, and by the time Tango came 
along, my project was done totally using Phobos, so I have to 
continue to use Phobos.

josé

Jun 17 2014

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:

On Tuesday, 17 June 2014 at 02:27:43 UTC, jicman wrote:
 Greetings!

 I have a bunch of files plain ASCII, UTF8 and UTF16 with and 
 without BOM (Byte Order Mark).  I had, "I thought", a nice way 
 of figuring out what type of encoding the file was (ASCII, UTF8 
 or UTF16) when the BOM was missing, by reading the content and 
 applying the std.utf.validate function to the char[] or, 
 wchar[] string.  The problem is that lately, I am hitting into 
 a wall with the "array cast misalignment" when casting wchar[].
  ie.

 auto text = cast(string) file.read();
 wchar[] temp = cast(wchar[]) text;

If the length of the data is odd, it cannot be (valid) UTF16. You 
can check for that, and skip the test for UTF16 in this case.

Another thing: it is better not to cast the data to `string` 
before you know that it's actually UTF8. Better make it 
`ubyte[]`; this way you don't need all the casts inside the 
if-blocks.

Jun 17 2014

"jicman" <cabrera wrc.xerox.com> writes:

On Tuesday, 17 June 2014 at 12:54:39 UTC, Marc Schütz wrote:
 On Tuesday, 17 June 2014 at 02:27:43 UTC, jicman wrote:
 Greetings!

 I have a bunch of files plain ASCII, UTF8 and UTF16 with and 
 without BOM (Byte Order Mark).  I had, "I thought", a nice way 
 of figuring out what type of encoding the file was (ASCII, 
 UTF8 or UTF16) when the BOM was missing, by reading the 
 content and applying the std.utf.validate function to the 
 char[] or, wchar[] string.  The problem is that lately, I am 
 hitting into a wall with the "array cast misalignment" when 
 casting wchar[].
 ie.

 auto text = cast(string) file.read();
 wchar[] temp = cast(wchar[]) text;

 If the length of the data is odd, it cannot be (valid) UTF16. 
 You can check for that, and skip the test for UTF16 in this 
 case.

 Another thing: it is better not to cast the data to `string` 
 before you know that it's actually UTF8. Better make it 
 `ubyte[]`; this way you don't need all the casts inside the 
 if-blocks.

Indeed.  Thanks.

Jun 17 2014

"Jesse Phillips" <Jesse.K.Phillips+D gmail.com> writes:

On Tuesday, 17 June 2014 at 02:27:43 UTC, jicman wrote:
 Greetings!

 I have a bunch of files plain ASCII, UTF8 and UTF16 with and 
 without BOM (Byte Order Mark).  I had, "I thought", a nice way 
 of figuring out what type of encoding the file was (ASCII, UTF8 
 or UTF16) when the BOM was missing, by reading the content and 
 applying the std.utf.validate function to the char[] or, 
 wchar[] string.  The problem is that lately, I am hitting into 
 a wall with the "array cast misalignment" when casting wchar[].

If the BOM is missing and it is not UTF-8, it isn't a valid UTF 
encoding.

Otherwise you have your answer. Don't cast a char[] to wchar[], 
if you have a valid char[] then it must be converted (use 
std.conv.to);

Some testing, the mentioned check for UTF-16 being even is 
exactly what caused the "array cast misalignment" error (the 
array wasn't an even number of bytes).

Jun 17 2014

"jicman" <cabrera wrc.xerox.com> writes:

On Wednesday, 18 June 2014 at 02:25:34 UTC, Jesse Phillips wrote:
 On Tuesday, 17 June 2014 at 02:27:43 UTC, jicman wrote:
 Greetings!

 I have a bunch of files plain ASCII, UTF8 and UTF16 with and 
 without BOM (Byte Order Mark).  I had, "I thought", a nice way 
 of figuring out what type of encoding the file was (ASCII, 
 UTF8 or UTF16) when the BOM was missing, by reading the 
 content and applying the std.utf.validate function to the 
 char[] or, wchar[] string.  The problem is that lately, I am 
 hitting into a wall with the "array cast misalignment" when 
 casting wchar[].

 If the BOM is missing and it is not UTF-8, it isn't a valid UTF 
 encoding.

 Otherwise you have your answer. Don't cast a char[] to wchar[], 
 if you have a valid char[] then it must be converted (use 
 std.conv.to);

 Some testing, the mentioned check for UTF-16 being even is 
 exactly what caused the "array cast misalignment" error (the 
 array wasn't an even number of bytes).

Thanks.

Jun 18 2014

D Programming

C/C++ Programming

Other

digitalmars.D.learn - D1: UTF8 char[] casting to wchar[] array cast misalignment ERROR