digitalmars.D.learn - Reading and writing Unicode files

jicman (7/7) Feb 27 2009 Greetings.

downs (5/18) Feb 28 2009 Wow, you're in luck!

downs (2/24) Feb 28 2009 PS: You may need to do detection for UTF-16. In that case, just cast to ...

jicman (2/27) Feb 28 2009 shouldn't auto take care of that?

Daniel Keep (4/31) Feb 28 2009 The compiler doesn't know what format your text files are in. auto does

Jarrett Billingsley (23/27) Feb 28 2009 anding this Unicode, ANSI, UTF* ideas. =A0I know how to get an UTF8 File...

jicman (5/31) Feb 28 2009 Ok, the only reason that I say Unicode is that when I open the file in N...

Daniel Keep (7/9) Feb 28 2009 There is no such thing as a Unicode file format. There just isn't. I

jicman (60/73) Mar 03 2009 Thanks, Daniel...

jicman <cabrera_ _wrc.xerox.com> writes:

Greetings.

Sorry guys, please be patient with me.  I am having a hard time understanding
this Unicode, ANSI, UTF* ideas.  I know how to get an UTF8 File and turn it
into ANSI. and I know how to take a ANSI file and turn it into an UTF file. 
But, now I have a Unicode file and I need to change the content and create a
new Unicode file with the changes in the content.  I have read all kind of
places, and I found mtext, from Chris Miller's site, by reading,

http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD

Anyway, what I need is to read an Unicode file, search the strings inside, make
changes to the file and write the changes back to an Unicode file.

Any help would be greatly appreciate.

thanks,

jos�

Feb 27 2009

downs <default_357-line yahoo.de> writes:

jicman wrote:
 Greetings.
 
 Sorry guys, please be patient with me.  I am having a hard time understanding
this Unicode, ANSI, UTF* ideas.  I know how to get an UTF8 File and turn it
into ANSI. and I know how to take a ANSI file and turn it into an UTF file. 
But, now I have a Unicode file and I need to change the content and create a
new Unicode file with the changes in the content.  I have read all kind of
places, and I found mtext, from Chris Miller's site, by reading,
 
 http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD
 
 Anyway, what I need is to read an Unicode file, search the strings inside,
make changes to the file and write the changes back to an Unicode file.
 
 Any help would be greatly appreciate.
 
 thanks,
 
 jos�

Wow, you're in luck!

D is all unicode.

Just do import std.file; auto text = cast(string) filename.read(); do your
changes; filename.write(cast(void[]) text);

and you're done.

Feb 28 2009

downs <default_357-line yahoo.de> writes:

downs wrote:
 jicman wrote:
 Greetings.

 Sorry guys, please be patient with me.  I am having a hard time understanding
this Unicode, ANSI, UTF* ideas.  I know how to get an UTF8 File and turn it
into ANSI. and I know how to take a ANSI file and turn it into an UTF file. 
But, now I have a Unicode file and I need to change the content and create a
new Unicode file with the changes in the content.  I have read all kind of
places, and I found mtext, from Chris Miller's site, by reading,

 http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD

 Anyway, what I need is to read an Unicode file, search the strings inside,
make changes to the file and write the changes back to an Unicode file.

 Any help would be greatly appreciate.

 thanks,

 jos�

 
 Wow, you're in luck!
 
 D is all unicode.
 
 Just do import std.file; auto text = cast(string) filename.read(); do your
changes; filename.write(cast(void[]) text);
 
 and you're done.

PS: You may need to do detection for UTF-16. In that case, just cast to a
wstring instead, then (optionally) use std.utf.toUTF8.

Feb 28 2009

jicman <cabrera_ _wrc.xerox.com> writes:

downs Wrote:

 downs wrote:
 jicman wrote:
 Greetings.

 Sorry guys, please be patient with me.  I am having a hard time understanding
this Unicode, ANSI, UTF* ideas.  I know how to get an UTF8 File and turn it
into ANSI. and I know how to take a ANSI file and turn it into an UTF file. 
But, now I have a Unicode file and I need to change the content and create a
new Unicode file with the changes in the content.  I have read all kind of
places, and I found mtext, from Chris Miller's site, by reading,

 http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD

 Anyway, what I need is to read an Unicode file, search the strings inside,
make changes to the file and write the changes back to an Unicode file.

 Any help would be greatly appreciate.

 thanks,

 jos�

 
 Wow, you're in luck!
 
 D is all unicode.
 
 Just do import std.file; auto text = cast(string) filename.read(); do your
changes; filename.write(cast(void[]) text);
 
 and you're done.

 
 PS: You may need to do detection for UTF-16. In that case, just cast to a
wstring instead, then (optionally) use std.utf.toUTF8.

shouldn't auto take care of that?

Feb 28 2009

Daniel Keep <daniel.keep.lists gmail.com> writes:

jicman wrote:
 downs Wrote:
 
 downs wrote:
 jicman wrote:
 Greetings.

 Sorry guys, please be patient with me.  I am having a hard time understanding
this Unicode, ANSI, UTF* ideas.  I know how to get an UTF8 File and turn it
into ANSI. and I know how to take a ANSI file and turn it into an UTF file. 
But, now I have a Unicode file and I need to change the content and create a
new Unicode file with the changes in the content.  I have read all kind of
places, and I found mtext, from Chris Miller's site, by reading,

 http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD

 Anyway, what I need is to read an Unicode file, search the strings inside,
make changes to the file and write the changes back to an Unicode file.

 Any help would be greatly appreciate.

 thanks,

 jos�

 Wow, you're in luck!

 D is all unicode.

 Just do import std.file; auto text = cast(string) filename.read(); do your
changes; filename.write(cast(void[]) text);

 and you're done.

 PS: You may need to do detection for UTF-16. In that case, just cast to a
wstring instead, then (optionally) use std.utf.toUTF8.

 
 shouldn't auto take care of that?

The compiler doesn't know what format your text files are in.  auto does
type inference.

  -- Daniel

Feb 28 2009

Jarrett Billingsley <jarrett.billingsley gmail.com> writes:

On Sat, Feb 28, 2009 at 1:40 AM, jicman <cabrera_ _wrc.xerox.com> wrote:
 Greetings.

 Sorry guys, please be patient with me. =A0I am having a hard time underst=

anding this Unicode, ANSI, UTF* ideas. =A0I know how to get an UTF8 File an=
d turn it into ANSI. and I know how to take a ANSI file and turn it into an=
 UTF file. =A0But, now I have a Unicode file and I need to change the conte=
nt and create a new Unicode file with the changes in the content. =A0I have=
 read all kind of places, and I found mtext, from Chris Miller's site, by r=
eading,
 http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD

 Anyway, what I need is to read an Unicode file, search the strings inside=

, make changes to the file and write the changes back to an Unicode file.

You seem to be distinguishing between UTF and Unicode; it's kind of
apples to oranges.  Unicode is a standard for character encoding (a
mapping from numbers to characters, like ASCII).  UTF is a way - or
rather, _several_ ways - of encoding Unicode text.  There are three
major encodings, UTF-8, UTF-16, and UTF-32 (and the 16- and 32-bit
encodings have both little- and big-endian versions), which correspond
to D's char[], wchar[], and dchar[].

When you say a "Unicode" file do you mean it's encoded in UTF-16?  If
so, you can just read the file's contents as a wchar[].  If you're
using Phobos, keep in mind that it provides no functionality for
searching or manipulating wchar[]s, which means you'll have to convert
it to UTF-8 (char[]).  If you're using Tango, you can give
tango.io.UnicodeFile a shot - it will automatically transcode a file
from any Unicode encoding to any other, and if your file has a BOM, it
can even automatically detect which encoding it's in.

Feb 28 2009

jicman <cabrera_ _wrc.xerox.com> writes:

Jarrett Billingsley Wrote:

 On Sat, Feb 28, 2009 at 1:40 AM, jicman wrote:
 Greetings.

 Sorry guys, please be patient with me. �I am having a hard time understanding
this Unicode, ANSI, UTF* ideas. �I know how to get an UTF8 File and turn it
into ANSI. and I know how to take a ANSI file and turn it into an UTF file.
�But, now I have a Unicode file and I need to change the content and create a
new Unicode file with the changes in the content. �I have read all kind of
places, and I found mtext, from Chris Miller's site, by reading,

 http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD

 Anyway, what I need is to read an Unicode file, search the strings inside,
make changes to the file and write the changes back to an Unicode file.

 
 You seem to be distinguishing between UTF and Unicode; it's kind of
 apples to oranges.  Unicode is a standard for character encoding (a
 mapping from numbers to characters, like ASCII).  UTF is a way - or
 rather, _several_ ways - of encoding Unicode text.  There are three
 major encodings, UTF-8, UTF-16, and UTF-32 (and the 16- and 32-bit
 encodings have both little- and big-endian versions), which correspond
 to D's char[], wchar[], and dchar[].
 
 When you say a "Unicode" file do you mean it's encoded in UTF-16?  If
 so, you can just read the file's contents as a wchar[].  If you're
 using Phobos, keep in mind that it provides no functionality for
 searching or manipulating wchar[]s, which means you'll have to convert
 it to UTF-8 (char[]).  If you're using Tango, you can give
 tango.io.UnicodeFile a shot - it will automatically transcode a file
 from any Unicode encoding to any other, and if your file has a BOM, it
 can even automatically detect which encoding it's in.

Ok, the only reason that I say Unicode is that when I open the file in Notepad
and I do a SaveAs, the Encoding says Unicode.  So, when i read this file and I
write it back to the another file, the Encoding turns to UTF8.  I want to keep
it as Unicode.

I will give the suggestion a try.  I did not try it yet.  Maybe Phobos should
think about taking care of the BOM byte and provide support for these
encodings.  I am a big fan of Phobos. :-)  I have not tried Tango yet, because
I would have to uninstall Phobos and I have just spend two years using Phobos
and we already have an application based in Phobos and changing back to Tango
will slow us down and put us back.  Maybe version 2.0.

Thanks, Jarrett.

jos�

Feb 28 2009

Daniel Keep <daniel.keep.lists gmail.com> writes:

jicman wrote:
 Ok, the only reason that I say Unicode is that when I open the file in Notepad
and I do a SaveAs, the Encoding says Unicode.  So, when i read this file and I
write it back to the another file, the Encoding turns to UTF8.  I want to keep
it as Unicode.

There is no such thing as a Unicode file format.  There just isn't.  I
know the option you speak of, and I have no idea what it's supposed to
be; probably UCS-2 or UTF-16.

 I will give the suggestion a try.  I did not try it yet.  Maybe Phobos should
think about taking care of the BOM byte and provide support for these
encodings.  I am a big fan of Phobos. :-)  I have not tried Tango yet, because
I would have to uninstall Phobos and I have just spend two years using Phobos
and we already have an application based in Phobos and changing back to Tango
will slow us down and put us back.  Maybe version 2.0.

There's std.stream.EndianStream, which looks like it can read and write
BOMs.  As for converting between UTF encodings, std.utf.

  -- Daniel

Feb 28 2009

jicman <cabrera_ _wrc.xerox.com> writes:

Daniel Keep Wrote:

 
 
 jicman wrote:
 Ok, the only reason that I say Unicode is that when I open the file in Notepad
and I do a SaveAs, the Encoding says Unicode.  So, when i read this file and I
write it back to the another file, the Encoding turns to UTF8.  I want to keep
it as Unicode.

 
 There is no such thing as a Unicode file format.  There just isn't.  I
 know the option you speak of, and I have no idea what it's supposed to
 be; probably UCS-2 or UTF-16.
 
 I will give the suggestion a try.  I did not try it yet.  Maybe Phobos should
think about taking care of the BOM byte and provide support for these
encodings.  I am a big fan of Phobos. :-)  I have not tried Tango yet, because
I would have to uninstall Phobos and I have just spend two years using Phobos
and we already have an application based in Phobos and changing back to Tango
will slow us down and put us back.  Maybe version 2.0.

 
 There's std.stream.EndianStream, which looks like it can read and write
 BOMs.  As for converting between UTF encodings, std.utf.

Thanks, Daniel...

I could not the above to work.  Maybe for lack of understanding and examples,
but the code below is working.  For now, the 1000+ XML files I have are all the
same BOM (UTF16_le), so it will work at least for now.  However, I need to fill
in the rest later.  I have a question on this code below which is working for
UTF16_le:

import std.stdio;
import std.file;
import std.utf;

char[] getBOM(ubyte[] t)
{
  ubyte[] UTF32_be = [0x00,0x00,0xfe,0xff];
  ubyte[] UTF32_le = [0xff,0xfe,0x00,0x00];
  ubyte[] UTF8 = [0xef,0xbb,0xbf];
  ubyte[] UTF16_be = [0xfe,0xff];
  ubyte[] UTF16_le = [0xff,0xfe];
  if(t == UTF32_be)
    return "UTF32_be";
  if(t == UTF32_le)
    return "UTF32_le";
  if(t[0 .. 3] == UTF8)
    return "UTF8";
  if(t[0 .. 2] == UTF16_be)
    return "UTF16_be";
  if(t[0 .. 2] == UTF16_le)
    return "UTF16_le";
  return "NO_BOM";
}

void main()
{
  char[] f0 = "Unicode.ttx.xml";
  char[] f1 = "UnicodeNew.ttx.xml";
  auto text = cast(string) f0.read();
  ubyte[4] b = cast(ubyte[]) text[0 .. 4];
  char[] bom = getBOM(b);
  char[] wText;
  writefln(bom);
  if (bom[0 .. 5] == "UTF16")
  {
    wchar[] temp = cast(wchar[]) text; //text[2 .. $];
    wText = std.utf.toUTF8(temp);
  }
  else if (bom[0 .. 5] == "UTF32")
  {
  }
  else if (bom == "UTF8")
  {
  }
  
  if (std.string.find(wText,"DisplayText=\"TrixieTag\">") > 0)
  {
    writefln("Found Trixie Tags in " ~ f0);
    char[][] nt = std.string.split(wText,`<ut Style="external"
DisplayText="TrixieTag">`);
    wText = nt[0];
    nt = std.string.split(nt[1],"</ut>");
    wText ~= nt[1];
    wchar[] eText = std.utf.toUTF16(wText);
    f1.write(cast(void[]) eText);
  }
}


The question: what happens when I get an UTF16_be (big endian)?  Will the call
to std.utf.toUTF16(wText) take care of the BOM?

thanks so much for all the support.

jos�

Mar 03 2009

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Reading and writing Unicode files