www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Reading and writing Unicode files

reply jicman <cabrera_ _wrc.xerox.com> writes:
Greetings.

Sorry guys, please be patient with me.  I am having a hard time understanding
this Unicode, ANSI, UTF* ideas.  I know how to get an UTF8 File and turn it
into ANSI. and I know how to take a ANSI file and turn it into an UTF file. 
But, now I have a Unicode file and I need to change the content and create a
new Unicode file with the changes in the content.  I have read all kind of
places, and I found mtext, from Chris Miller's site, by reading,

http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD

Anyway, what I need is to read an Unicode file, search the strings inside, make
changes to the file and write the changes back to an Unicode file.

Any help would be greatly appreciate.

thanks,

josé
Feb 27 2009
next sibling parent reply downs <default_357-line yahoo.de> writes:
jicman wrote:
 Greetings.
 
 Sorry guys, please be patient with me.  I am having a hard time understanding
this Unicode, ANSI, UTF* ideas.  I know how to get an UTF8 File and turn it
into ANSI. and I know how to take a ANSI file and turn it into an UTF file. 
But, now I have a Unicode file and I need to change the content and create a
new Unicode file with the changes in the content.  I have read all kind of
places, and I found mtext, from Chris Miller's site, by reading,
 
 http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD
 
 Anyway, what I need is to read an Unicode file, search the strings inside,
make changes to the file and write the changes back to an Unicode file.
 
 Any help would be greatly appreciate.
 
 thanks,
 
 josé
Wow, you're in luck! D is all unicode. Just do import std.file; auto text = cast(string) filename.read(); do your changes; filename.write(cast(void[]) text); and you're done.
Feb 28 2009
parent reply downs <default_357-line yahoo.de> writes:
downs wrote:
 jicman wrote:
 Greetings.

 Sorry guys, please be patient with me.  I am having a hard time understanding
this Unicode, ANSI, UTF* ideas.  I know how to get an UTF8 File and turn it
into ANSI. and I know how to take a ANSI file and turn it into an UTF file. 
But, now I have a Unicode file and I need to change the content and create a
new Unicode file with the changes in the content.  I have read all kind of
places, and I found mtext, from Chris Miller's site, by reading,

 http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD

 Anyway, what I need is to read an Unicode file, search the strings inside,
make changes to the file and write the changes back to an Unicode file.

 Any help would be greatly appreciate.

 thanks,

 josé
Wow, you're in luck! D is all unicode. Just do import std.file; auto text = cast(string) filename.read(); do your changes; filename.write(cast(void[]) text); and you're done.
PS: You may need to do detection for UTF-16. In that case, just cast to a wstring instead, then (optionally) use std.utf.toUTF8.
Feb 28 2009
parent reply jicman <cabrera_ _wrc.xerox.com> writes:
downs Wrote:

 downs wrote:
 jicman wrote:
 Greetings.

 Sorry guys, please be patient with me.  I am having a hard time understanding
this Unicode, ANSI, UTF* ideas.  I know how to get an UTF8 File and turn it
into ANSI. and I know how to take a ANSI file and turn it into an UTF file. 
But, now I have a Unicode file and I need to change the content and create a
new Unicode file with the changes in the content.  I have read all kind of
places, and I found mtext, from Chris Miller's site, by reading,

 http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD

 Anyway, what I need is to read an Unicode file, search the strings inside,
make changes to the file and write the changes back to an Unicode file.

 Any help would be greatly appreciate.

 thanks,

 josé
Wow, you're in luck! D is all unicode. Just do import std.file; auto text = cast(string) filename.read(); do your changes; filename.write(cast(void[]) text); and you're done.
PS: You may need to do detection for UTF-16. In that case, just cast to a wstring instead, then (optionally) use std.utf.toUTF8.
shouldn't auto take care of that?
Feb 28 2009
parent Daniel Keep <daniel.keep.lists gmail.com> writes:
jicman wrote:
 downs Wrote:
 
 downs wrote:
 jicman wrote:
 Greetings.

 Sorry guys, please be patient with me.  I am having a hard time understanding
this Unicode, ANSI, UTF* ideas.  I know how to get an UTF8 File and turn it
into ANSI. and I know how to take a ANSI file and turn it into an UTF file. 
But, now I have a Unicode file and I need to change the content and create a
new Unicode file with the changes in the content.  I have read all kind of
places, and I found mtext, from Chris Miller's site, by reading,

 http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD

 Anyway, what I need is to read an Unicode file, search the strings inside,
make changes to the file and write the changes back to an Unicode file.

 Any help would be greatly appreciate.

 thanks,

 jos�
Wow, you're in luck! D is all unicode. Just do import std.file; auto text = cast(string) filename.read(); do your changes; filename.write(cast(void[]) text); and you're done.
PS: You may need to do detection for UTF-16. In that case, just cast to a wstring instead, then (optionally) use std.utf.toUTF8.
shouldn't auto take care of that?
The compiler doesn't know what format your text files are in. auto does type inference. -- Daniel
Feb 28 2009
prev sibling parent reply Jarrett Billingsley <jarrett.billingsley gmail.com> writes:
On Sat, Feb 28, 2009 at 1:40 AM, jicman <cabrera_ _wrc.xerox.com> wrote:
 Greetings.

 Sorry guys, please be patient with me. =A0I am having a hard time underst=
anding this Unicode, ANSI, UTF* ideas. =A0I know how to get an UTF8 File an= d turn it into ANSI. and I know how to take a ANSI file and turn it into an= UTF file. =A0But, now I have a Unicode file and I need to change the conte= nt and create a new Unicode file with the changes in the content. =A0I have= read all kind of places, and I found mtext, from Chris Miller's site, by r= eading,
 http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD

 Anyway, what I need is to read an Unicode file, search the strings inside=
, make changes to the file and write the changes back to an Unicode file. You seem to be distinguishing between UTF and Unicode; it's kind of apples to oranges. Unicode is a standard for character encoding (a mapping from numbers to characters, like ASCII). UTF is a way - or rather, _several_ ways - of encoding Unicode text. There are three major encodings, UTF-8, UTF-16, and UTF-32 (and the 16- and 32-bit encodings have both little- and big-endian versions), which correspond to D's char[], wchar[], and dchar[]. When you say a "Unicode" file do you mean it's encoded in UTF-16? If so, you can just read the file's contents as a wchar[]. If you're using Phobos, keep in mind that it provides no functionality for searching or manipulating wchar[]s, which means you'll have to convert it to UTF-8 (char[]). If you're using Tango, you can give tango.io.UnicodeFile a shot - it will automatically transcode a file from any Unicode encoding to any other, and if your file has a BOM, it can even automatically detect which encoding it's in.
Feb 28 2009
parent reply jicman <cabrera_ _wrc.xerox.com> writes:
Jarrett Billingsley Wrote:

 On Sat, Feb 28, 2009 at 1:40 AM, jicman wrote:
 Greetings.

 Sorry guys, please be patient with me.  I am having a hard time understanding
this Unicode, ANSI, UTF* ideas.  I know how to get an UTF8 File and turn it
into ANSI. and I know how to take a ANSI file and turn it into an UTF file.
 But, now I have a Unicode file and I need to change the content and create a
new Unicode file with the changes in the content.  I have read all kind of
places, and I found mtext, from Chris Miller's site, by reading,

 http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD

 Anyway, what I need is to read an Unicode file, search the strings inside,
make changes to the file and write the changes back to an Unicode file.
You seem to be distinguishing between UTF and Unicode; it's kind of apples to oranges. Unicode is a standard for character encoding (a mapping from numbers to characters, like ASCII). UTF is a way - or rather, _several_ ways - of encoding Unicode text. There are three major encodings, UTF-8, UTF-16, and UTF-32 (and the 16- and 32-bit encodings have both little- and big-endian versions), which correspond to D's char[], wchar[], and dchar[]. When you say a "Unicode" file do you mean it's encoded in UTF-16? If so, you can just read the file's contents as a wchar[]. If you're using Phobos, keep in mind that it provides no functionality for searching or manipulating wchar[]s, which means you'll have to convert it to UTF-8 (char[]). If you're using Tango, you can give tango.io.UnicodeFile a shot - it will automatically transcode a file from any Unicode encoding to any other, and if your file has a BOM, it can even automatically detect which encoding it's in.
Ok, the only reason that I say Unicode is that when I open the file in Notepad and I do a SaveAs, the Encoding says Unicode. So, when i read this file and I write it back to the another file, the Encoding turns to UTF8. I want to keep it as Unicode. I will give the suggestion a try. I did not try it yet. Maybe Phobos should think about taking care of the BOM byte and provide support for these encodings. I am a big fan of Phobos. :-) I have not tried Tango yet, because I would have to uninstall Phobos and I have just spend two years using Phobos and we already have an application based in Phobos and changing back to Tango will slow us down and put us back. Maybe version 2.0. Thanks, Jarrett. josé
Feb 28 2009
parent reply Daniel Keep <daniel.keep.lists gmail.com> writes:
jicman wrote:
 Ok, the only reason that I say Unicode is that when I open the file in Notepad
and I do a SaveAs, the Encoding says Unicode.  So, when i read this file and I
write it back to the another file, the Encoding turns to UTF8.  I want to keep
it as Unicode.
There is no such thing as a Unicode file format. There just isn't. I know the option you speak of, and I have no idea what it's supposed to be; probably UCS-2 or UTF-16.
 I will give the suggestion a try.  I did not try it yet.  Maybe Phobos should
think about taking care of the BOM byte and provide support for these
encodings.  I am a big fan of Phobos. :-)  I have not tried Tango yet, because
I would have to uninstall Phobos and I have just spend two years using Phobos
and we already have an application based in Phobos and changing back to Tango
will slow us down and put us back.  Maybe version 2.0.
There's std.stream.EndianStream, which looks like it can read and write BOMs. As for converting between UTF encodings, std.utf. -- Daniel
Feb 28 2009
parent jicman <cabrera_ _wrc.xerox.com> writes:
Daniel Keep Wrote:

 
 
 jicman wrote:
 Ok, the only reason that I say Unicode is that when I open the file in Notepad
and I do a SaveAs, the Encoding says Unicode.  So, when i read this file and I
write it back to the another file, the Encoding turns to UTF8.  I want to keep
it as Unicode.
There is no such thing as a Unicode file format. There just isn't. I know the option you speak of, and I have no idea what it's supposed to be; probably UCS-2 or UTF-16.
 I will give the suggestion a try.  I did not try it yet.  Maybe Phobos should
think about taking care of the BOM byte and provide support for these
encodings.  I am a big fan of Phobos. :-)  I have not tried Tango yet, because
I would have to uninstall Phobos and I have just spend two years using Phobos
and we already have an application based in Phobos and changing back to Tango
will slow us down and put us back.  Maybe version 2.0.
There's std.stream.EndianStream, which looks like it can read and write BOMs. As for converting between UTF encodings, std.utf.
Thanks, Daniel... I could not the above to work. Maybe for lack of understanding and examples, but the code below is working. For now, the 1000+ XML files I have are all the same BOM (UTF16_le), so it will work at least for now. However, I need to fill in the rest later. I have a question on this code below which is working for UTF16_le: import std.stdio; import std.file; import std.utf; char[] getBOM(ubyte[] t) { ubyte[] UTF32_be = [0x00,0x00,0xfe,0xff]; ubyte[] UTF32_le = [0xff,0xfe,0x00,0x00]; ubyte[] UTF8 = [0xef,0xbb,0xbf]; ubyte[] UTF16_be = [0xfe,0xff]; ubyte[] UTF16_le = [0xff,0xfe]; if(t == UTF32_be) return "UTF32_be"; if(t == UTF32_le) return "UTF32_le"; if(t[0 .. 3] == UTF8) return "UTF8"; if(t[0 .. 2] == UTF16_be) return "UTF16_be"; if(t[0 .. 2] == UTF16_le) return "UTF16_le"; return "NO_BOM"; } void main() { char[] f0 = "Unicode.ttx.xml"; char[] f1 = "UnicodeNew.ttx.xml"; auto text = cast(string) f0.read(); ubyte[4] b = cast(ubyte[]) text[0 .. 4]; char[] bom = getBOM(b); char[] wText; writefln(bom); if (bom[0 .. 5] == "UTF16") { wchar[] temp = cast(wchar[]) text; //text[2 .. $]; wText = std.utf.toUTF8(temp); } else if (bom[0 .. 5] == "UTF32") { } else if (bom == "UTF8") { } if (std.string.find(wText,"DisplayText=\"TrixieTag\">") > 0) { writefln("Found Trixie Tags in " ~ f0); char[][] nt = std.string.split(wText,`<ut Style="external" DisplayText="TrixieTag">`); wText = nt[0]; nt = std.string.split(nt[1],"</ut>"); wText ~= nt[1]; wchar[] eText = std.utf.toUTF16(wText); f1.write(cast(void[]) eText); } } The question: what happens when I get an UTF16_be (big endian)? Will the call to std.utf.toUTF16(wText) take care of the BOM? thanks so much for all the support. josé
Mar 03 2009