digitalmars.D.learn - Reading and writing Unicode files
- jicman (7/7) Feb 27 2009 Greetings.
- downs (5/18) Feb 28 2009 Wow, you're in luck!
- downs (2/24) Feb 28 2009 PS: You may need to do detection for UTF-16. In that case, just cast to ...
- jicman (2/27) Feb 28 2009 shouldn't auto take care of that?
- Daniel Keep (4/31) Feb 28 2009 The compiler doesn't know what format your text files are in. auto does
- Jarrett Billingsley (23/27) Feb 28 2009 anding this Unicode, ANSI, UTF* ideas. =A0I know how to get an UTF8 File...
- jicman (5/31) Feb 28 2009 Ok, the only reason that I say Unicode is that when I open the file in N...
- Daniel Keep (7/9) Feb 28 2009 There is no such thing as a Unicode file format. There just isn't. I
- jicman (60/73) Mar 03 2009 Thanks, Daniel...
Greetings. Sorry guys, please be patient with me. I am having a hard time understanding this Unicode, ANSI, UTF* ideas. I know how to get an UTF8 File and turn it into ANSI. and I know how to take a ANSI file and turn it into an UTF file. But, now I have a Unicode file and I need to change the content and create a new Unicode file with the changes in the content. I have read all kind of places, and I found mtext, from Chris Miller's site, by reading, http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD Anyway, what I need is to read an Unicode file, search the strings inside, make changes to the file and write the changes back to an Unicode file. Any help would be greatly appreciate. thanks, josé
Feb 27 2009
jicman wrote:Greetings. Sorry guys, please be patient with me. I am having a hard time understanding this Unicode, ANSI, UTF* ideas. I know how to get an UTF8 File and turn it into ANSI. and I know how to take a ANSI file and turn it into an UTF file. But, now I have a Unicode file and I need to change the content and create a new Unicode file with the changes in the content. I have read all kind of places, and I found mtext, from Chris Miller's site, by reading, http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD Anyway, what I need is to read an Unicode file, search the strings inside, make changes to the file and write the changes back to an Unicode file. Any help would be greatly appreciate. thanks, joséWow, you're in luck! D is all unicode. Just do import std.file; auto text = cast(string) filename.read(); do your changes; filename.write(cast(void[]) text); and you're done.
Feb 28 2009
downs wrote:jicman wrote:PS: You may need to do detection for UTF-16. In that case, just cast to a wstring instead, then (optionally) use std.utf.toUTF8.Greetings. Sorry guys, please be patient with me. I am having a hard time understanding this Unicode, ANSI, UTF* ideas. I know how to get an UTF8 File and turn it into ANSI. and I know how to take a ANSI file and turn it into an UTF file. But, now I have a Unicode file and I need to change the content and create a new Unicode file with the changes in the content. I have read all kind of places, and I found mtext, from Chris Miller's site, by reading, http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD Anyway, what I need is to read an Unicode file, search the strings inside, make changes to the file and write the changes back to an Unicode file. Any help would be greatly appreciate. thanks, joséWow, you're in luck! D is all unicode. Just do import std.file; auto text = cast(string) filename.read(); do your changes; filename.write(cast(void[]) text); and you're done.
Feb 28 2009
downs Wrote:downs wrote:shouldn't auto take care of that?jicman wrote:PS: You may need to do detection for UTF-16. In that case, just cast to a wstring instead, then (optionally) use std.utf.toUTF8.Greetings. Sorry guys, please be patient with me. I am having a hard time understanding this Unicode, ANSI, UTF* ideas. I know how to get an UTF8 File and turn it into ANSI. and I know how to take a ANSI file and turn it into an UTF file. But, now I have a Unicode file and I need to change the content and create a new Unicode file with the changes in the content. I have read all kind of places, and I found mtext, from Chris Miller's site, by reading, http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD Anyway, what I need is to read an Unicode file, search the strings inside, make changes to the file and write the changes back to an Unicode file. Any help would be greatly appreciate. thanks, joséWow, you're in luck! D is all unicode. Just do import std.file; auto text = cast(string) filename.read(); do your changes; filename.write(cast(void[]) text); and you're done.
Feb 28 2009
jicman wrote:downs Wrote:The compiler doesn't know what format your text files are in. auto does type inference. -- Danieldowns wrote:shouldn't auto take care of that?jicman wrote:PS: You may need to do detection for UTF-16. In that case, just cast to a wstring instead, then (optionally) use std.utf.toUTF8.Greetings. Sorry guys, please be patient with me. I am having a hard time understanding this Unicode, ANSI, UTF* ideas. I know how to get an UTF8 File and turn it into ANSI. and I know how to take a ANSI file and turn it into an UTF file. But, now I have a Unicode file and I need to change the content and create a new Unicode file with the changes in the content. I have read all kind of places, and I found mtext, from Chris Miller's site, by reading, http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD Anyway, what I need is to read an Unicode file, search the strings inside, make changes to the file and write the changes back to an Unicode file. Any help would be greatly appreciate. thanks, jos�Wow, you're in luck! D is all unicode. Just do import std.file; auto text = cast(string) filename.read(); do your changes; filename.write(cast(void[]) text); and you're done.
Feb 28 2009
On Sat, Feb 28, 2009 at 1:40 AM, jicman <cabrera_ _wrc.xerox.com> wrote:Greetings. Sorry guys, please be patient with me. =A0I am having a hard time underst=anding this Unicode, ANSI, UTF* ideas. =A0I know how to get an UTF8 File an= d turn it into ANSI. and I know how to take a ANSI file and turn it into an= UTF file. =A0But, now I have a Unicode file and I need to change the conte= nt and create a new Unicode file with the changes in the content. =A0I have= read all kind of places, and I found mtext, from Chris Miller's site, by r= eading,http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD Anyway, what I need is to read an Unicode file, search the strings inside=, make changes to the file and write the changes back to an Unicode file. You seem to be distinguishing between UTF and Unicode; it's kind of apples to oranges. Unicode is a standard for character encoding (a mapping from numbers to characters, like ASCII). UTF is a way - or rather, _several_ ways - of encoding Unicode text. There are three major encodings, UTF-8, UTF-16, and UTF-32 (and the 16- and 32-bit encodings have both little- and big-endian versions), which correspond to D's char[], wchar[], and dchar[]. When you say a "Unicode" file do you mean it's encoded in UTF-16? If so, you can just read the file's contents as a wchar[]. If you're using Phobos, keep in mind that it provides no functionality for searching or manipulating wchar[]s, which means you'll have to convert it to UTF-8 (char[]). If you're using Tango, you can give tango.io.UnicodeFile a shot - it will automatically transcode a file from any Unicode encoding to any other, and if your file has a BOM, it can even automatically detect which encoding it's in.
Feb 28 2009
Jarrett Billingsley Wrote:On Sat, Feb 28, 2009 at 1:40 AM, jicman wrote:Ok, the only reason that I say Unicode is that when I open the file in Notepad and I do a SaveAs, the Encoding says Unicode. So, when i read this file and I write it back to the another file, the Encoding turns to UTF8. I want to keep it as Unicode. I will give the suggestion a try. I did not try it yet. Maybe Phobos should think about taking care of the BOM byte and provide support for these encodings. I am a big fan of Phobos. :-) I have not tried Tango yet, because I would have to uninstall Phobos and I have just spend two years using Phobos and we already have an application based in Phobos and changing back to Tango will slow us down and put us back. Maybe version 2.0. Thanks, Jarrett. joséGreetings. Sorry guys, please be patient with me. I am having a hard time understanding this Unicode, ANSI, UTF* ideas. I know how to get an UTF8 File and turn it into ANSI. and I know how to take a ANSI file and turn it into an UTF file. But, now I have a Unicode file and I need to change the content and create a new Unicode file with the changes in the content. I have read all kind of places, and I found mtext, from Chris Miller's site, by reading, http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD Anyway, what I need is to read an Unicode file, search the strings inside, make changes to the file and write the changes back to an Unicode file.You seem to be distinguishing between UTF and Unicode; it's kind of apples to oranges. Unicode is a standard for character encoding (a mapping from numbers to characters, like ASCII). UTF is a way - or rather, _several_ ways - of encoding Unicode text. There are three major encodings, UTF-8, UTF-16, and UTF-32 (and the 16- and 32-bit encodings have both little- and big-endian versions), which correspond to D's char[], wchar[], and dchar[]. When you say a "Unicode" file do you mean it's encoded in UTF-16? If so, you can just read the file's contents as a wchar[]. If you're using Phobos, keep in mind that it provides no functionality for searching or manipulating wchar[]s, which means you'll have to convert it to UTF-8 (char[]). If you're using Tango, you can give tango.io.UnicodeFile a shot - it will automatically transcode a file from any Unicode encoding to any other, and if your file has a BOM, it can even automatically detect which encoding it's in.
Feb 28 2009
jicman wrote:Ok, the only reason that I say Unicode is that when I open the file in Notepad and I do a SaveAs, the Encoding says Unicode. So, when i read this file and I write it back to the another file, the Encoding turns to UTF8. I want to keep it as Unicode.There is no such thing as a Unicode file format. There just isn't. I know the option you speak of, and I have no idea what it's supposed to be; probably UCS-2 or UTF-16.I will give the suggestion a try. I did not try it yet. Maybe Phobos should think about taking care of the BOM byte and provide support for these encodings. I am a big fan of Phobos. :-) I have not tried Tango yet, because I would have to uninstall Phobos and I have just spend two years using Phobos and we already have an application based in Phobos and changing back to Tango will slow us down and put us back. Maybe version 2.0.There's std.stream.EndianStream, which looks like it can read and write BOMs. As for converting between UTF encodings, std.utf. -- Daniel
Feb 28 2009
Daniel Keep Wrote:jicman wrote:Thanks, Daniel... I could not the above to work. Maybe for lack of understanding and examples, but the code below is working. For now, the 1000+ XML files I have are all the same BOM (UTF16_le), so it will work at least for now. However, I need to fill in the rest later. I have a question on this code below which is working for UTF16_le: import std.stdio; import std.file; import std.utf; char[] getBOM(ubyte[] t) { ubyte[] UTF32_be = [0x00,0x00,0xfe,0xff]; ubyte[] UTF32_le = [0xff,0xfe,0x00,0x00]; ubyte[] UTF8 = [0xef,0xbb,0xbf]; ubyte[] UTF16_be = [0xfe,0xff]; ubyte[] UTF16_le = [0xff,0xfe]; if(t == UTF32_be) return "UTF32_be"; if(t == UTF32_le) return "UTF32_le"; if(t[0 .. 3] == UTF8) return "UTF8"; if(t[0 .. 2] == UTF16_be) return "UTF16_be"; if(t[0 .. 2] == UTF16_le) return "UTF16_le"; return "NO_BOM"; } void main() { char[] f0 = "Unicode.ttx.xml"; char[] f1 = "UnicodeNew.ttx.xml"; auto text = cast(string) f0.read(); ubyte[4] b = cast(ubyte[]) text[0 .. 4]; char[] bom = getBOM(b); char[] wText; writefln(bom); if (bom[0 .. 5] == "UTF16") { wchar[] temp = cast(wchar[]) text; //text[2 .. $]; wText = std.utf.toUTF8(temp); } else if (bom[0 .. 5] == "UTF32") { } else if (bom == "UTF8") { } if (std.string.find(wText,"DisplayText=\"TrixieTag\">") > 0) { writefln("Found Trixie Tags in " ~ f0); char[][] nt = std.string.split(wText,`<ut Style="external" DisplayText="TrixieTag">`); wText = nt[0]; nt = std.string.split(nt[1],"</ut>"); wText ~= nt[1]; wchar[] eText = std.utf.toUTF16(wText); f1.write(cast(void[]) eText); } } The question: what happens when I get an UTF16_be (big endian)? Will the call to std.utf.toUTF16(wText) take care of the BOM? thanks so much for all the support. joséOk, the only reason that I say Unicode is that when I open the file in Notepad and I do a SaveAs, the Encoding says Unicode. So, when i read this file and I write it back to the another file, the Encoding turns to UTF8. I want to keep it as Unicode.There is no such thing as a Unicode file format. There just isn't. I know the option you speak of, and I have no idea what it's supposed to be; probably UCS-2 or UTF-16.I will give the suggestion a try. I did not try it yet. Maybe Phobos should think about taking care of the BOM byte and provide support for these encodings. I am a big fan of Phobos. :-) I have not tried Tango yet, because I would have to uninstall Phobos and I have just spend two years using Phobos and we already have an application based in Phobos and changing back to Tango will slow us down and put us back. Maybe version 2.0.There's std.stream.EndianStream, which looks like it can read and write BOMs. As for converting between UTF encodings, std.utf.
Mar 03 2009