digitalmars.D.learn - Transparent ANSI to UTF-8 conversion

Lubos Pintes (12/12) Feb 27 2013 Hi,

monarch_dodra (17/32) Feb 27 2013 I'd say the D way would be to simply exploit the fact that UTF is

Dmitry Olshansky (7/35) Feb 27 2013 Making a table that translates ANSI to UTF8 is trivially constructible

Lubos Pintes (6/42) Feb 27 2013 I don't understand the CTFE usage in this context. I thought about

Dmitry Olshansky (18/66) Feb 27 2013 It's fine. What I've meant is if all you want to do is convert ANSI ->

Era Scarecrow (8/15) Feb 27 2013 A while back I wrote a little code that effectively does that,

Lubos Pintes (2/16) Feb 28 2013

Lubos Pintes <lubos.pintes gmail.com> writes:

Hi,
I would like to transparently convert from ANSI to UTF-8 when dealing 
with text files. For example here in Slovakia, virtually every text file 
is in Windows-1250.
If someone opens a text file, he or she expects that it will work 
properly. So I suppose, that it is not feasible to tell someone "if you 
want to use my program, please convert every text to UTF-8".

To obtain the mapping from ANSI to Unicode for particular code page is 
trivial. Maybe even MultibyteToWidechar could help with this.

I however need to know how to do it "D-way". Could I define something 
like TextReader class? Or perhaps some support already exists somewhere?
Thank

Feb 27 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Wednesday, 27 February 2013 at 10:56:16 UTC, Lubos Pintes 
wrote:
 Hi,
 I would like to transparently convert from ANSI to UTF-8 when 
 dealing with text files. For example here in Slovakia, 
 virtually every text file is in Windows-1250.
 If someone opens a text file, he or she expects that it will 
 work properly. So I suppose, that it is not feasible to tell 
 someone "if you want to use my program, please convert every 
 text to UTF-8".

 To obtain the mapping from ANSI to Unicode for particular code 
 page is trivial. Maybe even MultibyteToWidechar could help with 
 this.

 I however need to know how to do it "D-way". Could I define 
 something like TextReader class? Or perhaps some support 
 already exists somewhere?
 Thank

I'd say the D way would be to simply exploit the fact that UTF is 
built into the language, and as such, not worry about encoding, 
and use raw code points.

You get you "Codepage to unicode *codepoint*" table, and then you 
simply map each character to a dchar. From there, D will itself 
convert your raw unicode (aka UTF-32) to UTF8 on the fly, when 
you need it. For example, writing to a file will automatically 
convert input to UTF-8. You can also simply use 
std.conv.to!string to convert any UTF scheme to UTF-8 (or any 
other UTF too for that matter).

This may not be as efficient as a "true" "codepage to UTF8 table" 
but:
1) Given you'll most probably be IO bound anyways, who cares?
2) Scalability. D does everything but the code page to code point 
mapping. Why bother doing any more than that?

Feb 27 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

27-Feb-2013 16:20, monarch_dodra пишет:
 On Wednesday, 27 February 2013 at 10:56:16 UTC, Lubos Pintes wrote:
 Hi,
 I would like to transparently convert from ANSI to UTF-8 when dealing
 with text files. For example here in Slovakia, virtually every text
 file is in Windows-1250.
 If someone opens a text file, he or she expects that it will work
 properly. So I suppose, that it is not feasible to tell someone "if
 you want to use my program, please convert every text to UTF-8".

 To obtain the mapping from ANSI to Unicode for particular code page is
 trivial. Maybe even MultibyteToWidechar could help with this.

 I however need to know how to do it "D-way". Could I define something
 like TextReader class? Or perhaps some support already exists somewhere?
 Thank

 I'd say the D way would be to simply exploit the fact that UTF is built
 into the language, and as such, not worry about encoding, and use raw
 code points.

 You get you "Codepage to unicode *codepoint*" table, and then you simply
 map each character to a dchar. From there, D will itself convert your
 raw unicode (aka UTF-32) to UTF8 on the fly, when you need it. For
 example, writing to a file will automatically convert input to UTF-8.
 You can also simply use std.conv.to!string to convert any UTF scheme to
 UTF-8 (or any other UTF too for that matter).

Making a table that translates ANSI to UTF8 is trivially constructible 
using CTFE from the static one that does ANSI -> dchar.
 This may not be as efficient as a "true" "codepage to UTF8 table" but:
 1) Given you'll most probably be IO bound anyways, who cares?

With in-memory transcoding you won't be. Text editors are typically all 
in-memory or mmap-ed.

 2) Scalability. D does everything but the code page to code point
 mapping. Why bother doing any more than that?


-- 
Dmitry Olshansky

Feb 27 2013

Lubos Pintes <lubos.pintes gmail.com> writes:

I don't understand the CTFE usage in this context. I thought about 
something like
dchar[] windows_1250=[...];
Isn't this enough?
Thank

Dňa 27. 2. 2013 18:32 Dmitry Olshansky  wrote / napísal(a):
 27-Feb-2013 16:20, monarch_dodra пишет:
 On Wednesday, 27 February 2013 at 10:56:16 UTC, Lubos Pintes wrote:
 Hi,
 I would like to transparently convert from ANSI to UTF-8 when dealing
 with text files. For example here in Slovakia, virtually every text
 file is in Windows-1250.
 If someone opens a text file, he or she expects that it will work
 properly. So I suppose, that it is not feasible to tell someone "if
 you want to use my program, please convert every text to UTF-8".

 To obtain the mapping from ANSI to Unicode for particular code page is
 trivial. Maybe even MultibyteToWidechar could help with this.

 I however need to know how to do it "D-way". Could I define something
 like TextReader class? Or perhaps some support already exists somewhere?
 Thank

 I'd say the D way would be to simply exploit the fact that UTF is built
 into the language, and as such, not worry about encoding, and use raw
 code points.

 You get you "Codepage to unicode *codepoint*" table, and then you simply
 map each character to a dchar. From there, D will itself convert your
 raw unicode (aka UTF-32) to UTF8 on the fly, when you need it. For
 example, writing to a file will automatically convert input to UTF-8.
 You can also simply use std.conv.to!string to convert any UTF scheme to
 UTF-8 (or any other UTF too for that matter).

 Making a table that translates ANSI to UTF8 is trivially constructible
 using CTFE from the static one that does ANSI -> dchar.
 This may not be as efficient as a "true" "codepage to UTF8 table" but:
 1) Given you'll most probably be IO bound anyways, who cares?

 With in-memory transcoding you won't be. Text editors are typically all
 in-memory or mmap-ed.

 2) Scalability. D does everything but the code page to code point
 mapping. Why bother doing any more than that?

Feb 27 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

28-Feb-2013 00:35, Lubos Pintes пишет:
 I don't understand the CTFE usage in this context. I thought about
 something like
 dchar[] windows_1250=[...];
 Isn't this enough?
 Thank

It's fine. What I've meant is if all you want to do is convert ANSI -> 
UTF8 there is no need to convert to dchar and then to UTF-8 chars.

so the table becomes more like:

char[][] windows_1250_to_UTF8 = [...];

Or rather (far better memory footprint):

char[2][] windows_1250_to_UTF8 = [ ... ];

I think 2 UTF-8 chars should be enough for your codepage.

Then CTFE is just a tool create one table from another :
char[2][] windows_1250UTF = createUTF8Table(windows_1250);

The point is that inside of createUTF8Table you create an array it by 
using new and simple loops + std.utf.encode just like in normal code but 
it'll be CTFE-ed.

Same goes for going backwards - you can treat char[2] as ushort and do 
the tables. Though now it may have gaps due to encoding not being linear 
but rather having some stride with certain period.

 Dňa 27. 2. 2013 18:32 Dmitry Olshansky  wrote / napísal(a):
 27-Feb-2013 16:20, monarch_dodra пишет:
 On Wednesday, 27 February 2013 at 10:56:16 UTC, Lubos Pintes wrote:
 Hi,
 I would like to transparently convert from ANSI to UTF-8 when dealing
 with text files. For example here in Slovakia, virtually every text
 file is in Windows-1250.
 If someone opens a text file, he or she expects that it will work
 properly. So I suppose, that it is not feasible to tell someone "if
 you want to use my program, please convert every text to UTF-8".

 To obtain the mapping from ANSI to Unicode for particular code page is
 trivial. Maybe even MultibyteToWidechar could help with this.

 I however need to know how to do it "D-way". Could I define something
 like TextReader class? Or perhaps some support already exists
 somewhere?
 Thank

 I'd say the D way would be to simply exploit the fact that UTF is built
 into the language, and as such, not worry about encoding, and use raw
 code points.

 You get you "Codepage to unicode *codepoint*" table, and then you simply
 map each character to a dchar. From there, D will itself convert your
 raw unicode (aka UTF-32) to UTF8 on the fly, when you need it. For
 example, writing to a file will automatically convert input to UTF-8.
 You can also simply use std.conv.to!string to convert any UTF scheme to
 UTF-8 (or any other UTF too for that matter).

 Making a table that translates ANSI to UTF8 is trivially constructible
 using CTFE from the static one that does ANSI -> dchar.
 This may not be as efficient as a "true" "codepage to UTF8 table" but:
 1) Given you'll most probably be IO bound anyways, who cares?

 With in-memory transcoding you won't be. Text editors are typically all
 in-memory or mmap-ed.

 2) Scalability. D does everything but the code page to code point
 mapping. Why bother doing any more than that?




-- 
Dmitry Olshansky

Feb 27 2013

"Era Scarecrow" <rtcvb32 yahoo.com> writes:

On Wednesday, 27 February 2013 at 10:56:16 UTC, Lubos Pintes 
wrote:
 Hi,
 I would like to transparently convert from ANSI to UTF-8 when 
 dealing with text files. For example here in Slovakia, 
 virtually every text file is in Windows-1250. If someone opens 
 a text file, he or she expects that it will work properly. So I 
 suppose, that it is not feasible to tell someone "if you want 
 to use my program, please convert every text to UTF-8".

  A while back I wrote a little code that effectively does that, 
mind you it's probably not the right specific encoding, however 
you should be able to find the code points and replace them. I 
think this was for iso-8859-1.

See "Reading ASCII file with some codes above 127 (exten ascii)"

http://forum.dlang.org/thread/lehgyzmwewgvkdgraizv forum.dlang.org

Feb 27 2013

Lubos Pintes <lubos.pintes gmail.com> writes:

Thank you all. Now I believe I will be able to solve this.
D�a 28. 2. 2013 5:25 Era Scarecrow  wrote / nap�sal(a):
 On Wednesday, 27 February 2013 at 10:56:16 UTC, Lubos Pintes wrote:
 Hi,
 I would like to transparently convert from ANSI to UTF-8 when dealing
 with text files. For example here in Slovakia, virtually every text
 file is in Windows-1250. If someone opens a text file, he or she
 expects that it will work properly. So I suppose, that it is not
 feasible to tell someone "if you want to use my program, please
 convert every text to UTF-8".

   A while back I wrote a little code that effectively does that, mind
 you it's probably not the right specific encoding, however you should be
 able to find the code points and replace them. I think this was for
 iso-8859-1.

 See "Reading ASCII file with some codes above 127 (exten ascii)"

 http://forum.dlang.org/thread/lehgyzmwewgvkdgraizv forum.dlang.org

Feb 28 2013

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Transparent ANSI to UTF-8 conversion