digitalmars.D.learn - how to localize console and GUI apps in Windows

Andrei (21/21) Dec 28 2017 There is one everlasting problem writing Cyrillic programs in

H. S. Teoh (162/182) Dec 28 2017 [...]

Andrei (69/78) Dec 29 2017 Thank you Teoh for advise and good example! I was looking towards

zabruk70 (5/6) Dec 29 2017 AFAIK, Windows GUI have no ANSI/OEM problem.

Andrei (9/16) Jan 02 2018 Partly, yes. Just for a test I tried to "russify" the example

thedeemon (8/16) Jan 03 2018 Windows API contains two sets of functions: those whose names end

thedeemon (6/9) Jan 03 2018 Some details:
Andrei (23/27) Jan 03 2018 Gosh, I should new this :)) Thanks for the point! TextOutW()

Martin Krejcirik (6/8) Jan 03 2018 Be advised there are some problems with console UTF-8

H. S. Teoh (34/56) Dec 29 2017 You mean if your environment uses a non-UTF encoding? If your

Andrei (43/84) Jan 04 2018 No, I mean difficulties to write a program based on non-ASCII
Andrei (3/5) Jan 04 2018 Could you kindly explain how I can read console input into binary

zabruk70 (2/2) Dec 28 2017 you can just set console CP to UTF-8:

Andrei <aalub mail.ru> writes:

There is one everlasting problem writing Cyrillic programs in 
Windows: Microsoft consequently invented two much different code 
pages for Russia and other Cyrillic-alphabet countries: first was 
MSDOS-866 (and alike), second Windows-1251. Nowadays MS Windows 
uses first code page for console programs, second for GUI 
applications, and there always are many workarounds to get proper 
translation between them. Mostly a programmer should write 
program sources either in one code page for console and other for 
GUI, or use .NET, which basically uses UTF8 in sources and makes 
seamless translation depending on back end.

In D language which uses only UTF8 for string encoding I cannot 
write neither MS866 code page program texts, nor Windows-1251 - 
both cases end in a compiler error like "Invalid trailing code 
unit" or "Outside Unicode code space". And writing Cyrillic 
strings in UTF8 format is fatal for both console and GUI Windows 
targets.

My question is: is there any standard means to translate Cyrillic 
or any other localized UTF8 strings for console and GUI output in 
D libraries. If so - where I can get more information and good 
example. Google would not help.

Thanks.

Dec 28 2017

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Dec 28, 2017 at 05:56:32PM +0000, Andrei via Digitalmars-d-learn wrote:
 There is one everlasting problem writing Cyrillic programs in Windows:
 Microsoft consequently invented two much different code pages for
 Russia and other Cyrillic-alphabet countries: first was MSDOS-866 (and
 alike), second Windows-1251. Nowadays MS Windows uses first code page
 for console programs, second for GUI applications, and there always
 are many workarounds to get proper translation between them. Mostly a
 programmer should write program sources either in one code page for
 console and other for GUI, or use .NET, which basically uses UTF8 in
 sources and makes seamless translation depending on back end.
 
 In D language which uses only UTF8 for string encoding I cannot write
 neither MS866 code page program texts, nor Windows-1251 - both cases
 end in a compiler error like "Invalid trailing code unit" or "Outside
 Unicode code space". And writing Cyrillic strings in UTF8 format is
 fatal for both console and GUI Windows targets.
 
 My question is: is there any standard means to translate Cyrillic or
 any other localized UTF8 strings for console and GUI output in D
 libraries. If so - where I can get more information and good example.
 Google would not help.

[...]

The string / wstring / dstring types in D are intended to be Unicode
strings.  If you need to use other encodings, you really should be using
ubyte[] or const(ubyte)[] or immutable(ubyte)[], instead of string.

One approach is to use UTF-8 in your code, and only translate to one of
the code pages when you need to produce output.  I wrote a small module
for translating to/from KOI8-R when dealing with Russian text; you might
find it helpful:

-------------------------------------------------------------------------------
/**
 * Module to convert between UTF and KOI8-R
 */
module koi8r;

import std.string;
import std.range;

static immutable ubyte[0x450 - 0x410] utf2koi8r = [
    225, 226, 247, 231, 228, 229, 246, 250, // АБВГДЕЖЗ
    233, 234, 235, 236, 237, 238, 239, 240, // ИЙКЛМНОП
    242, 243, 244, 245, 230, 232, 227, 254, // РСТУФХЦЧ
    251, 253, 255, 249, 248, 252, 224, 241, // ШЩЪЫЬЭЮЯ
    193, 194, 215, 199, 196, 197, 214, 218, // абвгдежз
    201, 202, 203, 204, 205, 206, 207, 208, // ийклмноп
    210, 211, 212, 213, 198, 200, 195, 222, // рстуфхцч
    219, 221, 223, 217, 216, 220, 192, 209  // шщъыьэюя
];

/**
 * Translates a range of UTF characters into KOI8-R characters.
 * Returns: Range of KOI8-R characters (as ubyte).
 */
auto toKOI8r(R)(R range)
    if (isInputRange!R && is(ElementType!R : dchar))
{
    static struct Result
    {
        R _range;

         property bool empty() { return _range.empty; }

         property ubyte front()
        {
            dchar ch = _range.front;

            // ASCII
            if (ch < 128)
                return cast(ubyte)ch;

            // Primary alphabetic range
            if (ch >= 0x410 && ch < 0x450)
                return utf2koi8r[ch - 0x410];

            // Special case: Ё and ё are outside the usual range.
            if (ch == 0x401) return 179;
            if (ch == 0x451) return 163;

            throw new Exception(
                "Encoding error: unable to convert '%c' to KOI8-R".format(ch));
        }

        void popFront() { _range.popFront(); }

        static if (isForwardRange!R)
        {
             property Result save()
            {
                Result copy;
                copy._range = _range.save;
                return copy;
            }
        }
    }
    return Result(range);
}

unittest
{
    import std.string;
    import std.algorithm : equal;

    assert("юабцдефгхийклмнопярстужвьызшэщчъ".toK
I8r.equal(iota(192, 224)));
    assert("ЮАБЦДЕФГХИЙКЛМНОПЯРСТУЖВЬЫЗШЭЩЧЪ".toK
I8r.equal(iota(224, 256)));
}

unittest
{
    auto r = "abc абв".toKOI8r;
    static assert(isForwardRange!(typeof(r)));
    import std.algorithm.comparison : equal;
    assert(r.equal(['a', 'b', 'c', ' ', 193, 194, 215]));
}

static dchar[0x100 - 0xC0] koi8r2utf = [
    'ю', 'а', 'б', 'ц', 'д', 'е', 'ф', 'г', // 192-199
    'х', 'и', 'й', 'к', 'л', 'м', 'н', 'о', // 200-207
    'п', 'я', 'р', 'с', 'т', 'у', 'ж', 'в', // 208-215
    'ь', 'ы', 'з', 'ш', 'э', 'щ', 'ч', 'ъ', // 216-223
    'Ю', 'А', 'Б', 'Ц', 'Д', 'Е', 'Ф', 'Г', // 224-231
    'Х', 'И', 'Й', 'К', 'Л', 'М', 'Н', 'О', // 232-239
    'П', 'Я', 'Р', 'С', 'Т', 'У', 'Ж', 'В', // 240-247
    'Ь', 'Ы', 'З', 'Ш', 'Э', 'Щ', 'Ч', 'Ъ'  // 248-255
];

/**
 * Translates a range of KOI8-R characters to UTF.
 * Returns: Range of UTF characters (as dchar).
 */
auto fromKOI8r(R)(R range)
    if (isInputRange!R && is(ElementType!R : ubyte))
{
    static struct Result
    {
        R _range;
         property bool empty() { return _range.empty; }
         property dchar front()
        {
            ubyte b = _range.front;
            if (b < 128) return b;
            if (b >= 192)
                return koi8r2utf[b - 192];

            switch (b)
            {
                case 128: return '─';
                case 152: return '≤';
                case 153: return '≥';
                case 163: return 'ё';
                case 179: return 'Ё';
                default:
                    import std.string : format;
                    throw new Exception(
                        "KOI8-R character %d not implemented yet".format(b));
            }
        }
        void popFront() { _range.popFront(); }
        static if (isForwardRange!R)
        {
             property Result save()
            {
                Result copy;
                copy._range = _range.save;
                return copy;
            }
        }
    }
    return Result(range);
}

unittest
{
    import std.algorithm.comparison : equal;
    ubyte[] lower = [
        193, 194, 215, 199, 196, 197, 163, 214,
        218, 201, 202, 203, 204, 205, 206, 207,
        208, 210, 211, 212, 213, 198, 200, 195,
        222, 219, 221, 223, 217, 216, 220, 192,
        209
    ];
    assert(lower.fromKOI8r.equal("абвгдеёжзийклмнопрстуфхцчшщъыьэюя"));

    ubyte[] upper = [
        225, 226, 247, 231, 228, 229, 179, 246,
        250, 233, 234, 235, 236, 237, 238, 239,
        240, 242, 243, 244, 245, 230, 232, 227,
        254, 251, 253, 255, 249, 248, 252, 224,
        241
    ];
    assert(upper.fromKOI8r.equal("АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ"));
}
-------------------------------------------------------------------------------

As the unittests show, you just call toKOI8r or fromKOI8r to translate
between encodings.  All non-Unicode strings are traded as ubyte[], so
that you won't accidentally mix up a Unicode string with a KOI8-R string.

And the code should be straightforward enough to be adapted for other
encodings as well.

Hope this helps.


T

-- 
For every argument for something, there is always an equal and opposite
argument against it. Debates don't give answers, only wounded or inflated egos.

Dec 28 2017

Andrei <aalub mail.ru> writes:

On Thursday, 28 December 2017 at 18:45:39 UTC, H. S. Teoh wrote:
 On Thu, Dec 28, 2017 at 05:56:32PM +0000, Andrei via 
 Digitalmars-d-learn wrote:
 ...
 The string / wstring / dstring types in D are intended to be 
 Unicode strings.  If you need to use other encodings, you 
 really should be using ubyte[] or const(ubyte)[] or 
 immutable(ubyte)[], instead of string.

Thank you Teoh for advise and good example! I was looking towards 
writing something like that if no decision exists. Still this way 
of deliberate translations seems to be not the best. It supposes 
explicit workaround for every ahchoo in Russian and steady 
converting ubyte[] to string and back around. No formatting gems, 
no simple and elegant I/O statements or string/char comparisons. 
This may be endurable if you write an application where Russian 
is only one of rare options, and what if your whole environment 
is totally Russian?

Or some other nonASCII locale... Many other cultures have same 
mix of DOS/Window/Unix code pages. The decision to use only 
Unicode for strings in D language seems excellent just because of 
this, but the realization turns out to be delusive. Folks in such 
countries won’t appreciate a language which is elegant only for 
English-spoken intercommunications.

This problem is common for most programming languages and 
runtimes I know of. The only system which has decided the whole 
case is .NET I think.

The way proposed by zabruk70 below seems more appropriate though 
more particular too - I feel it suits only console type of 
applications. Alas, this type of application proved to be buggy 
too.

On Thursday, 28 December 2017 at 22:49:30 UTC, zabruk70 wrote:
 you can just set console CP to UTF-8:

 https://github.com/CyberShadow/ae/blob/master/sys/console.d

Yes! This seems to be the required, thank you very much! Though 
it is not suitable for GUI type of a Windows application.

Still some testing showed that this way conforms only console 
output. Simple read/write/compare script listed below works very 
well until the user enters something Russian. It then prints 
**empty** response and falls into indefinite loop printing the 
prompt and then immediately empty response without actually 
reading it.

But I think this is subject for ”Issues” part of this forum.

p.s. I’ve found that I may set “Consolas” font for a console 
window and then you can output properly localized UTF8 strings 
without any special code in D script managing code pages. Still 
this does not decide localized input problem: any localized input 
throws an exception “std.utf.UTFException... Invalid UTF-8 
sequence”.

The script:

import core.sys.windows.windows;
import std.stdio;
import std.string;

int main(string[] args)
{
     const UTF8CP = 65001;
     UINT oldCP, oldOutputCP;
     oldCP = GetConsoleCP();
     oldOutputCP = GetConsoleOutputCP();

     SetConsoleCP(UTF8CP);
     SetConsoleOutputCP(UTF8CP);

     writeln("hello world, привет всем!");

     bool quit = false;
     string response;
     while (!quit)
     {
         write("responde something: ");
         response=readln().strip();
         writefln("your response is \"%s\"", response);
         if (response == "quit")
         {
             writeln("good buy then!");
             quit = true;
         }
     }

     SetConsoleCP(oldCP);
     SetConsoleOutputCP(oldOutputCP);

     return 0;
}

Dec 29 2017

zabruk70 <sorry noem.ail> writes:

On Friday, 29 December 2017 at 10:35:53 UTC, Andrei wrote:
 Though it is not suitable for GUI type of a Windows application.

AFAIK, Windows GUI have no ANSI/OEM problem.
You can use Unicode.

For Windows ANSI/OEM problem you can use also
https://dlang.org/phobos/std_windows_charset.html

Dec 29 2017

Andrei <aalub mail.ru> writes:

On Friday, 29 December 2017 at 11:14:39 UTC, zabruk70 wrote:
 On Friday, 29 December 2017 at 10:35:53 UTC, Andrei wrote:
 Though it is not suitable for GUI type of a Windows 
 application.

 AFAIK, Windows GUI have no ANSI/OEM problem.
 You can use Unicode.

Partly, yes. Just for a test I tried to "russify" the example 
Windows GUI program that comes with D installation pack 
(samples\d\winsamp.d). Window captions, button captions, message 
box texts written in UTF8 all shows fine. But direct text output 
functions CreateFont()/TextOut() render all Cyrillic from UTF8 
strings into garbage.

 For Windows ANSI/OEM problem you can use also
 https://dlang.org/phobos/std_windows_charset.html

Thank you very much, toMBSz() makes requisite translation for  
TextOut() function with some workarounds.

Jan 02 2018

thedeemon <dlang thedeemon.com> writes:

On Wednesday, 3 January 2018 at 06:42:42 UTC, Andrei wrote:
 AFAIK, Windows GUI have no ANSI/OEM problem.
 You can use Unicode.

 Partly, yes. Just for a test I tried to "russify" the example 
 Windows GUI program that comes with D installation pack 
 (samples\d\winsamp.d). Window captions, button captions, 
 message box texts written in UTF8 all shows fine. But direct 
 text output functions CreateFont()/TextOut() render all 
 Cyrillic from UTF8 strings into garbage.

Windows API contains two sets of functions: those whose names end 
with A (meaning ANSI), the other where names end with W (wide 
characters, meaning Unicode). The sample uses TextOutA, this 
function that expects 8-bit encoding. Properly, you need to use 
TextOutW that accepts 16-bit Unicode, so just convert your UTF-8 
D strings to 16-bit Unicode wstrings, there are appropriate 
conversion functions in Phobos.

Jan 03 2018

thedeemon <dlang thedeemon.com> writes:

On Wednesday, 3 January 2018 at 09:11:32 UTC, thedeemon wrote:
 you need to use TextOutW that accepts 16-bit Unicode, so just 
 convert your UTF-8 D strings to 16-bit Unicode wstrings, there 
 are appropriate conversion functions in Phobos.

Some details:
import std.utf : toUTF16z;
...
string s = "привет";
TextOutW(s.toUTF16z);

Jan 03 2018

Andrei <aalub mail.ru> writes:

On Wednesday, 3 January 2018 at 09:11:32 UTC, thedeemon wrote:
 Windows API contains two sets of functions: those whose names 
 end with A (meaning ANSI), the other where names end with W 
 (wide characters, meaning Unicode). The sample uses TextOutA, 
 this function that expects 8-bit encoding.

Gosh, I should new this :)) Thanks for the point! TextOutW() 
works fine with wstring texts in this example and no more changes 
needed.

That's just enough for this example. Thank you!

Yet my particular interest is console interconnections. With the 
help of this forum I've learned console settings to write 
Cyrillic properly and simply to the console using UTF8 encoding.

One thing that remains is to read and process the user's input.

For now in the example I've cited above response=readln(); 
statement returns an empty string, in a console set for UTF8 code 
page, if the user's input contains any Cyrillic letters. Then the 
program's behavior differs depending on the compiler (or more 
likely on the runtime library): the one compiled with ldc 
continues to read on and returns empty lines, instead of the 
user's input, and the one compiled with dmd only returns empty 
lines not waiting for the user's input and not actually reading 
anything (i.e. it falls into indefinite loop busily printing 
empty response hundreds times a second).

That's only for localized input. With ASCII input same program 
works fine.

May be there is some more settings I must learn to set console to 
properly read non-ASCII input?

Jan 03 2018

Martin Krejcirik <mk-junk i-line.cz> writes:

On Friday, 29 December 2017 at 11:14:39 UTC, zabruk70 wrote:
 AFAIK, Windows GUI have no ANSI/OEM problem.
 You can use Unicode.

Be advised there are some problems with console UTF-8 
input/output in Windows. The most usable is Win10 new console 
window but I recommend to use Windows API (WriteConsole) instead. 
It works correctly regardless of codepage setting, os version and 
C library.

Jan 03 2018

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Fri, Dec 29, 2017 at 10:35:53AM +0000, Andrei via Digitalmars-d-learn wrote:
 On Thursday, 28 December 2017 at 18:45:39 UTC, H. S. Teoh wrote:
 On Thu, Dec 28, 2017 at 05:56:32PM +0000, Andrei via Digitalmars-d-learn
 wrote:
 ...
 The string / wstring / dstring types in D are intended to be Unicode
 strings.  If you need to use other encodings, you really should be
 using ubyte[] or const(ubyte)[] or immutable(ubyte)[], instead of
 string.

 
 Thank you Teoh for advise and good example! I was looking towards
 writing something like that if no decision exists. Still this way of
 deliberate translations seems to be not the best. It supposes explicit
 workaround for every ahchoo in Russian and steady converting ubyte[]
 to string and back around. No formatting gems, no simple and elegant
 I/O statements or string/char comparisons. This may be endurable if
 you write an application where Russian is only one of rare options,
 and what if your whole environment is totally Russian?

You mean if your environment uses a non-UTF encoding?  If your
environment uses UTF, there is no problem.  I have code with strings in
Russian (and other languages) embedded, and it's no problem because
everything is in Unicode, all input and all output.

But I understand that in Windows you may not have this luxury. So you
have to deal with codepages and what-not.

Converting back and forth is not a big problem, and it actually also
solves the problem of string comparisons, because std.uni provides
utilities for collating strings, etc.. But it only works for Unicode, so
you have to convert to Unicode internally anyway.  Also, for static
strings, it's not hard to make the codepage mapping functions CTFE-able,
so you can actually write string literals in a codepage and have the
compiler automatically convert it to UTF-8.

The other approach, if you don't like the idea of converting codepages
all the time, is to explicitly work in ubyte[] for all strings. Or,
preferably, create your own string type with ubyte[] representation
underneath, and implement your own comparison functions, etc., then use
this type for all strings. Better yet, contribute this to code.dlang.org
so that others who have the same problem can reuse your code instead of
needing to write their own.

[...]
 p.s. I’ve found that I may set “Consolas” font for a console window
 and then you can output properly localized UTF8 strings without any
 special code in D script managing code pages. Still this does not
 decide localized input problem: any localized input throws an
 exception “std.utf.UTFException...  Invalid UTF-8 sequence”.

Is the exception thrown in readln() or in writeln()? If it's in
writeln(), it shouldn't be a big deal, you just have to pass the data
returned by readln() to fromKOI8 (or whatever other codepage you're
using).

If the problem is in readln(), then you probably need to read the input
in binary (i.e., as ubyte[]) and convert it manually. Unfortunately,
there's no other way around this if you're forced to use codepages. The
ideal situation is if you can just use Unicode throughout your
environment. But of course, sometimes you have no choice.


T

-- 
Heuristics are bug-ridden by definition. If they didn't have bugs, they'd be
algorithms.

Dec 29 2017

Andrei <aalub mail.ru> writes:

On Friday, 29 December 2017 at 18:13:04 UTC, H. S. Teoh wrote:
 On Fri, Dec 29, 2017 at 10:35:53AM +0000, Andrei via 
 Digitalmars-d-learn wrote:
 This may be endurable if you write an application where 
 Russian is only one of rare options, and what if your whole 
 environment is totally Russian?

 You mean if your environment uses a non-UTF encoding?  If your 
 environment uses UTF, there is no problem.  I have code with 
 strings in Russian (and other languages) embedded, and it's no 
 problem because everything is in Unicode, all input and all 
 output.

No, I mean difficulties to write a program based on non-ASCII 
locales. Every programming language learning since C starts with 
a "hello world" program which every non-English programmer 
essentially tries to translate to native language - and gets 
unreadable mess on the screen. Thousands try, hundreds look for a 
solution, dozens find it, and a few continue with the new 
language. That's not because these programmers cannot read 
English text-books, they can. That's because they want to write 
non-English programs for non-English people, and that's 
essential. And there are many programming languages (or rather 
their runtimes) which do not suffer such a deficiency.

That's the reason for UNICODE adoption all over the programming 
world - including D language, but what's the good for me if I can 
write in a D program a UTF8 string with my native language text, 
and get the same unreadable mess on the screen?

Yes, a new language in development can lack support for some 
features, but this forum branch shows that a simple and handy 
solution exists - yet nobody cares to bring it to the first pages 
of every text-book for beginners, at least as a footnote. Thus 
thousands of potential new language fans are lost from start.

 But I understand that in Windows you may not have this luxury. 
 So you have to deal with codepages and what-not.

 Converting back and forth is not a big problem, and it actually 
 also solves the problem of string comparisons, because std.uni 
 provides utilities for collating strings, etc.. But it only 
 works for Unicode, so you have to convert to Unicode internally 
 anyway.  Also, for static strings, it's not hard to make the 
 codepage mapping functions CTFE-able, so you can actually write 
 string literals in a codepage and have the compiler 
 automatically convert it to UTF-8.

 The other approach, if you don't like the idea of converting 
 codepages all the time, is to explicitly work in ubyte[] for 
 all strings. Or, preferably, create your own string type with 
 ubyte[] representation underneath, and implement your own 
 comparison functions, etc., then use this type for all strings. 
 Better yet, contribute this to code.dlang.org so that others 
 who have the same problem can reuse your code instead of 
 needing to write their own.

I'd definitely try this if I decide to use D language for my 
purposes (which not settled yet). But to decide I need some 
experience, and for now it stopped at reading the user's input 
(for training I intend to translate into D my recent rather 


 Still this does not decide localized input problem: any 
 localized input throws an exception “std.utf.UTFException...  
 Invalid UTF-8 sequence”.

 Is the exception thrown in readln() or in writeln()? If it's in
 writeln(), it shouldn't be a big deal, you just have to pass 
 the data returned by readln() to fromKOI8 (or whatever other 
 codepage you're using).

 If the problem is in readln(), then you probably need to read 
 the input in binary (i.e., as ubyte[]) and convert it manually. 
 Unfortunately, there's no other way around this if you're 
 forced to use codepages. The ideal situation is if you can just 
 use Unicode throughout your environment. But of course, 
 sometimes you have no choice.

It depends.

If I avoid proper console code page initializing, I see in 
debugger that runtime reads the user's input as CP866 (MS DOS) 
Cyrillic and then throws the exception "Invalid UTF-8 sequence" 
when trying to handle it as UTF8 string (in particular by strip() 
or writeln() functions). This situation seems quite manageable by 
code page conversions you've mentioned above. I've tried first 
library function found (std.windows.charset), and got a rather 
fanciful working statement:

response = fromMBSz((readln()~"\0").ptr, 1).strip();

which assigns correct Latin/Cyrillic contents to the response 
variable.

And if I initialize console with SetConsoleCP(65001) statement 
things get worse, as I've said above. Then readln() statement 
returns an empty string and something gets broken inside the 
runtime, because any further readln() statements do not wait for 
user input, and return empty strings immediately.

Jan 04 2018

Andrei <aalub mail.ru> writes:

On Friday, 29 December 2017 at 18:13:04 UTC, H. S. Teoh wrote:
 If the problem is in readln(), then you probably need to read 
 the input in binary (i.e., as ubyte[]) and convert it manually.

Could you kindly explain how I can read console input into binary 
ubyte[]?

Jan 04 2018

zabruk70 <sorry noem.ail> writes:

you can just set console CP to UTF-8:

https://github.com/CyberShadow/ae/blob/master/sys/console.d

Dec 28 2017

D Programming

C/C++ Programming

Other

digitalmars.D.learn - how to localize console and GUI apps in Windows