digitalmars.D - [Suggestion] Standard version identifiers for language

Stewart Gordon (73/73) Jul 16 2004 Suppose one wants to write an application with versions in different

Arcane Jill (8/9) Jul 16 2004 Can I respond to this next week?
J C Calvarese (29/57) Jul 16 2004 There seems to be 2 schools of thought in the area of localization (of

Arcane Jill (71/85) Jul 18 2004 Compile-time localization not something I'd thought about before, so it'...

Juanjo =?ISO-8859-15?Q?=C1lvarez?= (6/6) Jul 18 2004 What about just porting GNU gettext to phobos? This way you have a

Arcane Jill (31/34) Jul 19 2004 I'm not sure. I'll admit I don't know much about gettext, so perhaps you...

Juanjo =?ISO-8859-15?Q?=C1lvarez?= (32/57) Jul 19 2004 First things first, you can have all the documentation about GNU gettext

Thomas Kuehne (5/7) Jul 19 2004 Just to point out some other - not neccessary better - localization libs...

Juanjo =?ISO-8859-15?Q?=C1lvarez?= (5/14) Jul 19 2004 I don't know about the Java implementation but Qt tr() is very similar t...

Berin Loritsch (11/32) Jul 19 2004 With the Java MessageFormat solution, things work fairly well. Consider...

Arcane Jill (17/25) Jul 19 2004 Wow!

Juanjo =?ISO-8859-15?Q?=C1lvarez?= (10/22) Jul 19 2004 I vote for that too.

Arcane Jill (25/30) Jul 19 2004 Be careful not to go too over-the-top here. I think that stuff like

Hauke Duden (10/58) Jul 19 2004 As far as I remember (I looked at gettext a few of years ago) gettext

Juanjo =?ISO-8859-15?Q?=C1lvarez?= (6/17) Jul 19 2004 Mmm, not the GNU gettext, you can put:

Arcane Jill (5/7) Jul 19 2004 Is this a feature of printf()? If so, is a Linux thing or an all-platfor...

Sean Kelly (6/15) Jul 19 2004 It's not a feature of printf and AFAIK it's not in the new writef either...
Jonathan Leffler (13/23) Jul 19 2004 It depends on whose printf() you're looking at.

Arcane Jill (15/20) Jul 19 2004 Well, that /sounds/ like the kind of thing we need, but your above examp...

Hauke Duden (34/58) Jul 19 2004 I've been using a pretty simple but effective technique for quite some

J C Calvarese (19/135) Jul 18 2004 I could be a lot of run-time overhead. It could be a little. But

Arcane Jill (9/11) Jul 19 2004 I found a good explanation about this when looking up gettext on the web...

Stewart Gordon (23/38) Jul 19 2004 Only if you choose to do it that way. You can just as well have one

Arcane Jill (23/30) Jul 19 2004 In a sense, they don't cover ANYTHING. They are just tuples of

Stewart Gordon (9/16) Jul 19 2004

Stewart Gordon <smjg_1998 yahoo.com> writes:

Suppose one wants to write an application with versions in different 
languages.  That's human languages, not programming languages.

At face value, that's simple - use version blocks to hold the various 
languages' UI text.  Or for Windows, define a separate resource file for 
each language.  (Or lists of string macros to be imported by one 
resource file.)

But what if you want to use one or more libraries from various sources, 
which also may have language versions?  Then you'd have to set all the 
version identifiers that the different library designers have chosen for 
your choice of language, which could lead to quite long command lines. 
It would be simpler if there could be a standard system of language 
identifiers for everyone to follow.

A system based on ISO 639-1 would probably be good.  One could then write

----------
version (en) {
     const char[][] DAY = [ "Sunday", "Monday", "Tuesday", "Wednesday",
       "Thursday", "Friday", "Saturday" ];
} else version (fr) {
     const char[][] DAY = [ "dimanche", "lundi", "mardi", "mercredi",
       "jeudi", "vendredi", "samedi" ];
} else {
     static assert (false);
}
----------

There are a few matters of debate to be considered:

1. Should we create a prefixed language namespace?  Or will these codes 
by themselves do?

2. How really should a lib be written to deal with unsupported 
languages, or if no language has been specified?  Two possibilities I 
can see:
(a) the lib programmer would put his/her own language (or maybe the one 
predicted to be most popular) as the default.
(b) a static assert as above, effectively telling the app programmer 
"please set a language, or create a version block in me for your 
language".  Maybe a future D compiler could be configured to use a 
certain language version as the default if none is specified on the 
command line.

3. Should we really have them as version identifiers?  Or invent a new 
CC block called 'language' that would have the specifics of language 
designation built in, a corresponding command line option and a 
corresponding compiler configuration setting?

Dialects of a language could be indicated by replacing the hyphen in the 
ISO code with an underscore.  Libs would then have something like

----------
version (en_GB) {
     ...
} else version (en_US) {
     ...
} else version (en) {
     ...
}
----------

It would be necessary either for the compiler to automatically set en if 
en_GB or en_US or en_anything is set, and similarly for other language 
codes, or to persuade all D users to do this.  Of course, this would be 
done in the aforementioned default language setting.

This would give lib programmers a choice of writing for each dialect of 
each language, just covering the basic languages, or a mixture.  A 
default fallback for an unsupported dialect would, I guess, typically be 
some 'default' dialect if that makes sense.

This provides for compile-time localisation.  Of course, some might want 
run-time l10n, in which case the app would be explicitly programmed to 
do this.

In writing a lib one might choose to support RTL.  In which case the 
version/language blocks would be used to select the default language, 
which would make it usable for monolingual apps, CTL apps and RTL apps 
alike.  Of course, one could argue that there should be some global 
variable in Phobos or somewhere for run-time language....

Stewart.

-- 
My e-mail is valid but not my primary mailbox, aside from its being the 
unfortunate victim of intensive mail-bombing at the moment.  Please keep 
replies on the 'group where everyone may benefit.

Jul 16 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cd8rji$ar0$1 digitaldaemon.com>, Stewart Gordon says...
<some interesting ideas>

Can I respond to this next week?

Short answer - this is a run-time problem, not a compile-time problem. A third
party should (IMO) be able to add additional human languages given access only
to the executable binary and some ini files (or something similar).

Long answer - can you wait? I've got lots of ideas about this, but I really not
up to debate just yet.

Arcane Jill

Jul 16 2004

J C Calvarese <jcc7 cox.net> writes:

Stewart Gordon wrote:
 Suppose one wants to write an application with versions in different 
 languages.  That's human languages, not programming languages.
 
 At face value, that's simple - use version blocks to hold the various 
 languages' UI text.  Or for Windows, define a separate resource file for 
 each language.  (Or lists of string macros to be imported by one 
 resource file.)
 
 But what if you want to use one or more libraries from various sources, 
 which also may have language versions?  Then you'd have to set all the 
 version identifiers that the different library designers have chosen for 
 your choice of language, which could lead to quite long command lines. 
 It would be simpler if there could be a standard system of language 
 identifiers for everyone to follow.
 
 A system based on ISO 639-1 would probably be good.  One could then write
 
 ----------
 version (en) {
     const char[][] DAY = [ "Sunday", "Monday", "Tuesday", "Wednesday",
       "Thursday", "Friday", "Saturday" ];
 } else version (fr) {
     const char[][] DAY = [ "dimanche", "lundi", "mardi", "mercredi",
       "jeudi", "vendredi", "samedi" ];
 } else {
     static assert (false);
 }
 ----------

There seems to be 2 schools of thought in the area of localization (of 
which language issues are a subset):

1. Compile-time generated using version() as you described in your post.

2. Runtime-time generated using some sort of a plugin architecture or 
language resource files (as Arcane Jill alludes to in her reply).


Since D is capable enough for either method, both parties can be happy.

Personally, I think I'd prefer to use compile-time localization, but 
that doesn't prevent others from designing runtime-time localization 
functions.

I have a quick comment about the specifics of your ideas. I think the 
version identifiers should have a prefix (such as "lang_"). This should 
make it clear to most viewers of the code what's happening. Not everyone 
would intuitively know that ky_KG is a language feature, but lang_ky_KG 
is guessable.

version (lang_en_GB) {
      ...
  } else version (lang_en_US) {
      ...
  } else version (lang_en) {
      ...
  }

Since I don't have any real experience with localization, I'd love to 
hear some opinions from those who have actually worked with 
localization. Which programming languages make localization easy. Which 
libraries are helpful.

-- 
Justin (a/k/a jcc7)
http://jcc_7.tripod.com/d/

Jul 16 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cd9peo$oce$1 digitaldaemon.com>, J C Calvarese says...

There seems to be 2 schools of thought in the area of localization (of 
which language issues are a subset):

1. Compile-time generated using version() as you described in your post.

2. Runtime-time generated using some sort of a plugin architecture or 
language resource files (as Arcane Jill alludes to in her reply).

Since D is capable enough for either method, both parties can be happy.

Personally, I think I'd prefer to use compile-time localization, but 
that doesn't prevent others from designing runtime-time localization 
functions.

Compile-time localization not something I'd thought about before, so it's been
kind of an interesting thing to think about. The thing is, though, we already
_have_ compile-time localization. We've always had it. As Stewart said, you can
do:














And we've been able to do that in C ever since we learned how to use #ifdef. So
it requires no new language features. It's already there. But, since this
technique has been around for so long, you'd expect it be widely used ...
unless, that is, it turns out to be not very useful.

There are a number of disadvantages I can think of. For a start, your
locale-specific code will end up distributed throughout your source code,
instead of all in one place. This could be a nightmare if you decide to support
a new locale. Another problem is that, if you choose the locale at compile-time,
then the end-user (as opposed to the developer) has to have the source code, OR
an executable which was compiled especially for their locale. And that's just
for executables. For libraries, the situation is even worse. A
compile-time-localized executable would have to be linked with
compile-time-localized libraries, compiled for the same locale. It would be a
serious headache.

Another problem is that someone might compile it for version(en_US), without
realizing that they should have been using version(en).

And all for what? To save a small amount of run-time overhead. Well, /how much/
run-time overhead? In most cases, run-time-localization amounts to looking
something up in a map. Is that bad? A matter of judgement, maybe, but I'd say it
was insignificant compared to the overhead incurred by writing that localized
string to printf() or a file. 

Localization, to me, is the flip-side of internationalization (or i18n for lazy
typists). The way it's traditionally done is you "internationalize" your code -
a compile-time thing, and then "localize" it at run-time. Here's an example.
Start with some normal, unlocalized code:



Now, internationalize it. It will become something like this:



Not much different really. The function local() would probably be something like
this, and would get inlined:








"Localizing" this program now consists only of initializing the localizedLookup
map, which could happen in any number of ways.

My guess is that what would be most useful in terms of
internationalization/localization would be some classes and functions to make
stuff like the above easier, providing localized number formats and so on. Plus
of course the D definition of a "locale" - we have to standardize that somehow.
Me, I'd prefer enums to strings - saves all that messing about with case, for
one thing, and string-splitting to get at the two (possibly three) parts.



I have a quick comment about the specifics of your ideas. I think the 
version identifiers should have a prefix (such as "lang_"). This should 
make it clear to most viewers of the code what's happening. Not everyone 
would intuitively know that ky_KG is a language feature, but lang_ky_KG 
is guessable.

Strictly speaking it should be "locale_", not "lang_". ("ky" is a lang, or
language; "KG" is a country. The combination "ky_KG" is a locale). Sorry for
being a pedantic bugger here.

But the original questions was, should there be "standard version identifiers"?
Thing is - I don't see how there can be. A version identifier is just D's name
for a #define, and there's nothing to stop anyone from using they want as such.
A note in the style guide might help, but even that won't force people to use
said standard.

Just my thoughts.

Arcane Jill

Jul 18 2004

Juanjo =?ISO-8859-15?Q?=C1lvarez?= <juanjuxNO SPAMyahoo.es> writes:

What about just porting GNU gettext to phobos? This way you have a
semi-standart way of localizing programs (which a lot of translators know
about), and a set of pre-written tools (even nice GUI ones).

Looking at the python implementation it should not be difficult; the Python
implementation is only 493 lines (gettext.py) I'll see if I can take enough
time to do it in the next weeks.

Jul 18 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cdeva5$2p2o$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?=
says...
What about just porting GNU gettext to phobos? This way you have a
semi-standart way of localizing programs (which a lot of translators know
about), and a set of pre-written tools (even nice GUI ones).

I'm not sure. I'll admit I don't know much about gettext, so perhaps you could
clear a few things up for me (and others)?

What is the encoding of a .po or .mo file? Specifically, is it UTF-8? (Or CAN it
be UTF-8?). Put another way - how good is its Unicode support?

I guess I have to say I'd be disappointed if we had to rely on yet another C
library. Maybe I'm just completely mad, but I'd prefer a pure D solution. (That
said, I have no objection to using files of the same format). gettext looks like
it does some cool stuff, but it's ... well ... C. It's not OO, unless I've
misunderstood. It doesn't use exceptions. Moreover, it assumes the Linux meaning
of "locale", which is (again, in my opinion) not right for D.

The way I see it, D should define locales exclusively in terms of ISO language
and country codes, plus variant extensions. Unicode defines locales that way,
and the etc.unicode library will have no choice but to use the ISO codes.
Collation and stuff like that will need to rely on data from the CDLR (Common
Locale Data Repository - see http://www.unicode.org/cldr/).

I suppose my gut feeling is that internationalization /isn't that hard/, so it
ought to be relatively simple a task to come up with a native-D solution.
gettext seems to do string translation only (again, correct me if I'm wrong),
which only a small part of internationalization/localazation.

So, I guess, on balance, I'd vote against this one, at least pending some
persuasive argument. That said, I'm way too busy to volunteer for any work (plus
I'm still taking a bit of time off from coding for personal reasons) - although
I /do/ intend to tackle Unicode localization quite soon.

Does that help? Probably not, I guess. Ah well.

Tell you what, let's start an open discussion. (I've changed the thread title).
I think we should hear lots of opinions before anyone actually DOES anything. A
wrong early decision here could hamper D's potential future as /the/ language
for internationalization (which I'd like it to become).

Arcane Jill

Jul 19 2004

Juanjo =?ISO-8859-15?Q?=C1lvarez?= <juanjuxNO SPAMyahoo.es> writes:

First things first, you can have all the documentation about GNU gettext
here:

http://www.gnu.org/software/gettext/manual/html_chapter/gettext_toc.html

Arcane Jill wrote:

 I'm not sure. I'll admit I don't know much about gettext, so perhaps you
 could clear a few things up for me (and others)?

Let's try.
 
 What is the encoding of a .po or .mo file? Specifically, is it UTF-8? (Or
 CAN it be UTF-8?). Put another way - how good is its Unicode support?

The enconding of the file is declared in the header of the po files; so it
can be (I think) anything, for example:

"Project-Id-Version: animail\n"
"POT-Creation-Date: 2003-12-07 02:02+0100\n"
"PO-Revision-Date: 2004-07-08 20:19+0200\n"
"Last-Translator: XXX XXX <XXX XXX.de>\n"
"Language-Team: Deutsch <de li.org>\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
^^^^^^^
"Content-Transfer-Encoding: 8bit\n"
^^^^^^^
"X-Generator: KBabel 1.3.1\n"

 I guess I have to say I'd be disappointed if we had to rely on yet another
 C library. 

We don't; the Python implementation, (that of about 500 lines) don't use any
external C lib at all; it's 100% pure Python. 

 gettext looks like it does some cool stuff, but it's ... well ...
 C. 

I repeat, it doesn't have to be C. gettext is more like a set of tools and
formats than a library (altought the library exists, of course, but not
only for C but for a lot of languages.)

 It's not OO.

It can be done as OO.

 unless I've misunderstood. It doesn't use exceptions.  

Our implementation could.

 Moreover, it assumes the Linux meaning of "locale", which is (again, in my
 opinion) not right for D.
 The way I see it, D should define locales exclusively in terms of ISO
 language and country codes, plus variant extensions. 
 Unicode defines 
 locales that way, and the etc.unicode library will have no choice but to
 use the ISO codes. Collation and stuff like that will need to rely on data
 from the CDLR (Common Locale Data Repository - see
 http://www.unicode.org/cldr/).

I don't know the answer to this one, gettext seems to use ISO3136 country
codes and ISO639 language codes.

 I suppose my gut feeling is that internationalization /isn't that hard/,
 so it ought to be relatively simple a task to come up with a native-D
 solution. 

 gettext seems to do string translation only (again, correct me 
 if I'm wrong), which only a small part of
 internationalization/localazation.

That's true. It also handles plural forms which is not so simple
(http://www.gnu.org/software/gettext/manual/html_chapter/gettext_10.html#SEC150)

Anyway if we all can discuss the matter and come with a better solution than
gettext (which I'm sure it's possible) I doubt many will be opposed.

Jul 19 2004

"Thomas Kuehne" <eisvogel users.sourceforge.net> writes:

Juanjo �lvarez:
 Anyway if we all can discuss the matter and come with a better solution

than
 gettext (which I'm sure it's possible) I doubt many will be opposed.

Just to point out some other - not neccessary better - localization libs:

qt/kde: http://doc.trolltech.com/3.0/linguist-manual.html
java: http://java.sun.com/j2se/1.5.0/docs/api/java/util/Formattable.html

Jul 19 2004

Juanjo =?ISO-8859-15?Q?=C1lvarez?= <juanjuxNO SPAMyahoo.es> writes:

Thomas Kuehne wrote:

 Juanjo �lvarez:
 Anyway if we all can discuss the matter and come with a better solution

 than
 gettext (which I'm sure it's possible) I doubt many will be opposed.

 
 Just to point out some other - not neccessary better - localization libs:
 
 qt/kde: http://doc.trolltech.com/3.0/linguist-manual.html
 java: http://java.sun.com/j2se/1.5.0/docs/api/java/util/Formattable.html

I don't know about the Java implementation but Qt tr() is very similar to
gettext (the format of the translation files is different.) I don't know if
KDE uses gettext internally but they use po/mo files just like gettext (and
with the same format.)

Jul 19 2004

Berin Loritsch <bloritsch d-haven.org> writes:

Juanjo �lvarez wrote:

 Thomas Kuehne wrote:
 
 
Juanjo �lvarez:

Anyway if we all can discuss the matter and come with a better solution

than

gettext (which I'm sure it's possible) I doubt many will be opposed.

Just to point out some other - not neccessary better - localization libs:

qt/kde: http://doc.trolltech.com/3.0/linguist-manual.html
java: http://java.sun.com/j2se/1.5.0/docs/api/java/util/Formattable.html

 
 
 I don't know about the Java implementation but Qt tr() is very similar to
 gettext (the format of the translation files is different.) I don't know if
 KDE uses gettext internally but they use po/mo files just like gettext (and
 with the same format.)

With the Java MessageFormat solution, things work fairly well.  Consider 
for instance:

MessageFormat.format("There was a problem in {0}, where {2} parse errors 
encountered at line {1}", location, lineNum, numErrs);

There is a way to map which argument goes to which location in the line. 
  Also, the MessageFormat does have a shorthand for kind of an if:then 
construct so that the same message would be interpreted differently for 
plurals/etc.  That makes it convenient to handle those issues in I18N.

However, I would not say that the MessageFormat is super easy to use. 
It could have a better interface, but the concepts are pretty decent.

Jul 19 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cdg2mr$7dn$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?=
says...

The enconding of the file is declared in the header of the po files; so it
can be (I think) anything, for example:

the Python implementation, (that of about 500 lines) don't use any
external C lib at all; it's 100% pure Python. 

It can be done as OO.

Our implementation could [use exceptions].

gettext seems to use ISO3136 country
codes and ISO639 language codes.

Wow!

Well, you've quashed all of my objections then. I'll change my vote then. Looks
like a D implementation of gettext is the way to go.

Just one last thing though - we do need a D definition of a locale. In effect,
we need a class Locale (or possibly a struct Locale) containing those ISO codes.
Java uses strings internally (I _think_), but there are a whole bunch of reasons
why that's not such a good idea - such as the fact that "fr", "fra" and "fre"
are all, equivalently, the language code for French, and should all compare as
equal; such as case and other punctuation concerns ("en-us" == "en-US" ==
"en_us" == "en_US", etc.). I'd vote for putting enums inside the class (enum
Language and enum Country - the variant field will still need to be a string). I
imagine that the gettext implementation will need to use our yet-to-be-invented
Locale class, and the unicode lib certainly will (and soon). Any thoughts? Class
or struct? Strings or enums? Something else?

Arcane Jill

Jul 19 2004

Juanjo =?ISO-8859-15?Q?=C1lvarez?= <juanjuxNO SPAMyahoo.es> writes:

Arcane Jill wrote:


 Just one last thing though - we do need a D definition of a locale. In
 effect, we need a class Locale (or possibly a struct Locale) containing
 those ISO codes.

Yes, definitively.

 the language code for
 French, and should all compare as equal; such as case and other
 punctuation concerns ("en-us" == "en-US" == "en_us" == "en_US", etc.). I'd
 vote for putting enums inside the class (enum Language and enum Country -
 the variant field will still need to be a string).

I vote for that too.

 I imagine that the 
 gettext implementation will need to use our yet-to-be-invented Locale
 class, and the unicode lib certainly will (and soon). Any thoughts? Class
 or struct? Strings or enums? Something else?

Class + enums, IMHO :)

Also the Python locale module[1] (sorry, I'm a pythonist :) could be a good
source of inspiration; it supports Unix, Windows and MAC style locales with
a bunch of useful functions (getlocale, getdefaultlocale, setlocale,
normalize, locate-aware atoi+atof+str+format+strcol, etc...). 

I'll take a look at it this weekend.

[1] http://doc.astro-wise.org/locale.html

Jul 19 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cdgjdq$e82$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?=
says...

Also the Python locale module[1] (sorry, I'm a pythonist :) could be a good
source of inspiration; it supports Unix, Windows and MAC style locales with
a bunch of useful functions (getlocale, getdefaultlocale, setlocale,
normalize, locate-aware atoi+atof+str+format+strcol, etc...). 

I'll take a look at it this weekend.

Be careful not to go too over-the-top here. I think that stuff like
locale-aware-atoi(), etc., should NOT be member functions of class Locale. I'll
explain my reasoning below. class Locale itself should be short, sweet and
simple - little more than the embodiment of those ISO codes, in fact. Locales
can identify a resource by being used as a map key, so you don't need tons of
other stuff build in.

The reason I say this is circularity, or bootstrapping, or simplicity, depending
on your point of view. To implement (say) locale-aware-atoi() would require
actual KNOWLEDGE of how to do that, for every locale. Now, we COULD pull all
that data from CDLR and implement it by hand, but it would be a lot of work.

Conceptually, it's simpler if we get the very basics up and running first, and
then later overload functions such as strcol() later. In fact, in this
/particular/ example (collation) we are most certainly better off leaving this
until later. I plan later to implement the Unicode Collation Algorithm, based on
the data in CDLR. That will end up as a function which takes a Locale as one of
its parameters, and whose behavior is controlled by that parameter. Same with
full casing. Where such an algorithm exists (along with all the data) it makes
sense to take advantage of it, but we are not in a position to do that yet,
because not enough of the basics are there.

But do take a look anyway. Look also at Java's class Locale. It basically does
nothing, except identify a locale. That's the kind of line I'm thinking along,
as it allows for unlimited expansion later without tying us down to anything.

Jill

Jul 19 2004

Hauke Duden <H.NS.Duden gmx.net> writes:

As far as I remember (I looked at gettext a few of years ago) gettext 
has some serious drawbacks. The worst being that parameters that are 
inserted into the translated string have to be specified in "printf" 
formatting. That means that their order in the translated string must be 
the same as the order in the original text, which is not always possible 
and often awkward.

My memory is a little fuzzy about the specifics, so please correct me if 
I'm wrong.

Hauke

Arcane Jill wrote:

 In article <cdeva5$2p2o$1 digitaldaemon.com>, Juanjo
=?ISO-8859-15?Q?=C1lvarez?=
 says...
 
What about just porting GNU gettext to phobos? This way you have a
semi-standart way of localizing programs (which a lot of translators know
about), and a set of pre-written tools (even nice GUI ones).

 
 
 I'm not sure. I'll admit I don't know much about gettext, so perhaps you could
 clear a few things up for me (and others)?
 
 What is the encoding of a .po or .mo file? Specifically, is it UTF-8? (Or CAN
it
 be UTF-8?). Put another way - how good is its Unicode support?
 
 I guess I have to say I'd be disappointed if we had to rely on yet another C
 library. Maybe I'm just completely mad, but I'd prefer a pure D solution. (That
 said, I have no objection to using files of the same format). gettext looks
like
 it does some cool stuff, but it's ... well ... C. It's not OO, unless I've
 misunderstood. It doesn't use exceptions. Moreover, it assumes the Linux
meaning
 of "locale", which is (again, in my opinion) not right for D.
 
 The way I see it, D should define locales exclusively in terms of ISO language
 and country codes, plus variant extensions. Unicode defines locales that way,
 and the etc.unicode library will have no choice but to use the ISO codes.
 Collation and stuff like that will need to rely on data from the CDLR (Common
 Locale Data Repository - see http://www.unicode.org/cldr/).
 
 I suppose my gut feeling is that internationalization /isn't that hard/, so it
 ought to be relatively simple a task to come up with a native-D solution.
 gettext seems to do string translation only (again, correct me if I'm wrong),
 which only a small part of internationalization/localazation.
 
 So, I guess, on balance, I'd vote against this one, at least pending some
 persuasive argument. That said, I'm way too busy to volunteer for any work
(plus
 I'm still taking a bit of time off from coding for personal reasons) - although
 I /do/ intend to tackle Unicode localization quite soon.
 
 Does that help? Probably not, I guess. Ah well.
 
 Tell you what, let's start an open discussion. (I've changed the thread title).
 I think we should hear lots of opinions before anyone actually DOES anything. A
 wrong early decision here could hamper D's potential future as /the/ language
 for internationalization (which I'd like it to become).
 
 Arcane Jill

Jul 19 2004

Juanjo =?ISO-8859-15?Q?=C1lvarez?= <juanjuxNO SPAMyahoo.es> writes:

Hauke Duden wrote:

 As far as I remember (I looked at gettext a few of years ago) gettext
 has some serious drawbacks. The worst being that parameters that are
 inserted into the translated string have to be specified in "printf"
 formatting. That means that their order in the translated string must be
 the same as the order in the original text, which is not always possible
 and often awkward.
 
 My memory is a little fuzzy about the specifics, so please correct me if
 I'm wrong.
 
 Hauke

Mmm, not the GNU gettext, you can put:

printf(_("There are %d %s %s\n"), count, _(color), _(name));

And the output po file will be:

"There are %1$d %2$s %3$s"

So translator can change the numbers thus changing the word order.

Jul 19 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cdgiqm$dua$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?=
says...
"There are %1$d %2$s %3$s"

So translator can change the numbers thus changing the word order.

Is this a feature of printf()? If so, is a Linux thing or an all-platform thing?
And (probably a silly question, but someone might know the answer) is this
functionality available in the new writef()?

Jul 19 2004

Sean Kelly <sean f4.ca> writes:

In article <cdgq1p$gqi$1 digitaldaemon.com>, Arcane Jill says...
In article <cdgiqm$dua$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?=
says...
"There are %1$d %2$s %3$s"

So translator can change the numbers thus changing the word order.

Is this a feature of printf()? If so, is a Linux thing or an all-platform thing?
And (probably a silly question, but someone might know the answer) is this
functionality available in the new writef()?

It's not a feature of printf and AFAIK it's not in the new writef either.

Semi-related: I'm recoding my scanf implementation as unFormat (to match
doFormat) and changing the calling syntax to readf.  So with any luck there will
be both input and output routines written in D.

Sean

Jul 19 2004

Jonathan Leffler <jleffler earthlink.net> writes:

Arcane Jill wrote:

 In article <cdgiqm$dua$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?=
 says...
 
"There are %1$d %2$s %3$s"

So translator can change the numbers thus changing the word order.

 
 Is this a feature of printf()? If so, is a Linux thing or an all-platform
thing?
 And (probably a silly question, but someone might know the answer) is this
 functionality available in the new writef()?

It depends on whose printf() you're looking at.
Standard C - no.  POSIX - yes.

See:

http://www.opengroup.org/onlinepubs/009695399/functions/fprintf.html

I discussed this once before in this news group, a few weeks after the 
thread had gone stale (mainly because I only just started to pay 
attention to D).  ...dig...dig...dig...Friday 9th July 2004...

http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/5662


-- 
Jonathan Leffler                   #include <disclaimer.h>
Email: jleffler earthlink.net, jleffler us.ibm.com
Guardian of DBD::Informix v2003.04 -- http://dbi.perl.org/

Jul 19 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cdgiqm$dua$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?=
says...

Mmm, not the GNU gettext, you can put:

printf(_("There are %d %s %s\n"), count, _(color), _(name));

And the output po file will be:

"There are %1$d %2$s %3$s"

So translator can change the numbers thus changing the word order.

Well, that /sounds/ like the kind of thing we need, but your above example is a
little unclear to those of us who have not used gettext() before. As I read the
above, and assuming that _() is the text-localizing function, that wouldn't
change the word order. But you say it does, so I must have misunderstood
something. Can you break that down into steps?

Berin mentioned Java's MessageFormat class. This does the job of word order
switching. It's cumbersome to use in practice, but we could still borrow the
technique if we so needed.

We will certainly find a way to do word reording in D. The question is where is
the right place for that? Does gettext do it? Should we petition Walter to get
writef() to do it? Would Hauke's string class be the right place.

We need more information....

Jill

Jul 19 2004

Hauke Duden <H.NS.Duden gmx.net> writes:

Arcane Jill wrote:
 In article <cdgiqm$dua$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?=
 says...
 
 
Mmm, not the GNU gettext, you can put:

printf(_("There are %d %s %s\n"), count, _(color), _(name));

And the output po file will be:

"There are %1$d %2$s %3$s"

So translator can change the numbers thus changing the word order.

 
 
 Well, that /sounds/ like the kind of thing we need, but your above example is a
 little unclear to those of us who have not used gettext() before. As I read the
 above, and assuming that _() is the text-localizing function, that wouldn't
 change the word order. But you say it does, so I must have misunderstood
 something. Can you break that down into steps?
 
 Berin mentioned Java's MessageFormat class. This does the job of word order
 switching. It's cumbersome to use in practice, but we could still borrow the
 technique if we so needed.

I've been using a pretty simple but effective technique for quite some 
time. The translatable string can contain place holders of the form 
%NAME% and the translation function can take a map parameter that 
inserts the correct values.

This also has the advantage of better documentation. It is pretty hard 
to deduce the intended meaning of a string like "There are %d %ss in the 
%s". It gets easier if you have something like "There are %NUM% %OBJ%s 
in the %CONTAINER%". Less room for error.

I have also found that it can sometimes be helpful to be able to include 
some kind of comment for the translator that describes the intended use 
or any constraints of the string. For example "keep this as short as 
possible" or "context is file I/O". I implemented this by adding an 
optional parameter that can be passed to the translation function. It is 
ignored at runtime, but the "harvester" tool that extracts the strings 
from the code files includes it in the translatable files.

And last but not least, I think translatable strings should have an ID 
(a string ID, not a number). Not all strings that are the same in one 
language are the same in other languages. So if the translation is bound 
to the original text then you can have situations where you need to 
specify two different texts for two different contexts, but you are not 
able to do so, because the original text serves as ID/key.

A good example that I encountered a few years ago:
At the time I played the german version of the game Baldurs Gate. It 
contained some horrible text bugs that obviously originated from a 
translation system where the original text served as the ID.
One particular case was the text "XXX attacks YYY" that was displayed 
whenever one character attacked another. "attacks" in english can mean 
the plural of the noun "attack" or it can be a form of the verb "to 
attack". In this case it is the verb form. Unfortunately it was 
translated with the German plural of the noun, which is different from 
the verb (probably because it was also used in a different context where 
it meant the noun). So that the translation made no sense at all.


Hauke

Jul 19 2004

J C Calvarese <jcc7 cox.net> writes:

Arcane Jill wrote:
 In article <cd9peo$oce$1 digitaldaemon.com>, J C Calvarese says...
 
 
There seems to be 2 schools of thought in the area of localization (of 
which language issues are a subset):

1. Compile-time generated using version() as you described in your post.

2. Runtime-time generated using some sort of a plugin architecture or 
language resource files (as Arcane Jill alludes to in her reply).

Since D is capable enough for either method, both parties can be happy.

Personally, I think I'd prefer to use compile-time localization, but 
that doesn't prevent others from designing runtime-time localization 
functions.

 
 
 Compile-time localization not something I'd thought about before, so it's been
 kind of an interesting thing to think about. The thing is, though, we already
 _have_ compile-time localization. We've always had it. As Stewart said, you can
 do:
 












 
 And we've been able to do that in C ever since we learned how to use #ifdef. So
 it requires no new language features. It's already there. But, since this
 technique has been around for so long, you'd expect it be widely used ...
 unless, that is, it turns out to be not very useful.

It's nothing revolutionary, but it's a start.

 
 There are a number of disadvantages I can think of. For a start, your
 locale-specific code will end up distributed throughout your source code,
 instead of all in one place. This could be a nightmare if you decide to support
 a new locale. Another problem is that, if you choose the locale at
compile-time,
 then the end-user (as opposed to the developer) has to have the source code, OR
 an executable which was compiled especially for their locale. And that's just
 for executables. For libraries, the situation is even worse. A
 compile-time-localized executable would have to be linked with
 compile-time-localized libraries, compiled for the same locale. It would be a
 serious headache.
 
 Another problem is that someone might compile it for version(en_US), without
 realizing that they should have been using version(en).
 
 And all for what? To save a small amount of run-time overhead. Well, /how much/
 run-time overhead? In most cases, run-time-localization amounts to looking
 something up in a map. Is that bad? A matter of judgement, maybe, but I'd say
it
 was insignificant compared to the overhead incurred by writing that localized
 string to printf() or a file. 

I could be a lot of run-time overhead. It could be a little. But 
ultimately, it should be left up to the individual programmer.

 
 Localization, to me, is the flip-side of internationalization (or i18n for lazy
 typists). The way it's traditionally done is you "internationalize" your code -
 a compile-time thing, and then "localize" it at run-time. Here's an example.

I didn't realize there was a different between localization and 
internalization. The OP was mostly concerned with "human languages", but 
I think that other issues such as date formats would naturally be 
discussed at the same time.

 Start with some normal, unlocalized code:
 

 
 Now, internationalize it. It will become something like this:
 

 
 Not much different really. The function local() would probably be something
like
 this, and would get inlined:
 






 
 "Localizing" this program now consists only of initializing the localizedLookup
 map, which could happen in any number of ways.
 
 My guess is that what would be most useful in terms of
 internationalization/localization would be some classes and functions to make
 stuff like the above easier, providing localized number formats and so on. Plus
 of course the D definition of a "locale" - we have to standardize that somehow.
 Me, I'd prefer enums to strings - saves all that messing about with case, for
 one thing, and string-splitting to get at the two (possibly three) parts.

Sure, Phobos should include some modules for run-time support (and maybe 
compile-time support, too).

 
 
I have a quick comment about the specifics of your ideas. I think the 
version identifiers should have a prefix (such as "lang_"). This should 
make it clear to most viewers of the code what's happening. Not everyone 
would intuitively know that ky_KG is a language feature, but lang_ky_KG 
is guessable.

 
 
 Strictly speaking it should be "locale_", not "lang_". ("ky" is a lang, or
 language; "KG" is a country. The combination "ky_KG" is a locale). Sorry for
 being a pedantic bugger here.

I don't mind nit-picking (I do a lot of it myself). I hereby retract 
"lang_" in favor of either "locale_" or "loc_".

 
 But the original questions was, should there be "standard version identifiers"?
 Thing is - I don't see how there can be. A version identifier is just D's name
 for a #define, and there's nothing to stop anyone from using they want as such.
 A note in the style guide might help, but even that won't force people to use
 said standard.

Right. I was thinking "convention" when I read "standard". I don't 
intend to compell anyone (and as you state, they can't really be 
compelled), but I think if the convention makes sense, many people would 
use it.

 
 Just my thoughts.
 
 Arcane Jill
 
 
 


-- 
Justin (a/k/a jcc7)
http://jcc_7.tripod.com/d/

Jul 18 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cdfgli$2vdr$1 digitaldaemon.com>, J C Calvarese says...
I didn't realize there was a different between localization and 
internalization.

I found a good explanation about this when looking up gettext on the web. Have a
look at http://www.gnu.org/software/gettext/manual/html_mono/gettext.html#SEC3.

To quote its summary of itself: "Also, very roughly said, when it comes to
multi-lingual messages, internationalization is usually taken care of by
programmers, and localization is usually taken care of by translators."

I consider myself a good programmer, but I'd make a lousy translator. I'd want
to leave that job to someone else.

Arcane Jill

Jul 19 2004

Stewart Gordon <smjg_1998 yahoo.com> writes:

Arcane Jill wrote:

<snip>
 There are a number of disadvantages I can think of. For a start, your
 locale-specific code will end up distributed throughout your source code,
 instead of all in one place.

Only if you choose to do it that way.  You can just as well have one 
version block per module, with the locale-specific data and code in 
them, and have the rest of the module use stuff in here.

 This could be a nightmare if you decide to support a new locale.  
 Another problem is that, if you choose the locale at compile-time, 
 then the end-user (as opposed to the developer) has to have the 
 source code, OR an executable which was compiled especially for their 
 locale.

I believe it's quite common to offer separate downloadable versions in 
each language.  That way, a unilingual end-user isn't faced with the 
bloat of a multilingual UI or the overhead of compiling it, and you can 
choose whether to release the source or not.

 And that's just for executables. For libraries, the situation is even worse. A
 compile-time-localized executable would have to be linked with
 compile-time-localized libraries, compiled for the same locale. It would be a
 serious headache.

To me it would seem straightforward to build a copy of the lib, and give 
it an identifying name, for each language that your app supports.

<snip>
 Strictly speaking it should be "locale_", not "lang_". ("ky" is a lang, or
 language; "KG" is a country. The combination "ky_KG" is a locale). Sorry for
 being a pedantic bugger here.

<snip>

ISO language codes are language-dialect pairs.  So en-GB is British 
English, es-MX is Mexican Spanish.  AIUI they don't cover other aspects 
of locale, such as time zones, date formats and the like.  They tend to 
be managed by the OS - it would seem pointless to try and write apps to 
override this.

Stewart.

-- 
My e-mail is valid but not my primary mailbox, aside from its being the 
unfortunate victim of intensive mail-bombing at the moment.  Please keep 
replies on the 'group where everyone may benefit.

Jul 19 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cdg8ab$98s$1 digitaldaemon.com>, Stewart Gordon says...

ISO language codes are language-dialect pairs.  So en-GB is British 
English, es-MX is Mexican Spanish.

I know.


AIUI they don't cover other aspects 
of locale, such as time zones, date formats and the like.

In a sense, they don't cover ANYTHING. They are just tuples of
language/country/variant tags. However, if you think of these as map keys, you
can turn them into anything else quite straightforwardly.


They tend to be managed by the OS

That would be nice, but collation, number formats, date formats,  etc. (to give
just a few examples) are not handled very well at all by any OS of which I am
aware. 


it would seem pointless to try and write apps to 
override this.

But not pointless for a library. The CLDR, which is maintained by the Unicode
Consortium, contains just about every fragment of information you could possibly
imagine wanting (short of actual language translation). Its data files are in
XML - actually a custom format called LDML (Locale Data Markup Language). It
absolutely DOES include such information as time zones, currencies, number
formats, and so on. It's a resource we would be foolish to ignore, and since
it's XML, it can be robot-parsed far, far more easily than the Unicode database.
I will most certainly be using /some/ of the CLDR data for the Unicode collation
algorithm.

If you want to write several monolingual applications from the same source,
no-one is going to stop you. Go ahead and do it, and use whatever version
identifiers you want. There's room in this world (and indeed in D) for BOTH
compile-time language selecation AND true internationalization/localization, so
I guess we can all be happy.

Arcane Jill

Jul 19 2004

Stewart Gordon <smjg_1998 yahoo.com> writes:

Arcane Jill wrote:

<snip>
 Another problem is that someone might compile it for version(en_US), without
 realizing that they should have been using version(en).

<snip>

Yes, as I said in my original post:
 It would be necessary either for the compiler to automatically 
 set en if en_GB or en_US or en_anything is set, and similarly for 
 other language codes, or to persuade all D users to do this. Of 
 course, this would be done in the aforementioned default language 
 setting.



Stewart.

-- 
My e-mail is valid but not my primary mailbox, aside from its being the 
unfortunate victim of intensive mail-bombing at the moment.  Please keep 
replies on the 'group where everyone may benefit.

Jul 19 2004

D Programming

C/C++ Programming

Other

digitalmars.D - [Suggestion] Standard version identifiers for language