digitalmars.D - [Suggestion] Standard version identifiers for language
- Stewart Gordon (73/73) Jul 16 2004 Suppose one wants to write an application with versions in different
- Arcane Jill (8/9) Jul 16 2004 Can I respond to this next week?
- J C Calvarese (29/57) Jul 16 2004 There seems to be 2 schools of thought in the area of localization (of
- Arcane Jill (71/85) Jul 18 2004 Compile-time localization not something I'd thought about before, so it'...
- Juanjo =?ISO-8859-15?Q?=C1lvarez?= (6/6) Jul 18 2004 What about just porting GNU gettext to phobos? This way you have a
- Arcane Jill (31/34) Jul 19 2004 I'm not sure. I'll admit I don't know much about gettext, so perhaps you...
- Juanjo =?ISO-8859-15?Q?=C1lvarez?= (32/57) Jul 19 2004 First things first, you can have all the documentation about GNU gettext
- Thomas Kuehne (5/7) Jul 19 2004 Just to point out some other - not neccessary better - localization libs...
- Juanjo =?ISO-8859-15?Q?=C1lvarez?= (5/14) Jul 19 2004 I don't know about the Java implementation but Qt tr() is very similar t...
- Berin Loritsch (11/32) Jul 19 2004 With the Java MessageFormat solution, things work fairly well. Consider...
- Arcane Jill (17/25) Jul 19 2004 Wow!
- Juanjo =?ISO-8859-15?Q?=C1lvarez?= (10/22) Jul 19 2004 I vote for that too.
- Arcane Jill (25/30) Jul 19 2004 Be careful not to go too over-the-top here. I think that stuff like
- Hauke Duden (10/58) Jul 19 2004 As far as I remember (I looked at gettext a few of years ago) gettext
- Juanjo =?ISO-8859-15?Q?=C1lvarez?= (6/17) Jul 19 2004 Mmm, not the GNU gettext, you can put:
- Arcane Jill (5/7) Jul 19 2004 Is this a feature of printf()? If so, is a Linux thing or an all-platfor...
- Sean Kelly (6/15) Jul 19 2004 It's not a feature of printf and AFAIK it's not in the new writef either...
- Jonathan Leffler (13/23) Jul 19 2004 It depends on whose printf() you're looking at.
- Arcane Jill (15/20) Jul 19 2004 Well, that /sounds/ like the kind of thing we need, but your above examp...
- Hauke Duden (34/58) Jul 19 2004 I've been using a pretty simple but effective technique for quite some
- J C Calvarese (19/135) Jul 18 2004 I could be a lot of run-time overhead. It could be a little. But
- Arcane Jill (9/11) Jul 19 2004 I found a good explanation about this when looking up gettext on the web...
- Stewart Gordon (23/38) Jul 19 2004 Only if you choose to do it that way. You can just as well have one
- Arcane Jill (23/30) Jul 19 2004 In a sense, they don't cover ANYTHING. They are just tuples of
-
Stewart Gordon
(9/16)
Jul 19 2004
Suppose one wants to write an application with versions in different languages. That's human languages, not programming languages. At face value, that's simple - use version blocks to hold the various languages' UI text. Or for Windows, define a separate resource file for each language. (Or lists of string macros to be imported by one resource file.) But what if you want to use one or more libraries from various sources, which also may have language versions? Then you'd have to set all the version identifiers that the different library designers have chosen for your choice of language, which could lead to quite long command lines. It would be simpler if there could be a standard system of language identifiers for everyone to follow. A system based on ISO 639-1 would probably be good. One could then write ---------- version (en) { const char[][] DAY = [ "Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday" ]; } else version (fr) { const char[][] DAY = [ "dimanche", "lundi", "mardi", "mercredi", "jeudi", "vendredi", "samedi" ]; } else { static assert (false); } ---------- There are a few matters of debate to be considered: 1. Should we create a prefixed language namespace? Or will these codes by themselves do? 2. How really should a lib be written to deal with unsupported languages, or if no language has been specified? Two possibilities I can see: (a) the lib programmer would put his/her own language (or maybe the one predicted to be most popular) as the default. (b) a static assert as above, effectively telling the app programmer "please set a language, or create a version block in me for your language". Maybe a future D compiler could be configured to use a certain language version as the default if none is specified on the command line. 3. Should we really have them as version identifiers? Or invent a new CC block called 'language' that would have the specifics of language designation built in, a corresponding command line option and a corresponding compiler configuration setting? Dialects of a language could be indicated by replacing the hyphen in the ISO code with an underscore. Libs would then have something like ---------- version (en_GB) { ... } else version (en_US) { ... } else version (en) { ... } ---------- It would be necessary either for the compiler to automatically set en if en_GB or en_US or en_anything is set, and similarly for other language codes, or to persuade all D users to do this. Of course, this would be done in the aforementioned default language setting. This would give lib programmers a choice of writing for each dialect of each language, just covering the basic languages, or a mixture. A default fallback for an unsupported dialect would, I guess, typically be some 'default' dialect if that makes sense. This provides for compile-time localisation. Of course, some might want run-time l10n, in which case the app would be explicitly programmed to do this. In writing a lib one might choose to support RTL. In which case the version/language blocks would be used to select the default language, which would make it usable for monolingual apps, CTL apps and RTL apps alike. Of course, one could argue that there should be some global variable in Phobos or somewhere for run-time language.... Stewart. -- My e-mail is valid but not my primary mailbox, aside from its being the unfortunate victim of intensive mail-bombing at the moment. Please keep replies on the 'group where everyone may benefit.
Jul 16 2004
In article <cd8rji$ar0$1 digitaldaemon.com>, Stewart Gordon says...<some interesting ideas>Can I respond to this next week? Short answer - this is a run-time problem, not a compile-time problem. A third party should (IMO) be able to add additional human languages given access only to the executable binary and some ini files (or something similar). Long answer - can you wait? I've got lots of ideas about this, but I really not up to debate just yet. Arcane Jill
Jul 16 2004
Stewart Gordon wrote:Suppose one wants to write an application with versions in different languages. That's human languages, not programming languages. At face value, that's simple - use version blocks to hold the various languages' UI text. Or for Windows, define a separate resource file for each language. (Or lists of string macros to be imported by one resource file.) But what if you want to use one or more libraries from various sources, which also may have language versions? Then you'd have to set all the version identifiers that the different library designers have chosen for your choice of language, which could lead to quite long command lines. It would be simpler if there could be a standard system of language identifiers for everyone to follow. A system based on ISO 639-1 would probably be good. One could then write ---------- version (en) { const char[][] DAY = [ "Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday" ]; } else version (fr) { const char[][] DAY = [ "dimanche", "lundi", "mardi", "mercredi", "jeudi", "vendredi", "samedi" ]; } else { static assert (false); } ----------There seems to be 2 schools of thought in the area of localization (of which language issues are a subset): 1. Compile-time generated using version() as you described in your post. 2. Runtime-time generated using some sort of a plugin architecture or language resource files (as Arcane Jill alludes to in her reply). Since D is capable enough for either method, both parties can be happy. Personally, I think I'd prefer to use compile-time localization, but that doesn't prevent others from designing runtime-time localization functions. I have a quick comment about the specifics of your ideas. I think the version identifiers should have a prefix (such as "lang_"). This should make it clear to most viewers of the code what's happening. Not everyone would intuitively know that ky_KG is a language feature, but lang_ky_KG is guessable. version (lang_en_GB) { ... } else version (lang_en_US) { ... } else version (lang_en) { ... } Since I don't have any real experience with localization, I'd love to hear some opinions from those who have actually worked with localization. Which programming languages make localization easy. Which libraries are helpful. -- Justin (a/k/a jcc7) http://jcc_7.tripod.com/d/
Jul 16 2004
In article <cd9peo$oce$1 digitaldaemon.com>, J C Calvarese says...There seems to be 2 schools of thought in the area of localization (of which language issues are a subset): 1. Compile-time generated using version() as you described in your post. 2. Runtime-time generated using some sort of a plugin architecture or language resource files (as Arcane Jill alludes to in her reply). Since D is capable enough for either method, both parties can be happy. Personally, I think I'd prefer to use compile-time localization, but that doesn't prevent others from designing runtime-time localization functions.Compile-time localization not something I'd thought about before, so it's been kind of an interesting thing to think about. The thing is, though, we already _have_ compile-time localization. We've always had it. As Stewart said, you can do: And we've been able to do that in C ever since we learned how to use #ifdef. So it requires no new language features. It's already there. But, since this technique has been around for so long, you'd expect it be widely used ... unless, that is, it turns out to be not very useful. There are a number of disadvantages I can think of. For a start, your locale-specific code will end up distributed throughout your source code, instead of all in one place. This could be a nightmare if you decide to support a new locale. Another problem is that, if you choose the locale at compile-time, then the end-user (as opposed to the developer) has to have the source code, OR an executable which was compiled especially for their locale. And that's just for executables. For libraries, the situation is even worse. A compile-time-localized executable would have to be linked with compile-time-localized libraries, compiled for the same locale. It would be a serious headache. Another problem is that someone might compile it for version(en_US), without realizing that they should have been using version(en). And all for what? To save a small amount of run-time overhead. Well, /how much/ run-time overhead? In most cases, run-time-localization amounts to looking something up in a map. Is that bad? A matter of judgement, maybe, but I'd say it was insignificant compared to the overhead incurred by writing that localized string to printf() or a file. Localization, to me, is the flip-side of internationalization (or i18n for lazy typists). The way it's traditionally done is you "internationalize" your code - a compile-time thing, and then "localize" it at run-time. Here's an example. Start with some normal, unlocalized code: Now, internationalize it. It will become something like this: Not much different really. The function local() would probably be something like this, and would get inlined: "Localizing" this program now consists only of initializing the localizedLookup map, which could happen in any number of ways. My guess is that what would be most useful in terms of internationalization/localization would be some classes and functions to make stuff like the above easier, providing localized number formats and so on. Plus of course the D definition of a "locale" - we have to standardize that somehow. Me, I'd prefer enums to strings - saves all that messing about with case, for one thing, and string-splitting to get at the two (possibly three) parts.I have a quick comment about the specifics of your ideas. I think the version identifiers should have a prefix (such as "lang_"). This should make it clear to most viewers of the code what's happening. Not everyone would intuitively know that ky_KG is a language feature, but lang_ky_KG is guessable.Strictly speaking it should be "locale_", not "lang_". ("ky" is a lang, or language; "KG" is a country. The combination "ky_KG" is a locale). Sorry for being a pedantic bugger here. But the original questions was, should there be "standard version identifiers"? Thing is - I don't see how there can be. A version identifier is just D's name for a #define, and there's nothing to stop anyone from using they want as such. A note in the style guide might help, but even that won't force people to use said standard. Just my thoughts. Arcane Jill
Jul 18 2004
What about just porting GNU gettext to phobos? This way you have a semi-standart way of localizing programs (which a lot of translators know about), and a set of pre-written tools (even nice GUI ones). Looking at the python implementation it should not be difficult; the Python implementation is only 493 lines (gettext.py) I'll see if I can take enough time to do it in the next weeks.
Jul 18 2004
In article <cdeva5$2p2o$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?= says...What about just porting GNU gettext to phobos? This way you have a semi-standart way of localizing programs (which a lot of translators know about), and a set of pre-written tools (even nice GUI ones).I'm not sure. I'll admit I don't know much about gettext, so perhaps you could clear a few things up for me (and others)? What is the encoding of a .po or .mo file? Specifically, is it UTF-8? (Or CAN it be UTF-8?). Put another way - how good is its Unicode support? I guess I have to say I'd be disappointed if we had to rely on yet another C library. Maybe I'm just completely mad, but I'd prefer a pure D solution. (That said, I have no objection to using files of the same format). gettext looks like it does some cool stuff, but it's ... well ... C. It's not OO, unless I've misunderstood. It doesn't use exceptions. Moreover, it assumes the Linux meaning of "locale", which is (again, in my opinion) not right for D. The way I see it, D should define locales exclusively in terms of ISO language and country codes, plus variant extensions. Unicode defines locales that way, and the etc.unicode library will have no choice but to use the ISO codes. Collation and stuff like that will need to rely on data from the CDLR (Common Locale Data Repository - see http://www.unicode.org/cldr/). I suppose my gut feeling is that internationalization /isn't that hard/, so it ought to be relatively simple a task to come up with a native-D solution. gettext seems to do string translation only (again, correct me if I'm wrong), which only a small part of internationalization/localazation. So, I guess, on balance, I'd vote against this one, at least pending some persuasive argument. That said, I'm way too busy to volunteer for any work (plus I'm still taking a bit of time off from coding for personal reasons) - although I /do/ intend to tackle Unicode localization quite soon. Does that help? Probably not, I guess. Ah well. Tell you what, let's start an open discussion. (I've changed the thread title). I think we should hear lots of opinions before anyone actually DOES anything. A wrong early decision here could hamper D's potential future as /the/ language for internationalization (which I'd like it to become). Arcane Jill
Jul 19 2004
First things first, you can have all the documentation about GNU gettext here: http://www.gnu.org/software/gettext/manual/html_chapter/gettext_toc.html Arcane Jill wrote:I'm not sure. I'll admit I don't know much about gettext, so perhaps you could clear a few things up for me (and others)?Let's try.What is the encoding of a .po or .mo file? Specifically, is it UTF-8? (Or CAN it be UTF-8?). Put another way - how good is its Unicode support?The enconding of the file is declared in the header of the po files; so it can be (I think) anything, for example: "Project-Id-Version: animail\n" "POT-Creation-Date: 2003-12-07 02:02+0100\n" "PO-Revision-Date: 2004-07-08 20:19+0200\n" "Last-Translator: XXX XXX <XXX XXX.de>\n" "Language-Team: Deutsch <de li.org>\n" "MIME-Version: 1.0\n" "Content-Type: text/plain; charset=UTF-8\n" ^^^^^^^ "Content-Transfer-Encoding: 8bit\n" ^^^^^^^ "X-Generator: KBabel 1.3.1\n"I guess I have to say I'd be disappointed if we had to rely on yet another C library.We don't; the Python implementation, (that of about 500 lines) don't use any external C lib at all; it's 100% pure Python.gettext looks like it does some cool stuff, but it's ... well ... C.I repeat, it doesn't have to be C. gettext is more like a set of tools and formats than a library (altought the library exists, of course, but not only for C but for a lot of languages.)It's not OO.It can be done as OO.unless I've misunderstood. It doesn't use exceptions.Our implementation could.Moreover, it assumes the Linux meaning of "locale", which is (again, in my opinion) not right for D. The way I see it, D should define locales exclusively in terms of ISO language and country codes, plus variant extensions. Unicode defines locales that way, and the etc.unicode library will have no choice but to use the ISO codes. Collation and stuff like that will need to rely on data from the CDLR (Common Locale Data Repository - see http://www.unicode.org/cldr/).I don't know the answer to this one, gettext seems to use ISO3136 country codes and ISO639 language codes.I suppose my gut feeling is that internationalization /isn't that hard/, so it ought to be relatively simple a task to come up with a native-D solution.gettext seems to do string translation only (again, correct me if I'm wrong), which only a small part of internationalization/localazation.That's true. It also handles plural forms which is not so simple (http://www.gnu.org/software/gettext/manual/html_chapter/gettext_10.html#SEC150) Anyway if we all can discuss the matter and come with a better solution than gettext (which I'm sure it's possible) I doubt many will be opposed.
Jul 19 2004
Juanjo Álvarez:Anyway if we all can discuss the matter and come with a better solutionthangettext (which I'm sure it's possible) I doubt many will be opposed.Just to point out some other - not neccessary better - localization libs: qt/kde: http://doc.trolltech.com/3.0/linguist-manual.html java: http://java.sun.com/j2se/1.5.0/docs/api/java/util/Formattable.html
Jul 19 2004
Thomas Kuehne wrote:Juanjo Álvarez:I don't know about the Java implementation but Qt tr() is very similar to gettext (the format of the translation files is different.) I don't know if KDE uses gettext internally but they use po/mo files just like gettext (and with the same format.)Anyway if we all can discuss the matter and come with a better solutionthangettext (which I'm sure it's possible) I doubt many will be opposed.Just to point out some other - not neccessary better - localization libs: qt/kde: http://doc.trolltech.com/3.0/linguist-manual.html java: http://java.sun.com/j2se/1.5.0/docs/api/java/util/Formattable.html
Jul 19 2004
Juanjo Álvarez wrote:Thomas Kuehne wrote:With the Java MessageFormat solution, things work fairly well. Consider for instance: MessageFormat.format("There was a problem in {0}, where {2} parse errors encountered at line {1}", location, lineNum, numErrs); There is a way to map which argument goes to which location in the line. Also, the MessageFormat does have a shorthand for kind of an if:then construct so that the same message would be interpreted differently for plurals/etc. That makes it convenient to handle those issues in I18N. However, I would not say that the MessageFormat is super easy to use. It could have a better interface, but the concepts are pretty decent.Juanjo Álvarez:I don't know about the Java implementation but Qt tr() is very similar to gettext (the format of the translation files is different.) I don't know if KDE uses gettext internally but they use po/mo files just like gettext (and with the same format.)Anyway if we all can discuss the matter and come with a better solutionthangettext (which I'm sure it's possible) I doubt many will be opposed.Just to point out some other - not neccessary better - localization libs: qt/kde: http://doc.trolltech.com/3.0/linguist-manual.html java: http://java.sun.com/j2se/1.5.0/docs/api/java/util/Formattable.html
Jul 19 2004
In article <cdg2mr$7dn$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?= says...The enconding of the file is declared in the header of the po files; so it can be (I think) anything, for example:the Python implementation, (that of about 500 lines) don't use any external C lib at all; it's 100% pure Python.It can be done as OO.Our implementation could [use exceptions].gettext seems to use ISO3136 country codes and ISO639 language codes.Wow! Well, you've quashed all of my objections then. I'll change my vote then. Looks like a D implementation of gettext is the way to go. Just one last thing though - we do need a D definition of a locale. In effect, we need a class Locale (or possibly a struct Locale) containing those ISO codes. Java uses strings internally (I _think_), but there are a whole bunch of reasons why that's not such a good idea - such as the fact that "fr", "fra" and "fre" are all, equivalently, the language code for French, and should all compare as equal; such as case and other punctuation concerns ("en-us" == "en-US" == "en_us" == "en_US", etc.). I'd vote for putting enums inside the class (enum Language and enum Country - the variant field will still need to be a string). I imagine that the gettext implementation will need to use our yet-to-be-invented Locale class, and the unicode lib certainly will (and soon). Any thoughts? Class or struct? Strings or enums? Something else? Arcane Jill
Jul 19 2004
Arcane Jill wrote:Just one last thing though - we do need a D definition of a locale. In effect, we need a class Locale (or possibly a struct Locale) containing those ISO codes.Yes, definitively.the language code for French, and should all compare as equal; such as case and other punctuation concerns ("en-us" == "en-US" == "en_us" == "en_US", etc.). I'd vote for putting enums inside the class (enum Language and enum Country - the variant field will still need to be a string).I vote for that too.I imagine that the gettext implementation will need to use our yet-to-be-invented Locale class, and the unicode lib certainly will (and soon). Any thoughts? Class or struct? Strings or enums? Something else?Class + enums, IMHO :) Also the Python locale module[1] (sorry, I'm a pythonist :) could be a good source of inspiration; it supports Unix, Windows and MAC style locales with a bunch of useful functions (getlocale, getdefaultlocale, setlocale, normalize, locate-aware atoi+atof+str+format+strcol, etc...). I'll take a look at it this weekend. [1] http://doc.astro-wise.org/locale.html
Jul 19 2004
In article <cdgjdq$e82$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?= says...Also the Python locale module[1] (sorry, I'm a pythonist :) could be a good source of inspiration; it supports Unix, Windows and MAC style locales with a bunch of useful functions (getlocale, getdefaultlocale, setlocale, normalize, locate-aware atoi+atof+str+format+strcol, etc...). I'll take a look at it this weekend.Be careful not to go too over-the-top here. I think that stuff like locale-aware-atoi(), etc., should NOT be member functions of class Locale. I'll explain my reasoning below. class Locale itself should be short, sweet and simple - little more than the embodiment of those ISO codes, in fact. Locales can identify a resource by being used as a map key, so you don't need tons of other stuff build in. The reason I say this is circularity, or bootstrapping, or simplicity, depending on your point of view. To implement (say) locale-aware-atoi() would require actual KNOWLEDGE of how to do that, for every locale. Now, we COULD pull all that data from CDLR and implement it by hand, but it would be a lot of work. Conceptually, it's simpler if we get the very basics up and running first, and then later overload functions such as strcol() later. In fact, in this /particular/ example (collation) we are most certainly better off leaving this until later. I plan later to implement the Unicode Collation Algorithm, based on the data in CDLR. That will end up as a function which takes a Locale as one of its parameters, and whose behavior is controlled by that parameter. Same with full casing. Where such an algorithm exists (along with all the data) it makes sense to take advantage of it, but we are not in a position to do that yet, because not enough of the basics are there. But do take a look anyway. Look also at Java's class Locale. It basically does nothing, except identify a locale. That's the kind of line I'm thinking along, as it allows for unlimited expansion later without tying us down to anything. Jill
Jul 19 2004
As far as I remember (I looked at gettext a few of years ago) gettext has some serious drawbacks. The worst being that parameters that are inserted into the translated string have to be specified in "printf" formatting. That means that their order in the translated string must be the same as the order in the original text, which is not always possible and often awkward. My memory is a little fuzzy about the specifics, so please correct me if I'm wrong. Hauke Arcane Jill wrote:In article <cdeva5$2p2o$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?= says...What about just porting GNU gettext to phobos? This way you have a semi-standart way of localizing programs (which a lot of translators know about), and a set of pre-written tools (even nice GUI ones).I'm not sure. I'll admit I don't know much about gettext, so perhaps you could clear a few things up for me (and others)? What is the encoding of a .po or .mo file? Specifically, is it UTF-8? (Or CAN it be UTF-8?). Put another way - how good is its Unicode support? I guess I have to say I'd be disappointed if we had to rely on yet another C library. Maybe I'm just completely mad, but I'd prefer a pure D solution. (That said, I have no objection to using files of the same format). gettext looks like it does some cool stuff, but it's ... well ... C. It's not OO, unless I've misunderstood. It doesn't use exceptions. Moreover, it assumes the Linux meaning of "locale", which is (again, in my opinion) not right for D. The way I see it, D should define locales exclusively in terms of ISO language and country codes, plus variant extensions. Unicode defines locales that way, and the etc.unicode library will have no choice but to use the ISO codes. Collation and stuff like that will need to rely on data from the CDLR (Common Locale Data Repository - see http://www.unicode.org/cldr/). I suppose my gut feeling is that internationalization /isn't that hard/, so it ought to be relatively simple a task to come up with a native-D solution. gettext seems to do string translation only (again, correct me if I'm wrong), which only a small part of internationalization/localazation. So, I guess, on balance, I'd vote against this one, at least pending some persuasive argument. That said, I'm way too busy to volunteer for any work (plus I'm still taking a bit of time off from coding for personal reasons) - although I /do/ intend to tackle Unicode localization quite soon. Does that help? Probably not, I guess. Ah well. Tell you what, let's start an open discussion. (I've changed the thread title). I think we should hear lots of opinions before anyone actually DOES anything. A wrong early decision here could hamper D's potential future as /the/ language for internationalization (which I'd like it to become). Arcane Jill
Jul 19 2004
Hauke Duden wrote:As far as I remember (I looked at gettext a few of years ago) gettext has some serious drawbacks. The worst being that parameters that are inserted into the translated string have to be specified in "printf" formatting. That means that their order in the translated string must be the same as the order in the original text, which is not always possible and often awkward. My memory is a little fuzzy about the specifics, so please correct me if I'm wrong. HaukeMmm, not the GNU gettext, you can put: printf(_("There are %d %s %s\n"), count, _(color), _(name)); And the output po file will be: "There are %1$d %2$s %3$s" So translator can change the numbers thus changing the word order.
Jul 19 2004
In article <cdgiqm$dua$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?= says..."There are %1$d %2$s %3$s" So translator can change the numbers thus changing the word order.Is this a feature of printf()? If so, is a Linux thing or an all-platform thing? And (probably a silly question, but someone might know the answer) is this functionality available in the new writef()?
Jul 19 2004
In article <cdgq1p$gqi$1 digitaldaemon.com>, Arcane Jill says...In article <cdgiqm$dua$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?= says...It's not a feature of printf and AFAIK it's not in the new writef either. Semi-related: I'm recoding my scanf implementation as unFormat (to match doFormat) and changing the calling syntax to readf. So with any luck there will be both input and output routines written in D. Sean"There are %1$d %2$s %3$s" So translator can change the numbers thus changing the word order.Is this a feature of printf()? If so, is a Linux thing or an all-platform thing? And (probably a silly question, but someone might know the answer) is this functionality available in the new writef()?
Jul 19 2004
Arcane Jill wrote:In article <cdgiqm$dua$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?= says...It depends on whose printf() you're looking at. Standard C - no. POSIX - yes. See: http://www.opengroup.org/onlinepubs/009695399/functions/fprintf.html I discussed this once before in this news group, a few weeks after the thread had gone stale (mainly because I only just started to pay attention to D). ...dig...dig...dig...Friday 9th July 2004... http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/5662 -- Jonathan Leffler #include <disclaimer.h> Email: jleffler earthlink.net, jleffler us.ibm.com Guardian of DBD::Informix v2003.04 -- http://dbi.perl.org/"There are %1$d %2$s %3$s" So translator can change the numbers thus changing the word order.Is this a feature of printf()? If so, is a Linux thing or an all-platform thing? And (probably a silly question, but someone might know the answer) is this functionality available in the new writef()?
Jul 19 2004
In article <cdgiqm$dua$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?= says...Mmm, not the GNU gettext, you can put: printf(_("There are %d %s %s\n"), count, _(color), _(name)); And the output po file will be: "There are %1$d %2$s %3$s" So translator can change the numbers thus changing the word order.Well, that /sounds/ like the kind of thing we need, but your above example is a little unclear to those of us who have not used gettext() before. As I read the above, and assuming that _() is the text-localizing function, that wouldn't change the word order. But you say it does, so I must have misunderstood something. Can you break that down into steps? Berin mentioned Java's MessageFormat class. This does the job of word order switching. It's cumbersome to use in practice, but we could still borrow the technique if we so needed. We will certainly find a way to do word reording in D. The question is where is the right place for that? Does gettext do it? Should we petition Walter to get writef() to do it? Would Hauke's string class be the right place. We need more information.... Jill
Jul 19 2004
Arcane Jill wrote:In article <cdgiqm$dua$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?= says...I've been using a pretty simple but effective technique for quite some time. The translatable string can contain place holders of the form %NAME% and the translation function can take a map parameter that inserts the correct values. This also has the advantage of better documentation. It is pretty hard to deduce the intended meaning of a string like "There are %d %ss in the %s". It gets easier if you have something like "There are %NUM% %OBJ%s in the %CONTAINER%". Less room for error. I have also found that it can sometimes be helpful to be able to include some kind of comment for the translator that describes the intended use or any constraints of the string. For example "keep this as short as possible" or "context is file I/O". I implemented this by adding an optional parameter that can be passed to the translation function. It is ignored at runtime, but the "harvester" tool that extracts the strings from the code files includes it in the translatable files. And last but not least, I think translatable strings should have an ID (a string ID, not a number). Not all strings that are the same in one language are the same in other languages. So if the translation is bound to the original text then you can have situations where you need to specify two different texts for two different contexts, but you are not able to do so, because the original text serves as ID/key. A good example that I encountered a few years ago: At the time I played the german version of the game Baldurs Gate. It contained some horrible text bugs that obviously originated from a translation system where the original text served as the ID. One particular case was the text "XXX attacks YYY" that was displayed whenever one character attacked another. "attacks" in english can mean the plural of the noun "attack" or it can be a form of the verb "to attack". In this case it is the verb form. Unfortunately it was translated with the German plural of the noun, which is different from the verb (probably because it was also used in a different context where it meant the noun). So that the translation made no sense at all. HaukeMmm, not the GNU gettext, you can put: printf(_("There are %d %s %s\n"), count, _(color), _(name)); And the output po file will be: "There are %1$d %2$s %3$s" So translator can change the numbers thus changing the word order.Well, that /sounds/ like the kind of thing we need, but your above example is a little unclear to those of us who have not used gettext() before. As I read the above, and assuming that _() is the text-localizing function, that wouldn't change the word order. But you say it does, so I must have misunderstood something. Can you break that down into steps? Berin mentioned Java's MessageFormat class. This does the job of word order switching. It's cumbersome to use in practice, but we could still borrow the technique if we so needed.
Jul 19 2004
Arcane Jill wrote:In article <cd9peo$oce$1 digitaldaemon.com>, J C Calvarese says...It's nothing revolutionary, but it's a start.There seems to be 2 schools of thought in the area of localization (of which language issues are a subset): 1. Compile-time generated using version() as you described in your post. 2. Runtime-time generated using some sort of a plugin architecture or language resource files (as Arcane Jill alludes to in her reply). Since D is capable enough for either method, both parties can be happy. Personally, I think I'd prefer to use compile-time localization, but that doesn't prevent others from designing runtime-time localization functions.Compile-time localization not something I'd thought about before, so it's been kind of an interesting thing to think about. The thing is, though, we already _have_ compile-time localization. We've always had it. As Stewart said, you can do: And we've been able to do that in C ever since we learned how to use #ifdef. So it requires no new language features. It's already there. But, since this technique has been around for so long, you'd expect it be widely used ... unless, that is, it turns out to be not very useful.There are a number of disadvantages I can think of. For a start, your locale-specific code will end up distributed throughout your source code, instead of all in one place. This could be a nightmare if you decide to support a new locale. Another problem is that, if you choose the locale at compile-time, then the end-user (as opposed to the developer) has to have the source code, OR an executable which was compiled especially for their locale. And that's just for executables. For libraries, the situation is even worse. A compile-time-localized executable would have to be linked with compile-time-localized libraries, compiled for the same locale. It would be a serious headache. Another problem is that someone might compile it for version(en_US), without realizing that they should have been using version(en). And all for what? To save a small amount of run-time overhead. Well, /how much/ run-time overhead? In most cases, run-time-localization amounts to looking something up in a map. Is that bad? A matter of judgement, maybe, but I'd say it was insignificant compared to the overhead incurred by writing that localized string to printf() or a file.I could be a lot of run-time overhead. It could be a little. But ultimately, it should be left up to the individual programmer.Localization, to me, is the flip-side of internationalization (or i18n for lazy typists). The way it's traditionally done is you "internationalize" your code - a compile-time thing, and then "localize" it at run-time. Here's an example.I didn't realize there was a different between localization and internalization. The OP was mostly concerned with "human languages", but I think that other issues such as date formats would naturally be discussed at the same time.Start with some normal, unlocalized code: Now, internationalize it. It will become something like this: Not much different really. The function local() would probably be something like this, and would get inlined: "Localizing" this program now consists only of initializing the localizedLookup map, which could happen in any number of ways. My guess is that what would be most useful in terms of internationalization/localization would be some classes and functions to make stuff like the above easier, providing localized number formats and so on. Plus of course the D definition of a "locale" - we have to standardize that somehow. Me, I'd prefer enums to strings - saves all that messing about with case, for one thing, and string-splitting to get at the two (possibly three) parts.Sure, Phobos should include some modules for run-time support (and maybe compile-time support, too).I don't mind nit-picking (I do a lot of it myself). I hereby retract "lang_" in favor of either "locale_" or "loc_".I have a quick comment about the specifics of your ideas. I think the version identifiers should have a prefix (such as "lang_"). This should make it clear to most viewers of the code what's happening. Not everyone would intuitively know that ky_KG is a language feature, but lang_ky_KG is guessable.Strictly speaking it should be "locale_", not "lang_". ("ky" is a lang, or language; "KG" is a country. The combination "ky_KG" is a locale). Sorry for being a pedantic bugger here.But the original questions was, should there be "standard version identifiers"? Thing is - I don't see how there can be. A version identifier is just D's name for a #define, and there's nothing to stop anyone from using they want as such. A note in the style guide might help, but even that won't force people to use said standard.Right. I was thinking "convention" when I read "standard". I don't intend to compell anyone (and as you state, they can't really be compelled), but I think if the convention makes sense, many people would use it.Just my thoughts. Arcane Jill-- Justin (a/k/a jcc7) http://jcc_7.tripod.com/d/
Jul 18 2004
In article <cdfgli$2vdr$1 digitaldaemon.com>, J C Calvarese says...I didn't realize there was a different between localization and internalization.I found a good explanation about this when looking up gettext on the web. Have a look at http://www.gnu.org/software/gettext/manual/html_mono/gettext.html#SEC3. To quote its summary of itself: "Also, very roughly said, when it comes to multi-lingual messages, internationalization is usually taken care of by programmers, and localization is usually taken care of by translators." I consider myself a good programmer, but I'd make a lousy translator. I'd want to leave that job to someone else. Arcane Jill
Jul 19 2004
Arcane Jill wrote: <snip>There are a number of disadvantages I can think of. For a start, your locale-specific code will end up distributed throughout your source code, instead of all in one place.Only if you choose to do it that way. You can just as well have one version block per module, with the locale-specific data and code in them, and have the rest of the module use stuff in here.This could be a nightmare if you decide to support a new locale. Another problem is that, if you choose the locale at compile-time, then the end-user (as opposed to the developer) has to have the source code, OR an executable which was compiled especially for their locale.I believe it's quite common to offer separate downloadable versions in each language. That way, a unilingual end-user isn't faced with the bloat of a multilingual UI or the overhead of compiling it, and you can choose whether to release the source or not.And that's just for executables. For libraries, the situation is even worse. A compile-time-localized executable would have to be linked with compile-time-localized libraries, compiled for the same locale. It would be a serious headache.To me it would seem straightforward to build a copy of the lib, and give it an identifying name, for each language that your app supports. <snip>Strictly speaking it should be "locale_", not "lang_". ("ky" is a lang, or language; "KG" is a country. The combination "ky_KG" is a locale). Sorry for being a pedantic bugger here.<snip> ISO language codes are language-dialect pairs. So en-GB is British English, es-MX is Mexican Spanish. AIUI they don't cover other aspects of locale, such as time zones, date formats and the like. They tend to be managed by the OS - it would seem pointless to try and write apps to override this. Stewart. -- My e-mail is valid but not my primary mailbox, aside from its being the unfortunate victim of intensive mail-bombing at the moment. Please keep replies on the 'group where everyone may benefit.
Jul 19 2004
In article <cdg8ab$98s$1 digitaldaemon.com>, Stewart Gordon says...ISO language codes are language-dialect pairs. So en-GB is British English, es-MX is Mexican Spanish.I know.AIUI they don't cover other aspects of locale, such as time zones, date formats and the like.In a sense, they don't cover ANYTHING. They are just tuples of language/country/variant tags. However, if you think of these as map keys, you can turn them into anything else quite straightforwardly.They tend to be managed by the OSThat would be nice, but collation, number formats, date formats, etc. (to give just a few examples) are not handled very well at all by any OS of which I am aware.it would seem pointless to try and write apps to override this.But not pointless for a library. The CLDR, which is maintained by the Unicode Consortium, contains just about every fragment of information you could possibly imagine wanting (short of actual language translation). Its data files are in XML - actually a custom format called LDML (Locale Data Markup Language). It absolutely DOES include such information as time zones, currencies, number formats, and so on. It's a resource we would be foolish to ignore, and since it's XML, it can be robot-parsed far, far more easily than the Unicode database. I will most certainly be using /some/ of the CLDR data for the Unicode collation algorithm. If you want to write several monolingual applications from the same source, no-one is going to stop you. Go ahead and do it, and use whatever version identifiers you want. There's room in this world (and indeed in D) for BOTH compile-time language selecation AND true internationalization/localization, so I guess we can all be happy. Arcane Jill
Jul 19 2004
Arcane Jill wrote: <snip>Another problem is that someone might compile it for version(en_US), without realizing that they should have been using version(en).<snip> Yes, as I said in my original post:Stewart. -- My e-mail is valid but not my primary mailbox, aside from its being the unfortunate victim of intensive mail-bombing at the moment. Please keep replies on the 'group where everyone may benefit.It would be necessary either for the compiler to automatically set en if en_GB or en_US or en_anything is set, and similarly for other language codes, or to persuade all D users to do this. Of course, this would be done in the aforementioned default language setting.
Jul 19 2004