digitalmars.D - ICU (International Components for Unicode)
- Arcane Jill (57/57) Aug 23 2004 Following Ben's mention of ICU, I've been checking out what it does and ...
- Arcane Jill (15/26) Aug 23 2004 Or I could port it to D!
- =?ISO-8859-1?Q?J=F6rg_R=FCppel?= (6/16) Aug 23 2004 According to the API docs at
- Arcane Jill (9/12) Aug 23 2004 I just didn't see it, that's all.
- =?ISO-8859-1?Q?Julio_C=E9sar_Carrascal_Urquijo?= (2/126) Aug 23 2004
- antiAlias (50/107) Aug 23 2004 This would be a great thing for D to adopt. Just a few things to note:
- Walter (9/14) Aug 23 2004 special
- Regan Heath (11/29) Aug 23 2004 YAY! .. Sorry I can't help myself, I think _this_ is the way to go, see ...
- Juanjo =?ISO-8859-15?Q?=C1lvarez?= (3/6) Aug 23 2004 Excuse me if I'm saying something stupid but byte[] would not do the job...
- Regan Heath (14/21) Aug 23 2004 I didn't write this! :)
- Walter (4/10) Aug 23 2004 of
- Regan Heath (9/21) Aug 23 2004 http://www.digitalmars.com/d/htomodule.html
- Walter (6/10) Aug 23 2004 D.
- Regan Heath (9/20) Aug 24 2004 However, an old C lib might return latin-1 (or any other encoding) encod...
- Walter (5/23) Aug 25 2004 in
- Arcane Jill (8/11) Aug 24 2004 Implicit conversions are absolutely fine by me. In fact, that was the fi...
- Arcane Jill (28/33) Aug 24 2004 It's usually regarded as the best by most other sources, however. Conver...
- =?ISO-8859-1?Q?Julio_C=E9sar_Carrascal_Urquijo?= (10/17) Aug 24 2004 I'm sorry to interrupt. From your reply I would say that you are arguing...
- Arcane Jill (29/33) Aug 24 2004 No, I wasn't asking that, and I don't really care what things are called...
- Walter (30/62) Aug 24 2004 Converting
- Walter (18/39) Aug 25 2004 doesn't
- Arcane Jill (14/17) Aug 25 2004 No idea (yet). I worried about that myself, but it's a library not an
- Roald Ribe (17/33) Aug 25 2004 use
- Arcane Jill (14/18) Aug 25 2004 This is possibly the best idea yet, but I'm going to have to study the I...
- Sean Kelly (4/9) Aug 25 2004 What are those two operators used for? Could the functionality be worke...
- pragma (8/17) Aug 25 2004 I'm sure it can be worked around, as there is a Java interface to ICU al...
- Arcane Jill (4/13) Aug 25 2004 I don't know. I'm still looking into it all. Fortunately, I don't think ...
- Walter (8/23) Aug 25 2004 D
Following Ben's mention of ICU, I've been checking out what it does and doesn't do. Basically, it does EVERYTHING. There is the work of years there. Not just Unicode stuff, but a whole swathe of classes for internationalization and transcoding. It would take me a very long time to duplicate that. It's also free, open source, and with a license which basically says there's no problem with our using it. So I'm thinking seriously about ditching the whole etc.unicode project and replacing it with a wrapper around ICU. It's not completely straightforward. ICU is written in C++ (not C), and so we can't link against it directly. It uses classes, not raw functions. So, I'd have to write a C wrapper around ICU which gave me a C interface, and /then/ I'd have to write a D wrapper to call the C wrapper - at which point we could get the classes back again (and our own choice of architecture, so plugging into std or mango streams won't suffer). But the outermost (D) wrapper can, at least, be composed of D classes. If we want D to be the language of choice for Unicode, we would need all this functionality. So, if we went the ICU route, we'd need to bundle ICU (currently a ten megabyte zip file) with DMD, along with whatever wrapper I come up with. (etc.unicode is not likely to be smaller). I'd like to see some discussion on this. Read this page to inform yourself: http://oss.software.ibm.com/icu/userguide/index.html Finally, back to strings. Ben was right. The ICU says: "In order to take advantage of Unicode with its large character repertoire and its well-defined properties, there must be types with consistent definitions and semantics. The Unicode standard defines a default encoding based on 16-bit code units. This is supported in ICU by the definition of the UChar to be an unsigned 16-bit integer type. This is the base type for character arrays for strings in ICU." Get that? A *16-BIT* type is the basis for ICU strings. ICU defines no special string class - a string is just an array of wchars. So obviously, if we go the ICU route, I withdraw my suggestion to ditch the wchar. What I now recommend is: (1) Ditch "etc.unicode" in favor of - let's call it "etc.icu" (a D wrapper around ICU). Eventually I hope for this to change into "std.icu" (as I originally hoped that "etc.unicode" would turn into "std.unicode"). (2) Ditch the char. 8-bits is really too small for a character these days, honestly, and all previous arguments still apply. The existence of char only encourages ASCII and discourages Unicode anyway. (3) Native D strings shall be arrays of wchars. This means that Object.toString() must return a wchar[], and string literals in D source must compile to wchar[]s. UCI's type UChar would map directly to wchar. To reinforce this, there should probably be an in object.d. (4) We retain dchar (so that we can get character properties), but all string code is based on wchar[]s, not dchar[]s. UCI's type UChar32 would map directly to dchar. (5) Transcoding/streams/etc. go ahead as planned, but based around wchar[]s instead of dchar[]s. (6) That's pretty much it, although once "char" is gone, we could rename "wchar" as "char" (a la Java). Discussion please? And I really do want this talked through because it affects D work I'm currently involved in. Input is also requested from Walter - in particular the request that Object.toString() be re-jigged to return wchar[] instead of char[]. Okay, let's chew this one over. Jill
Aug 23 2004
In article <cgcv4n$2fsf$1 digitaldaemon.com>, Arcane Jill says...It's not completely straightforward. ICU is written in C++ (not C), and so we can't link against it directly. It uses classes, not raw functions. So, I'd have to write a C wrapper around ICU which gave me a C interface, and /then/ I'd have to write a D wrapper to call the C wrapper - at which point we could get the classes back again (and our own choice of architecture, so plugging into std or mango streams won't suffer).Or I could port it to D! Or... In article <cgd0h9$2ggf$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?= says...AJ, I don't know a shit^D^D^D^D too much about Unicode but you excitement about ICU is really contagious, only one question, are the C wrappers at the same level than the C++/Java ones? If so it seems that with a little easy and boring (compared to writing etc.unicode) wrapping we're going to have a first-class Unicode lib :) => (i18n version of <g>)I don't know, as I've only just started looking into it. But either way (port or wrap C) we move the time spend on development from a year or so down to only a few months. It really is worth thinking about, but it /does/ mean that D really should standardize on wchar[] strings, and this has consequences for (a) parsing string literals, and (b) Object.toString() - and probably a few other things too, not to mention all the code it would break, and the future (or not) of the char type. It's all this that I'd be concerned about, and should really be discussed by all of us in the D community, and Walter as its architect - not just those of us interested in Unicode. Arcane Jill
Aug 23 2004
Arcane Jill wrote:So I'm thinking seriously about ditching the whole etc.unicode project and replacing it with a wrapper around ICU. It's not completely straightforward. ICU is written in C++ (not C), and so we can't link against it directly. It uses classes, not raw functions. So, I'd have to write a C wrapper around ICU which gave me a C interface, and /then/ I'd have to write a D wrapper to call the C wrapper - at which point we could get the classes back again (and our own choice of architecture, so plugging into std or mango streams won't suffer).According to the API docs at http://oss.software.ibm.com/icu/apiref/index.html there is a C API. Didn't you see that or is there a reason why that can't be used? Regards, Jörg
Aug 23 2004
In article <cgd5go$2j1k$1 digitaldaemon.com>, =?ISO-8859-1?Q?J=F6rg_R=FCppel?= says...According to the API docs at http://oss.software.ibm.com/icu/apiref/index.html there is a C API. Didn't you see that or is there a reason why that can't be used?I just didn't see it, that's all. I think now that the best approach would be something which is part-port and part-wrapper around the C API. I would want the D interface to maintain the classes and so forth which are present in the C++ and Java APIs, so the D API's C wrappers would have to be part-port anyway, even if only to recreate the class heirarchy and put things back into member functions. Jill
Aug 23 2004
Most of ICU it's written in C with C++ wrappers. Arcane Jill wrote:Following Ben's mention of ICU, I've been checking out what it does and doesn't do. Basically, it does EVERYTHING. There is the work of years there. Not just Unicode stuff, but a whole swathe of classes for internationalization and transcoding. It would take me a very long time to duplicate that. It's also free, open source, and with a license which basically says there's no problem with our using it. So I'm thinking seriously about ditching the whole etc.unicode project and replacing it with a wrapper around ICU. It's not completely straightforward. ICU is written in C++ (not C), and so we can't link against it directly. It uses classes, not raw functions. So, I'd have to write a C wrapper around ICU which gave me a C interface, and /then/ I'd have to write a D wrapper to call the C wrapper - at which point we could get the classes back again (and our own choice of architecture, so plugging into std or mango streams won't suffer). But the outermost (D) wrapper can, at least, be composed of D classes. If we want D to be the language of choice for Unicode, we would need all this functionality. So, if we went the ICU route, we'd need to bundle ICU (currently a ten megabyte zip file) with DMD, along with whatever wrapper I come up with. (etc.unicode is not likely to be smaller). I'd like to see some discussion on this. Read this page to inform yourself: http://oss.software.ibm.com/icu/userguide/index.html Finally, back to strings. Ben was right. The ICU says: "In order to take advantage of Unicode with its large character repertoire and its well-defined properties, there must be types with consistent definitions and semantics. The Unicode standard defines a default encoding based on 16-bit code units. This is supported in ICU by the definition of the UChar to be an unsigned 16-bit integer type. This is the base type for character arrays for strings in ICU." Get that? A *16-BIT* type is the basis for ICU strings. ICU defines no special string class - a string is just an array of wchars. So obviously, if we go the ICU route, I withdraw my suggestion to ditch the wchar. What I now recommend is: (1) Ditch "etc.unicode" in favor of - let's call it "etc.icu" (a D wrapper Path: digitalmars.com!drn From: Arcane Jill <Arcane_member pathlink.com> Newsgroups: digitalmars.D Subject: Syntax for pinning Date: Sun, 22 Aug 2004 09:18:11 +0000 (UTC) Organization: [http://www.pathlink.com] Lines: 39 Sender: usenet www.digitalmars.com Message-ID: <cg9ocj$9ce$1 digitaldaemon.com> X-Trace: digitaldaemon.com 1093166291 9614 63.105.9.61 (22 Aug 2004 09:18:11 GMT) X-Complaints-To: usenet digitalmars.com NNTP-Posting-Date: Sun, 22 Aug 2004 09:18:11 +0000 (UTC) X-Newsreader: Direct Read News v3.11a Xref: digitalmars.com digitalmars.D:9385 In article <cfit27$snc$1 digitaldaemon.com>, Walter says...Walter, hi. Listen - while the prospect of a compacting GC may be a long way away, I'd like to be able to mark specific arrays on the heap as immovable right now, for forward compatibility, so that when the new GC comes along everything will still behave as intended. Could you perhaps reserve a keyword and/or some syntax for pinning? It only has around ICU). Eventually I hope for this to change into "std.icu" (as I originally hoped that "etc.unicode" would turn into "std.unicode"). (2) Ditch the char. 8-bits is really too small for a character these days, honestly, and all previous arguments still apply. The existence of char only encourages ASCII and discourages Unicode anyway. (3) Native D strings shall be arrays of wchars. This means that Object.toString() must return a wchar[], and string literals in D source must compile to wchar[]s. UCI's type UChar would map directly to wchar. To reinforce this, there should probably be an in object.d. (4) We retain dchar (so that we can get character properties), but all string code is based on wchar[]s, not dchar[]s. UCI's type UChar32 would map directly to dchar. (5) Transcoding/streams/etc. go ahead as planned, but based around wchar[]s instead of dchar[]s. (6) That's pretty much it, although once "char" is gone, we could rename "wchar" as "char" (a la Java). Discussion please? And I really do want this talked through because it affects D work I'm currently involved in. Input is also requested from Walter - in particular the request that Object.toString() be re-jigged to return wchar[] instead of char[]. Okay, let's chew this one over. Jill(PleaseI know how to build a gc that will compact memory to deal with fragmentation when it does arise, so although this is not in the current D it is not a technical problem to do it.That's probably good news, but: Will it be possible to mark specific arrays on the heap as "immovable"?say yes. I'm sure it is not a technical problem).Yes. It's called "pinning". And I can guess why you need this feature <g>.
Aug 23 2004
This would be a great thing for D to adopt. Just a few things to note: 1) This would be best served as a DLL (given it's size). In fact, the team apparently like to compile the string-resource files into DLLs (which makes a lot of sense IMO). If D treated DLLs as first-class citizens, this would be a no-brainer. Right now, that's not the case. 2) There's a rather nice String class (C++). That's a perfect candidate for porting directly to D. 3) From what I've seen, the lib is mostly C. Even better, it eschews the traditional morass of header files. Building a ICU.d import will be much easier because of this. The project on dsource.org might handle that part without issue? Wrapping those library functions with D shells would be nice, if only to take advantage of D arrays. 4) the transcoders deal with arrays of buffered data, so they're efficient. ICU has transcoders and code-page tables up-the-wazoo. For those who haven't looked through the lib, it's far more than just Unicode transcoders (as Jill notes): you get sophisticated and flexible date & number parsing/formatting; I18N message ordering; BiDi support; collation/sorting support; text-break algorithms; a text layout engine; unicode regex; and much more. It's a first-class suite of libraries, and an awesome resource to leverage. "Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cgcv4n$2fsf$1 digitaldaemon.com...Following Ben's mention of ICU, I've been checking out what it does anddoesn'tdo. Basically, it does EVERYTHING. There is the work of years there. NotjustUnicode stuff, but a whole swathe of classes for internationalization and transcoding. It would take me a very long time to duplicate that. It'salsofree, open source, and with a license which basically says there's noproblemwith our using it. So I'm thinking seriously about ditching the whole etc.unicode project and replacing it with a wrapper around ICU. It's not completely straightforward. ICU is written in C++ (not C), and sowecan't link against it directly. It uses classes, not raw functions. So,I'd haveto write a C wrapper around ICU which gave me a C interface, and /then/I'd haveto write a D wrapper to call the C wrapper - at which point we could gettheclasses back again (and our own choice of architecture, so plugging intostd ormango streams won't suffer). But the outermost (D) wrapper can, at least, be composed of D classes. If we want D to be the language of choice for Unicode, we would need allthisfunctionality. So, if we went the ICU route, we'd need to bundle ICU(currentlya ten megabyte zip file) with DMD, along with whatever wrapper I come upwith.(etc.unicode is not likely to be smaller). I'd like to see some discussion on this. Read this page to informyourself:http://oss.software.ibm.com/icu/userguide/index.html Finally, back to strings. Ben was right. The ICU says: "In order to take advantage of Unicode with its large character repertoireandits well-defined properties, there must be types with consistentdefinitions andsemantics. The Unicode standard defines a default encoding based on 16-bitcodeunits. This is supported in ICU by the definition of the UChar to be anunsigned16-bit integer type. This is the base type for character arrays forstrings inICU." Get that? A *16-BIT* type is the basis for ICU strings. ICU defines nospecialstring class - a string is just an array of wchars. So obviously, if we gotheICU route, I withdraw my suggestion to ditch the wchar. What I now recommend is: (1) Ditch "etc.unicode" in favor of - let's call it "etc.icu" (a D wrapper around ICU). Eventually I hope for this to change into "std.icu" (as I originally hoped that "etc.unicode" would turn into "std.unicode"). (2) Ditch the char. 8-bits is really too small for a character these days, honestly, and all previous arguments still apply. The existence of charonlyencourages ASCII and discourages Unicode anyway. (3) Native D strings shall be arrays of wchars. This means that Object.toString() must return a wchar[], and string literals in D sourcemustcompile to wchar[]s. UCI's type UChar would map directly to wchar. Toreinforcethis, there should probably be an in object.d. (4) We retain dchar (so that we can get character properties), but allstringcode is based on wchar[]s, not dchar[]s. UCI's type UChar32 would mapdirectlyto dchar. (5) Transcoding/streams/etc. go ahead as planned, but based aroundwchar[]sinstead of dchar[]s. (6) That's pretty much it, although once "char" is gone, we could rename"wchar"as "char" (a la Java). Discussion please? And I really do want this talked through because itaffects Dwork I'm currently involved in. Input is also requested from Walter - in particular the request that Object.toString() be re-jigged to return wchar[] instead of char[]. Okay, let's chew this one over. Jill
Aug 23 2004
"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cgcv4n$2fsf$1 digitaldaemon.com...Get that? A *16-BIT* type is the basis for ICU strings. ICU defines nospecialstring class - a string is just an array of wchars. So obviously, if we gotheICU route, I withdraw my suggestion to ditch the wchar.Ok, but suppose we ditch char[]. Then, we find some great library we want to bring into D, or build a D interface too, that is in char[].Input is also requested from Walter - in particular the request that Object.toString() be re-jigged to return wchar[] instead of char[].My experience with all-wchar is that its performance is not the best. It'll also become a nuisance interfacing with C. I'd rather explore perhaps making implict conversions between the 3 utf types more seamless.
Aug 23 2004
On Mon, 23 Aug 2004 14:06:57 -0700, Walter <newshound digitalmars.com> wrote:"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cgcv4n$2fsf$1 digitaldaemon.com...YAY! .. Sorry I can't help myself, I think _this_ is the way to go, see my post here: http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/9494 for my arguments. I would add one additional argument. I can't imagine the suggested change breaking existing code.. more likely it will fix existing bugs. Regan. -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/Get that? A *16-BIT* type is the basis for ICU strings. ICU defines nospecialstring class - a string is just an array of wchars. So obviously, if we gotheICU route, I withdraw my suggestion to ditch the wchar.Ok, but suppose we ditch char[]. Then, we find some great library we want to bring into D, or build a D interface too, that is in char[].Input is also requested from Walter - in particular the request that Object.toString() be re-jigged to return wchar[] instead of char[].My experience with all-wchar is that its performance is not the best. It'll also become a nuisance interfacing with C. I'd rather explore perhaps making implict conversions between the 3 utf types more seamless.
Aug 23 2004
Regan Heath wrote:Excuse me if I'm saying something stupid but byte[] would not do the job of interfacing with C char[]?Ok, but suppose we ditch char[]. Then, we find some great library we want to bring into D, or build a D interface too, that is in char[].
Aug 23 2004
On Tue, 24 Aug 2004 01:18:09 +0200, Juanjo Álvarez <juanjuxNO SPAMyahoo.es> wrote:Regan Heath wrote:I didn't write this! :)I almost made that same comment in reply to 'Walter' (who made the comment above) I think you're right, you could use byte[], in fact it'd be more correct to use byte[] as the C 'char' type is a byte with no specified encoding (whereas D's char[] is utf-8 encoded). If we had no char[] you'd have to transcode the byte[] to dchar[], this is why I disagree with removing char[], I think char[] has a place in D, I just want to see implicit transcoding of char[] to wchar[] to dchar[]. Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/Excuse me if I'm saying something stupid but byte[] would not do the job of interfacing with C char[]?Ok, but suppose we ditch char[]. Then, we find some great library we want to bring into D, or build a D interface too, that is in char[].
Aug 23 2004
"Juanjo Álvarez" <juanjuxNO SPAMyahoo.es> wrote in message news:cgdu4b$2ed$1 digitaldaemon.com...Regan Heath wrote:ofExcuse me if I'm saying something stupid but byte[] would not do the jobOk, but suppose we ditch char[]. Then, we find some great library we want to bring into D, or build a D interface too, that is in char[].interfacing with C char[]?That could work, but it just wouldn't look right.
Aug 23 2004
On Mon, 23 Aug 2004 16:41:10 -0700, Walter <newshound digitalmars.com> wrote:"Juanjo Álvarez" <juanjuxNO SPAMyahoo.es> wrote in message news:cgdu4b$2ed$1 digitaldaemon.com...http://www.digitalmars.com/d/htomodule.html Specifically states that C's 'char' should be represented by a 'byte' in D. So when building an interface to the C lib that uses char[] you'd use byte[]. Regan. -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/Regan Heath wrote:ofExcuse me if I'm saying something stupid but byte[] would not do the jobOk, but suppose we ditch char[]. Then, we find some great library we want to bring into D, or build a D interface too, that is in char[].interfacing with C char[]?That could work, but it just wouldn't look right.
Aug 23 2004
"Regan Heath" <regan netwin.co.nz> wrote in message news:opsc7lqdnp5a2sq9 digitalmars.com...http://www.digitalmars.com/d/htomodule.html Specifically states that C's 'char' should be represented by a 'byte' inD.So when building an interface to the C lib that uses char[] you'd use byte[].I'm sorry that wasn't clear, but I meant that when 'unsigned char' and 'signed char' in C are used not as text, but as very small integers, the corresponding D types should be ubyte and byte.
Aug 23 2004
On Mon, 23 Aug 2004 23:35:07 -0700, Walter <newshound digitalmars.com> wrote:"Regan Heath" <regan netwin.co.nz> wrote in message news:opsc7lqdnp5a2sq9 digitalmars.com...However, an old C lib might return latin-1 (or any other encoding) encoded data, in which case you also have to use ubyte then transcode to utf-8 and store in char[] (if that is the desired result). Right? Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/http://www.digitalmars.com/d/htomodule.html Specifically states that C's 'char' should be represented by a 'byte' inD.So when building an interface to the C lib that uses char[] you'd use byte[].I'm sorry that wasn't clear, but I meant that when 'unsigned char' and 'signed char' in C are used not as text, but as very small integers, the corresponding D types should be ubyte and byte.
Aug 24 2004
"Regan Heath" <regan netwin.co.nz> wrote in message news:opsc9o90n65a2sq9 digitalmars.com...On Mon, 23 Aug 2004 23:35:07 -0700, Walter <newshound digitalmars.com> wrote:in"Regan Heath" <regan netwin.co.nz> wrote in message news:opsc7lqdnp5a2sq9 digitalmars.com...http://www.digitalmars.com/d/htomodule.html Specifically states that C's 'char' should be represented by a 'byte'Yup. You'll have to understand what the C code is using the char type for, in order to select the best equivalent D type.D.However, an old C lib might return latin-1 (or any other encoding) encoded data, in which case you also have to use ubyte then transcode to utf-8 and store in char[] (if that is the desired result). Right?So when building an interface to the C lib that uses char[] you'd use byte[].I'm sorry that wasn't clear, but I meant that when 'unsigned char' and 'signed char' in C are used not as text, but as very small integers, the corresponding D types should be ubyte and byte.
Aug 25 2004
In article <cgdmj9$2v1a$1 digitaldaemon.com>, Walter says...My experience with all-wchar is that its performance is not the best. It'll also become a nuisance interfacing with C. I'd rather explore perhaps making implict conversions between the 3 utf types more seamless.Implicit conversions are absolutely fine by me. In fact, that was the first suggestion (then the discussion wandered, as things do, along the lines of "if they're interchangable, why not have just the one type". But sure, I'd be more than happy with implicit conversions between the three UTF types. Since such conversions lose no information in any direction, they are always guaranteed to be harmless. Jill
Aug 24 2004
In article <cgdmj9$2v1a$1 digitaldaemon.com>, Walter says...My experience with all-wchar is that its performance is not the best.It's usually regarded as the best by most other sources, however. Converting between wchar[] and dchar[] is /almost/ as fast as doing a memcpy(), because UTF-16 encoding is very, very simple. UTF-16 is as efficient for the codepoint range U+0000 to U+FFFF as UTF-8 is for the codepoint range U+0000 to U+007F. Outside of these ranges, UTF-16 is still very fast, since all remaining characters consist of /precisely/ two wchars, wheras UTF-8 conversion outside of the ASCII range is always going to be inefficient, what with it's variable number of required bytes, variable width bitmasks, and the additional requirements of validation and rejection of non-shortest sequences. If you're arguing on the basis performance, UTF-8 loses hands down.It'll also become a nuisance interfacing with C.Actually, it's D's char (which C doesn't have) which is a nuisance interfacing with C. As others have pointed out, C has no type which enforces UTF-8 encoding, and in fact on the Windows PC on which I am typing right now, every C char is going to be storing characters from Windows code page 1252 unless I take special action to do otherwise. That is /not/ interchangable with D's chars. Beyond U+007F, one C char corresponds to two (or more) D chars. You don't regard that as a nuisance? In fact, I believe that comment of yours which I just quoted above, actually adds further weight to my argument. I argue that the existence of the char type *causes confusion*. People /think/ (erroneously) that it does the same job as C's char, and is interchangable therewith. Now if you, the architect of D, can fall prey to that confusion, I would take that as clear evidence that such confusion exists.I'd rather explore perhaps making implict conversions between the 3 utf types more seamless.Yes. If all three D string types were implicitly convertable, then there would be nothing for me to complain about. (The char confusion would still exist, but that's just education) Arcane Jill
Aug 24 2004
Arcane Jill wrote:(...) adds further weight to my argument. I argue that the existence of the char type *causes confusion*. People /think/ (erroneously) that it does the same job as C's char, and is interchangable therewith. Now if you, the architect of D, can fall prey to that confusion, I would take that as clear evidence that such confusion exists. (...)I'm sorry to interrupt. From your reply I would say that you are arguing more about the name of the type been used as the *representation of a character* (which is not). Maybe we should ask Walter to change the names of the types to utf8, utf16 and utf32. I read somewhere that those where the original names in earlier DMD implementations. If that's what you are asking, you have my vote. -- Julio César Carrascal Urquijo
Aug 24 2004
In article <cgfmdm$v3t$1 digitaldaemon.com>,Maybe we should ask Walter to change the names of the types to utf8, utf16 and utf32. I read somewhere that those where the original names in earlier DMD implementations. If that's what you are asking, you have my vote.No, I wasn't asking that, and I don't really care what things are called. I do /try/ to stay on the topic of the thread title, and in this thread the discussion is about how/whether to make use of ICU for our internationalization and Unicode needs. I've been looking more closely at ICU, and I keep being (pleasantly) surprised to discover that it already has zillions of other goodies not previously mentioned in this thread, but which we've been talking about in this forum in the past. For example - Locales, ResourceBundles, everything you need for text internationalization/localization. It's relevant to D's character types because ICU has only two character types - a type equivalent to wchar that is used to make UTF-16 strings, and a type equivalent to dchar that is used to access character properties. The important detail here is that ICU strings are wchar[]s, but D's basic "string" concept is char[]. So, calling lots of ICU routines would result in lots of explicit toUTF8() and toUTF16() calls all over your code, /unless/ either: (1) D adopted wchar[] as the basic string type, or (2) D implicitly auto-converted between its various string types as required I'm trying to suggest that (1) is the best option. That's all. Walter prefers (2), but that's acceptable too. As a corrollary to (1), if we start using wchar[]s as the default native string used by Phobos and the compiler, it would then follow that the char type would be superfluous and could be dropped. Or at least, it seems that way to me. Opinions differ. However, the "should we ditch the char or not?" discussion is over on another thread. Renaming the character types is kind of irrelevant to this, although it is pertainant to Walter's reply. No - I'm not asking that they be renamed (except insofar as, if "char" is ditched, then "wchar" could be renamed "char", but that again is for the other thread). Arcane Jill
Aug 24 2004
"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cgf8d1$os0$1 digitaldaemon.com...In article <cgdmj9$2v1a$1 digitaldaemon.com>, Walter says...ConvertingMy experience with all-wchar is that its performance is not the best.It's usually regarded as the best by most other sources, however.between wchar[] and dchar[] is /almost/ as fast as doing a memcpy(),becauseUTF-16 encoding is very, very simple. UTF-16 is as efficient for thecodepointrange U+0000 to U+FFFF as UTF-8 is for the codepoint range U+0000 toU+007F.Outside of these ranges, UTF-16 is still very fast, since all remaining characters consist of /precisely/ two wchars, wheras UTF-8 conversionoutside ofthe ASCII range is always going to be inefficient, what with it's variable number of required bytes, variable width bitmasks, and the additional requirements of validation and rejection of non-shortest sequences. Ifyou'rearguing on the basis performance, UTF-8 loses hands down.Converting would be faster, sure, but if the bulk of your app is char[], there is little conversion happening.interfacingIt'll also become a nuisance interfacing with C.Actually, it's D's char (which C doesn't have) which is a nuisancewith C. As others have pointed out, C has no type which enforces UTF-8encoding,and in fact on the Windows PC on which I am typing right now, every C charisgoing to be storing characters from Windows code page 1252 unless I takespecialaction to do otherwise. That is /not/ interchangable with D's chars.BeyondU+007F, one C char corresponds to two (or more) D chars. You don't regardthatas a nuisance?I've been dealing with multibyte charsets in C for decades - it's not just UTF-8 that's multibyte, there are also the Shift-JIS, Korean, and Taiwan code pages. You can also set up your windows machine so UTF-8 *is* the charset used by the "A" APIs. I've written UTF-8 apps in C, and D would map onto them directly. There is no way to avoid, when interfacing with C, dealing with whatever charset it might be in. It can't happen automatically. And that is the source of the nuisance.In fact, I believe that comment of yours which I just quoted above,actuallyadds further weight to my argument. I argue that the existence of the chartype*causes confusion*. People /think/ (erroneously) that it does the same jobasC's char, and is interchangable therewith. Now if you, the architect of D,canfall prey to that confusion, I would take that as clear evidence that such confusion exists.wouldI'd rather explore perhaps making implict conversions between the 3 utf types more seamless.Yes. If all three D string types were implicitly convertable, then therebe nothing for me to complain about. (The char confusion would stillexist, butthat's just education)
Aug 24 2004
"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cgcv4n$2fsf$1 digitaldaemon.com...Following Ben's mention of ICU, I've been checking out what it does anddoesn'tdo. Basically, it does EVERYTHING. There is the work of years there. NotjustUnicode stuff, but a whole swathe of classes for internationalization and transcoding. It would take me a very long time to duplicate that. It'salsofree, open source, and with a license which basically says there's noproblemwith our using it. So I'm thinking seriously about ditching the whole etc.unicode project and replacing it with a wrapper around ICU. It's not completely straightforward. ICU is written in C++ (not C), and sowecan't link against it directly. It uses classes, not raw functions. So,I'd haveto write a C wrapper around ICU which gave me a C interface, and /then/I'd haveto write a D wrapper to call the C wrapper - at which point we could gettheclasses back again (and our own choice of architecture, so plugging intostd ormango streams won't suffer). But the outermost (D) wrapper can, at least, be composed of D classes. If we want D to be the language of choice for Unicode, we would need allthisfunctionality. So, if we went the ICU route, we'd need to bundle ICU(currentlya ten megabyte zip file) with DMD, along with whatever wrapper I come upwith.(etc.unicode is not likely to be smaller). I'd like to see some discussion on this. Read this page to informyourself:http://oss.software.ibm.com/icu/userguide/index.htmlIt sounds pretty cool. It being a very large library, is only what you use linked in? Or does using anything tend to pull in the whole shebang? I hope it's the former!
Aug 25 2004
In article <cghofr$202k$3 digitaldaemon.com>, Walter says...It sounds pretty cool. It being a very large library, is only what you use linked in? Or does using anything tend to pull in the whole shebang? I hope it's the former!No idea (yet). I worried about that myself, but it's a library not an indivisibile object file, so it must have /some/ granularity. But the functionality of ICU goes way beyond what I was even planning to achieve - there's even stuff for font rendering in there for feck's sake! (Unicode allows you to put any accent over any glyph, ligate any two glyphs into one, etc. Font rendering engines have a fair bit of work to do). So ICU is hard to beat. Therefore, I'm inclining to the view that we'd be better off trying to plumb ICU into D than to reject it because it fails to meet any one single D requirement. So, if it won't split into sufficiently small chunks on linking, I'd say that was as good an argument as any for improving D's DLL support. ICU would clearly be an excellent candidate for a DLL (assuming we can have classes in DLLs). Jill
Aug 25 2004
"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cghrn3$21dg$1 digitaldaemon.com...In article <cghofr$202k$3 digitaldaemon.com>, Walter says...useIt sounds pretty cool. It being a very large library, is only what youhopelinked in? Or does using anything tend to pull in the whole shebang? Iachieveit's the former!No idea (yet). I worried about that myself, but it's a library not an indivisibile object file, so it must have /some/ granularity. But the functionality of ICU goes way beyond what I was even planning to- there's even stuff for font rendering in there for feck's sake! (Unicode allows you to put any accent over any glyph, ligate any two glyphs intoone,etc. Font rendering engines have a fair bit of work to do). So ICU is hardtobeat. Therefore, I'm inclining to the view that we'd be better off tryingtoplumb ICU into D than to reject it because it fails to meet any one singleDrequirement. So, if it won't split into sufficiently small chunks on linking, I'd saythatwas as good an argument as any for improving D's DLL support. ICU wouldclearlybe an excellent candidate for a DLL (assuming we can have classes inDLLs). The ICU has a C API, you want to port the C++ API to D, why not compile a DLL with the C API, and a lib with the ported D API? This would also be good for minimum disturbance between the ICU source and the D API. Roald
Aug 25 2004
In article <cgi7f6$25qf$1 digitaldaemon.com>, Roald Ribe says...The ICU has a C API, you want to port the C++ API to D, why not compile a DLL with the C API, and a lib with the ported D API? This would also be good for minimum disturbance between the ICU source and the D API.This is possibly the best idea yet, but I'm going to have to study the ICU source before I know how feasible that is. Some fundamentals: (1) ICU's C API requires you to acquire and release (memory) resources. D programmers are accustomed to letting the garbage collector do the releasing. (2) ICU's C++ API requires classes to have a copy constructor and an assignment operator. This isn't possible with D. The above two points mean that the D API is probably going to look more like the Java API than the C++ API. Exactly how much of it can be done via a wrapper around C I'm not sure (yet). But yes - if the C-based core can be put into a DLL, and a D-API built which calls it, I guess that would work a treat. Arcane Jill
Aug 25 2004
In article <cgi8jj$26dr$1 digitaldaemon.com>, Arcane Jill says...Some fundamentals: (1) ICU's C API requires you to acquire and release (memory) resources. D programmers are accustomed to letting the garbage collector do the releasing. (2) ICU's C++ API requires classes to have a copy constructor and an assignment operator. This isn't possible with D.What are those two operators used for? Could the functionality be worked around somehow? Sean
Aug 25 2004
In article <cgifct$29kf$1 digitaldaemon.com>, Sean Kelly says...In article <cgi8jj$26dr$1 digitaldaemon.com>, Arcane Jill says...I'm sure it can be worked around, as there is a Java interface to ICU already available. ( http://oss.software.ibm.com/icu4j/download/ ) ;) This table ( http://oss.software.ibm.com/icu4j/comparison/index.html ) is about as close an indicator as we're going to get for what D would need to do to compete with what's already out there. -Pragma [[ EricAnderton at (code it, and they will come) yahoo.com ]]Some fundamentals: (1) ICU's C API requires you to acquire and release (memory) resources. D programmers are accustomed to letting the garbage collector do the releasing. (2) ICU's C++ API requires classes to have a copy constructor and an assignment operator. This isn't possible with D.What are those two operators used for? Could the functionality be worked around somehow?
Aug 25 2004
In article <cgifct$29kf$1 digitaldaemon.com>, Sean Kelly says...In article <cgi8jj$26dr$1 digitaldaemon.com>, Arcane Jill says...I don't know. I'm still looking into it all. Fortunately, I don't think it matters.Some fundamentals: (1) ICU's C API requires you to acquire and release (memory) resources. D programmers are accustomed to letting the garbage collector do the releasing. (2) ICU's C++ API requires classes to have a copy constructor and an assignment operator. This isn't possible with D.What are those two operators used for?Could the functionality be worked around somehow?Obviously yes, since there is a Java API.
Aug 25 2004
"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cgihb5$2aub$1 digitaldaemon.com...In article <cgifct$29kf$1 digitaldaemon.com>, Sean Kelly says...DIn article <cgi8jj$26dr$1 digitaldaemon.com>, Arcane Jill says...Some fundamentals: (1) ICU's C API requires you to acquire and release (memory) resources.releasing.programmers are accustomed to letting the garbage collector do theassignment(2) ICU's C++ API requires classes to have a copy constructor and anThe Java API would likely be the best starting point. IBM is no slouch in its support of Java, and the Java interface will have solved the issues with interfacing to a GC language.I don't know. I'm still looking into it all. Fortunately, I don't think it matters.operator. This isn't possible with D.What are those two operators used for?Could the functionality be worked around somehow?Obviously yes, since there is a Java API.
Aug 25 2004