digitalmars.D - Unicode library now in Deimos
- Arcane Jill (139/139) Jun 27 2004 With humungous thanks to Hauke for ideas, suggestions, algorithms, inspr...
- Walter (1/1) Jun 27 2004 Cool! Is this a supplement or a replacement for Hauke's earlier work?
- Arcane Jill (19/20) Jun 28 2004 Both really. etc.unicode is a very different beast from unichar. Hauke's...
- Walter (3/7) Jul 01 2004 Hmm. This can be confusing. Can the functionality of each be made unique...
- Arcane Jill (68/76) Jul 02 2004 Hauke's "utype" module (drop-in replacement for ctype) is unique. It is ...
- Hauke Duden (20/123) Jul 02 2004 I agree that Phobos should only have one Unicode package - everything
- Arcane Jill (25/45) Jul 02 2004 No problem. I'll do that soon, like within a week. You should definitely...
- Hauke Duden (9/17) Jul 03 2004 Hmmm. I ran a few tests and it seems that you're right. As soon as you
- Hauke Duden (5/153) Jun 27 2004 This is incredible!
- Hauke Duden (8/11) Jun 27 2004 Hmmm. Why do you store each page separately with a manual switch for
- Arcane Jill (18/27) Jun 28 2004 The old space/speed tradeoff. You can probably tweak this, once I make t...
- Hauke Duden (9/42) Jun 28 2004 Not necessarily bigger. If you add RLE compression it can even get
- Arcane Jill (14/21) Jun 28 2004 You mentioned that before, but I'm not sure I agree. RLE is pretty much ...
- Hauke Duden (32/52) Jun 28 2004 Heh. Sorry if I seem like I want to push my ideas onto you. I usually
- Arcane Jill (7/10) Jun 29 2004 As well as what?
- Hauke Duden (16/31) Jun 29 2004 Yes, but they ARE in RAM. My point was that you don't save RAM if you
- Arcane Jill (14/33) Jun 29 2004 But clearly you do. If the compressed size is X, and the uncompressed si...
- Hauke Duden (22/66) Jun 29 2004 In this particular case, yes, as I stated in my post. I just wanted to
-
Martin M. Pedersen
(7/11)
Jun 29 2004
"Hauke Duden"
skrev i en meddelelse news:cbrvnp$1uf... - Hauke Duden (5/18) Jun 30 2004 I'm pretty sure about it, but not 100% sure. I have frequently observed
- Arcane Jill (71/82) Jun 30 2004 Once upon a time, I did just that. I had to write some Unicode stuff for...
- Martin M. Pedersen (22/28) Jun 30 2004 my
- Sam McCall (7/7) Jun 30 2004 Hi, this is really impressive! (Okay, so I'm only using isWhiteSpace and...
- Arcane Jill (7/14) Jun 30 2004 It is, kindof. At least the source code is there, at
- Sam McCall (14/40) Jun 30 2004 Sorry, yeah, that's what I meant.
With humungous thanks to Hauke for ideas, suggestions, algorithms, inspriation, etc., I've got the first version of etc.unicode uploaded to Deimos. It gives you access to pretty much all Unicode properties. For an idea of the flavor of the thing, these are the functions you get: char[] getAge(dchar c) char[] getArabicShapingName(dchar c) BidiClass getBidiClass(dchar c) char[] getBidiClassName(BidiClass e) char[] getBidiClassName(dchar c) dchar getBidiMirroringGlyph(dchar c) char[] getBlock(dchar c) uint getCanonicalCombiningClass(dchar c) int getDecimalDigit(dchar c) wchar[] getDecompositionMappingUTF16(dchar c) dchar[] getDecompositionMappingUTF32(dchar c) char[] getDecompositionMappingUTF8(dchar c) DecompositionType getDecompositionType(dchar c) char[] getDecompositionTypeName(DecompositionType e) char[] getDecompositionTypeName(dchar c) int getDigit(dchar c) EastAsianWidth getEastAsianWidth(dchar c) char[] getEastAsianWidthName(EastAsianWidth e) char[] getEastAsianWidthName(dchar c) GeneralCategory getGeneralCategory(dchar c) char[] getGeneralCategoryName(GeneralCategory e) char[] getGeneralCategoryName(dchar c) HangulSyllableType getHangulSyllableType(dchar c) char[] getHangulSyllableTypeName(HangulSyllableType e) char[] getHangulSyllableTypeName(dchar c) int getHexValue(dchar c) char[] getISOComment(dchar c) char[] getJamo(dchar c) char[] getJoiningGroup(dchar c) JoiningType getJoiningType(dchar c) char[] getJoiningTypeName(JoiningType e) char[] getJoiningTypeName(dchar c) LineBreak getLineBreak(dchar c) char[] getLineBreakName(LineBreak e) char[] getLineBreakName(dchar c) wchar[] getLowercaseMappingLocalUTF16(dchar c, char[] locale) dchar[] getLowercaseMappingLocalUTF32(dchar c, char[] locale) char[] getLowercaseMappingLocalUTF8(dchar c, char[] locale) wchar[] getLowercaseMappingUTF16(dchar c) dchar[] getLowercaseMappingUTF32(dchar c) char[] getLowercaseMappingUTF8(dchar c) char[] getName(dchar c) char[] getNormalizationCorrectionVersion(dchar c) dchar getNormalizationCorrectionsCorrection(dchar c) dchar getNormalizationCorrectionsOriginal(dchar c) char[] getNumeric(dchar c) uint getNumericType(dchar c) Script getScript(dchar c) char[] getScriptName(Script e) char[] getScriptName(dchar c) dchar getSimpleCaseFolding(dchar c) dchar getSimpleLowercaseMapping(dchar c) dchar getSimpleTitlecaseMapping(dchar c) dchar getSimpleUppercaseMapping(dchar c) char[] getSpecialCaseCondition(dchar c) char[] getSpecialCaseConditionLocal(dchar c) wchar[] getTitlecaseMappingLocalUTF16(dchar c, char[] locale) dchar[] getTitlecaseMappingLocalUTF32(dchar c, char[] locale) char[] getTitlecaseMappingLocalUTF8(dchar c, char[] locale) wchar[] getTitlecaseMappingUTF16(dchar c) dchar[] getTitlecaseMappingUTF32(dchar c) char[] getTitlecaseMappingUTF8(dchar c) char[] getUnicode1Name(dchar c) wchar[] getUppercaseMappingLocalUTF16(dchar c, char[] locale) dchar[] getUppercaseMappingLocalUTF32(dchar c, char[] locale) char[] getUppercaseMappingLocalUTF8(dchar c, char[] locale) wchar[] getUppercaseMappingUTF16(dchar c) dchar[] getUppercaseMappingUTF32(dchar c) char[] getUppercaseMappingUTF8(dchar c) bool isASCIIHexDigit(dchar c) bool isAlphabetic(dchar c) bool isBidiControl(dchar c) bool isBidiMirrored(dchar c) bool isCompositionExclusion(dchar c) bool isDash(dchar c) bool isDefaultIgnorableCodePoint(dchar c) bool isDeprecated(dchar c) bool isDiacritic(dchar c) bool isExtender(dchar c) bool isGraphemeBase(dchar c) bool isGraphemeExtend(dchar c) bool isGraphemeLink(dchar c) bool isHexDigit(dchar c) bool isHyphen(dchar c) bool isIDContinue(dchar c) bool isIDSBinaryOperator(dchar c) bool isIDSTrinaryOperator(dchar c) bool isIDStart(dchar c) bool isIdeographic(dchar c) bool isJoinControl(dchar c) bool isLogicalOrderException(dchar c) bool isLowercase(dchar c) bool isMath(dchar c) bool isNoncharacterCodePoint(dchar c) bool isOtherAlphabetic(dchar c) bool isOtherDefaultIgnorableCodePoint(dchar c) bool isOtherGraphemeExtend(dchar c) bool isOtherIDStart(dchar c) bool isOtherLowercase(dchar c) bool isOtherMath(dchar c) bool isOtherUppercase(dchar c) bool isQuotationMark(dchar c) bool isRadical(dchar c) bool isSTerm(dchar c) bool isSoftDotted(dchar c) bool isTerminalPunctuation(dchar c) bool isUnifiedIdeograph(dchar c) bool isUppercase(dchar c) bool isVariationSelector(dchar c) bool isWhiteSpace(dchar c) bool isXIDContinue(dchar c) bool isXIDStart(dchar c) Pretty much every function is in its own module. This means that when you link against it you only get those functions which you actually call. In addition, the tables that get linked in are tiny (well, most of them), and in some cases even non-existent, thanks to some seriously aggressive space optimization. For instance, if you call toSimpleUppercaseMapping(), which converts a character to uppercase, you will add only 5K to the size of your executable. Despite this space saving, the functions should still be pretty fast. The code for that uppercasing function consists of two if tests, a shift, a switch statement with seven cases, and a table lookup. And nothing else. Most of the other functions go the same way. Some are optimized in different ways, but I believe we now have a very good balance between speed and size. This is only the first step toward full Unicode support for D. Character properties are the heart of the Unicode algorithms. You need those first - so here they are. Currently, Deimos is not very well organized, so my next task will be trying to get that together. There are lots of interesting things in Deimos now (and some of them I don't even know what they are), but what we're lacking is overall organization, a build script, a "ready-to-go" downloadable library, proper doxygen documentation, and so on. It's a bit irritating, so I guess now is the time to deal with that. In the meantime, you can download the etc.unicode source files and documentation and build-it-yourself. (But be patient. There are A LOT of files to compile). Arcane Jill
Jun 27 2004
Cool! Is this a supplement or a replacement for Hauke's earlier work?
Jun 27 2004
In article <cbnbdl$182e$1 digitaldaemon.com>, Walter says...Cool! Is this a supplement or a replacement for Hauke's earlier work?Both really. etc.unicode is a very different beast from unichar. Hauke's unichar is very easy to use - there's just one source file. You stick that source file in your project and you're done. With etc.unicode it is (for now) more complicated, as there are many, many source files, and, what with Deimos being slightly disorganized at present, it may be a while before we get a build script together. Deimos will work like Phobos - you download a lib and link against it. But we're not that organized yet. Getting Deimos organized now has quite a high priority for me (more even that writing code, and that's saying something). It's a replacement for PART of Hauke's work. Hauke's utype module will always be necessary if you want a drop-in replacement for ctype. We should keep that forever. I don't know how isprint() and isgraph() are implemented in utype right now, but they could in any case be implemented in terms of etc.unicode if needed. (isgraph() == isGraphemeBase() || isGraphemeExtend()), etc. etc.unicode does overlap the functionality of unichar though. That's because etc.unicode is written by robot, and it was easier to get the robot to write the lot rather than just some of it. I need to make the codebuilder robot public - or at least available to Hauke - because he may want to tweak it in places. Arcane Jill
Jun 28 2004
"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cboirj$1ca$1 digitaldaemon.com...In article <cbnbdl$182e$1 digitaldaemon.com>, Walter says...Hmm. This can be confusing. Can the functionality of each be made unique?Cool! Is this a supplement or a replacement for Hauke's earlier work?It's a replacement for PART of Hauke's work.
Jul 01 2004
In article <cc297e$ik$2 digitaldaemon.com>, Walter says..."Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cboirj$1ca$1 digitaldaemon.com...I meant, it's a replacement for "unichar", but not for "utype".In article <cbnbdl$182e$1 digitaldaemon.com>, Walter says...Hmm. This can be confusing.Cool! Is this a supplement or a replacement for Hauke's earlier work?It's a replacement for PART of Hauke's work.Can the functionality of each be made unique?Hauke's "utype" module (drop-in replacement for ctype) is unique. It is not duplicated in etc.unicode. (The FUNCTIONALITY of some functions is duplicated, of course, but that's because what makes utype special is the fact that all functions have the same _name_ as the corresponding ctype functions). For examples, to convert a character to uppercase using simple-casing, you can currently use any of: (a) toupper(c) // ASCII only, from cytpe (b) toupper(c) // all Unicode chars, from utype (c) charToUpper(c) // all Unicode chars, from unichar (d) getUppercaseMapping(c) // all Unicode chars, from etc.unicode all of the above are locale-unaware, but very shortly, there will also be: (e) getUppercaseMapping(c, locale); It is not possible, however, to make "etc.unicode" and "unichar" unique. The former is a superset of the latter. The codebuilder could, of course, be instructed NOT to generate those functions for which similar functionality exists in unichar, but then you'd have problems with keeping both versions in step with each other, function names being inconsistent, linking strategy being different, and so on. My vote would go to retaining "utype" (because people are familiar with ctype), but using "etc.unicode" in place of "unichar". What "etc.unicode" can do is a superset of what "unichar" can do, in terms of provided functions, and in addition can be rebuilt for any future (or even past) version of Unicode by any end-user in a matter of minutes (once the codebuilder program goes public). It's also likely to be smaller, because you link in only those parts you need, whereas with "unichar" it's all or nothing. It should be noted of course that "etc.unicode" is well optimized for space (but still with guaranteed constant-time lookup), and Hauke's input is what made this possible. The function names are definitely confusing, I agree. But in the case of "etc.unicode", the names originate from the Unicode Consortium. These folk define property names, such as "Simple_Uppercase_Mapping", and so on. The names are part of the Unicode standard - so I just slavishly put my metaphorical blinkers on, removed the underscores and added a "get" or "is" prefix ("get" for non-boolean properties, "is" for boolean properties) to conform to the D style guide. Such names may be cumbersome, but I still think it's better than using made-up names, and it's a consistent methodology to extend to the remaining properties we haven't added yet. I hope that makes things less confusing. Unfortunately, right now, Deimos is not well organized, because it is in the hands of many people. To me it makes more sense that people should be able to download the whole of Deimos in one go (instead of individual packages), just like they can currently download the whole of Phobos in one go. That sort of organization would be easy if Deimos were a one-person-project, or even a project with one leader whose word was law, but it's a collective effort, and I think those involved are going to HAVE to put some effort into making it look like a unified effort. This will happen in time, but I mention it because, right now (and I hope this is a temporary phase), "unichar" is easier to use than "etc.unicode", even though both are currently supplied in source code form, if only for the simple reason that "unichar" is one file and "etc.unicode" is many files. In the (near?) future, I would hope to have the following: (1) Deimos being easy-to-download and easy-to-use, with pre-build linkable libraries for all platforms, in both Debug and Release builds. (2) Headers for etc.unicode (by which I mean, stripped versions of the source code, with large tables removed), to speed up compilation time. (3) The codebuilder program (which generates etc.unicode) being made public, along with documentation, so that people can compile Unicode lookups for any version of Unicode, past, present or future (or even customized). Until this is done, unichar is likely to be easier to use. However, once these steps are taken, I would then have no hesitation in suggesting that we use as standard: (i) etc.unicode (possibly renamed to std.unicode - the codebuilder can locate it anywhere). (ii) utype (possibly renamed to std.utype). Arcane Jill
Jul 02 2004
Arcane Jill wrote:In article <cc297e$ik$2 digitaldaemon.com>, Walter says...I agree that Phobos should only have one Unicode package - everything else would be a bad idea. My hope is that I'll be able to integrate some of the current advantages of unichar (faster lookup, smaller footprint) with AJ's work so that they can apply to all Unicode functions. I'll know more the possibilities once AJ releases the code generator. I also have a feeling that it is not necessary to have as many separate modules as etc.unicode currently has (but since I have only glanced at etc.unicode I could be wrong). Since the linker will always throw out uncalled functions and unaccessed data (correct?) it should be possible to make it easier to use. The main thing we'd have to keep an eye on is that static module constructors do not pull in all the data and functions. The function names could also use some tuning - right now they feel a little clunky (as do the unichar function names, of course). The main problem here is that people knowing Unicode will recognize the property names and that the functions will still be sufficiently different from utype/ctype to prevent confusion (since utype/ctype define quite properties with the same name in different way). Hauke"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cboirj$1ca$1 digitaldaemon.com...I meant, it's a replacement for "unichar", but not for "utype".In article <cbnbdl$182e$1 digitaldaemon.com>, Walter says...Hmm. This can be confusing.Cool! Is this a supplement or a replacement for Hauke's earlier work?It's a replacement for PART of Hauke's work.Can the functionality of each be made unique?Hauke's "utype" module (drop-in replacement for ctype) is unique. It is not duplicated in etc.unicode. (The FUNCTIONALITY of some functions is duplicated, of course, but that's because what makes utype special is the fact that all functions have the same _name_ as the corresponding ctype functions). For examples, to convert a character to uppercase using simple-casing, you can currently use any of: (a) toupper(c) // ASCII only, from cytpe (b) toupper(c) // all Unicode chars, from utype (c) charToUpper(c) // all Unicode chars, from unichar (d) getUppercaseMapping(c) // all Unicode chars, from etc.unicode all of the above are locale-unaware, but very shortly, there will also be: (e) getUppercaseMapping(c, locale); It is not possible, however, to make "etc.unicode" and "unichar" unique. The former is a superset of the latter. The codebuilder could, of course, be instructed NOT to generate those functions for which similar functionality exists in unichar, but then you'd have problems with keeping both versions in step with each other, function names being inconsistent, linking strategy being different, and so on. My vote would go to retaining "utype" (because people are familiar with ctype), but using "etc.unicode" in place of "unichar". What "etc.unicode" can do is a superset of what "unichar" can do, in terms of provided functions, and in addition can be rebuilt for any future (or even past) version of Unicode by any end-user in a matter of minutes (once the codebuilder program goes public). It's also likely to be smaller, because you link in only those parts you need, whereas with "unichar" it's all or nothing. It should be noted of course that "etc.unicode" is well optimized for space (but still with guaranteed constant-time lookup), and Hauke's input is what made this possible. The function names are definitely confusing, I agree. But in the case of "etc.unicode", the names originate from the Unicode Consortium. These folk define property names, such as "Simple_Uppercase_Mapping", and so on. The names are part of the Unicode standard - so I just slavishly put my metaphorical blinkers on, removed the underscores and added a "get" or "is" prefix ("get" for non-boolean properties, "is" for boolean properties) to conform to the D style guide. Such names may be cumbersome, but I still think it's better than using made-up names, and it's a consistent methodology to extend to the remaining properties we haven't added yet. I hope that makes things less confusing. Unfortunately, right now, Deimos is not well organized, because it is in the hands of many people. To me it makes more sense that people should be able to download the whole of Deimos in one go (instead of individual packages), just like they can currently download the whole of Phobos in one go. That sort of organization would be easy if Deimos were a one-person-project, or even a project with one leader whose word was law, but it's a collective effort, and I think those involved are going to HAVE to put some effort into making it look like a unified effort. This will happen in time, but I mention it because, right now (and I hope this is a temporary phase), "unichar" is easier to use than "etc.unicode", even though both are currently supplied in source code form, if only for the simple reason that "unichar" is one file and "etc.unicode" is many files. In the (near?) future, I would hope to have the following: (1) Deimos being easy-to-download and easy-to-use, with pre-build linkable libraries for all platforms, in both Debug and Release builds. (2) Headers for etc.unicode (by which I mean, stripped versions of the source code, with large tables removed), to speed up compilation time. (3) The codebuilder program (which generates etc.unicode) being made public, along with documentation, so that people can compile Unicode lookups for any version of Unicode, past, present or future (or even customized). Until this is done, unichar is likely to be easier to use. However, once these steps are taken, I would then have no hesitation in suggesting that we use as standard: (i) etc.unicode (possibly renamed to std.unicode - the codebuilder can locate it anywhere). (ii) utype (possibly renamed to std.utype).
Jul 02 2004
In article <cc3a30$1ni0$1 digitaldaemon.com>, Hauke Duden says...I agree that Phobos should only have one Unicode package - everything else would be a bad idea. My hope is that I'll be able to integrate some of the current advantages of unichar (faster lookup, smaller footprint) with AJ's work so that they can apply to all Unicode functions. I'll know more the possibilities once AJ releases the code generator.No problem. I'll do that soon, like within a week. You should definitely be given write access, and if you can make it better/faster/whatever that would certainly be great.I also have a feeling that it is not necessary to have as many separate modules as etc.unicode currently has (but since I have only glanced at etc.unicode I could be wrong).As currently written, the codebuilder generates between two and four modules per Unicode property - one is just a wrapper to present a usable interface to humans; the others are purely robot-generated and (in general) will comprise one module per lookup table (but you still need a module even if there are zero lookup tables). You could certainly have a bash at reducing the number of modules per Unicode property, but it would be bad (in my opinion) to reduce it further. As a minimum, you need one per Unicode property. I don't know if having many modules is actually a problem though. From they end users' point of view, they will still need to import ONE module, and link with ONE library, and the rest happens automatically. But yeah, if you can make that happen better or faster, great!Since the linker will always throw out uncalled functions and unaccessed data (correct?) it should be possible to make it easier to use.No, the linker can only throw out whole unused modules. It cannot throw out /parts/ of modules. Therefore, each "optional thing" needs to be in its own module or modules.The main thing we'd have to keep an eye on is that static module constructors do not pull in all the data and functions.There aren't any static module constructors. None at all. Zero. So keeping an eye on that should be fairly easy, I'd say.The function names could also use some tuning - right now they feel a little clunky (as do the unichar function names, of course). The main problem here is that people knowing Unicode will recognize the property names and that the functions will still be sufficiently different from utype/ctype to prevent confusion (since utype/ctype define quite properties with the same name in different way).Yes, I agree. The problem is that the names are defined by the Unicode Consortium (though I did tweak them to match D style). They are the standard, official names for properties. I think we'd have to be quite imaginitive to come up with any reasonable alternative. Arcane Jill
Jul 02 2004
Arcane Jill wrote:Hmmm. I ran a few tests and it seems that you're right. As soon as you specify a module to the linker it will be included fully in the executable, regardless of whether it is used or not. It is unfortunate that it isn't a little more sophisticated :(. GDC should be able to do better (since GCC/G++ does for C/C++), but since the unicode lib should work well with all compilers it seems that your current approach is the only feasible one. HaukeSince the linker will always throw out uncalled functions and unaccessed data (correct?) it should be possible to make it easier to use.No, the linker can only throw out whole unused modules. It cannot throw out /parts/ of modules. Therefore, each "optional thing" needs to be in its own module or modules.
Jul 03 2004
This is incredible! Hopefully I'll have some free time next week to check this out :). Great work. Hauke Arcane Jill wrote:With humungous thanks to Hauke for ideas, suggestions, algorithms, inspriation, etc., I've got the first version of etc.unicode uploaded to Deimos. It gives you access to pretty much all Unicode properties. For an idea of the flavor of the thing, these are the functions you get: char[] getAge(dchar c) char[] getArabicShapingName(dchar c) BidiClass getBidiClass(dchar c) char[] getBidiClassName(BidiClass e) char[] getBidiClassName(dchar c) dchar getBidiMirroringGlyph(dchar c) char[] getBlock(dchar c) uint getCanonicalCombiningClass(dchar c) int getDecimalDigit(dchar c) wchar[] getDecompositionMappingUTF16(dchar c) dchar[] getDecompositionMappingUTF32(dchar c) char[] getDecompositionMappingUTF8(dchar c) DecompositionType getDecompositionType(dchar c) char[] getDecompositionTypeName(DecompositionType e) char[] getDecompositionTypeName(dchar c) int getDigit(dchar c) EastAsianWidth getEastAsianWidth(dchar c) char[] getEastAsianWidthName(EastAsianWidth e) char[] getEastAsianWidthName(dchar c) GeneralCategory getGeneralCategory(dchar c) char[] getGeneralCategoryName(GeneralCategory e) char[] getGeneralCategoryName(dchar c) HangulSyllableType getHangulSyllableType(dchar c) char[] getHangulSyllableTypeName(HangulSyllableType e) char[] getHangulSyllableTypeName(dchar c) int getHexValue(dchar c) char[] getISOComment(dchar c) char[] getJamo(dchar c) char[] getJoiningGroup(dchar c) JoiningType getJoiningType(dchar c) char[] getJoiningTypeName(JoiningType e) char[] getJoiningTypeName(dchar c) LineBreak getLineBreak(dchar c) char[] getLineBreakName(LineBreak e) char[] getLineBreakName(dchar c) wchar[] getLowercaseMappingLocalUTF16(dchar c, char[] locale) dchar[] getLowercaseMappingLocalUTF32(dchar c, char[] locale) char[] getLowercaseMappingLocalUTF8(dchar c, char[] locale) wchar[] getLowercaseMappingUTF16(dchar c) dchar[] getLowercaseMappingUTF32(dchar c) char[] getLowercaseMappingUTF8(dchar c) char[] getName(dchar c) char[] getNormalizationCorrectionVersion(dchar c) dchar getNormalizationCorrectionsCorrection(dchar c) dchar getNormalizationCorrectionsOriginal(dchar c) char[] getNumeric(dchar c) uint getNumericType(dchar c) Script getScript(dchar c) char[] getScriptName(Script e) char[] getScriptName(dchar c) dchar getSimpleCaseFolding(dchar c) dchar getSimpleLowercaseMapping(dchar c) dchar getSimpleTitlecaseMapping(dchar c) dchar getSimpleUppercaseMapping(dchar c) char[] getSpecialCaseCondition(dchar c) char[] getSpecialCaseConditionLocal(dchar c) wchar[] getTitlecaseMappingLocalUTF16(dchar c, char[] locale) dchar[] getTitlecaseMappingLocalUTF32(dchar c, char[] locale) char[] getTitlecaseMappingLocalUTF8(dchar c, char[] locale) wchar[] getTitlecaseMappingUTF16(dchar c) dchar[] getTitlecaseMappingUTF32(dchar c) char[] getTitlecaseMappingUTF8(dchar c) char[] getUnicode1Name(dchar c) wchar[] getUppercaseMappingLocalUTF16(dchar c, char[] locale) dchar[] getUppercaseMappingLocalUTF32(dchar c, char[] locale) char[] getUppercaseMappingLocalUTF8(dchar c, char[] locale) wchar[] getUppercaseMappingUTF16(dchar c) dchar[] getUppercaseMappingUTF32(dchar c) char[] getUppercaseMappingUTF8(dchar c) bool isASCIIHexDigit(dchar c) bool isAlphabetic(dchar c) bool isBidiControl(dchar c) bool isBidiMirrored(dchar c) bool isCompositionExclusion(dchar c) bool isDash(dchar c) bool isDefaultIgnorableCodePoint(dchar c) bool isDeprecated(dchar c) bool isDiacritic(dchar c) bool isExtender(dchar c) bool isGraphemeBase(dchar c) bool isGraphemeExtend(dchar c) bool isGraphemeLink(dchar c) bool isHexDigit(dchar c) bool isHyphen(dchar c) bool isIDContinue(dchar c) bool isIDSBinaryOperator(dchar c) bool isIDSTrinaryOperator(dchar c) bool isIDStart(dchar c) bool isIdeographic(dchar c) bool isJoinControl(dchar c) bool isLogicalOrderException(dchar c) bool isLowercase(dchar c) bool isMath(dchar c) bool isNoncharacterCodePoint(dchar c) bool isOtherAlphabetic(dchar c) bool isOtherDefaultIgnorableCodePoint(dchar c) bool isOtherGraphemeExtend(dchar c) bool isOtherIDStart(dchar c) bool isOtherLowercase(dchar c) bool isOtherMath(dchar c) bool isOtherUppercase(dchar c) bool isQuotationMark(dchar c) bool isRadical(dchar c) bool isSTerm(dchar c) bool isSoftDotted(dchar c) bool isTerminalPunctuation(dchar c) bool isUnifiedIdeograph(dchar c) bool isUppercase(dchar c) bool isVariationSelector(dchar c) bool isWhiteSpace(dchar c) bool isXIDContinue(dchar c) bool isXIDStart(dchar c) Pretty much every function is in its own module. This means that when you link against it you only get those functions which you actually call. In addition, the tables that get linked in are tiny (well, most of them), and in some cases even non-existent, thanks to some seriously aggressive space optimization. For instance, if you call toSimpleUppercaseMapping(), which converts a character to uppercase, you will add only 5K to the size of your executable. Despite this space saving, the functions should still be pretty fast. The code for that uppercasing function consists of two if tests, a shift, a switch statement with seven cases, and a table lookup. And nothing else. Most of the other functions go the same way. Some are optimized in different ways, but I believe we now have a very good balance between speed and size. This is only the first step toward full Unicode support for D. Character properties are the heart of the Unicode algorithms. You need those first - so here they are. Currently, Deimos is not very well organized, so my next task will be trying to get that together. There are lots of interesting things in Deimos now (and some of them I don't even know what they are), but what we're lacking is overall organization, a build script, a "ready-to-go" downloadable library, proper doxygen documentation, and so on. It's a bit irritating, so I guess now is the time to deal with that. In the meantime, you can download the etc.unicode source files and documentation and build-it-yourself. (But be patient. There are A LOT of files to compile). Arcane Jill
Jun 27 2004
Arcane Jill wrote:Despite this space saving, the functions should still be pretty fast. The code for that uppercasing function consists of two if tests, a shift, a switch statement with seven cases, and a table lookup. And nothing else.Hmmm. Why do you store each page separately with a manual switch for choosing the right one? A second lookup table should be a lot faster. You could also save some more cycles if you add a single page that contains only 0 values instead of returning a null-pointer. That way you do not need to check for null every time you read a value. But these are just minor points. This is a great piece of work. Hauke
Jun 27 2004
In article <cbneog$1cu3$1 digitaldaemon.com>, Hauke Duden says...Arcane Jill wrote:The old space/speed tradeoff. You can probably tweak this, once I make the codebuilder program public. There are various parameters which control decisions the robot makes, and right now those parameters are just constants. But we COULD change that so that more popular lookup get biased in favor of speed, while less popular lookups get biased in favor of minimal space. In the current configuration, the robot decides that a big table full of zeroes is a waste of space compared with a test for null, and that a switch statement is acceptable so long as there are fewer than sixteen cases. But all these things are ultimately tweakable if we later decide to tweak them. If we do that, the robot will simply write different code (i.e. faster but bigger). Personally, I think that the choices currently made are quite reasonable for most properties. There may be a case for speeding up uppercasing and a few others, but you still have to think in terms of how much RAM that would consume at runtime. Right now it's 5K for uppercasing, and another 5K for lowercasing. Unicode contains a *lot* of data, and I'm hesitant to give speed too high a priority here, for fear of everything getting huge. Arcane JillDespite this space saving, the functions should still be pretty fast. The code for that uppercasing function consists of two if tests, a shift, a switch statement with seven cases, and a table lookup. And nothing else.Hmmm. Why do you store each page separately with a manual switch for choosing the right one? A second lookup table should be a lot faster. You could also save some more cycles if you add a single page that contains only 0 values instead of returning a null-pointer. That way you do not need to check for null every time you read a value.
Jun 28 2004
Arcane Jill wrote:In article <cbneog$1cu3$1 digitaldaemon.com>, Hauke Duden says...Not necessarily bigger. If you add RLE compression it can even get smaller than what you have now. I'd love to take a look at the "robot" code. I have some things in mind that might improve on both speed and size. I'd like to see how easy it would be to integrate them into your current system.Arcane Jill wrote:The old space/speed tradeoff. You can probably tweak this, once I make the codebuilder program public. There are various parameters which control decisions the robot makes, and right now those parameters are just constants. But we COULD change that so that more popular lookup get biased in favor of speed, while less popular lookups get biased in favor of minimal space. In the current configuration, the robot decides that a big table full of zeroes is a waste of space compared with a test for null, and that a switch statement is acceptable so long as there are fewer than sixteen cases. But all these things are ultimately tweakable if we later decide to tweak them. If we do that, the robot will simply write different code (i.e. faster but bigger).Despite this space saving, the functions should still be pretty fast. The code for that uppercasing function consists of two if tests, a shift, a switch statement with seven cases, and a table lookup. And nothing else.Hmmm. Why do you store each page separately with a manual switch for choosing the right one? A second lookup table should be a lot faster. You could also save some more cycles if you add a single page that contains only 0 values instead of returning a null-pointer. That way you do not need to check for null every time you read a value.Personally, I think that the choices currently made are quite reasonable for most properties. There may be a case for speeding up uppercasing and a few others, but you still have to think in terms of how much RAM that would consume at runtime. Right now it's 5K for uppercasing, and another 5K for lowercasing. Unicode contains a *lot* of data, and I'm hesitant to give speed too high a priority here, for fear of everything getting huge.We're on the same page here. Both sides need to be optimized but you have to find a good balance. Hauke
Jun 28 2004
In article <cbosl4$f9d$1 digitaldaemon.com>, Hauke Duden says...Not necessarily bigger. If you add RLE compression it can even get smaller than what you have now.You mentioned that before, but I'm not sure I agree. RLE is pretty much the /only/ one of your ideas that I didn't go with. You see, I take the view that hard disk space is plentiful, but RAM is not. With that perspective, compressing on disk, but decompressing into RAM, is /not/ a good thing to do. You might as well load it into RAM in the uncompressed state in the first place.I'd love to take a look at the "robot" code. I have some things in mind that might improve on both speed and size. I'd like to see how easy it would be to integrate them into your current system.I thought you might. Rest assured, you will be the /first/ person to get write access. I'll probably need to start a new project for it thought. The codebuilder itself doesn't REALLY belong in Deimos, as it's not general purpose.We're on the same page here. Both sides need to be optimized but you have to find a good balance.The codebuilder is a good way to get that balance. Change a few constants, run it again and new source gets written reflecting the new balance. But fine tuning it is probably more your area of expertise. You seem to know more about this sort of stuff than I, anyway. Jill
Jun 28 2004
Arcane Jill wrote:In article <cbosl4$f9d$1 digitaldaemon.com>, Hauke Duden says...Heh. Sorry if I seem like I want to push my ideas onto you. I usually write these messages in a hurry and I'm often not sure if I have mentioned something before ;) Regarding the size problem: for me the trade-off is not so much disk space against RAM usage but executable size against RAM usage. My concern is that if the executables get too big then people might not want to use the unicode functions for some applications (and fall back to ASCII instead). For example, good Setup software adds as little overhead as possible to the installed data. If all the Unicode stuff together amounts to 100K then that may already be too much. You should also keep in mind that the executable is held in RAM as well, so increasing executable size to save RAM does not always give you an advantage. Please also let me emphasize that I'm not advocating holding completely uncompressed tables in RAM. On the contrary: I think the layout you currently have is good for the uncompressed version. What I mean is not storing this data directly in the executable, but storing an RLE compressed version and unpacking it into the current form at runtime. The RLE'ed version should be quite a bit smaller (it reduces the mapping data in unichar to about 1/4th). So the increase in RAM usage would be around 125% of what you have now (assuming that the other data packs similarly well - the 25% increase comes from the second RLE compressed version of the data in the executable). But executable size goes down to 25%. I think that's worth it. Also keep in mind that we're talking about kilobyte sizes here. 500 KB of RAM is not much nowadays (my estimate for an application that uses just about everything), but downloading 500 KB more from the internet is very noticable for modem users.Not necessarily bigger. If you add RLE compression it can even get smaller than what you have now.You mentioned that before, but I'm not sure I agree. RLE is pretty much the /only/ one of your ideas that I didn't go with. You see, I take the view that hard disk space is plentiful, but RAM is not. With that perspective, compressing on disk, but decompressing into RAM, is /not/ a good thing to do. You might as well load it into RAM in the uncompressed state in the first place.Take your time - I'm curious, but I don't have much free time to spend on this anyway. I can wait :). HaukeI'd love to take a look at the "robot" code. I have some things in mind that might improve on both speed and size. I'd like to see how easy it would be to integrate them into your current system.I thought you might. Rest assured, you will be the /first/ person to get write access. I'll probably need to start a new project for it thought. The codebuilder itself doesn't REALLY belong in Deimos, as it's not general purpose.
Jun 28 2004
In article <cbpp7k$1pgi$1 digitaldaemon.com>, Hauke Duden says...You should also keep in mind that the executable is held in RAM as well,As well as what? The tables are directly contained in the RAM image of the executable. They are not duplicated or otherwise reconstructed. They are accessed in-place.so increasing executable size to save RAM does not always give you an advantage.Curiously, you seem to be arguing in favor of my position. Had we used RLE decompression, THEN we'd have to worry about the "as well". Jill
Jun 29 2004
Arcane Jill wrote:In article <cbpp7k$1pgi$1 digitaldaemon.com>, Hauke Duden says...As well as the data you store in explicitly allocated memory.You should also keep in mind that the executable is held in RAM as well,As well as what?The tables are directly contained in the RAM image of the executable. They are not duplicated or otherwise reconstructed. They are accessed in-place.Yes, but they ARE in RAM. My point was that you don't save RAM if you put the data in the executable instead of an explicitly allocated memory block.I don't think I understand what you mean. If I understood your last post correctly you didn't want to use RLE compression because disk space is cheap, but RAM is not. I am arguing that: - executable size (=disk space) is more expensive than RAM if the file is downloaded from the internet - RAM usage is increased only slightly (my rough estimate was 125% of the original space) but executable size is reduced significantly (down to 25%). In an age where many programs are downloaded from the internet that is worth thinking about. Haukeso increasing executable size to save RAM does not always give you an advantage.Curiously, you seem to be arguing in favor of my position. Had we used RLE decompression, THEN we'd have to worry about the "as well".
Jun 29 2004
In article <cbrvnp$1uf8$1 digitaldaemon.com>, Hauke Duden says...That would be zero.As well as the data you store in explicitly allocated memory.You should also keep in mind that the executable is held in RAM as well,As well as what?But clearly you do. If the compressed size is X, and the uncompressed size is Y, then storing the uncompressed table in the executable costs Y bytes of RAM. Decompressing at runtime costs (X+Y) bytes of RAM, since you can't un-allocate the X. Since X is not negative, it follows that Y will always be less than (X+Y)The tables are directly contained in the RAM image of the executable. They are not duplicated or otherwise reconstructed. They are accessed in-place.Yes, but they ARE in RAM. My point was that you don't save RAM if you put the data in the executable instead of an explicitly allocated memory block.I don't think I understand what you mean. If I understood your last post correctly you didn't want to use RLE compression because disk space is cheap, but RAM is not. I am arguing that: - executable size (=disk space) is more expensive than RAM if the file is downloaded from the internetThat's what zip files are for. Besides which, I don't think my obj files actually do contain large arrays full of zeroes. Such zero blocks will all have been removed and replaced by null pointer returns. Or were you arguing that zero-blocks should be re-inserted?- RAM usage is increased only slightly (my rough estimate was 125% of the original space) but executable size is reduced significantly (down to 25%).I'm not arguing with that. I'm arguing in favor of not increasing RAM usage *AT ALL*. Like, not even slightly.In an age where many programs are downloaded from the internet that is worth thinking about.Zip files use much better compression than simple RLE. I say zip 'em. Jill
Jun 29 2004
Arcane Jill wrote:In article <cbrvnp$1uf8$1 digitaldaemon.com>, Hauke Duden says...In this particular case, yes, as I stated in my post. I just wanted to emphasize that moving data into statically compiled arrays (as opposed to dynamic ones) doesn't automatically reduce RAM usage.That would be zero.As well as the data you store in explicitly allocated memory.You should also keep in mind that the executable is held in RAM as well,As well as what?But clearly you do. If the compressed size is X, and the uncompressed size is Y, then storing the uncompressed table in the executable costs Y bytes of RAM. Decompressing at runtime costs (X+Y) bytes of RAM, since you can't un-allocate the X. Since X is not negative, it follows that Y will always be less than (X+Y)The tables are directly contained in the RAM image of the executable. They are not duplicated or otherwise reconstructed. They are accessed in-place.Yes, but they ARE in RAM. My point was that you don't save RAM if you put the data in the executable instead of an explicitly allocated memory block.What about installers and self-extractors? You do not ZIP those because they ARE the ZIP file (in a manner of speaking). I'd like to be able to write such applications in D. Besides half of my reasons is that even if D executables could be compressed better than C++ ones, many people would still compare their size in uncompressed form. A similar thing happened to C++: C++ executables usually compress better than C ones (templates and exception handling create lots of similar code), yet C++ is often said to be the "bloat king" among languages. I just don't want people to shun the Unicode routines because of the size difference, even if it may not have such a big impact on the end result as they might think.I don't think I understand what you mean. If I understood your last post correctly you didn't want to use RLE compression because disk space is cheap, but RAM is not. I am arguing that: - executable size (=disk space) is more expensive than RAM if the file is downloaded from the internetThat's what zip files are for.Besides which, I don't think my obj files actually do contain large arrays full of zeroes. Such zero blocks will all have been removed and replaced by null pointer returns. Or were you arguing that zero-blocks should be re-inserted?RLE doesn't just pack zero arrays. Unicode contains lots of ranges with the same values.As I said, I think 100 KB of extra RAM usage is a lot better than 400 KB of increased executable size. Especially for a garbage collected language that will always use more RAM than strictly necessary. Hauke- RAM usage is increased only slightly (my rough estimate was 125% of the original space) but executable size is reduced significantly (down to 25%).I'm not arguing with that. I'm arguing in favor of not increasing RAM usage *AT ALL*. Like, not even slightly.
Jun 29 2004
"Hauke Duden" <H.NS.Duden gmx.net> skrev i en meddelelse news:cbrvnp$1uf8> > The tables are directly contained in the RAM image of the executable. They areAre you sure about that? I would expect individual pages to be loaded on demand. Regards, Martinnot duplicated or otherwise reconstructed. They are accessed in-place.Yes, but they ARE in RAM. My point was that you don't save RAM if you put the data in the executable instead of an explicitly allocated memory block.
Jun 29 2004
Martin M. Pedersen wrote:"Hauke Duden" <H.NS.Duden gmx.net> skrev i en meddelelse news:cbrvnp$1uf8> > The tables are directly contained in the RAM image of the executable. They areI'm pretty sure about it, but not 100% sure. I have frequently observed in the past that RAM usage increases by the size of a DLL as soon as it is loaded. HaukeAre you sure about that? I would expect individual pages to be loaded on demand.not duplicated or otherwise reconstructed. They are accessed in-place.Yes, but they ARE in RAM. My point was that you don't save RAM if you put the data in the executable instead of an explicitly allocated memory block.
Jun 30 2004
In article <cbsj6s$2s9m$1 digitaldaemon.com>, Martin M. Pedersen says..."Hauke Duden" <H.NS.Duden gmx.net> skrev i en meddelelse news:cbrvnp$1uf8> > The tables are directly contained in the RAM image of the executable. They areOnce upon a time, I did just that. I had to write some Unicode stuff for my employer a few years back, and I adopted exactly that "load on demand" approach. It is with this hindsight that I now beleive it to have been a bad idea (although with some modification, it might be a good idea in a DLL). The basic load-on-demand approach is this: The user calls something like isUppercase(c); The library evaluates (c >> N) (for some N) to get what we might loosely call a "page" number. Then it says to itself, "Is this page cached?". If so, use the in-RAM table to look up the answer; if not, load the page from disk, decompress it into RAM, cache it so we don't have to do all that again, and THEN look up the value. Plus, you'd have to do this every time you re-ran your application, which might matter for some very small applications. At the time, it seemed like this was quite a promising approach, but it had a huge number of drawbacks. For one thing, neither C++ nor D has any concept of resource files (essentially a Java concept), so, in order to load anything off disk, YOU FIRST HAVE TO FIND IT. This means either a DLL (per "page"? per property?), or you have to read an environment variable to tell you where to look. Requiring users to set an environment variable just to get isUppercase() working is not desirable. For another thing, the extra code you have to go through at runtime to answer the question "is it cached?" is itself a few extra cycles. I took a different approach this time round. In this new approach, there are two important principles: (1) The souce code shall be written by robot. This protects us against Unicode itself being updated (and it is /constantly/ being updated), and it also allows for some SERIOUS optimization, because of course a robot can try many, many different optimization strategies and, using sheer brute force, pick the best. (2) Each property shall be in its own object module, so that you do not link in those properties which you do not need. This is load-on-demand in a sense, but it's COMPILE-TIME load-on-demand, which is better (in my opinion) and the split is in a different direction (property-required rather than codepoint-range). The new approach makes complete sense when you realize just how small things get. That isUppercase() function generates a linkable object module which is a measly 3420 /bytes/ in size, and does not depend on (pull in) anything else. I mean - come on guys - just 3K! Are we REALLY saying that's too much? Plus, as a bonus, you get isUppercase() data for all characters, so no there is no bias toward any particular subrange. And in any case, I didn't get that 3K figure by measuring the size of the generated tables, I got it by compiling an object module and typing dir at the command line to see how big it ended up. For reference, the following do-nothing functions: by the same standard comes out at 232 bytes. However, this measure overestimates, because, even if it isn't inlined, there are some linker symbols in there that will be discarded when constructing an executable. So even these small estimates are overestimates. I think you would be hard-pressed to do better than my robot. The good news, however, is that nobody has to agree with me. (I guess it was too much to hope for that everyone would). Because, pretty soon, I'm going to make the codebuilder robot open source, once I've added a few more tweaks and got a few other things sorted out. That means that if anyone can come up with a better strategy than I, they would be perfectly welcome to take a branch of the codebuilder source tree and modify it to do something else. Then we could run various efficiency tests to compare all versions. You could do this to see if your load-on-demand-by-codepoint-range idea was feasable; Hauke could do it to see if his RLE encoding idea works out better than what I've done, and without doubt, the most efficient one is the one we'll keep (though I'm not sure how you define efficient). In any case, I'm putting my next efforts into (a) fixing some bugs in Int, and (b) making Deimos easy to download and use. This sort of feature modification you suggest is not on my agenda in the near future, because, although there ARE a few more functions I need to add to etc.unicode., I've basically achieved what I set out to achieve, and I'm happy with it, and pretty soon I'm going to be keen to get back to my crypto stuff. I hope that helps. Arcane JillAre you sure about that? I would expect individual pages to be loaded on demand. Regards, Martinnot duplicated or otherwise reconstructed. They are accessed in-place.Yes, but they ARE in RAM. My point was that you don't save RAM if you put the data in the executable instead of an explicitly allocated memory block.
Jun 30 2004
"Arcane Jill" <Arcane_member pathlink.com> skrev i en meddelelse news:cbu6ru$2nm4$1 digitaldaemon.com...myAre you sure about that? I would expect individual pages to be loaded on demand.Once upon a time, I did just that. I had to write some Unicode stuff foremployer a few years back, and I adopted exactly that "load on demand"approach.It is with this hindsight that I now beleive it to have been a bad idea (although with some modification, it might be a good idea in a DLL).What I meant was load-on-demand implemented by the operating system. I think you have done a great job, and also in this respect made the right decision :-) When a modern operating system starts executing a program, it does not actually load the program into RAM. Instead, it sets up page tables, and uses the page-fault mechanism of the CPU to implement load-on-demand for the code segment. For example, when the very first instruction of the program is to be executed, it generates a page-fault, and it is at this time, the very first page is loaded into RAM. At least, this is my understanding. Static, constant tables like the ones in the Unicode library, would be - or can be - embraced by the same mechanism, meaning that you only pay for what you use. A decompression scheme at application level would mean that you would always pay for everything. It is better to leave that kind of stuff to the operating system, and mechanisms for doing that has existed for a long time, although they might not be universally available. NTFS has built-in support for LZW-compression, and it was a long time since "Stacker" was invented. For distribution, we have de-facto standards such as zip. Regards, Martin
Jun 30 2004
Hi, this is really impressive! (Okay, so I'm only using isWhiteSpace and simple folding/casing atm, I'll learn what the others do later ;-) The unicode stuff isn't in the library in subversion AFAICS though. (Actually, I can't get subversion to check out properly, I've been using the HTTP gateway. I tried http://svn.dsource.org/svn/projects/deimos as the location in TortoiseSVN, does that look right?) Sam
Jun 30 2004
In article <cbu5r3$2kpj$1 digitaldaemon.com>, Sam McCall says...Hi, this is really impressive! (Okay, so I'm only using isWhiteSpace and simple folding/casing atm, I'll learn what the others do later ;-) The unicode stuff isn't in the library in subversion AFAICS though.It is, kindof. At least the source code is there, at http://svn.dsource.org/svn/projects/deimos/trunk/etc/unicode/. But what we really NEED is a downloadable pre-built library. That's the part that's currently missing.(Actually, I can't get subversion to check out properly, I've been using the HTTP gateway. I tried http://svn.dsource.org/svn/projects/deimos as the location in TortoiseSVN, does that look right?) SamEr - I hope someone else can answer that...? Jill
Jun 30 2004
Arcane Jill wrote:In article <cbu5r3$2kpj$1 digitaldaemon.com>, Sam McCall says...Sorry, yeah, that's what I meant. I built one here with (from trunk/) for /R etc %f in (*.d) dmd -c -release %f lib -c deimos.lib age.obj for %f in (*.obj) lib deimos.lib %f (I'm not at all familiar with this stuff, so this may well be the Wrong Way, in particular the age.obj thing is a hack because i can't seem to get the lib tool to add an obj, creating the library if it doesn't exist). Then I hit the problem that I couldn't use most of the functions, due to unknown symbol. Sure enough, most of the functions weren't in the library. I changed the ones I wanted to use to "public" in the source, and that worked. I'm not sure if this is the right fix. SamHi, this is really impressive! (Okay, so I'm only using isWhiteSpace and simple folding/casing atm, I'll learn what the others do later ;-) The unicode stuff isn't in the library in subversion AFAICS though.It is, kindof. At least the source code is there, at http://svn.dsource.org/svn/projects/deimos/trunk/etc/unicode/. But what we really NEED is a downloadable pre-built library. That's the part that's currently missing.(Actually, I can't get subversion to check out properly, I've been using the HTTP gateway. I tried http://svn.dsource.org/svn/projects/deimos as the location in TortoiseSVN, does that look right?) SamEr - I hope someone else can answer that...? Jill
Jun 30 2004