digitalmars.D - upper case
- FLorian Rivoal (17/17) Jul 12 2004 Overall, D fully integrates unicode strings, in data structures as well ...
- Arcane Jill (25/42) Jul 13 2004 Panicke ye not. The full Unicode caseing algorithms are on their way, co...
- Blandger (16/22) Jul 13 2004 Sounds great. Thank you Jill in advance.
- Hauke Duden (10/34) Jul 13 2004 I'm currently working on this. A String interface that abstracts from
- Arcane Jill (10/19) Jul 13 2004 Hauke, dude, did anyone ever tell you you're brilliant? Well, I'll say i...
- Hauke Duden (8/26) Jul 13 2004 Yes. I've written a mixin that contains the string algorithms and that
- Blandger (5/14) Jul 13 2004 Wow! Nice to hear it. :)
- Arcane Jill (8/9) Jul 13 2004 Not true. char[] stores UTF-8, not ASCII. The whole of Unicode is availa...
- Arcane Jill (3/12) Jul 13 2004 Okay, so it doesn't come out right on this forum!
- Blandger (10/17) Jul 13 2004 available to
- Walter (8/29) Jul 13 2004 to
- Blandger (10/14) Jul 13 2004 the
- Thomas Kuehne (4/6) Jul 13 2004 Hasn't this been the standard for several years now - at least in the pe...
- Arcane Jill (8/9) Jul 14 2004 I wasn't aware that there were still any _non_ UTF-XX editors in use! Ev...
- Blandger (8/10) Jul 14 2004 save
- Arcane Jill (6/16) Jul 14 2004 And I say again, almost ALL text editors these days can save in UTF. In ...
- Roberto Mariottini (8/11) Jul 13 2004 This leds to some questions:
- Arcane Jill (22/28) Jul 14 2004 UTF-8, UTF-16BE, UTF-16LE, UTF-32BE and UTF32-LE are very easy to tell a...
- Roberto Mariottini (9/22) Jul 14 2004 Thanks for the answer.
- Arcane Jill (48/49) Jul 14 2004 How about I just make one up right now:
- Roberto Mariottini (89/89) Jul 14 2004 I've played a little with this, but I don't seem to find a suitable solu...
- Arcane Jill (30/39) Jul 13 2004 Errm. That was an artifact of this forum's web interface. When I typed i...
- Blandger (15/29) Jul 13 2004 up the
- Walter (5/13) Jul 13 2004 char[] isn't ASCII, it's UTF-8. Any UTF-8 string can be converted to UTF...
Overall, D fully integrates unicode strings, in data structures as well as in the various functions provided. But there seem to be some little things forgoten on the way in std.string: Everything concerning upper-case and lower-case characters only process non accentuated roman letters. This is the behaviour I would expect for functions processing ANSI strings, but since D string encode unicode characters, it might be a good idea to extend their behaviour to other characters like accentuted roman letters, cyrilic letters, and so on... those also have upper-case and lower-case forms. for the sake of efficiency, clarity or something, maybe those could be supplied as separated functions. maybe not. But anyway, i think this would have its place in std.string. Otherwise, include something like "assert(language is english);" in the preconditions of the functions ;) Of course, this is not difficult to be implemented by the programmer who needs it. But neither would be the current version which processes only non actentuated roman letters. So if it is considered worth including for this case, why not for the other?
Jul 12 2004
In article <ccve36$r2o$1 digitaldaemon.com>, FLorian Rivoal says...Overall, D fully integrates unicode strings, in data structures as well as in the various functions provided. But there seem to be some little things forgoten on the way in std.string: Everything concerning upper-case and lower-case characters only process non accentuated roman letters. This is the behaviour I would expect for functions processing ANSI strings, but since D string encode unicode characters, it might be a good idea to extend their behaviour to other characters like accentuted roman letters, cyrilic letters, and so on... those also have upper-case and lower-case forms. for the sake of efficiency, clarity or something, maybe those could be supplied as separated functions. maybe not. But anyway, i think this would have its place in std.string. Otherwise, include something like "assert(language is english);" in the preconditions of the functions ;) Of course, this is not difficult to be implemented by the programmer who needs it. But neither would be the current version which processes only non actentuated roman letters. So if it is considered worth including for this case, why not for the other?Panicke ye not. The full Unicode caseing algorithms are on their way, complete with locale-sensitivity as required by Turkish, Azeri and Lithuanian, and context-sensitivity as required by Greek and a few others. Just wait a little bit longer. Right now, the functions getSimpleLowercaseMapping(), getSimpleUppercaseMapping() and getSimpleTitlecaseMapping() in etc.unicode.unicode perform case "Default Simple Case Mapping" as defined by the Unicode standard. "Default" means not locale sensitive, and "Simple" means "one character at a time, as defined in UnicodeData.txt". They perform case mappings on a character-by-character basis, and work for ALL languages (except Turkish, Azeri and Lithuanian, which will have to wait for the next version). The forthcoming version will do everything. Including casefolding and normalization. It's a few weeks away, unfortunately, so be patient. It would not have been possible for std.string to do all that you require, because a Unicode casing algorithm cannot possibly work unless it can first access all the Unicode properties. std.string does not have that advantage - hence etc.unicode.unicode. One day in the future, it is my hope that all of this will be integrated into Phobos. Arcane Jill. Oh - PS - must apologize. A pre-linked downloadable version of etc.unicode.unicode is STILL not available (so it's still just source code). The reason for this was that it was my birthday last weekend, and I was partying instead of coding. Since I actually have a day job, it will have to wait until next weekend now.
Jul 13 2004
"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cd04a3$2280$1 digitaldaemon.com...In article <ccve36$r2o$1 digitaldaemon.com>, FLorian Rivoal says...The forthcoming version will do everything. Including casefolding and normalization. It's a few weeks away, unfortunately, so be patient.Sounds great. Thank you Jill in advance. I think D is lack of good and consistent String class as java has. For example, recently I stuck with: Object { ... char[] toString() ... } but I need wchar[] at least for supporting non ASCII languages. DMD complains about another return type. It seems that many good libs are coming out to the first versions very soon. I looking forward for first DTL also.Oh - PS - must apologize. A pre-linked downloadable version of etc.unicode.unicode is STILL not available (so it's still just sourcecode). Thereason for this was that it was my birthday last weekend,Congratulations! It's a good reason for the rest. :))
Jul 13 2004
Blandger wrote:"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cd04a3$2280$1 digitaldaemon.com...I'm currently working on this. A String interface that abstracts from the specific encoding + a bunch of implementations for the most common ones (UTF-8, 16, 32, system codepage, etc...). It provides some very useful (IMHO) functionality too (like "split", which is so rarely implemented in non-script languages). It is near completion and needs only a few more hours of work on documentation and testing. I hope to find the time within the next one or two weeks. HaukeIn article <ccve36$r2o$1 digitaldaemon.com>, FLorian Rivoal says...The forthcoming version will do everything. Including casefolding and normalization. It's a few weeks away, unfortunately, so be patient.Sounds great. Thank you Jill in advance. I think D is lack of good and consistent String class as java has. For example, recently I stuck with: Object { ... char[] toString() ... } but I need wchar[] at least for supporting non ASCII languages. DMD complains about another return type. It seems that many good libs are coming out to the first versions very soon. I looking forward for first DTL also.
Jul 13 2004
In article <cd0bgb$2g5g$1 digitaldaemon.com>, Hauke Duden says...I'm currently working on this. A String interface that abstracts from the specific encoding + a bunch of implementations for the most common ones (UTF-8, 16, 32, system codepage, etc...). It provides some very useful (IMHO) functionality too (like "split", which is so rarely implemented in non-script languages).Hauke, dude, did anyone ever tell you you're brilliant? Well, I'll say it anyway - you're brilliant. We need this. I've always been annoyed that, while std.string has got some amazing functions in it, like find() and so forth, they ONLY work chars! Huh???? I reckon that now that we have templates, find() should be made to work for ANY kind of array - no need to limit it even to strings. Same for all the other nice stringy functions.It is near completion and needs only a few more hours of work on documentation and testing. I hope to find the time within the next one or two weeks. HaukeYay. Looking forward to it. Jill
Jul 13 2004
Arcane Jill wrote:In article <cd0bgb$2g5g$1 digitaldaemon.com>, Hauke Duden says...Not recently, so thank you very much ;).I'm currently working on this. A String interface that abstracts from the specific encoding + a bunch of implementations for the most common ones (UTF-8, 16, 32, system codepage, etc...). It provides some very useful (IMHO) functionality too (like "split", which is so rarely implemented in non-script languages).Hauke, dude, did anyone ever tell you you're brilliant? Well, I'll say it anyway - you're brilliant. We need this.I've always been annoyed that, while std.string has got some amazing functions in it, like find() and so forth, they ONLY work chars! Huh???? I reckon that now that we have templates, find() should be made to work for ANY kind of array - no need to limit it even to strings. Same for all the other nice stringy functions.Yes. I've written a mixin that contains the string algorithms and that is used in the String classes. I've also gone to some length to ensure that the character decoding stuff can be inlined into the mixed-in algorithms. So performance will (hopefully - I haven't done any tests yet) be good. Hauke
Jul 13 2004
"Hauke Duden" <H.NS.Duden gmx.net> wrote in message news:cd0bgb$2g5g$1 digitaldaemon.com...Blandger wrote:I'm currently working on this. A String interface that abstracts from the specific encoding + a bunch of implementations for the most common ones (UTF-8, 16, 32, system codepage, etc...). It provides some very useful (IMHO) functionality too (like "split", which is so rarely implemented in non-script languages).Wow! Nice to hear it. :)It is near completion and needs only a few more hours of work on documentation and testing. I hope to find the time within the next one or two weeks.Good. Don't hurry much, just make it good, consistent and handy for working with. Thanks!
Jul 13 2004
In article <cd085g$29tq$1 digitaldaemon.com>, Blandger says...but I need wchar[] at least for supporting non ASCII languages.Not true. char[] stores UTF-8, not ASCII. The whole of Unicode is available to char[] arrays. is perfectly legal. (And you can use etc.unicode's getSimpleUppercaseMapping() to uppercase it too). Arcane Jill
Jul 13 2004
In article <cd0jdn$2sru$1 digitaldaemon.com>, Arcane Jill says...In article <cd085g$29tq$1 digitaldaemon.com>, Blandger says...Okay, so it doesn't come out right on this forum! But it will work in D source.but I need wchar[] at least for supporting non ASCII languages.Not true. char[] stores UTF-8, not ASCII. The whole of Unicode is available to char[] arrays. is perfectly legal. (And you can use etc.unicode's getSimpleUppercaseMapping() to uppercase it too). Arcane Jill
Jul 13 2004
"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cd0jdn$2sru$1 digitaldaemon.com...In article <cd085g$29tq$1 digitaldaemon.com>, Blandger says...available tobut I need wchar[] at least for supporting non ASCII languages.Not true. char[] stores UTF-8, not ASCII. The whole of Unicode ischar[] arrays.is perfectly legal. (And you can use etc.unicode'sgetSimpleUppercaseMapping() to uppercase it too). Thanks for addition. You are right it's legal but it looks (and I think works) ugly. It seems to me there is no 'normal way' to work with upper/lowecase, sort, search, collate, replace, code pages stuff with non ASCII letters within Phobos in this case . Or am I something missed ??
Jul 13 2004
"Blandger" <zeroman prominvest.com.ua> wrote in message news:cd0lhh$30mc$1 digitaldaemon.com..."Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cd0jdn$2sru$1 digitaldaemon.com...toIn article <cd085g$29tq$1 digitaldaemon.com>, Blandger says...available tobut I need wchar[] at least for supporting non ASCII languages.Not true. char[] stores UTF-8, not ASCII. The whole of Unicode ischar[] arrays.is perfectly legal. (And you can use etc.unicode'sgetSimpleUppercaseMapping() to uppercase it too). Thanks for addition. You are right it's legal but it looks (and I think works) ugly. It seemsme there is no 'normal way' to work with upper/lowecase, sort, search, collate, replace, code pages stuff with non ASCII letters within Phobosinthis case . Or am I something missed ??It looks ugly because it's written with unicode code numbers rather than the actual characters. If you write your source code using an editor that supports UTF-8, UTF-16, or UTF-32 you can write it using the actual characters. The D compiler can handle UTF-8, UTF-16, or UTF-32 source text.
Jul 13 2004
"Walter" <newshound digitalmars.com> wrote in message news:cd17f0$115j$2 digitaldaemon.com...It looks ugly because it's written with unicode code numbers rather thantheactual characters. If you write your source code using an editor that supports UTF-8, UTF-16, or UTF-32 you can write it using the actual characters. The D compiler can handle UTF-8, UTF-16, or UTF-32 sourcetext. I'm always catching myself with a thought I'm afraid write a code using UTF editors. Actually I don't know why! May be it's an old, outdated habits, may be it's something like 'internal fear' from UTF-x stuff. Really I don't know why it's so. So I decided to ask how many people in NG use UTF-x editors coding sources??
Jul 13 2004
Blandger wrote:So I decided to ask how many people in NG use UTF-x editors coding sources??Hasn't this been the standard for several years now - at least in the perl and Java world? Thomas
Jul 13 2004
In article <cd1fmv$1fqa$3 digitaldaemon.com>, Blandger says...So I decided to ask how many people in NG use UTF-x editors coding sources??I wasn't aware that there were still any _non_ UTF-XX editors in use! Even Microsoft Notepad - the bottom end of text editors if you're a programmer (no syntax highlighting, etc.) understands UTF-8. These days, what text editors don't? Me, I use TextPad. TextPad is not fully Unicode-aware (yet), but it CAN save files in UTF-8 format, which is all I need. Arcane Jill
Jul 14 2004
"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cd2lsj$hsu$1 digitaldaemon.com...Me, I use TextPad. TextPad is not fully Unicode-aware (yet), but it CANsavefiles in UTF-8 format, which is all I need.Actually I'd like to ask: how many people at present time use 'unicode editors' for their project's sources on the 'regular base' but not occasionally. It seems to me it happens very rarely (if ever) and it's not the 'strict rule' in companies/projects. So I think myself why i's so if unicode is so wonderful?
Jul 14 2004
In article <cd2vp0$164f$1 digitaldaemon.com>, Blandger says..."Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cd2lsj$hsu$1 digitaldaemon.com...And I say again, almost ALL text editors these days can save in UTF. In fact, I'm not even sure I can name one that doesn't. On that basis, then, the probable answer is almost everyone (although they may not consciously be aware of it). Arcane JillMe, I use TextPad. TextPad is not fully Unicode-aware (yet), but it CANsavefiles in UTF-8 format, which is all I need.Actually I'd like to ask: how many people at present time use 'unicode editors' for their project's sources on the 'regular base' but not occasionally. It seems to me it happens very rarely (if ever) and it's not the 'strict rule' in companies/projects. So I think myself why i's so if unicode is so wonderful?
Jul 14 2004
In article <cd17f0$115j$2 digitaldaemon.com>, Walter says...[...]If you write your source code using an editor that supports UTF-8, UTF-16, or UTF-32 you can write it using the actual characters. The D compiler can handle UTF-8, UTF-16, or UTF-32 source text.This leds to some questions: How can it detect the right coding? Does endianess matter? And what about my current default codepage (windows-1252)? If I pass an HTML as source, does it honor the encoding specified in the header? Ciao
Jul 13 2004
In article <cd2ksg$fng$1 digitaldaemon.com>, Roberto Mariottini says...This leds to some questions:How can it detect the right coding?UTF-8, UTF-16BE, UTF-16LE, UTF-32BE and UTF32-LE are very easy to tell apart, either with or without a BOM (a BOM is a special prefix). It cannot, however, distinguish the above from any OTHER encoding.Does endianess matter?With the UTF family, no. As I said, they are easy to tell apart.And what about my current default codepage (windows-1252)?D is designed with a global philosophy, so it will ignore your default codepage, and signal an error if you rely upon it. This is a good thing, because in D (unlike C/C++), the same source file will compile identically on all machines. Consider the following fragment of C++: (assuming the existence of a C++ toUTF16() function). Even in Western Europe and America, if you run that on Linux (where the default encoding is ISO-8859-1) you'll end up with s containing U+0080, but if you run it on Windows (where the default encoding is WINDOWS-1252) you'll end up with s containing U+20AC. Outside of Western Europe and America, the situation would be decidedly worse. D, on the other hand, will produce a consistent binary for the same source, no matter where you live or what your encoding is. In other words, the short answer to your question:And what about my current default codepage (windows-1252)?is, if you're using D, forget it.If I pass an HTML as source, does it honor the encoding specified in the header?No. It can't, because DMD doesn't come armed with hundreds of different decoders. Arcane Jill
Jul 14 2004
In article <cd2p4m$p0c$1 digitaldaemon.com>, Arcane Jill says...In article <cd2ksg$fng$1 digitaldaemon.com>, Roberto Mariottini says...[...]This leds to some questions:How can it detect the right coding?UTF-8, UTF-16BE, UTF-16LE, UTF-32BE and UTF32-LE are very easy to tell apart, either with or without a BOM (a BOM is a special prefix). It cannot, however, distinguish the above from any OTHER encoding.Does endianess matter?With the UTF family, no. As I said, they are easy to tell apart.And what about my current default codepage (windows-1252)?is, if you're using D, forget it.Thanks for the answer. I should have RTFM before asking, though. In http://www.digitalmars.com/d/lex.html is stated that D supports only ASCII and UTF-*, if there isn't a BOM at the beginning then UTF-8 is assumed(so ASCII is safe too).Well, do you know any translator from 1252 to UTF-8? CiaoIf I pass an HTML as source, does it honor the encoding specified in the header?No. It can't, because DMD doesn't come armed with hundreds of different decoders.
Jul 14 2004
In article <cd372c$1i77$1 digitaldaemon.com>, Roberto Mariottini says...Well, do you know any translator from 1252 to UTF-8?How about I just make one up right now: Arcane Jill
Jul 14 2004
I've played a little with this, but I don't seem to find a suitable solution. Attached is the Jill code modified to get a filter program. My test program is this: import std.c.stdio; import std.utf; int main(char[][] args) { int perché; printf("Perché\n"); return 0; } Obviously, if I compile it in its original encoding (Windows 1252) I get an error: test.d(6): invalid UTF-8 sequence test.d(6): invalid UTF-8 sequence test.d(6): unsupported char 0xe9 So I translate it in UTF-8, using: w2u.exe test.d > test2.d This new encoded file compiles without errors, but printf output is scrambled by the conversion: two characters are printed instead of the special one. In fact the special character is translated in a two-byte UTF-8 sequence by the filter, and printf doesn't recognize UTF-8 encoded strings. So I changed it to use wprintf:In article <cd0lhh$30mc$1 digitaldaemon.com>, Blandger says...Errm. That was an artifact of this forum's web interface. When I typed it in, it looked to me like a nice bunch of Russian and Chinese characters with a few Runes and Dingbats thrown it. It would look like that in my text editor too. And it would work. Alas, the HTML capacities of the D forum web site were not up the job, so you didn't see what I intended for you to see. Apparently you have to be a virgin to see unicode. :) Something like that anyway. Walter says Unicode is the future. I think he's right, but unfortunately it isn't the present.You are right it's legal but it looks (and I think works) ugly.It seems to me there is no 'normal way' to work with upper/lowecase, sort, search, collate, replace, code pages stuff with non ASCII letters within Phobos in this case . Or am I something missed ??Right now, no. But you can use the getSimpleUppercaseMapping() etc. functions from Deimos to do casing. Lexicographical sort isn't a problem, obviously. Search - depends what you mean. If you're waiting for the Unicode regular expression engine, you'll have to wait a while - that will be one of the last things we get. If you want an exact match though, that's pretty easy right now - a string is just an array, after all. Collation will be available (but isn't yet) via the Unicode Collation Algorithm - for which we'll have to download the CLDR (Common Locale Data Repository) from Unicode to get all the locale-specific weightings, but that will come. "Code pages", note, have nothing to do with Unicode. That comes into play in our sphere during transoding (encoding/decoding), which is something that I imagine will ultimately be built into streams. Much of Phobos was written in the early days of D, when there was no access to Unicode property data. It takes time to organize a proper Unicode library. Unicode has layers of features, with each algorithm relying on the services of the next layer down. Phobos had access to none of this, when it was written. Even now, Deimos's Unicode support is still only at the character level, but we'll get to the string level eventually. But all this will come. And I strongly suspect that D's Unicode support will eventually make it the language of choice for Unicode projects. Arcane JillJul 13 2004"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cd18gg$135d$1 digitaldaemon.com...In article <cd0lhh$30mc$1 digitaldaemon.com>, Blandger says...up theit would work. Alas, the HTML capacities of the D forum web site were notjob, so you didn't see what I intended for you to see.I see. :)Apparently you have to be a virgin to see unicode. :) Something like that anyway. Walter says Unicode is the future. I thinkhe'sright, but unfortunately it isn't the present.Agree with you both."Code pages", note, have nothing to do with Unicode. That comes into playin oursphere during transoding (encoding/decoding), which is something that Iimaginewill ultimately be built into streams.Exactly. I meant I don't want to think about code page then I use something like 'String class' in the D cdoe because it's should be 'internally unicoded' as it's in java. But I have to think about code page for I/O because there are a lots of 'old files' with 'old non unicode' content.Even now, Deimos's Unicode support is still only at the character level,butwe'll get to the string level eventually. But all this will come. And I strongly suspect that D's Unicode supportwilleventually make it the language of choice for Unicode projects.Hope so. :)Jul 13 2004"Blandger" <zeroman prominvest.com.ua> wrote in message news:cd085g$29tq$1 digitaldaemon.com...For example, recently I stuck with: Object { ... char[] toString() ... } but I need wchar[] at least for supporting non ASCII languages. DMD complains about another return type.char[] isn't ASCII, it's UTF-8. Any UTF-8 string can be converted to UTF-16 (which is wchar[]) by calling std.utf.toUTF16(). So, char[] toString() does fully support non-ASCII languages.Jul 13 2004"Walter" <newshound digitalmars.com> wrote in message news:cd17ev$115j$1 digitaldaemon.com...char[] isn't ASCII, it's UTF-8. Any UTF-8 string can be converted toUTF-16(which is wchar[]) by calling std.utf.toUTF16(). So, char[] toString()doesfully support non-ASCII languages.Sorry for mistaking all of you a little. DWT has a 'internal convention' to use 'alias wchar[] String;' for 'java String class' replacement. I don't know why. Seem it was Andy's decision. I hope it's right but... Recently I stuck with this: alias wchar[] String; public class ToStringTest { this() { } String toString() { return "ff"; } } DMD complains about another return type: //function toString overrides but is not covariant with toString How we can go throught this 'probable error'? This error has gone away by this time with unknow reason (it happed before) but I'm not sure if it doesn't come back again later... (sorry for probobly wrong english gramma here).Jul 13 2004"Blandger" <zeroman aport.ru> wrote in message news:cd1fmq$1fqa$2 digitaldaemon.com..."Walter" <newshound digitalmars.com> wrote in message news:cd17ev$115j$1 digitaldaemon.com...Ichar[] isn't ASCII, it's UTF-8. Any UTF-8 string can be converted toUTF-16(which is wchar[]) by calling std.utf.toUTF16(). So, char[] toString()doesfully support non-ASCII languages.Sorry for mistaking all of you a little. DWT has a 'internal convention' to use 'alias wchar[] String;' for 'java String class' replacement. I don't know why. Seem it was Andy's decision.hope it's right but... Recently I stuck with this: alias wchar[] String; public class ToStringTest { this() { } String toString() { return "ff"; } } DMD complains about another return type: //function toString overrides but is not covariant with toString How we can go throught this 'probable error'? This error has gone away by this time with unknow reason (it happed before) but I'm not sure if it doesn't come back again later... (sorry for probobly wrong english gramma here).The "not covariant" error happens when the overriding function has a return type that is not the same as the return type of the overridden function, or is not derived from that type.Jul 13 2004