digitalmars.D - performance of char vs wchar
- Ben Hinkle (21/21) Aug 27 2004 All the threads about char vs wchar got me thinking about char performan...
- Ben Hinkle (7/7) Aug 27 2004 Note: looks like the version that got attached was not the version I use...
- Berin Loritsch (23/46) Aug 27 2004 FWIW, the Java 'char' is a 16 bit value due to the unicode standards.
- Ben Hinkle (24/71) Aug 27 2004 performance,
- Arcane Jill (39/49) Aug 27 2004 Actually, wchar and dchar are pretty much the same in terms of ease of u...
- Ben Hinkle (31/80) Aug 27 2004 use. In
- Berin Loritsch (5/9) Aug 27 2004 Either that or hamper Japanese coders :)
- Walter (12/16) Aug 27 2004 general,
- Arcane Jill (34/44) Aug 27 2004 Are you sure? That's one hellava claim to make.
- Arcane Jill (5/8) Aug 27 2004 Whoops - dumb brain fart! Please pretend I didn't say that.
- Walter (67/115) Aug 28 2004 general,
- Berin Loritsch (15/21) Aug 30 2004 Umm, what about the toString() function? Doesn't that assume char[]?
- Walter (9/24) Aug 30 2004 Yes, but it isn't char(!)acteristic of D.
- Roald Ribe (14/31) Sep 02 2004 But this claim holds true only for those who have English as their only
- Berin Loritsch (27/55) Aug 27 2004 Internally there is no such thing. It's just easier to deal with that
- Arcane Jill (8/11) Aug 27 2004 Yeah, I forgot about allocation time. Of course, D initializes all array...
- Walter (8/16) Aug 27 2004 no
- Arcane Jill (15/19) Aug 27 2004 There you go again, assuming that wchar[] strings are double the length ...
- Walter (25/45) Aug 28 2004 real
- Berin Loritsch (18/22) Aug 30 2004 But not completely. There is the euro symbol (I dare say would
- Roald Ribe (55/77) Aug 30 2004 Even Britain has a non-ASCII used quite extensively: Pound. £
- Lars Ivar Igesund (7/99) Aug 30 2004 I couldn't agree more about Walter's ASCII argument. It's way out there
- Ben Hinkle (15/114) Aug 30 2004 Walter did use the word "most". Does anyone know of any studies on the
- Ilya Minkov (18/37) Aug 30 2004 When serving HTML, extended european characters are usually not served
- Arcane Jill (17/21) Aug 31 2004 One option would be the encoding WINDOWS-1251. Quote...
- Ilya Minkov (34/44) Sep 01 2004 Oh come on. Do you rally think i don't know 1251 and all the other
- van eeshan (18/37) Aug 28 2004 What you fail to understand, Jill, is that such arguments are but pinpri...
- Arcane Jill (8/9) Aug 27 2004 Why are strings added to the GC root list anyway? It occurs to me that a...
- Walter (8/16) Aug 28 2004 arrays of
- Sean Kelly (9/12) Aug 27 2004 So:
- Arcane Jill (5/8) Aug 27 2004 Only if there are the same number of chars in a char array as there are ...
All the threads about char vs wchar got me thinking about char performance, so I thought I'd test out the difference. I've attached a simple test that allocates a bunch of strings (of type char[] or wchar[]) of random length each with one random 'a' in them. Then it loops over the strings randomly and counts the number of 'a's. So this test does not measure transcoding speed - it just measures allocation and simple access speed across a large number of long strings. My times are char: .36 seconds wchar: .78 dchar: 1.3 I tried to make sure the dchar version didn't run into swap space. If I scale up or down the size of the problem the factor of 2 generally remains until the times gets so small that it doesn't matter. It looks like a lot of the time is taken in initializing the strings since the when I modify the test to only time access performance the scale factor goes down from 2 to something under 1.5 or so. I built it using dmd charspeed.d -O -inline -release It would be nice to try other tests that involve searching for sub-strings (like multi-byte unicode strings) but I haven't done that. -Ben
Aug 27 2004
Note: looks like the version that got attached was not the version I used to run the tests - it was the one I was using to tests access. I thought attaching the file would make a copy right after attaching then but it looks like my newsreader makes a copy only when the message is posted. The version I ran to get my original numbers had longer strings (like 10000 instead of 1000) and timed the allocation as well as the access. -Ben
Aug 27 2004
Ben Hinkle wrote:All the threads about char vs wchar got me thinking about char performance, so I thought I'd test out the difference. I've attached a simple test that allocates a bunch of strings (of type char[] or wchar[]) of random length each with one random 'a' in them. Then it loops over the strings randomly and counts the number of 'a's. So this test does not measure transcoding speed - it just measures allocation and simple access speed across a large number of long strings. My times are char: .36 seconds wchar: .78 dchar: 1.3 I tried to make sure the dchar version didn't run into swap space. If I scale up or down the size of the problem the factor of 2 generally remains until the times gets so small that it doesn't matter. It looks like a lot of the time is taken in initializing the strings since the when I modify the test to only time access performance the scale factor goes down from 2 to something under 1.5 or so. I built it using dmd charspeed.d -O -inline -release It would be nice to try other tests that involve searching for sub-strings (like multi-byte unicode strings) but I haven't done that.FWIW, the Java 'char' is a 16 bit value due to the unicode standards. The idea of course, is that internally to the program all strings are encoded the same and translated on IO. I would venture to say there are two observations about Java strings: 1) In most applications they are fast enough to handle a lot of work. 2) It can become a bottleneck when excessive string concatenation is happening, or logging is overdone. As far as D is concerned, it should be expected that a same sized array would take longer if the difference between arrays is the element size. In this case 1 byte, 2 bytes, 4 bytes. I don't know much about allocation routines, but I imagine getting something to run at constant cost is either beyond the realm of possibility or it is just way too difficult to be practical to pursue. IMO, the benefits of using a standard unicode encoding within the D program outway the costs of allocating the arrays. Anytime we make it easier to provide internationalization (i18n), it is a step toward greater acceptance. To assume the world only works with european (or american) character sets is to stick your head in the sand. Honestly, I would prefer something that made internationalized strings easier to manage that more difficult. If there is no multi-char graphemes (i.e. takes up more than one code space) then that would be the easiest to work with and write libraries for.
Aug 27 2004
"Berin Loritsch" <bloritsch d-haven.org> wrote in message news:cgnc25$1l1f$1 digitaldaemon.com...Ben Hinkle wrote:performance,All the threads about char vs wchar got me thinking about charthatso I thought I'd test out the difference. I've attached a simple testlengthallocates a bunch of strings (of type char[] or wchar[]) of randomrandomlyeach with one random 'a' in them. Then it loops over the stringslargeand counts the number of 'a's. So this test does not measure transcoding speed - it just measures allocation and simple access speed across aremainsnumber of long strings. My times are char: .36 seconds wchar: .78 dchar: 1.3 I tried to make sure the dchar version didn't run into swap space. If I scale up or down the size of the problem the factor of 2 generallylotuntil the times gets so small that it doesn't matter. It looks like a2of the time is taken in initializing the strings since the when I modify the test to only time access performance the scale factor goes down fromsub-stringsto something under 1.5 or so. I built it using dmd charspeed.d -O -inline -release It would be nice to try other tests that involve searching forI wonder what Java strings in utf8 would be like... I wonder if anyone has tried that out.(like multi-byte unicode strings) but I haven't done that.FWIW, the Java 'char' is a 16 bit value due to the unicode standards. The idea of course, is that internally to the program all strings are encoded the same and translated on IO. I would venture to say there are two observations about Java strings: 1) In most applications they are fast enough to handle a lot of work. 2) It can become a bottleneck when excessive string concatenation is happening, or logging is overdone.As far as D is concerned, it should be expected that a same sized array would take longer if the difference between arrays is the element size. In this case 1 byte, 2 bytes, 4 bytes. I don't know much about allocation routines, but I imagine getting something to run at constant cost is either beyond the realm of possibility or it is just way too difficult to be practical to pursue.agreed. There probably isn't any way to speed up the initialization much.IMO, the benefits of using a standard unicode encoding within the D program outway the costs of allocating the arrays. Anytime we make it easier to provide internationalization (i18n), it is a step toward greater acceptance. To assume the world only works with european (or american) character sets is to stick your head in the sand.agreed. utf8 and utf16 are both unicode standards that require multi-byte handling to cover all of unicode so in terms of ease of use they shouldn't be any different.Honestly, I would prefer something that made internationalized strings easier to manage that more difficult. If there is no multi-char graphemes (i.e. takes up more than one code space) then that would be the easiest to work with and write libraries for.dchar would be the choice for ease of use but as you can see performance goes downhill significantly (at least for the naive test I ran). To me the performance of dchar is too poor to make it the standard and the ease-of-use of utf8 and utf16 are essentially equivalent so since utf8 has the best performance it should be the default. Hence my attempt to measure which is faster for typical usage: char or wchar? -Ben
Aug 27 2004
In article <cgner6$1mh4$1 digitaldaemon.com>, Ben Hinkle says...dchar would be the choice for ease of useActually, wchar and dchar are pretty much the same in terms of ease of use. In almost all cases, you can just /pretend/ that UTF-16 is not a multi-word encoding, and just write your code as if one wchar == one character. Except in very specialist cases, you'll be correct. Even if you're using characters beyond U+FFFF, regarding them as two characters instead of one is likely to be harmless. Of course, there are applications for which you can't do this. Font rendering algorithms, for example, /must/ be able to distinguish true character boundaries. But even then, the cost of rendering greatly outweight the (almost insignificant) cost of determining true character boundaries - a trivial task in UTF-16. char, by contrast, is much more difficult to use, because you /cannot/ "pretend" it's a single-byte encoding, because there are too many characters beyond U+00FF which applications need to "understand".but as you can see performance goes downhill significantly (at least for the naive test I ran). To me the performance of dchar is too poor to make it the standardAgreed.and the ease-of-use of utf8 and utf16 are essentially equivalentNot really. The ICU API is geared around UTF-16, so if you use wchar[]s, you can call ICU functions directly. With char[]s, there's a bit more faffing, so UTF-8 becomes less easy to use in this regard. And don't forget - UTF-8 is a complex algorithm, while UTF-16 is a trivially simple algorithm. As an application programmer, you may be shielded from the complexity of UTF-8 by library implementation and implicit conversion, but it's all going on under the hood. UTF-8's "ease of use" vanishes as soon as you actually have to implement it.so since utf8 has the best performance it should be the default.But it doesn't. Your tests were unfair. A UTF-16 array will not, in general, require twice as many bytes as a UTF-8 array, which is what you seemed to assume. That will only be true if the string is pure ASCII, but not otherwise, and for the majority of characters the performance will be worse. Check out this table: Codepoint Number of bytes range UTF-8 UTF-16 UTF-32 ---------------------------------------- 0000 to 007F 1 2 4 0080 to 07FF 2 2 4 0800 to FFFF 3 2 4 <----- who wins on this row? 10000+ 4 4 4Hence my attempt to measure which is faster for typical usage: char or wchar?You could try measuring it again, but this time without the assumption that all characters are ASCII. Arcane Jill
Aug 27 2004
"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cgngtc$1nep$1 digitaldaemon.com...In article <cgner6$1mh4$1 digitaldaemon.com>, Ben Hinkle says...use. Indchar would be the choice for ease of useActually, wchar and dchar are pretty much the same in terms of ease ofalmost all cases, you can just /pretend/ that UTF-16 is not a multi-word encoding, and just write your code as if one wchar == one character.Except invery specialist cases, you'll be correct. Even if you're using charactersbeyondU+FFFF, regarding them as two characters instead of one is likely to be harmless. Of course, there are applications for which you can't do this. Fontrenderingalgorithms, for example, /must/ be able to distinguish true character boundaries. But even then, the cost of rendering greatly outweight the(almostinsignificant) cost of determining true character boundaries - a trivialtask inUTF-16. char, by contrast, is much more difficult to use, because you /cannot/"pretend"it's a single-byte encoding, because there are too many characters beyondU+00FFwhich applications need to "understand".true - shortcuts can be taken if the application doesn't have to support all of unicode.thebut as you can see performance goes downhill significantly (at least for the naive test I ran). To meyou canperformance of dchar is too poor to make it the standardAgreed.and the ease-of-use of utf8 and utf16 are essentially equivalentNot really. The ICU API is geared around UTF-16, so if you use wchar[]s,call ICU functions directly. With char[]s, there's a bit more faffing, soUTF-8becomes less easy to use in this regard.I'm not sure what faffing is but yeah I agree - I just meant the ease-of-use without worrying about library APIs.And don't forget - UTF-8 is a complex algorithm, while UTF-16 is atriviallysimple algorithm. As an application programmer, you may be shielded fromthecomplexity of UTF-8 by library implementation and implicit conversion, butit'sall going on under the hood. UTF-8's "ease of use" vanishes as soon as you actually have to implement it.I meant ease-of-use for end-users. Calling toUTF8 is just as easy as calling toUTF16. Personally as long as it is correct the underlying library complexity doesn't affect me. Performance of the application becomes the key factor. .general,so since utf8 has the best performance it should be the default.But it doesn't. Your tests were unfair. A UTF-16 array will not, inrequire twice as many bytes as a UTF-8 array, which is what you seemed to assume. That will only be true if the string is pure ASCII, but nototherwise,and for the majority of characters the performance will be worse. Checkout thistable: Codepoint Number of bytes range UTF-8 UTF-16 UTF-32 ---------------------------------------- 0000 to 007F 1 2 4 0080 to 07FF 2 2 4 0800 to FFFF 3 2 4 <----- who wins on this row? 10000+ 4 4 4I was going to start digging into non-ascii next. I remember reading somewhere that encoding asian languages in utf8 typically results in longer strings than utf16. That will definitely hamper utf8.that allHence my attempt to measure which is faster for typical usage: char or wchar?You could try measuring it again, but this time without the assumptioncharacters are ASCII. Arcane Jill
Aug 27 2004
Ben Hinkle wrote:I was going to start digging into non-ascii next. I remember reading somewhere that encoding asian languages in utf8 typically results in longer strings than utf16. That will definitely hamper utf8.Either that or hamper Japanese coders :) If it comes out to a performance draw when dealing with non-ascii text, then might I suggest using programming ease (for library writers as well) to be the tie breaker?
Aug 27 2004
"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cgngtc$1nep$1 digitaldaemon.com...But it doesn't. Your tests were unfair. A UTF-16 array will not, ingeneral,require twice as many bytes as a UTF-8 array, which is what you seemed to assume. That will only be true if the string is pure ASCII, but nototherwise,and for the majority of characters the performance will be worse.The majority of characters are multibyte in UTF-8, that is true. But the distribution of characters is what matters for speed in real apps, and for those, the vast majority will be ASCII. Furthermore, many operations on strings can treat UTF-8 as if it were single byte, such as copying, sorting, and searching. Of course, there are still many situations where UTF-8 is not ideal, which is why D tries to be agnostic about whether the application programmer wants to use char[], wchar[], or dchar[] or any combination of the three.
Aug 27 2004
In article <cgobse$237t$1 digitaldaemon.com>, Walter says...The majority of characters [within strings] are multibyte in UTF-8, that is true. But the [frequency] distribution of characters is what matters for speed in real apps, and for those, the vast majority will be ASCII.Are you sure? That's one hellava claim to make. I've noticed that D seems to generate a lot of interest from Japan (judging from the existense of Japanese web sites). Of course, Japanese strings average 1.5 times as long in UTF-8 as those same strings would have been in UTF-16. The whole "Most characters are ASCII" dogma is really only true if you happen to live in a certain part of the world, and to bias a computer language because of that assumption hurts performance for everyone else. /Please/ reconsider.Furthermore, many operations on strings can treat UTF-8 as if it were single byte, such as copying, sorting, and searching.Copying, yes. But of course you miss the point that this would be just as true in UTF-16 as it is in UTF-8. Sorting? - Lexicographical sorting, maybe, but the only reason you can get away with that is because, in ASCII-only parts of the world, codepoint order happens to correspond to the order we find letters in the alphabet, and even then only if we're prepared to compromise on case ("Foo" sorts before "bar"). Stick an acute accent over one of the vowels and lexicographical sort order goes out the window. Lexicographical sorting may be good for things purely mechanical things like eliminating duplicates in an AA, but if someone wants to look up all entries between "Alice" and "Bob" in a database, I think they would be very surprised to find that "Äaron" was not in the list. (And again, you miss the point that if lexicographical sorting /is/ what you want, it works just as well in UTF-16 as it does in UTF-8). /Real/ sorting, however, requires full understanding of Unicode, and for that, ASCII is just not good enough. Searching? If you treat "UTF-8 as if it were single byte", there are an /awful lot/ of characters you can't search for, including the British Pound (currency) sign, the Euro currency sign and anything with an accent over it. And searching for the single byte 0x81 (for example) is not exactly useful.Of course, there are still many situations where UTF-8 is not ideal, which is why D tries to be agnostic about whether the application programmer wants to use char[], wchar[], or dchar[] or any combination of the three.Yes, agnostic is good. No problem there. I'm only talking about the /default/. I thought you were all /for/ internationalization and Unicode and all that? I'm surprised to find myself arguing with you on this one. (Okay, I didn't really expect you to ditch the char, but to prefer wchar[] over char[] is /reasonable/). I have given many, many, many reasons why I think that wchar[] strings should be the default in D, and if I can't convince you, I think that would be a big shame. Arcane Jill
Aug 27 2004
In article <cgp7c3$2e36$1 digitaldaemon.com>, Arcane Jill says...if someone wants to look up all entries between "Alice" and "Bob" in a database, I think they would be very surprised to find that "Äaron" was not in the list.Whoops - dumb brain fart! Please pretend I didn't say that. The reasoning is still sound - it's just my conception of alphabetical order that's up the spout. Jill
Aug 27 2004
"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cgp7c3$2e36$1 digitaldaemon.com...general,But it doesn't. Your tests were unfair. A UTF-16 array will not, intorequire twice as many bytes as a UTF-8 array, which is what you seemedotherwise,assume. That will only be true if the string is pure ASCII, but notforIn article <cgobse$237t$1 digitaldaemon.com>, Walter says...and for the majority of characters the performance will be worse.The majority of characters [within strings] are multibyte in UTF-8, that is true. But the [frequency] distribution of characters is what matters for speed in real apps, andYes. Nearly all the spam I get is in ascii <g>. When optimizing for speed, the first rule is optimize for what the bulk of the data will likely consist of for your application. For example, if you're writing a user interface for Chinese people, you'd be sensible to consider using dchar[] throughout. It probably makes sense to use wchar[] for the unicode library you're developing because programmers who have a need for such a library will most likely NOT be writing applications for ascii.those, the vast majority will be ASCII.Are you sure? That's one hellava claim to make.I've noticed that D seems to generate a lot of interest from Japan(judging fromthe existense of Japanese web sites). Of course, Japanese strings average1.5times as long in UTF-8 as those same strings would have been in UTF-16.If I was building a Japanese word processor, I certainly wouldn't use UTF-8 internally in it for that reason.The whole "Most characters are ASCII" dogma is really only true if youhappen tolive in a certain part of the world, and to bias a computer languagebecause ofthat assumption hurts performance for everyone else. /Please/ reconsider.If everything is optimized for Japanese, it will hurt performance for ASCII users. The point is, there is no UTF encoding that is optimal for everyone. That's why D supports all three.sorting,Furthermore, many operations on strings can treat UTF-8 as if it were single byte, such as copying,trueand searching.Copying, yes. But of course you miss the point that this would be just asin UTF-16 as it is in UTF-8.Of course. My point was that quite a few common string operations do not require decoding. For example, the D compiler processes source as UTF-8. It almost never has to do any decoding. The performance penalty for supporting multibyte encodings in D source is essentially zero.Sorting? - Lexicographical sorting, maybe, but the only reason you can getawaywith that is because, in ASCII-only parts of the world, codepoint orderhappensto correspond to the order we find letters in the alphabet, and even thenonlyif we're prepared to compromise on case ("Foo" sorts before "bar"). Stickanacute accent over one of the vowels and lexicographical sort order goesout thewindow. Lexicographical sorting may be good for things purely mechanicalthingslike eliminating duplicates in an AA, but if someone wants to look up all entries between "Alice" and "Bob" in a database, I think they would beverysurprised to find that "Äaron" was not in the list. (And again, you missthepoint that if lexicographical sorting /is/ what you want, it works just aswellin UTF-16 as it does in UTF-8).Sure - and my point is it wasn't necessary to decode UTF-8 to do that sort. It's not necessary for hashing the string, either./Real/ sorting, however, requires full understanding of Unicode, and for that, ASCII is just not good enough.There are many different ways to sort, and since the unicode characters are not always ordered the obvious way, you have to deal with that specially in each of UTF-8, -16, and -32.Searching? If you treat "UTF-8 as if it were single byte", there are an/awfullot/ of characters you can't search for, including the British Pound(currency)sign, the Euro currency sign and anything with an accent over it. Andsearchingfor the single byte 0x81 (for example) is not exactly useful.That's why std.string.find() takes a dchar as its search argument. What you do is treat it as a substring search. There are a lot of very fast algorithms for doing such searches, such as Boyer-Moore, which get pretty close to the performance of a single character search. Furthermore, I'd optimize it so the first thing the search did was check if the search character was ascii. If so, it'd do the single character scan. Otherwise, it'd do the substring search.whichOf course, there are still many situations where UTF-8 is not ideal,wantsis why D tries to be agnostic about whether the application programmer/default/. Ito use char[], wchar[], or dchar[] or any combination of the three.Yes, agnostic is good. No problem there. I'm only talking about thethought you were all /for/ internationalization and Unicode and all that?But D does not have a default. Programmers can use the encoding which is optimal for the data they expect to see. Even if UTF-8 were the default, UTF-8 still supports full internationalization and Unicode. I am certainly not talking about supporting only ASCII or having ASCII as the default.I'm surprised to find myself arguing with you on this one. (Okay, I didn'treallyexpect you to ditch the char, but to prefer wchar[] over char[] is /reasonable/). I have given many, many, many reasons why I think thatwchar[]strings should be the default in D, and if I can't convince you, I thinkthatwould be a big shame.My experience with using UTF-16 throughout a program is it speeded up quite a bit when converted to UTF-8. There is no blanket advantage to UTF-16, it depends on your expected data. When your expected data will be mostly ASCII, then UTF-8 is the reasonable choice.
Aug 28 2004
Walter wrote:But D does not have a default. Programmers can use the encoding which is optimal for the data they expect to see. Even if UTF-8 were the default, UTF-8 still supports full internationalization and Unicode. I am certainly not talking about supporting only ASCII or having ASCII as the default.Umm, what about the toString() function? Doesn't that assume char[]? Hense, it is the default by example. I'll be honest, I don't get why optimization is so important when there hasn't been determined a need yet. I am sure there can be quicker ways of dealing with allocation and de-allocation--this would make the system faster for all objects, not just strings. If that can be done, why not concentrate on that? More advanced memory utilization can mean better overall performance, and reduce the cost of one type of string over another. Heck, if a page of memory is being allocated for string storage (multiple strings mind you), what about a really fast bit blit for the whole page? That would make the strings default to initialization state and speed things up. Ideally, the difference between a char[] and a dchar[] would be how much of that page is allocated.
Aug 30 2004
"Berin Loritsch" <bloritsch d-haven.org> wrote in message news:cgv92r$2fvv$1 digitaldaemon.com...Umm, what about the toString() function? Doesn't that assume char[]? Hense, it is the default by example.Yes, but it isn't char(!)acteristic of D.I'll be honest, I don't get why optimization is so important when there hasn't been determined a need yet.Efficiency, or at least potential efficiency, has always been a strong attraction that programmers have to C/C++. Since D is targetted at that market, efficiency will be a major consideration. If D acquires an early reputation for being "slow", like Java did, that reputation can be very, very hard to shake.I am sure there can be quicker ways of dealing with allocation and de-allocation--this would make the system faster for all objects, not just strings. If that can be done, why not concentrate on that?There's no way to just wipe away the costs of using double the storage.More advanced memory utilization can mean better overall performance, and reduce the cost of one type of string over another. Heck, if a page of memory is being allocated for string storage (multiple strings mind you), what about a really fast bit blit for the whole page? That would make the strings default to initialization state and speed things up. Ideally, the difference between a char[] and a dchar[] would be how much of that page is allocated.
Aug 30 2004
"Walter" <newshound digitalmars.com> wrote in message news:ch048q$2uql$1 digitaldaemon.com..."Berin Loritsch" <bloritsch d-haven.org> wrote in message news:cgv92r$2fvv$1 digitaldaemon.com...But this claim holds true only for those who have English as their only working language, and (maybe) for a few others in Europe. In all other markets (5 billion+) the utf8 storage will in fact (mostly) be _larger_ than the utf16 storage. And as I proposed earlier, you could leave an otption for English/Europeans in the form of char defined as in C/C++, which would in addition to being just as fast, actually make the transition to the D language easier. I think it all comes down to this: will D became a general purpose language for the international community or will it mostly become a "better" C++ for English speakers only? RoaldUmm, what about the toString() function? Doesn't that assume char[]? Hense, it is the default by example.Yes, but it isn't char(!)acteristic of D.I'll be honest, I don't get why optimization is so important when there hasn't been determined a need yet.Efficiency, or at least potential efficiency, has always been a strong attraction that programmers have to C/C++. Since D is targetted at that market, efficiency will be a major consideration. If D acquires an early reputation for being "slow", like Java did, that reputation can be very, very hard to shake.I am sure there can be quicker ways of dealing with allocation and de-allocation--this would make the system faster for all objects, not just strings. If that can be done, why not concentrate on that?There's no way to just wipe away the costs of using double the storage.
Sep 02 2004
Ben Hinkle wrote:"Berin Loritsch" <bloritsch d-haven.org> wrote in message news:cgnc25$1l1f$1 digitaldaemon.com...Internally there is no such thing. It's just easier to deal with that way. The translation happens with encoders and decoders on IO.FWIW, the Java 'char' is a 16 bit value due to the unicode standards. The idea of course, is that internally to the program all strings are encoded the same and translated on IO. I would venture to say there are two observations about Java strings: 1) In most applications they are fast enough to handle a lot of work. 2) It can become a bottleneck when excessive string concatenation is happening, or logging is overdone.I wonder what Java strings in utf8 would be like... I wonder if anyone has tried that out.Yea, but I've learned not to get hung up on strict performance. There is a difference between ultimately fast and fast enough. Sometimes to squeeze those extra cycles out we can cause more programming issues than needs be. If the allocation routines could be done faster (big assumption here), would it be preferable to use a dchar if it is fast enough? For the record, I believe Java uses UTF16 internally, which means for most things there is less of a need to worry about MB characters. The interesting test would be to have strings of the same length, and then test the algorithm to get a substring from that string. For example, a fixed string that uses some multi-codespace characters here and there, and then getting the 15th through 20th characters of that string. This will show just how things might come out in the wash when we are dealing with that type of issue. For example, it is not uncommon to have the system read in a block of text from a file into memory (say 4k worth at a time), and then just iterate one line at a time. Which gives us the substring scenario. Alternatively there is the regex algorithm that would need to account for the multi-codespace characters that will get a performance hit as well. There is a lot more to an overall more performant system than just string allocation--even though we are aware that it can be a significant cost.Honestly, I would prefer something that made internationalized strings easier to manage that more difficult. If there is no multi-char graphemes (i.e. takes up more than one code space) then that would be the easiest to work with and write libraries for.dchar would be the choice for ease of use but as you can see performance goes downhill significantly (at least for the naive test I ran). To me the performance of dchar is too poor to make it the standard and the ease-of-use of utf8 and utf16 are essentially equivalent so since utf8 has the best performance it should be the default. Hence my attempt to measure which is faster for typical usage: char or wchar?
Aug 27 2004
In article <cgn92o$1jj4$1 digitaldaemon.com>, Ben Hinkle says...char: .36 seconds wchar: .78 dchar: 1.3Yeah, I forgot about allocation time. Of course, D initializes all arrays, no matter whence they are allocated. char[]s will be filled with all FFs, and wchar[]s will be filled with all FFFFs. Twice as many bytes = twice as many bytes to initialize. Damn! A super-fast character array allocator would make a lot of difference here. There are probably many different ways of doing this very fast. I guess this has to happen within DMD.
Aug 27 2004
"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cgncho$1la0$1 digitaldaemon.com...In article <cgn92o$1jj4$1 digitaldaemon.com>, Ben Hinkle says...nochar: .36 seconds wchar: .78 dchar: 1.3Yeah, I forgot about allocation time. Of course, D initializes all arrays,matter whence they are allocated. char[]s will be filled with all FFs, and wchar[]s will be filled with all FFFFs. Twice as many bytes = twice asmanybytes to initialize. Damn!There are also twice as many bytes to scan for the gc, and half the data until your machine starts thrashing the swap disk. The latter is a very real issue for server apps, since it means that you reach the point of having to double the hardware in half the time.
Aug 27 2004
In article <cgobse$237t$2 digitaldaemon.com>, Walter says...There are also twice as many bytes to scan for the gc, and half the data until your machine starts thrashing the swap disk. The latter is a very real issue for server apps, since it means that you reach the point of having to double the hardware in half the time.There you go again, assuming that wchar[] strings are double the length of char[] strings. THIS IS NOT TRUE IN GENERAL. In Chinese, wchar[] strings are shorter than char[] strings. In Japanese, wchar[] strings are shorter than char[] strings. In Mongolian, wchar[] strings are shorter than char[] strings. In Tibetan, wchar[] strings are shorter than char[] strings. I assume I don't need to go on...? <sarcasm>But I guess server apps never have to deliver text in those languages.</sarcasm> Walter, servers are one the places where internationalization matters most. XML and HTML documents, for example, could be (a) stored and (b) requested in any encodings whatsoever. A server would have to push them through a transcoding function. For this, wchar[]s are more sensible. I don't understand the basis of your determination. It seems ill-founded. Jill
Aug 27 2004
"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cgp845$2ea2$1 digitaldaemon.com...In article <cgobse$237t$2 digitaldaemon.com>, Walter says...realThere are also twice as many bytes to scan for the gc, and half the data until your machine starts thrashing the swap disk. The latter is a verytoissue for server apps, since it means that you reach the point of havingAre you sure? Even european languages are mostly ascii.double the hardware in half the time.There you go again, assuming that wchar[] strings are double the length of char[] strings. THIS IS NOT TRUE IN GENERAL.In Chinese, wchar[] strings are shorter than char[] strings. In Japanese, wchar[] strings are shorter than char[] strings. In Mongolian, wchar[] strings are shorter than char[]strings.In Tibetan, wchar[] strings are shorter than char[] strings. I assume Idon'tneed to go on...? <sarcasm>But I guess server apps never have to deliver text in those languages.</sarcasm>Never is not the right word here. The right idea is what is the frequency distribution of the various types of data one's app will see. Once you know that, you optimize for the most common cases. Tibetan is still fully supported regardless.Walter, servers are one the places where internationalization mattersmost. XMLand HTML documents, for example, could be (a) stored and (b) requested inanyencodings whatsoever.Of course. But what are the frequencies of the requests for various encodings? Each of the 3 UTF encodings fully support unicode and are fully internationalized. Which one you pick depends on the frequency distribution of your data.A server would have to push them through a transcoding function. For this, wchar[]s are more sensible.It is not optimal unless the person optimizing the server app has instrumented his data so he knows the frequency distribution of the various characters. Only then can he select the encoding that will deliver the best performance.I don't understand the basis of your determination. It seems ill-founded.Experience optimizing apps. One of the most potent tools for optimization is analyzing the data patterns, and making the most common cases take the shortest path through the code. UTF-16 is not optimal for a great many applications - and I have experience with it.
Aug 28 2004
Walter wrote:"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cgp845$2ea2$1 digitaldaemon.com... Are you sure? Even european languages are mostly ascii.But not completely. There is the euro symbol (I dare say would be quite common). In Spanish the enye (n with ~ on top, can't really do that well in Windows) is fairly common, and important. There is a big difference between an anus and a year, but the only difference in Spanish is n vs. enye. Not to mention all those words that use an accent to mark an abnormally stressed sylable. Then we get to French, which uses the circumflex, accents, and accent grave. Oh, then there's German which uses those two little dots alot. And I haven't even touched on Greek or Russian, both European countries. You can only make that assumption about English speaking countries. Yes almost everyone is exposed to English in some way, and it is the current "lingua de franca" (language of business, like French used to be--hense the term). The bottom line is that there are sufficient exceptions to your "rule" that it would be a shame to assume the world was America and Great Britain.
Aug 30 2004
"Berin Loritsch" <bloritsch d-haven.org> wrote in message news:cgv9m0$2g71$1 digitaldaemon.com...Walter wrote:Even Britain has a non-ASCII used quite extensively: Pound. £ Norway/Denmark/Sweden has three non ASCII characters (used all the time). The Sami peoples has their own characters (they live in Norway, Sweden, Russia). Finland, Estonia, Lituania, Poland, ++ all have their own characters in addition to ASCII. Russia has its own alphabet! All latin family languages (French/Spanish/Italian/ Portuguese) have all sorts of special characters (accents forwards/ backwards ++)... And now I have not even gone through HALF of Europe. In Asia there are wildly different systems, and several systems in use, _in_ each_ _country_. As I have stated before: I agree with Walter's concern for performance. But where I think there is some disagreement in these discussions is where to put the effort to "adapt" the environment, on those who only needs ASCII (most of the time), or on all those who would prefer the language to default to the more general need of application and server programmers all over the world. My view is that speed freaks are used to tune the tools for best speed, and the general case should reflect newbies and the 5 billion+ potential non English using markets. Everything else is selling D short, in a shortsighted quest for best speed as default as one of the language features. I have a rather radical suggestion, that may make sense, or it may happen that someone will shoot it down right away because of something I have not thought of: 1. Remove wchar and dchar from the language. 2. Make char mean 8-bit unsigned byte, containing US-ASCII/Latin1/ISO-8859/cp1252, with one character in each byte. Null termination is expected. AFAIK all the sets mentioned are compatible with each other. Char *may* contain characters from any 8-bit based encoding, given that either existing conv. table or application can convert to/from one of the types below. This type makes for a clean, minimum effort port, from C and C++, and interaction with current crop of OS and libraries. It also takes care of US/Western Europe speed freaks. 3. New types, utf8, utf16 and utf32 as suggested by others. 4. String based on utf16 as default storage. With overidden storage type like: new String(200, utf8) // 200 bytes new String(200, utf16) // 400 bytes new String(200) // 400 bytes new String(200, utf32) // 800 bytes Anyone can use string with the optimal performance for them. 5. String literals in source, default assumed to be utf16 encoded. Can be changed by app programmer like: c"text" // char[] 4 bytes u"text" // String() 4 bytes w"text" // String() 8 bytes "text" // String() 8 bytes d"text" // String() 16 bytes I am open to the fact that I am not at all experienced in language design, but I hope this may bring the discussion along. I think making char the same as in C/C++ (but slightly better defined default char set) and go with entirely different type for the rest is a sound idea. Roald"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cgp845$2ea2$1 digitaldaemon.com... Are you sure? Even european languages are mostly ascii.But not completely. There is the euro symbol (I dare say would be quite common). In Spanish the enye (n with ~ on top, can't really do that well in Windows) is fairly common, and important. There is a big difference between an anus and a year, but the only difference in Spanish is n vs. enye. Not to mention all those words that use an accent to mark an abnormally stressed sylable. Then we get to French, which uses the circumflex, accents, and accent grave. Oh, then there's German which uses those two little dots alot. And I haven't even touched on Greek or Russian, both European countries. You can only make that assumption about English speaking countries. Yes almost everyone is exposed to English in some way, and it is the current "lingua de franca" (language of business, like French used to be--hense the term). The bottom line is that there are sufficient exceptions to your "rule" that it would be a shame to assume the world was America and Great Britain.
Aug 30 2004
I couldn't agree more about Walter's ASCII argument. It's way out there and alienates all of us with non-english first languages (maybe I should start writing my messages using runes, just like my forefathers...). If the toString is only really useful for debugging anyway, it could as well return dchars. I'd rather remove altoghether, though. Lars Ivar Igesund Roald Ribe wrote:"Berin Loritsch" <bloritsch d-haven.org> wrote in message news:cgv9m0$2g71$1 digitaldaemon.com...Walter wrote:Even Britain has a non-ASCII used quite extensively: Pound. £ Norway/Denmark/Sweden has three non ASCII characters (used all the time). The Sami peoples has their own characters (they live in Norway, Sweden, Russia). Finland, Estonia, Lituania, Poland, ++ all have their own characters in addition to ASCII. Russia has its own alphabet! All latin family languages (French/Spanish/Italian/ Portuguese) have all sorts of special characters (accents forwards/ backwards ++)... And now I have not even gone through HALF of Europe. In Asia there are wildly different systems, and several systems in use, _in_ each_ _country_. As I have stated before: I agree with Walter's concern for performance. But where I think there is some disagreement in these discussions is where to put the effort to "adapt" the environment, on those who only needs ASCII (most of the time), or on all those who would prefer the language to default to the more general need of application and server programmers all over the world. My view is that speed freaks are used to tune the tools for best speed, and the general case should reflect newbies and the 5 billion+ potential non English using markets. Everything else is selling D short, in a shortsighted quest for best speed as default as one of the language features. I have a rather radical suggestion, that may make sense, or it may happen that someone will shoot it down right away because of something I have not thought of: 1. Remove wchar and dchar from the language. 2. Make char mean 8-bit unsigned byte, containing US-ASCII/Latin1/ISO-8859/cp1252, with one character in each byte. Null termination is expected. AFAIK all the sets mentioned are compatible with each other. Char *may* contain characters from any 8-bit based encoding, given that either existing conv. table or application can convert to/from one of the types below. This type makes for a clean, minimum effort port, from C and C++, and interaction with current crop of OS and libraries. It also takes care of US/Western Europe speed freaks. 3. New types, utf8, utf16 and utf32 as suggested by others. 4. String based on utf16 as default storage. With overidden storage type like: new String(200, utf8) // 200 bytes new String(200, utf16) // 400 bytes new String(200) // 400 bytes new String(200, utf32) // 800 bytes Anyone can use string with the optimal performance for them. 5. String literals in source, default assumed to be utf16 encoded. Can be changed by app programmer like: c"text" // char[] 4 bytes u"text" // String() 4 bytes w"text" // String() 8 bytes "text" // String() 8 bytes d"text" // String() 16 bytes I am open to the fact that I am not at all experienced in language design, but I hope this may bring the discussion along. I think making char the same as in C/C++ (but slightly better defined default char set) and go with entirely different type for the rest is a sound idea. Roald"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cgp845$2ea2$1 digitaldaemon.com... Are you sure? Even european languages are mostly ascii.But not completely. There is the euro symbol (I dare say would be quite common). In Spanish the enye (n with ~ on top, can't really do that well in Windows) is fairly common, and important. There is a big difference between an anus and a year, but the only difference in Spanish is n vs. enye. Not to mention all those words that use an accent to mark an abnormally stressed sylable. Then we get to French, which uses the circumflex, accents, and accent grave. Oh, then there's German which uses those two little dots alot. And I haven't even touched on Greek or Russian, both European countries. You can only make that assumption about English speaking countries. Yes almost everyone is exposed to English in some way, and it is the current "lingua de franca" (language of business, like French used to be--hense the term). The bottom line is that there are sufficient exceptions to your "rule" that it would be a shame to assume the world was America and Great Britain.
Aug 30 2004
Walter did use the word "most". Does anyone know of any studies on the fequency of non-ASCII chars for different document content and languages? There must be solid numbers about these things given all the zillions of electronic documents out there. A quick google for French just dug up a posting where someone scanned 86 millions characters from swiss-french newsagency reports and got 22M non-accented vowels (aeiou) and 1.8M accented chars. That's a factor of roughly 10. That seems significant. But I don't want to read too much into one posting found in a minute of googling - I'm just curious what the data says. "Lars Ivar Igesund" <larsivar igesund.net> wrote in message news:cgvoid$2nt9$1 digitaldaemon.com...I couldn't agree more about Walter's ASCII argument. It's way out there and alienates all of us with non-english first languages (maybe I should start writing my messages using runes, just like my forefathers...). If the toString is only really useful for debugging anyway, it could as well return dchars. I'd rather remove altoghether, though. Lars Ivar Igesund Roald Ribe wrote:compatible"Berin Loritsch" <bloritsch d-haven.org> wrote in message news:cgv9m0$2g71$1 digitaldaemon.com...Walter wrote:Even Britain has a non-ASCII used quite extensively: Pound. £ Norway/Denmark/Sweden has three non ASCII characters (used all the time). The Sami peoples has their own characters (they live in Norway, Sweden, Russia). Finland, Estonia, Lituania, Poland, ++ all have their own characters in addition to ASCII. Russia has its own alphabet! All latin family languages (French/Spanish/Italian/ Portuguese) have all sorts of special characters (accents forwards/ backwards ++)... And now I have not even gone through HALF of Europe. In Asia there are wildly different systems, and several systems in use, _in_ each_ _country_. As I have stated before: I agree with Walter's concern for performance. But where I think there is some disagreement in these discussions is where to put the effort to "adapt" the environment, on those who only needs ASCII (most of the time), or on all those who would prefer the language to default to the more general need of application and server programmers all over the world. My view is that speed freaks are used to tune the tools for best speed, and the general case should reflect newbies and the 5 billion+ potential non English using markets. Everything else is selling D short, in a shortsighted quest for best speed as default as one of the language features. I have a rather radical suggestion, that may make sense, or it may happen that someone will shoot it down right away because of something I have not thought of: 1. Remove wchar and dchar from the language. 2. Make char mean 8-bit unsigned byte, containing US-ASCII/Latin1/ISO-8859/cp1252, with one character in each byte. Null termination is expected. AFAIK all the sets mentioned are"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cgp845$2ea2$1 digitaldaemon.com... Are you sure? Even european languages are mostly ascii.But not completely. There is the euro symbol (I dare say would be quite common). In Spanish the enye (n with ~ on top, can't really do that well in Windows) is fairly common, and important. There is a big difference between an anus and a year, but the only difference in Spanish is n vs. enye. Not to mention all those words that use an accent to mark an abnormally stressed sylable. Then we get to French, which uses the circumflex, accents, and accent grave. Oh, then there's German which uses those two little dots alot. And I haven't even touched on Greek or Russian, both European countries. You can only make that assumption about English speaking countries. Yes almost everyone is exposed to English in some way, and it is the current "lingua de franca" (language of business, like French used to be--hense the term). The bottom line is that there are sufficient exceptions to your "rule" that it would be a shame to assume the world was America and Great Britain.clean,with each other. Char *may* contain characters from any 8-bit based encoding, given that either existing conv. table or application can convert to/from one of the types below. This type makes for acrop ofminimum effort port, from C and C++, and interaction with currentfreaks.OS and libraries. It also takes care of US/Western Europe speed3. New types, utf8, utf16 and utf32 as suggested by others. 4. String based on utf16 as default storage. With overidden storage type like: new String(200, utf8) // 200 bytes new String(200, utf16) // 400 bytes new String(200) // 400 bytes new String(200, utf32) // 800 bytes Anyone can use string with the optimal performance for them. 5. String literals in source, default assumed to be utf16 encoded. Can be changed by app programmer like: c"text" // char[] 4 bytes u"text" // String() 4 bytes w"text" // String() 8 bytes "text" // String() 8 bytes d"text" // String() 16 bytes I am open to the fact that I am not at all experienced in language design, but I hope this may bring the discussion along. I think making char the same as in C/C++ (but slightly better defined default char set) and go with entirely different type for the rest is a sound idea. Roald
Aug 30 2004
Berin Loritsch schrieb:Walter wrote:When serving HTML, extended european characters are usually not served as Latin or Unicode. Instead, the &sym; escape encoding is preferred. There are ASCII escapes for all Latin-1 characters, as far as i know. But what bothers me with all Unicode, is that cyrillic languages cannot be handled with 8 bits as well. What would be nice, if we found an encoding which would work on 2 buffers - the primary one containing the ASCII and data in some codepage. The secondary one would contain packed codepage changes, so that russian, english, hebrew and other test can be mixed and would still need about one byte per character on average. For asian languages, the encoding should use in average per character one symbol on primary string, and one symbol on the secondary. The length of the primary stream must be exactly the length of the string, all of the overhang must be placed in the secondary one. I have a feeling that this could be great for most uses and most efficient in total. We should also not forget that the world is mostly chinese, and soon the computer users will also be. The european race will loose its importance. -eye"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cgp845$2ea2$1 digitaldaemon.com... Are you sure? Even european languages are mostly ascii.But not completely. There is the euro symbol (I dare say would be quite common). In Spanish the enye (n with ~ on top, can't really do that well in Windows) is fairly common, and important. There is a big difference between an anus and a year, but the only difference in Spanish is n vs. enye. Not to mention all those words that use an accent to mark an abnormally stressed sylable. Then we get to French, which uses the circumflex, accents, and accent grave. Oh, then there's German which uses those two little dots alot. And I haven't even touched on Greek or Russian, both European countries.
Aug 30 2004
In article <ch07mc$30ai$1 digitaldaemon.com>, Ilya Minkov says...But what bothers me with all Unicode, is that cyrillic languages cannot be handled with 8 bits as well.One option would be the encoding WINDOWS-1251. Quote... "The Cyrillic text used in the data sets are encoded using the CP1251 Cyrillic system. Users will require CP1251 fonts to read or print such text correctly. CP1251 is the Cyrillic encoding used in Windows products as developed by Microsoft. The system replaces the underused upper 128 characters of the typical Latin character set with Cyrillic characters, leaving the full set of Latin type in the lower 128 characters. Thus the user may mix Cyrillic and Latin text without changing fonts." (-- source: http://polyglot.lss.wisc.edu/creeca/kaiser/cp1251.html) But that's just a transcoding issue, surely? Internally, we'd use Unicode, no?We should also not forget that the world is mostly chinese, and soon the computer users will also be.Well, Chinese /certainly/ can't be handled with 8 bits. Traditionally, Chinese users have made use of the encoding SHIFT-JIS, which is (shock! horror!) a /multi-byte-encoding/ (there being vastly more than 256 "letters" in the Chinese alphabet). SHIFT-JIS is seriously horrible to work with, compared with the elegant simplicity of UTF-16. Arcane Jill
Aug 31 2004
Arcane Jill schrieb:One option would be the encoding WINDOWS-1251. Quote...Oh come on. Do you rally think i don't know 1251 and all the other Windows codepages???? Oh, how would a person who natively speaks russian ever know that? They are all into typewriters and handwriting, aren't they?But that's just a transcoding issue, surely? Internally, we'd use Unicode, no?You have apparently ignored what i tried to say. What is used externally is determined by external conditions, and is not the subject of this part of the post. I have suggested to investigate and possibly develop another *internal* representation which would provide optimal performance. It should consist of 2 storages, the 8-bit primary storage and the variable length "overhang" storage, and should be able to represent *all* unicode characters. We are back at the question of an efficient String class or struct. The idea is, that characters are not self-contained, but instead context-dependant. For example, the most commonly used escape in the overhang string would be "select a new unicode subrange to work on". Unicode documents are not just random data! They are words or sentences written in a combination of a few languages, with a change of the language happening perhaps every few words. But you don't have every symbol be in the new language. So why does every symbol need to carry the complete information, if most of it is more effciently stored as a relatively rare state change?Again, you have chosen to ignore my post. As you are much more familiar with Unicode than myself, could you possibly debelop an encoding which takes amortized 1 byte per character for usual codepages (not including the fixed-length subrange select command in the beginning) 2 bytes per character for all multibyte encodings which fit into UTF-16 (not including the fixed-length subrange select command in the beginning) the rest of the Unicode characters should be representable as well. Besides, i would like that only the first byte from the character encoding is stored in a primary string, and the rest on the "overhang". I have my reasons to suggest that, and *if* you care to pay attention i would also like to explain in detail. -eyeWe should also not forget that the world is mostly chinese, and soon the computer users will also be.Well, Chinese /certainly/ can't be handled with 8 bits. Traditionally, Chinese users have made use of the encoding SHIFT-JIS, which is (shock! horror!) a /multi-byte-encoding/ (there being vastly more than 256 "letters" in the Chinese alphabet). SHIFT-JIS is seriously horrible to work with, compared with the elegant simplicity of UTF-16.
Sep 01 2004
What you fail to understand, Jill, is that such arguments are but pinpricks upon the World's foremost authority on everything from language-design to server-software to ease-of-use. Better to just build a wchar-based String class (and all the supporting goodies), and those who care about such things will naturally migrate to it; they'll curse D for the short-sighted approach to Object.toString, leaving the door further open for a D successor V "Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cgp845$2ea2$1 digitaldaemon.com...In article <cgobse$237t$2 digitaldaemon.com>, Walter says...realThere are also twice as many bytes to scan for the gc, and half the data until your machine starts thrashing the swap disk. The latter is a verytoissue for server apps, since it means that you reach the point of havingaredouble the hardware in half the time.There you go again, assuming that wchar[] strings are double the length of char[] strings. THIS IS NOT TRUE IN GENERAL. In Chinese, wchar[] stringsshorter than char[] strings. In Japanese, wchar[] strings are shorter than char[] strings. In Mongolian, wchar[] strings are shorter than char[]strings.In Tibetan, wchar[] strings are shorter than char[] strings. I assume Idon'tneed to go on...? <sarcasm>But I guess server apps never have to deliver text in those languages.</sarcasm> Walter, servers are one the places where internationalization mattersmost. XMLand HTML documents, for example, could be (a) stored and (b) requested inanyencodings whatsoever. A server would have to push them through atranscodingfunction. For this, wchar[]s are more sensible. I don't understand the basis of your determination. It seems ill-founded. Jill
Aug 28 2004
In article <cgobse$237t$2 digitaldaemon.com>, Walter says...There are also twice as many (sic) bytes to scan for the gc,Why are strings added to the GC root list anyway? It occurs to me that arrays of bit, byte, ubyte, short, ushort, char, wchar and dchar which are allocated on the heap can never contain pointers, and so should not be added to the GC's list of things to scan when created with new (or modification of .length). I imagine that this one simple step would increase D's performance rather dramatically. Arcane Jill
Aug 27 2004
"Arcane Jill" <Arcane_member pathlink.com> wrote in message news:cgp8di$2ec2$1 digitaldaemon.com...In article <cgobse$237t$2 digitaldaemon.com>, Walter says...arrays ofThere are also twice as many (sic) bytes to scan for the gc,Why are strings added to the GC root list anyway? It occurs to me thatbit, byte, ubyte, short, ushort, char, wchar and dchar which are allocatedonthe heap can never contain pointers, and so should not be added to theGC's listof things to scan when created with new (or modification of .length). I imagine that this one simple step would increase D's performance rather dramatically.There's certainly potential in D to add type awareness to the gc. But that adds penalties of its own, and it's an open question whether on the balance it will be faster or not.
Aug 28 2004
In article <cgn92o$1jj4$1 digitaldaemon.com>, Ben Hinkle says...char: .36 seconds wchar: .78 dchar: 1.3So: wchar = char * 2 dchar = char * 4 It looks like the time complexity is a direct factor of element size, which stands to reason since D default initializes all arrays. I would be interested in seeing performance comparisons for transcoding between different formats. For the sake of argument, perhaps using both std.utf and whatever Mango uses. Sean
Aug 27 2004
In article <cgnh8j$1nl6$1 digitaldaemon.com>, Sean Kelly says...So: wchar = char * 2 dchar = char * 4Only if there are the same number of chars in a char array as there are wchars in a wchar array, and dchars in a dchar array. This will /only/ be true if the string is pure ASCII. Jill
Aug 27 2004