digitalmars.D - Higher level built-in strings
- bearophile (29/29) Jul 18 2010 This odd post comes from reading the nice part about strings of chapter ...
- %u (9/11) Jul 19 2010 only 7 bit chars. But in such cases you can also use a immutable(ubyte)[...
- bearophile (4/6) Jul 19 2010 A dchar and an ubyte are better. This is almost what Python3 does and I ...
- Walter Bright (11/16) Jul 19 2010 Strings in D are deliberately meant to be arrays, not special things. Ot...
- dsimcha (8/24) Jul 19 2010 4. Sometimes one can make valid assumptions about the contents of a str...
- Steven Schveighoffer (30/50) Jul 19 2010 Andrei is changing that. Already, isRandomAccessRange!(string) == false...
- Walter Bright (3/17) Jul 19 2010 Probably too late to change that one.
- Andrei Alexandrescu (10/20) Jul 19 2010 I think otherwise. In fact I confess I am extremely excited. The current...
- Walter Bright (2/11) Jul 20 2010 That's good to hear.
- bearophile (8/10) Jul 20 2010 There is very little D2 code around, so little changes as this one are p...
- Walter Bright (2/5) Jul 20 2010 It's a D1 feature, and has been there since nearly the beginning.
- Steven Schveighoffer (14/19) Jul 20 2010 Since when did we care about D1 compatibility?
- Walter Bright (6/24) Jul 20 2010 No argument there, but we do try to avoid silent breakage.
- Andrei Alexandrescu (8/39) Jul 20 2010 Unfortunately it's inconsistent. foreach for ranges operates in terms of...
- Michel Fortin (23/32) Jul 20 2010 Sad. That's one of the first things I tried when I first learned D and
- DCoder (20/45) Jul 20 2010 and
- Jonathan M Davis (16/26) Jul 20 2010 That doesn't really gain us anything. It would likely just make it so th...
- Walter Bright (2/5) Jul 20 2010 As I said to bearophile, it's a D1 feature.
- bearophile (9/16) Jul 19 2010 In my original post I have forgotten another difference over arrays:
- Walter Bright (16/35) Jul 19 2010 This is backwards. The [i] should behave as expected for arrays. As it t...
- Andrei Alexandrescu (6/47) Jul 19 2010 Exactly. And that's what the bidirectional range interface is doing for
- Sean Kelly (2/20) Jul 20 2010 I've had the same experience. The proposed changes would make string us...
- Walter Bright (4/6) Jul 20 2010 At this point, the experience with D strings is that D gets it right. To...
- Marianne Gagnon (6/20) Jul 19 2010 Pragmatically, I seem to have noted that in languages with low level str...
- Jonathan M Davis (16/35) Jul 19 2010 For the most part, D's strings are plenty high level. Between how fantas...
- Jesse Phillips (6/6) Jul 19 2010 What about:
- Andrei Alexandrescu (5/11) Jul 19 2010 Fortunately you can essentially achieve the above by simply writing free...
- Steven Schveighoffer (7/20) Jul 20 2010 How do we make this work?
- Sean Kelly (4/10) Jul 20 2010 foreach (dchar c; str)
- =?UTF-8?B?IkrDqXLDtG1lIE0uIEJlcmdlciI=?= (14/25) Jul 20 2010 And what about this one:
- Walter Bright (3/13) Jul 20 2010 You can specialize the template for strings:
- =?UTF-8?B?IkrDqXLDtG1lIE0uIEJlcmdlciI=?= (8/23) Jul 20 2010 Sure, i can also not use a template and write however many
- Walter Bright (4/11) Jul 20 2010 The overloaded template specialization capability is exactly because it ...
- Aelxx (12/25) Jul 21 2010 Hmm. Theoreticaly a bit more general
- Steven Schveighoffer (9/19) Jul 20 2010 The omission of dchar is on purpose. Phobos has characterized string as...
- Walter Bright (3/11) Jul 20 2010 For many algorithms on strings, iterating by char is preferred over dcha...
- Steven Schveighoffer (7/16) Jul 20 2010 Huh? Which ones? AFAIK, all of std.algorithm treats strings as ranges ...
- Walter Bright (4/21) Jul 20 2010 Andrei posted elsewhere that there were specializations for strings to d...
- Andrei Alexandrescu (3/27) Jul 20 2010 Boyer-Moore comes to mind.
- Rory McGuire (10/17) Jul 20 2010 You shouldn't need to do that:
- Rory McGuire (3/34) Jul 20 2010 such as?
- Rory McGuire (3/41) Jul 20 2010 I mean is there not another way to do the same thing?
-
Simen kjaeraas
(6/6)
Jul 20 2010
Rory McGuire
wrote: - Rory McGuire (4/8) Jul 20 2010 I'm using opera mail. Any suggestions for Linux+Windows, excluding
- Walter Bright (5/7) Jul 20 2010 I use thunderbird on both windows & linux, haven't noticed speed problem...
- Jonathan M Davis (7/19) Jul 20 2010 Well, since I'm a kde user, I use knode if I want a newsreader and kmail...
- Fawzi Mohamed (28/28) Jul 20 2010 I did not read all the discussion in detail, but in my opinion
- Jesse Phillips (2/11) Jul 20 2010 Actually I'm getting his messages as emails, and just thought he was onl...
- awishformore (3/14) Jul 20 2010 Same here. It's kindof annoying tbh.
- =?UTF-8?B?IkrDqXLDtG1lIE0uIEJlcmdlciI=?= (11/24) Jul 20 2010 ly sending to me.
- Bruno Medeiros (6/12) Jul 26 2010 In my client (Thunderbird 3), it appears as a top level post. Its been
This odd post comes from reading the nice part about strings of chapter 4 of TDPL. In the last few years I have seen changes in how D strings are meant and managed, changes that make them less and less like arrays (random-access sequences of mutable code units) and more and more what they are at high level (immutable bidirectional sequences of code points). So a final jump is to make string types something different from normal arrays. This frees them to behave better as high level strings. Probably other people have had similar ideas, so if you think this post is boring or useless, please ignore it. So strings can have some differences compared to arrays: 1) length returns code points, but it's immutable and stored in the string, so it's an O(1) operation (returns what std.utf.count() returns). 2) Assigning a string to another one is allowed: string s1 = s2; But changing the length manually is not allowed: s1.length += 2; // error This makes them more close to immutable, more similar to Python strings. 3) Another immutable string-specific attribute can be added that returns the number of code units in the string, for example .codeunits or .nunits or something similar. 4) s1[i] is not O(1), it's generally slower, and returns the i-th code point (there are ways to speed this operation up to O(log n) with some internal indexes). (Code points in dstrings can be accessed in O(1)). 5) foreach(c; s1) yields a code point dchars regardless if s1 is string, dstring, wstring. But you can use foreach(char c; s1) if s1 is a string and you are sure s1 uses only 7 bit chars. But in such cases you can also use a immutable(ubyte)[]. Python3 does something similar, its has a str type that's always UTF16 (or UTF32) and a bytes type that is similar to a ubyte[]. I think D can do something similar. std.string functions can made to work on ubyte[] and immutable(ubyte)[] too. Some more things, I am not sure about them: 6) Strings can contain their hash value. This field is initialized with a empty value (like -1) and computed lazily on-demand. (as done in Python). This can make them faster when they often are put inside associative arrays and sets, avoiding to compute their hash value again and again. So strings are not fully immutable, because this value gets initialized. But it's a pure value, it's determined only by the immutable contents of the string, so I don't think this can cause big problems in multi-thread programs. If two threads update it, they find the same result. 7) the validation (std.utf.validat(), or a cast) can be necessary to create a string. This means that the type string/dstring/wstring implies it's validated :-) 8) If strings are immutable, then the append can always create a new string. So the extra information at the end of the memory block (used for appending to dynamic arrays) is not necessary. 9) - Today memory, even RAM, is cheap, but moving RAM to the CPU caches is not so fast, so a program that works on just the cache is faster. - UFT8 and UTF16 make strings bidirectional ranges. - UTF encodings are just data compression, but it's not so efficient. So a smarter compression scheme can compress strings more in memory, and the decrease in cache misses can compensate for the increased CPU work to decompress them. (But keeping strings compressed can turn them from bidirectional ranges to forward ranges, this is not so bad). There is a compressor that gives a decompression speed of about 3 times slower than memcpy(): http://www.oberhumer.com/opensource/lzo/ LZO can be used transparently to compress strings in RAM when strings become long enough. Hash computation and equality tests done among compressed strings are faster. Turning strings into something different from arrays looks like a loss, but in practice they already are not arrays, thinking about them as normal arrays is no so useful and it's UTF-unsafe, using [] to read the code units doesn't seem so useful. Code that manages strings/wstrings as normal arrays is not correct in general. Bye, bearophile
Jul 18 2010
5) foreach(c; s1) yields a code point dchars regardless if s1 is string,dstring, wstring.But you can use foreach(char c; s1) if s1 is a string and you are sure s1 usesonly 7 bit chars. But in such cases you can also use a immutable(ubyte)[]. Python3 does something similar, its has a str type that's always UTF16 (or UTF32) and a bytes type that is similar to a ubyte[]. I think D can do something similar. std.string functions can made to work on ubyte[] and immutable(ubyte)[] too. Actually I think D should depreciate char and wchar in user code except inside externals for compatibility with C and Windows functions. When you are working with individual characters you almost always want either a dchar or a byte.
Jul 19 2010
%u:When you are working with individual characters you almost always want either a dchar or a byte.A dchar and an ubyte are better. This is almost what Python3 does and I think it can be good. Maybe other people will give their opinions about my original post. Bye, bearophile
Jul 19 2010
bearophile wrote:This odd post comes from reading the nice part about strings of chapter 4 of TDPL. In the last few years I have seen changes in how D strings are meant and managed, changes that make them less and less like arrays (random-access sequences of mutable code units) and more and more what they are at high level (immutable bidirectional sequences of code points).Strings in D are deliberately meant to be arrays, not special things. Other languages make them special because they have insufficiently powerful arrays. As for indexing by code point, I also believe this is a mistake. It is proposed often, but overlooks: 1. most string operations, such as copying and searching, even regular expressions, work just fine using regular indices. 2. doing the operations in (1) using code points and having to continually decode the strings would result in disastrously slow code. 3. the user can always layer a code point interface over the strings, but going the other way is not so practical.
Jul 19 2010
== Quote from Walter Bright (newshound2 digitalmars.com)'s articlebearophile wrote:4. Sometimes one can make valid assumptions about the contents of a string. For example, in an internal utility app that will never be internationalized you may get away with assuming a character is an ASCII byte. If you know your input will be in the Basic Multilingual Plane (for example if working with pre-sanitized input), you can use wstrings and always assume a character is 2 bytes. 5. For dchar strings, a code unit equals a code point. Should the interface for dchar strings be completely different than that for char and wchar strings?This odd post comes from reading the nice part about strings of chapter 4 of TDPL. In the last few years I have seen changes in how D strings are meant and managed, changes that make them less and less like arrays (random-access sequences of mutable code units) and more and more what they are at high level (immutable bidirectional sequences of code points).Strings in D are deliberately meant to be arrays, not special things. Other languages make them special because they have insufficiently powerful arrays. As for indexing by code point, I also believe this is a mistake. It is proposed often, but overlooks: 1. most string operations, such as copying and searching, even regular expressions, work just fine using regular indices. 2. doing the operations in (1) using code points and having to continually decode the strings would result in disastrously slow code. 3. the user can always layer a code point interface over the strings, but going the other way is not so practical.
Jul 19 2010
On Mon, 19 Jul 2010 16:04:21 -0400, Walter Bright <newshound2 digitalmars.com> wrote:bearophile wrote:Andrei is changing that. Already, isRandomAccessRange!(string) == false. I kind of don't like this direction, even though its clever. What you end up with is phobos refusing to believe that a string or char[] is an array, but the compiler saying it is. What I'd prefer is something where the compiler types string literals as string, a type defined by phobos which contains as its first member an immutable(char)[] (where the compiler puts the literal). Then we can properly limit the other operations.This odd post comes from reading the nice part about strings of chapter 4 of TDPL. In the last few years I have seen changes in how D strings are meant and managed, changes that make them less and less like arrays (random-access sequences of mutable code units) and more and more what they are at high level (immutable bidirectional sequences of code points).Strings in D are deliberately meant to be arrays, not special things. Other languages make them special because they have insufficiently powerful arrays.As for indexing by code point, I also believe this is a mistake. It is proposed often, but overlooks: 1. most string operations, such as copying and searching, even regular expressions, work just fine using regular indices. 2. doing the operations in (1) using code points and having to continually decode the strings would result in disastrously slow code. 3. the user can always layer a code point interface over the strings, but going the other way is not so practical.I agree here. Anything that uses indexing to perform a linear operation is bound for the scrap heap. But what about this: foreach(c; str) which types c as char (or immutable char), not dchar. These are the subtle problems that we have with the dichotomy of phobos refusing to believe a string is an array, but the compiler believing it is. I think the default inference for this should be dchar, and phobos can make that true as long as it controls the string type. There are other points to consider: 1) a string *could be* indexed by character and return the code point being pointed to. 2) even slicing could be valid as long as the slice operator jumps back to the start of the dchar being encoded. This might make for very tricky code, but then again, such is the cost of trying to slice something like a utf-8 string :) But having the compiler force the string type to be an array, when it clearly isn't, doesn't help. Give the runtime the choice, like it's done for AA's, and I think we may have something that is workable, and doesn't suck performance-wise. -Steve
Jul 19 2010
Steven Schveighoffer wrote:On Mon, 19 Jul 2010 16:04:21 -0400, Walter Bright <newshound2 digitalmars.com> wrote:That decision may be a mistake.Strings in D are deliberately meant to be arrays, not special things. Other languages make them special because they have insufficiently powerful arrays.Andrei is changing that. Already, isRandomAccessRange!(string) == false. I kind of don't like this direction, even though its clever.I agree here. Anything that uses indexing to perform a linear operation is bound for the scrap heap. But what about this: foreach(c; str) which types c as char (or immutable char), not dchar.Probably too late to change that one.
Jul 19 2010
On 07/19/2010 11:31 PM, Walter Bright wrote:Steven Schveighoffer wrote:I think otherwise. In fact I confess I am extremely excited. The current state of affairs described built-in strings very accurately: they are formally bidirectional ranges, yet they offer random access for code units that you can freely use if you so wish. It's modeling reality very accurately. As far as I know, all algorithms in std.algorithm that work well without decoding are special-cased for strings to work fast and yield correct results. AndreiOn Mon, 19 Jul 2010 16:04:21 -0400, Walter Bright <newshound2 digitalmars.com> wrote:That decision may be a mistake.Strings in D are deliberately meant to be arrays, not special things. Other languages make them special because they have insufficiently powerful arrays.Andrei is changing that. Already, isRandomAccessRange!(string) == false. I kind of don't like this direction, even though its clever.
Jul 19 2010
Andrei Alexandrescu wrote:I think otherwise. In fact I confess I am extremely excited. The current state of affairs described built-in strings very accurately: they are formally bidirectional ranges, yet they offer random access for code units that you can freely use if you so wish. It's modeling reality very accurately. As far as I know, all algorithms in std.algorithm that work well without decoding are special-cased for strings to work fast and yield correct results.That's good to hear.
Jul 20 2010
Walter:As it turns out, indexing by byte is *far* more common than indexing by code unit, in fact, I've never ever needed to index by code unit.<OK.Probably too late to change that one.There is very little D2 code around, so little changes as this one are possible still. As alternative see also this enhancement request, partially coming from a post of mine in D.learn: http://d.puremagic.com/issues/show_bug.cgi?id=4483 (If you refuse this fallback idea too, then it's better to close this bug report. Keeping too much dead wood in Bugzilla will eventually cause small troubles.) In my original post my points 2, 6, 8 and 9 are still valid. To be nice they need strings to be a little different from standard arrays. Thanks for the comments, bearophile
Jul 20 2010
bearophile wrote:It's a D1 feature, and has been there since nearly the beginning.Probably too late to change that one.There is very little D2 code around, so little changes as this one are possible still.
Jul 20 2010
On Tue, 20 Jul 2010 14:29:43 -0400, Walter Bright <newshound2 digitalmars.com> wrote:bearophile wrote:Since when did we care about D1 compatibility? const, inout, array appending, final, typeof(string), TLS globals just to name a few... If you expect D1 code to compile fine and run on D2, you are deluding yourself. The worst that happens is that code starts using dchar instead of char, and either a compiler error occurs and it's fixed simply by doing: foreach(char c; str) or it compiles fine because the type is never explicitly stated, and what's the big deal there? The code just becomes more utf compatible for free :) -SteveIt's a D1 feature, and has been there since nearly the beginning.Probably too late to change that one.There is very little D2 code around, so little changes as this one are possible still.
Jul 20 2010
Steven Schveighoffer wrote:On Tue, 20 Jul 2010 14:29:43 -0400, Walter Bright <newshound2 digitalmars.com> wrote:We care about incompatibilities that silently break code.It's a D1 feature, and has been there since nearly the beginning.Since when did we care about D1 compatibility?const, inout, array appending, final, typeof(string), TLS globals just to name a few... If you expect D1 code to compile fine and run on D2, you are deluding yourself.No argument there, but we do try to avoid silent breakage.The worst that happens is that code starts using dchar instead of char, and either a compiler error occurs and it's fixed simply by doing: foreach(char c; str) or it compiles fine because the type is never explicitly stated, and what's the big deal there? The code just becomes more utf compatible for free :)I don't think it's necessarily true that the user really wanted the decoded character rather than the byte, or even that wanting the decoded character is more likely to be desired.
Jul 20 2010
Walter Bright wrote:Steven Schveighoffer wrote:Unfortunately it's inconsistent. foreach for ranges operates in terms of front, empty, popFront - just not for strings. I avoid foreach in std.algorithm and in generic code. For my money I'd be okay if foreach (c; str) wouldn't even compile - the user would be asked to specify the type. But if the implicit type is allowed, I'm afraid I believe that dchar would be the best choice. AndreiOn Tue, 20 Jul 2010 14:29:43 -0400, Walter Bright <newshound2 digitalmars.com> wrote:We care about incompatibilities that silently break code.It's a D1 feature, and has been there since nearly the beginning.Since when did we care about D1 compatibility?const, inout, array appending, final, typeof(string), TLS globals just to name a few... If you expect D1 code to compile fine and run on D2, you are deluding yourself.No argument there, but we do try to avoid silent breakage.The worst that happens is that code starts using dchar instead of char, and either a compiler error occurs and it's fixed simply by doing: foreach(char c; str) or it compiles fine because the type is never explicitly stated, and what's the big deal there? The code just becomes more utf compatible for free :)I don't think it's necessarily true that the user really wanted the decoded character rather than the byte, or even that wanting the decoded character is more likely to be desired.
Jul 20 2010
On 2010-07-20 00:31:34 -0400, Walter Bright <newshound2 digitalmars.com> said:Steven Schveighoffer wrote:Sad. That's one of the first things I tried when I first learned D and the result did surprise me. I expected foreach to iterate on characters (code points), not code units. Then I saw I could add 'dchar' to get that behaviour and found that to be not too bad. The big problem here is that ranges and foreach behave differently. A range that doesn't work with foreach isn't a good range. That's even worse when that range is at the core of the language because it'll look bad on both ranges and the language. As it stands now, when doing generic programming we'd have to write foreach like this so it works the same with foreach as it does with the range APIs, just in case the range is a string: foreach (ElementType!(typeof(range)) c; range) {} Something needs to change so the above always work the same as not specifying the type! Either foreach should be adapted or ranges should let go the idea of iterating on code points for the default string type. As for the "too late to change" stance, I'm not sure. It'll certainly be to late to change in a year, but right now D2 is still pretty new. What makes you say it's too late? -- Michel Fortin michel.fortin michelf.com http://michelf.com/I agree here. Anything that uses indexing to perform a linear operation is bound for the scrap heap. But what about this: foreach(c; str) which types c as char (or immutable char), not dchar.Probably too late to change that one.
Jul 20 2010
== Quote from Michel Fortin (michel.fortin michelf.com)'s articleOn 2010-07-20 00:31:34 -0400, Walter Bright<newshound2 digitalmars.com> said:andSteven Schveighoffer wrote:Sad. That's one of the first things I tried when I first learned DI agree here. Anything that uses indexing to perform a linear operation is bound for the scrap heap. But what about this: foreach(c; str) which types c as char (or immutable char), not dchar.Probably too late to change that one.the result did surprise me. I expected foreach to iterate oncharacters(code points), not code units. Then I saw I could add 'dchar' togetthat behaviour and found that to be not too bad. The big problem here is that ranges and foreach behavedifferently. Arange that doesn't work with foreach isn't a good range. That'sevenworse when that range is at the core of the language because it'lllookbad on both ranges and the language. As it stands now, when doing generic programming we'd have towriteforeach like this so it works the same with foreach as it doeswith therange APIs, just in case the range is a string: foreach (ElementType!(typeof(range)) c; range) {} Something needs to change so the above always work the same as not specifying the type! Either foreach should be adapted or rangesshouldlet go the idea of iterating on code points for the default stringtype. I'm wondering how bad would it be introduce a schar (short char, 1 byte) type and then let char simply map to a "default" char type: dchar, wchar, or whatever we tell the compiler. By default, char would map to dchar. alias char dchar; Coupling this with implicit cast of schar/char (from imported C code) to dchar, I think this might even help fix many such situations.
Jul 20 2010
On Tuesday, July 20, 2010 05:30:51 DCoder wrote:I'm wondering how bad would it be introduce a schar (short char, 1 byte) type and then let char simply map to a "default" char type: dchar, wchar, or whatever we tell the compiler. By default, char would map to dchar. alias char dchar; Coupling this with implicit cast of schar/char (from imported C code) to dchar, I think this might even help fix many such situations.That doesn't really gain us anything. It would likely just make it so that char would mean dchar and be used by default. And honestly, using dchar is not really the solution. If you wan to do that, you can. However, using dchar in the general case is not necessarily a good idea since it wastes so much space. On top of all that, TDPL was _very_ clear on the differences between char, wchar, and dchar and TDPL is supposed to stay accurate. So, any changes which would contradict TDPL need a _very_ good reason for being made, or they won't be. Overall, strings in D work great. The only issue really is making it so that you properly deal with the cases where you need to treat them as code points vs when you can treat them as code units. Any programmer who wants to entirely avoid the problem can just using dchar and dstring. For the rest, you need to understand how string and wstring work and just handle them appropriately. There may be a few places where things should be smoothed out, but overall, I really do think that they work well as they are. - Jonathan M Davis
Jul 20 2010
Michel Fortin wrote:As for the "too late to change" stance, I'm not sure. It'll certainly be to late to change in a year, but right now D2 is still pretty new. What makes you say it's too late?As I said to bearophile, it's a D1 feature.
Jul 20 2010
Walter Bright:1. most string operations, such as copying and searching, even regular expressions, work just fine using regular indices. 2. doing the operations in (1) using code points and having to continually decode the strings would result in disastrously slow code.In my original post I have forgotten another difference over arrays: 5b) a method like ".unit()" that allows to index code units. So "foo".unit(1) is always O(1). Lower level code can use this method as [] is used for arrays. Copying is done on the bytes themselves, with a memcpy, no decoding necessary. If the point (9) (automatic LZO encoding) is used, then copying can be 2-3 times faster for long strings (because there is less data and you don't need to uncompress it to copy). (if such compression is added, then strings can need a third accessor method, to the true bytes).3. the user can always layer a code point interface over the strings, but going the other way is not so practical.This is true. But it makes the string usage unnecessarily low-level and hard... A better design in a smart system language as D is to give strings a default high level "interface" that sees strings as what they are at high level, and add a second lower level interface when you need faster lower-level fiddling (so they have [] that returns code points and unit() that returns code units). Bye, bearophile
Jul 19 2010
bearophile wrote:Walter Bright:This is backwards. The [i] should behave as expected for arrays. As it turns out, indexing by byte is *far* more common than indexing by code unit, in fact, I've never ever needed to index by code unit. (Though it is sometimes necessary to step through by code unit, that's different from indexing by code unit.)1. most string operations, such as copying and searching, even regular expressions, work just fine using regular indices. 2. doing the operations in (1) using code points and having to continually decode the strings would result in disastrously slow code.In my original post I have forgotten another difference over arrays: 5b) a method like ".unit()" that allows to index code units. So "foo".unit(1) is always O(1). Lower level code can use this method as [] is used for arrays.I don't believe that manipulating strings in D is hard, even if you do have to work with multibyte characters. You do have to be aware they are multibyte, but I think that just comes with being a programmer. A better design in a smart system language as D is to give strings a3. the user can always layer a code point interface over the strings, but going the other way is not so practical.This is true. But it makes the string usage unnecessarily low-level and hard...default high level "interface" that sees strings as what they are at high level, and add a second lower level interface when you need faster lower-level fiddling (so they have [] that returns code points and unit() that returns code units).I have some moderate experience with using utf. First there's the D javascript engine, which is fully utf'd. The D string design fits in with it perfectly. Then there are chunks of C++ ascii-only code I've translated to D, and it then worked with utf-8 without further modification. Based on that, I believe the D string design hits the sweet spot between efficiency and utility.
Jul 19 2010
On 07/19/2010 11:29 PM, Walter Bright wrote:bearophile wrote:Exactly. And that's what the bidirectional range interface is doing for strings.Walter Bright:This is backwards. The [i] should behave as expected for arrays. As it turns out, indexing by byte is *far* more common than indexing by code unit, in fact, I've never ever needed to index by code unit. (Though it is sometimes necessary to step through by code unit, that's different from indexing by code unit.)1. most string operations, such as copying and searching, even regular expressions, work just fine using regular indices. 2. doing the operations in (1) using code points and having to continually decode the strings would result in disastrously slow code.In my original post I have forgotten another difference over arrays: 5b) a method like ".unit()" that allows to index code units. So "foo".unit(1) is always O(1). Lower level code can use this method as [] is used for arrays.I agree. In fact there is no language I know that can compete with D at UTF string handling. AndreiI don't believe that manipulating strings in D is hard, even if you do have to work with multibyte characters. You do have to be aware they are multibyte, but I think that just comes with being a programmer. A better design in a smart system language as D is to give strings a3. the user can always layer a code point interface over the strings, but going the other way is not so practical.This is true. But it makes the string usage unnecessarily low-level and hard...default high level "interface" that sees strings as what they are at high level, and add a second lower level interface when you need faster lower-level fiddling (so they have [] that returns code points and unit() that returns code units).I have some moderate experience with using utf. First there's the D javascript engine, which is fully utf'd. The D string design fits in with it perfectly. Then there are chunks of C++ ascii-only code I've translated to D, and it then worked with utf-8 without further modification. Based on that, I believe the D string design hits the sweet spot between efficiency and utility.
Jul 19 2010
Walter Bright Wrote:bearophile wrote:I've had the same experience. The proposed changes would make string useless to me, even for Unicode work. I'd end up using ubyte[] instead.Walter Bright:This is backwards. The [i] should behave as expected for arrays. As it turns out, indexing by byte is *far* more common than indexing by code unit, in fact, I've never ever needed to index by code unit. (Though it is sometimes necessary to step through by code unit, that's different from indexing by code unit.)1. most string operations, such as copying and searching, even regular expressions, work just fine using regular indices. 2. doing the operations in (1) using code points and having to continually decode the strings would result in disastrously slow code.In my original post I have forgotten another difference over arrays: 5b) a method like ".unit()" that allows to index code units. So "foo".unit(1) is always O(1). Lower level code can use this method as [] is used for arrays.
Jul 20 2010
Sean Kelly wrote:I've had the same experience. The proposed changes would make string useless to me, even for Unicode work. I'd end up using ubyte[] instead.At this point, the experience with D strings is that D gets it right. To change it would require someone who has spent a *lot* of hours programming in utf making a very compelling argument.
Jul 20 2010
Pragmatically, I seem to have noted that in languages with low level strings, people invariably come up with librairies that provide higher-level strings. C/C++ provided low-level strings only initially, then a not-so-powerful std::string; and we saw QString, wxString, irr::string, BetterString, countless others... Java, on the other end, provided a powerful high-level String object from the start; and to my knowledge it is used consistently in all Java programs with no other string classes being made. I do acknowlegde that D arrays are much better than C/C++ arrays. Still, my prediction is that if D chooses to stick to C-style function calls, and does not provide a standard high-level String object, then a myriad of string objects will start popping around. Because lots of people like OOP and don't like C-style calls. Just my 2c :) I mean be wrong -- Auria Walter Bright Wrote:Strings in D are deliberately meant to be arrays, not special things. Other languages make them special because they have insufficiently powerful arrays. As for indexing by code point, I also believe this is a mistake. It is proposed often, but overlooks: 1. most string operations, such as copying and searching, even regular expressions, work just fine using regular indices. 2. doing the operations in (1) using code points and having to continually decode the strings would result in disastrously slow code. 3. the user can always layer a code point interface over the strings, but going the other way is not so practical.
Jul 19 2010
On Monday, July 19, 2010 16:49:58 Marianne Gagnon wrote:Pragmatically, I seem to have noted that in languages with low level strings, people invariably come up with librairies that provide higher-level strings. C/C++ provided low-level strings only initially, then a not-so-powerful std::string; and we saw QString, wxString, irr::string, BetterString, countless others... Java, on the other end, provided a powerful high-level String object from the start; and to my knowledge it is used consistently in all Java programs with no other string classes being made. I do acknowlegde that D arrays are much better than C/C++ arrays. Still, my prediction is that if D chooses to stick to C-style function calls, and does not provide a standard high-level String object, then a myriad of string objects will start popping around. Because lots of people like OOP and don't like C-style calls. Just my 2c :) I mean be wrong -- AuriaFor the most part, D's strings are plenty high level. Between how fantastic D's arrays are and the fact that you can call functions on them as if they were objects, D's strings are quite high level in terms of how you use them. Their solution for how to deal with unicode is also quite powerful. The problem is that the solution for how to deal with unicode makes it so that if you try and deal with individual chars or wchars, you're going to very quickly shoot yourself in the foot. If you want to avoid the problem entirely, you simply use dstring and they're at least as powerful - and arguably more so - than Java's strings. The only issue with strings in D that I'm aware of is the danger with trying to deal with individual characters. But their are lots of great functions for dealing with strings, and they allow you to easily deal with individual characters by just using them as single-character strings rather than as chars or wchars. On the whole, D's strings are the best strings that I've used. - Jonathan M Davis
Jul 19 2010
What about: struct String { string items; alias items this; } And add the needed functions you wish to have in string and it will still work in existing functions that operate on immutable(char)[]
Jul 19 2010
On 07/19/2010 06:51 PM, Jesse Phillips wrote:What about: struct String { string items; alias items this; } And add the needed functions you wish to have in string and it will still work in existing functions that operate on immutable(char)[]Fortunately you can essentially achieve the above by simply writing free functions that take a string or a ref string as their first argument. Then you can use str.foo(args) as an alternative for foo(str, args). Andrei
Jul 19 2010
On Mon, 19 Jul 2010 20:26:47 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 07/19/2010 06:51 PM, Jesse Phillips wrote:How do we make this work? auto str = "hello world"; foreach(c; str) assert(is(typeof(c) == dchar)); -SteveWhat about: struct String { string items; alias items this; } And add the needed functions you wish to have in string and it will still work in existing functions that operate on immutable(char)[]Fortunately you can essentially achieve the above by simply writing free functions that take a string or a ref string as their first argument. Then you can use str.foo(args) as an alternative for foo(str, args).
Jul 20 2010
Steven Schveighoffer Wrote:How do we make this work? auto str = "hello world"; foreach(c; str) assert(is(typeof(c) == dchar));foreach (dchar c; str) assert(...); This feature has been in D for years.
Jul 20 2010
Sean Kelly wrote:Steven Schveighoffer Wrote:And what about this one: void func(T) (T range) { foreach (elem; range) assert (is (typeof (elem) =3D=3D ElementType!(T))); } func ("azerty"); auto a =3D [ 1, 2, 3, 4, 5]; func (a); Jerome --=20 mailto:jeberger free.fr http://jeberger.free.fr Jabber: jeberger jabber.frHow do we make this work? auto str =3D "hello world"; foreach(c; str) assert(is(typeof(c) =3D=3D dchar));=20 foreach (dchar c; str) assert(...); =20 This feature has been in D for years.
Jul 20 2010
Jérôme M. Berger wrote:And what about this one: void func(T) (T range) { foreach (elem; range) assert (is (typeof (elem) == ElementType!(T))); } func ("azerty"); auto a = [ 1, 2, 3, 4, 5]; func (a);You can specialize the template for strings: void func(T:string)(T range) { ... }
Jul 20 2010
Walter Bright wrote:J=C3=A9r=C3=B4me M. Berger wrote:Sure, i can also not use a template and write however many overloaded functions I need. So what are templates for? Jerome --=20 mailto:jeberger free.fr http://jeberger.free.fr Jabber: jeberger jabber.frAnd what about this one: void func(T) (T range) { foreach (elem; range) assert (is (typeof (elem) =3D=3D ElementType!(T))); } func ("azerty"); auto a =3D [ 1, 2, 3, 4, 5]; func (a);=20 You can specialize the template for strings: =20 void func(T:string)(T range) { ... }
Jul 20 2010
Jérôme M. Berger wrote:Walter Bright wrote:The overloaded template specialization capability is exactly because it is often advantageous to write custom versions for certain types. The user of the template doesn't see that, it looks generic to him.You can specialize the template for strings: void func(T:string)(T range) { ... }Sure, i can also not use a template and write however many overloaded functions I need. So what are templates for?
Jul 20 2010
"Walter Bright" <newshound2 digitalmars.com> ÓÏÏÂÝÉÌ/ÓÏÏÂÝÉÌÁ × ÎÏ×ÏÓÔÑÈ ÓÌÅÄÕÀÝÅÅ: news:i24st1$12uh$1 digitalmars.com...Jerome M. Berger wrote:Hmm. Theoreticaly a bit more general void func(T, U, V )(T rangeT, U rangeU, V rangeV) { ... } void func(T:string, U, V )(T rangeT, U rangeU, V rangeV) { ... } void func(T, U:string, V )(T rangeT, U rangeU, V rangeV) { ... } void func(T, U, V:string )(T rangeT, U rangeU, V rangeV) { ... } void func(T:string, U:string, V )(T rangeT, U rangeU, V rangeV) { ... } void func(T:string, U, V:string )(T rangeT, U rangeU, V rangeV) { ... } void func(T, U:string, V:string )(T rangeT, U rangeU, V rangeV) { ... } void func(T:string, U:string, V:string )(T rangeT, U rangeU, V rangeV) { ... }And what about this one: void func(T) (T range) { foreach (elem; range) assert (is (typeof (elem) == ElementType!(T))); } func ("azerty"); auto a = [ 1, 2, 3, 4, 5]; func (a);You can specialize the template for strings: void func(T:string)(T range) { ... }
Jul 21 2010
On Tue, 20 Jul 2010 11:02:57 -0400, Sean Kelly <sean invisibleduck.org> wrote:Steven Schveighoffer Wrote:The omission of dchar is on purpose. Phobos has characterized string as a bidirectional range of dchars. For every range where I do: foreach(e; range) e is of the type of the range. Except for char and wchar. This schizophrenia of type induction is very bad for D, and it's a good argument of why strings should not simply be arrays. -SteveHow do we make this work? auto str = "hello world"; foreach(c; str) assert(is(typeof(c) == dchar));foreach (dchar c; str) assert(...); This feature has been in D for years.
Jul 20 2010
Steven Schveighoffer wrote:The omission of dchar is on purpose. Phobos has characterized string as a bidirectional range of dchars. For every range where I do: foreach(e; range) e is of the type of the range. Except for char and wchar. This schizophrenia of type induction is very bad for D, and it's a good argument of why strings should not simply be arrays.For many algorithms on strings, iterating by char is preferred over dchar, even for multibyte strings.
Jul 20 2010
On Tue, 20 Jul 2010 15:21:34 -0400, Walter Bright <newshound2 digitalmars.com> wrote:Steven Schveighoffer wrote:Huh? Which ones? AFAIK, all of std.algorithm treats strings as ranges of dchar. I am 100% in agreement with you that indexing and length should be done by char. All I'm talking about is foreach. -SteveThe omission of dchar is on purpose. Phobos has characterized string as a bidirectional range of dchars. For every range where I do: foreach(e; range) e is of the type of the range. Except for char and wchar. This schizophrenia of type induction is very bad for D, and it's a good argument of why strings should not simply be arrays.For many algorithms on strings, iterating by char is preferred over dchar, even for multibyte strings.
Jul 20 2010
Steven Schveighoffer wrote:On Tue, 20 Jul 2010 15:21:34 -0400, Walter Bright <newshound2 digitalmars.com> wrote:Searching, for one.Steven Schveighoffer wrote:Huh? Which ones?The omission of dchar is on purpose. Phobos has characterized string as a bidirectional range of dchars. For every range where I do: foreach(e; range) e is of the type of the range. Except for char and wchar. This schizophrenia of type induction is very bad for D, and it's a good argument of why strings should not simply be arrays.For many algorithms on strings, iterating by char is preferred over dchar, even for multibyte strings.AFAIK, all of std.algorithm treats strings as ranges of dchar.Andrei posted elsewhere that there were specializations for strings to do it one way or the other based on which was more efficient.
Jul 20 2010
Walter Bright wrote:Steven Schveighoffer wrote:Boyer-Moore comes to mind. AndreiOn Tue, 20 Jul 2010 15:21:34 -0400, Walter Bright <newshound2 digitalmars.com> wrote:Searching, for one.Steven Schveighoffer wrote:Huh? Which ones?The omission of dchar is on purpose. Phobos has characterized string as a bidirectional range of dchars. For every range where I do: foreach(e; range) e is of the type of the range. Except for char and wchar. This schizophrenia of type induction is very bad for D, and it's a good argument of why strings should not simply be arrays.For many algorithms on strings, iterating by char is preferred over dchar, even for multibyte strings.AFAIK, all of std.algorithm treats strings as ranges of dchar.Andrei posted elsewhere that there were specializations for strings to do it one way or the other based on which was more efficient.
Jul 20 2010
On Tue, 20 Jul 2010 01:51:51 +0200, Jesse Phillips <jessekphillips+d gmail.com> wrote:What about: struct String { string items; alias items this; } And add the needed functions you wish to have in string and it will still work in existing functions that operate on immutable(char)[]You shouldn't need to do that: string strstr(string haystack, string needle); can be used as: string s; s.strstr("needle"); so you can add "methods" to a string or whatever just by defining functions. -Rory
Jul 20 2010
On Tue, 20 Jul 2010 16:08:06 +0200, Jesse Phillips <jesse.k.phillips gmail.com> wrote:But then you can't overload operators. On Tue, Jul 20, 2010 at 12:54 AM, Rory McGuire <rmcguire neonova.co.za> wrote:such as?On Tue, 20 Jul 2010 01:51:51 +0200, Jesse Phillips <jessekphillips+d gmail.com> wrote:What about: struct String { string items; alias items this; } And add the needed functions you wish to have in string and it will still work in existing functions that operate on immutable(char)[]You shouldn't need to do that: string strstr(string haystack, string needle); can be used as: string s; s.strstr("needle"); so you can add "methods" to a string or whatever just by defining functions. -Rory
Jul 20 2010
On Tue, 20 Jul 2010 16:51:57 +0200, Rory McGuire <rmcguire neonova.co.za> wrote:On Tue, 20 Jul 2010 16:08:06 +0200, Jesse Phillips <jesse.k.phillips gmail.com> wrote:I mean is there not another way to do the same thing?But then you can't overload operators. On Tue, Jul 20, 2010 at 12:54 AM, Rory McGuire <rmcguire neonova.co.za> wrote:such as?On Tue, 20 Jul 2010 01:51:51 +0200, Jesse Phillips <jessekphillips+d gmail.com> wrote:What about: struct String { string items; alias items this; } And add the needed functions you wish to have in string and it will still work in existing functions that operate on immutable(char)[]You shouldn't need to do that: string strstr(string haystack, string needle); can be used as: string s; s.strstr("needle"); so you can add "methods" to a string or whatever just by defining functions. -Rory
Jul 20 2010
Rory McGuire <rmcguire neonova.co.za> wrote: [snip] Rory, is there something wrong with your newsreader? I keep seeing your posts as replies only to the top post. -- Simen
Jul 20 2010
On Tue, 20 Jul 2010 18:35:12 +0200, Simen kjaeraas <simen.kjaras gmail.com> wrote:Rory McGuire <rmcguire neonova.co.za> wrote: [snip] Rory, is there something wrong with your newsreader? I keep seeing your posts as replies only to the top post.I'm using opera mail. Any suggestions for Linux+Windows, excluding thunderbird(slow)?
Jul 20 2010
Rory McGuire wrote:I'm using opera mail. Any suggestions for Linux+Windows, excluding thunderbird(slow)?I use thunderbird on both windows & linux, haven't noticed speed problems other than my slow internet connection. I have noticed that thunderbird uses multithreading fairly effectively to speed itself up.
Jul 20 2010
On Tuesday, July 20, 2010 11:45:41 Rory McGuire wrote:On Tue, 20 Jul 2010 18:35:12 +0200, Simen kjaeraas <simen.kjaras gmail.com> wrote:Well, since I'm a kde user, I use knode if I want a newsreader and kmail if I want a mail client. I prefer knode for dealing with newsgroups rather than using a mail list with kmail, but I do sometimes end up using kmail with mail lists rather than knode because I can take advantage of imap and have stuff properly synced between my machines. - Jonathan M DavisRory McGuire <rmcguire neonova.co.za> wrote: [snip] Rory, is there something wrong with your newsreader? I keep seeing your posts as replies only to the top post.I'm using opera mail. Any suggestions for Linux+Windows, excluding thunderbird(slow)?
Jul 20 2010
I did not read all the discussion in detail, but in my opinion something that would be very useful in a library is struct String{ void *ptr; size_t _l; enum :size_t { MaskLen=((~cast(size_t)0)>>2) } enum :int { BitsLen=8*size_t.sizeof-2 } size_t len(){ return (_l & MaskLen); } int encodingId(){ return cast(int)(_l>>BitsLen); } } plus stuff to simplify its creation from T[] arrays and getting T[] arrays from it. this type would them be used where one wants a string without caring about its encoding, and without having to make all string accepting functions templates. As it was explained by others many string operations are rather generic. *this* is what I would have expected from string, not an alias to char[]. Fawzi
Jul 20 2010
Simen kjaeraas Wrote:Rory McGuire <rmcguire neonova.co.za> wrote: [snip] Rory, is there something wrong with your newsreader? I keep seeing your posts as replies only to the top post. -- SimenActually I'm getting his messages as emails, and just thought he was only sending to me.
Jul 20 2010
Am 20.07.2010 19:15, schrieb Jesse Phillips:Simen kjaeraas Wrote:Same here. It's kindof annoying tbh. /MaxRory McGuire<rmcguire neonova.co.za> wrote: [snip] Rory, is there something wrong with your newsreader? I keep seeing your posts as replies only to the top post. -- SimenActually I'm getting his messages as emails, and just thought he was only sending to me.
Jul 20 2010
Jesse Phillips wrote:Simen kjaeraas Wrote: =20rRory McGuire <rmcguire neonova.co.za> wrote: [snip] Rory, is there something wrong with your newsreader? I keep seeing you=ly sending to me. He's hitting "reply to all" instead of just plain "reply". Funny thing is I don't have any problem with his messages but yours does appear as a reply to the top post... Jerome --=20 mailto:jeberger free.fr http://jeberger.free.fr Jabber: jeberger jabber.frposts as replies only to the top post. --=20 Simen=20 Actually I'm getting his messages as emails, and just thought he was on=
Jul 20 2010
On 20/07/2010 20:41, "Jérôme M. Berger" wrote:Jesse Phillips wrote:In my client (Thunderbird 3), it appears as a top level post. Its been happening several times (with different people) since I started using TB 3 I think. -- Bruno Medeiros - Software EngineerSimen kjaeraas Wrote:Funny thing is I don't have any problem with his messages but yours does appear as a reply to the top post... Jerome
Jul 26 2010