www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Why foreach(c; someString) must yield dchar

reply dsimcha <dsimcha yahoo.com> writes:
I've been hacking in Phobos and parallelfuture and I've come to the conclusion
that having typeof(c) in the expression foreach(c; string.init) not be a dchar
is simply ridiculous.  I don't care how much existing code gets broken, this
needs to be fixed.  Otherwise, all generic code will have to deal with it as a
special case.  Most of it will probably overlook this special case in
practice, and the net result will be more broken code than if we just bite the
bullet and fix this now.  Here are some examples of the absurdities created by
the current situation:

static assert(is(typeof({
    foreach(elem; T.init) {
        return elem;
    }
    assert(0);
}) == ElementType!(T));

Looks reasonable.  FAILS on narrow strings.

size_t walkLength1(R)(R input) {
    size_t ret = 0;
    foreach(elem; input) {
        ret++;
    }

    return ret;
}

size_t walkLength2(R)(R input) {
    size_t ret = 0;
    while(!input.empty) {
       ret++;
       input.popFront();
    }

    return ret;
}

assert(walkLength1(stuff) == walkLength2(stuff));

FAILS if stuff is a narrow string with characters that aren't a single code
point.

void printRange(R)(R range) {
    foreach(elem; range) {
        write(elem, ' ');
    }
    writeln();
}

Prints garbage if range is a string with characters that aren't a single code
point.

auto rangeMax(R)(R range) {
    enforce(!range.empty);

    auto ret = range.front;
    foreach(elem; range) {
        if(elem > ret) {
            ret = elem;
        }
    }

    return ret;
}

This will not find the largest character in the range if R is a narrow string.

If D is at all serious about generic programming, we simply can't require this
to be dealt with **everywhere** as a special case.
Aug 18 2010
next sibling parent reply Rainer Deyke <rainerd eldwood.com> writes:
On 8/18/2010 20:37, dsimcha wrote:
 I've been hacking in Phobos and parallelfuture and I've come to the conclusion
 that having typeof(c) in the expression foreach(c; string.init) not be a dchar
 is simply ridiculous.
I have long ago come to the opposite conclusion. An array of 'char' should act like any other array. If you want a sequence of 'dchar' that is internally stored as an array of 'char', don't call it 'char[]'. You propose to fix a special case by adding more special cases. This will increase, not decrease, the number of cases that will need special treatment in generic code. Iterating over a sequence of 'char' as a sequence of 'dchar' is very useful. Implementing this functionality as a language feature, tied to the built-in array type, is just plain wrong.
 static assert(is(typeof({
     foreach(elem; T.init) {
         return elem;
     }
     assert(0);
 }) == ElementType!(T));
 
 Looks reasonable.  FAILS on narrow strings.
Because ElementType!(string) is broken.
 size_t walkLength1(R)(R input) {
     size_t ret = 0;
     foreach(elem; input) {
         ret++;
     }
 
     return ret;
 }
 
 size_t walkLength2(R)(R input) {
     size_t ret = 0;
     while(!input.empty) {
        ret++;
        input.popFront();
     }
 
     return ret;
 }
 
 assert(walkLength1(stuff) == walkLength2(stuff));
 
 FAILS if stuff is a narrow string with characters that aren't a single code
point.
Because 'popFront' is broken for narrow strings.
 void printRange(R)(R range) {
     foreach(elem; range) {
         write(elem, ' ');
     }
     writeln();
 }
 
 Prints garbage if range is a string with characters that aren't a single code
 point.
Prints bytes from the string separated by spaces. This may be intentional behavior if the parser on the other side is not utf-aware. -- Rainer Deyke - rainerd eldwood.com
Aug 18 2010
parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Wed, 18 Aug 2010 23:11:26 -0400, Rainer Deyke <rainerd eldwood.com>  
wrote:

 On 8/18/2010 20:37, dsimcha wrote:
 I've been hacking in Phobos and parallelfuture and I've come to the  
 conclusion
 that having typeof(c) in the expression foreach(c; string.init) not be  
 a dchar
 is simply ridiculous.
I have long ago come to the opposite conclusion. An array of 'char' should act like any other array. If you want a sequence of 'dchar' that is internally stored as an array of 'char', don't call it 'char[]'.
I have to agree with Rainer here. I think maybe string shouldn't just be an immutable(char)[]. I'd rather see it as a struct that wraps a char[] and presents the appropriate interface. Ditto for wchar. -Steve
Aug 23 2010
prev sibling next sibling parent reply Jonathan M Davis <jmdavisprog gmail.com> writes:
On Wednesday 18 August 2010 19:37:04 dsimcha wrote:
 I've been hacking in Phobos and parallelfuture and I've come to the
 conclusion that having typeof(c) in the expression foreach(c; string.init)
 not be a dchar is simply ridiculous.  I don't care how much existing code
 gets broken, this needs to be fixed.  Otherwise, all generic code will
 have to deal with it as a special case.  Most of it will probably overlook
 this special case in practice, and the net result will be more broken code
 than if we just bite the bullet and fix this now.  Here are some examples
 of the absurdities created by the current situation:
 
[snip]
 If D is at all serious about generic programming, we simply can't require
 this to be dealt with **everywhere** as a special case.
Considering that in all likelihood 99+% of the cases where someone is iterating over char, they really want dchar, I have no problem whatsoever with such a change. It may break existing code, but I'd expect that it's more likely to fix it. People could still iterate over char or wchar if they want to - they'd just have to specify the type. The one thing about it that bugs me is that it means that foreach acts differently with chars and wchars then it does with everything else, but really, that's a _lot_ less of an issue than the problems that you get with generic programming where you have to special case strings all over the place. As I understand it, Walter doesn't want to do this because it silently breaks D1 code. However, since odds are that that code should have been iterating over dchars in the first place, I really think that this change is worth making. In light of the costs to generic programming and the fact that programmers the world over are going to screw up when using foreach with strings when only a bare handful are actually going to want to iterate over chars or wchars, I'd say that making this change is worth it. Yes, it may break some existing code, but one, I'd expect that it would _fix_ more code than it breaks, and two, this will forever be a recurring bug in D programs if it doesn't get fixed. You _know_ that the average programmer is going to screw this up and that experienced ones will periodically forget to specify the type for foreach and get bittek by it, and the cost to generic programming is obviously very high if we leave it as is. So, I'd definitely vote to make it so that foreach over chars and wchars defaults to dchar. The pain that it will save is _far_ more than the pain that it will cost. - Jonathan M Davis
Aug 18 2010
next sibling parent Rainer Deyke <rainerd eldwood.com> writes:
On 8/18/2010 21:12, Jonathan M Davis wrote:
 The one thing about it that bugs me is that it means 
 that foreach acts differently with chars and wchars then it does with
everything 
 else, but really, that's a _lot_ less of an issue than the problems that you
get 
 with generic programming where you have to special case strings all over the 
 place.
False dichotomy. If foreach acts differently with chars and wchars than it does with everything else, then you /do/ need to special case strings all over the place. Thought experiment: what happens if you iterate not over 'char[]', but over 'Array!char'? -- Rainer Deyke - rainerd eldwood.com
Aug 18 2010
prev sibling parent reply Kagamin <spam here.lot> writes:
Jonathan M Davis Wrote:

 Considering that in all likelihood 99+% of the cases where someone is
iterating 
 over char, they really want dchar
And when someone is iterating over byte[] or short[], they want long, right? Yeah, why not?
Aug 18 2010
parent reply Jonathan Davis <jmdavisprog gmail.com> writes:
On 8/19/10, Kagamin <spam here.lot> wrote:
 Jonathan M Davis Wrote:

 Considering that in all likelihood 99+% of the cases where someone is
 iterating
 over char, they really want dchar
And when someone is iterating over byte[] or short[], they want long, right? Yeah, why not?
The problem is that chars are not characters. They are UTF-8 code units. If all you're using is ASCII, you can get away with treating them like one byte characters, but that doesn't work if you have any characters which aren't in ASCII. dchars _are_ characters. The correct way to iterate over a string or wstring if you want to treat the elements as characters is to give the type as dchar. foreach(dchar c; mystring) { //... } If you use char or wchar, you're going to iterate over code units, which is completely different. It is not generally the case that that is the correct thing to do. If someone does that in their code, odds are that it's a bug. bytes and shorts are legitimate values on their own, so it wouldn't make sense to give the type to foreach as long. You can deal with each byte or short on its own just fine. You can't safely do that with code units unless for some reason, you actually want to operate on code units (which is unlikely), or you don't actually care about the contents of the string for whatever you're doing (since some algorithms don't actually care about the contents of the arrays/ranges that they're dealing with). So, it's almost a guarantee that the correct type for iterating over a string or wstring is dchar, not char or wchar. String types are just weird that way due to how multibyte unicode encodings work. So, since it makes so little sense to iterate over chars or wchars by default, it would make sense to make the default dchar. - Jonathan M Davis
Aug 19 2010
next sibling parent reply Kagamin <spam here.lot> writes:
Jonathan Davis Wrote:

 bytes and shorts are legitimate values on their own, so it wouldn't
 make sense to give the type to foreach as long.
Having wider integer always has sense.
 byte or short on its own just fine.
Yes, but odds are that it's a bug. You can easily hit an overflow.
 So, it's almost a guarantee that the correct type for iterating over a
 string or wstring is dchar, not char or wchar. String types are just
 weird that way due to how multibyte unicode encodings work.
If you don't like narrow strings, don't use them. Use dstring. You are free to write what you want.
 So, since it makes so little sense to iterate over chars or wchars by default,
 it would make sense to make the default dchar.
It's an iteration over array items. This makes perfect sense.
Aug 19 2010
next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Kagamin (spam here.lot)'s article
 Jonathan Davis Wrote:
 bytes and shorts are legitimate values on their own, so it wouldn't
 make sense to give the type to foreach as long.
Having wider integer always has sense.
 byte or short on its own just fine.
Yes, but odds are that it's a bug. You can easily hit an overflow.
 So, it's almost a guarantee that the correct type for iterating over a
 string or wstring is dchar, not char or wchar. String types are just
 weird that way due to how multibyte unicode encodings work.
If you don't like narrow strings, don't use them. Use dstring. You are free to
write what you want. One major problem with this is the brokenness of std.string on non-UTF8 strings. Otherwise this would be a good solution provided you're not dealing with tons of strings, so space efficiency isn't a major concern. Hmm, lately I've been focusing my hacking efforts on debugging/polishing/removing annoying inconsistencies in Phobos. Maybe std.string should be my next target. It's generally a frustrating module because in addition to the wide character issue, lots of stuff requires immutable strings when it could work correctly and safely with a const or mutable string.
Aug 19 2010
parent reply Kagamin <spam here.lot> writes:
dsimcha Wrote:

 Hmm, lately I've been focusing my hacking efforts on
debugging/polishing/removing
 annoying inconsistencies in Phobos.  Maybe std.string should be my next target.
 It's generally a frustrating module because in addition to the wide character
 issue, lots of stuff requires immutable strings when it could work correctly
and
 safely with a const or mutable string.
They say, there're bugs with inout. You don't need them fixed?
Aug 19 2010
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Kagamin (spam here.lot)'s article
 dsimcha Wrote:
 Hmm, lately I've been focusing my hacking efforts on
debugging/polishing/removing
 annoying inconsistencies in Phobos.  Maybe std.string should be my next target.
 It's generally a frustrating module because in addition to the wide character
 issue, lots of stuff requires immutable strings when it could work correctly
and
 safely with a const or mutable string.
They say, there're bugs with inout. You don't need them fixed?
No. inout is only important where you can't/don't want to use templates. This is the case if you're concerned about code bloat, or need virtual functions. In std.string, not only **can** I use templates, I **have to** use them to deal with narrow vs. wide strings.
Aug 19 2010
parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 19 Aug 2010 10:34:01 -0400, dsimcha <dsimcha yahoo.com> wrote:

 == Quote from Kagamin (spam here.lot)'s article
 dsimcha Wrote:
 Hmm, lately I've been focusing my hacking efforts on  
debugging/polishing/removing
 annoying inconsistencies in Phobos.  Maybe std.string should be my  
next target.
 It's generally a frustrating module because in addition to the wide  
character
 issue, lots of stuff requires immutable strings when it could work  
correctly and
 safely with a const or mutable string.
They say, there're bugs with inout. You don't need them fixed?
No. inout is only important where you can't/don't want to use templates. This is the case if you're concerned about code bloat, or need virtual functions. In std.string, not only **can** I use templates, I **have to** use them to deal with narrow vs. wide strings.
No no, inout is essential on templates as well. e.g.: inout(T) min(T)(inout(T) t1, inout(T) t2) { return t2 < t1 ? t2 : t1; } -Steve
Aug 23 2010
prev sibling parent reply Jonathan M Davis <jmdavisprog gmail.com> writes:
On Thursday, August 19, 2010 07:13:25 Kagamin wrote:
 Jonathan Davis Wrote:
 bytes and shorts are legitimate values on their own, so it wouldn't
 make sense to give the type to foreach as long.
Having wider integer always has sense.
 byte or short on its own just fine.
Yes, but odds are that it's a bug. You can easily hit an overflow.
No, it doesn't hurt to have the iteration type larger than the actual type, but you're not going to have overflow. The value is in the array already. Sure, you could have had overflow putting it in, but when you're taking it out, you know that it fits because it was already in there. You could have overflow issues with math or whatnot inside the body of your loop if you're assigning to the foreach variable, but that has nothing to do with what you're getting out of the loop. With string and wstring, you're almost certainly getting a type that is inappropriate to process by itself.
 
 So, it's almost a guarantee that the correct type for iterating over a
 string or wstring is dchar, not char or wchar. String types are just
 weird that way due to how multibyte unicode encodings work.
If you don't like narrow strings, don't use them. Use dstring. You are free to write what you want.
It's fine with me to use narrow strings. Much as I'd love to avoid a lot of these issues, dstrings take up too much memory if you're going to be doing a lot of string processing. I'm aware of the issues and can program around them. The problem is that the default behavior is the abnormal (and therefore almost certainly buggy) behavior. Generally D tries to make the normal behavior the behavior that is less likely to cause bugs. Obviously, it doesn't always succeed, and this case is one of them. Very few people are actually going to want to deal with code points. They want characters. The result is that it becomes very easy to make mistakes with strings if you ever try and manipulate them character-by-character.
 
 So, since it makes so little sense to iterate over chars or wchars by
 default, it would make sense to make the default dchar.
It's an iteration over array items. This makes perfect sense.
It makes perfect sense for general arrays. It makes perfect sense if you don't really care about the contents of the array for your algorithm (that is, whether they're code points or characters or just bytes in memory doesn't matter for what you're doing). However, if you're actually processing characters, it makes no sense at all. This mess with foreach and strings is one of the big reasons why foreach tends to be avoided in std.algorithm. The reality of the matter is that what the container conceptually contains (characters) and what it actually contains aren't the same. That causes problems all over the place. Some reasonable workarounds have been found (for instance, strings are special-cased so that they're not random access ranges), but you have to special case string all over the place. The only way to avoid it completely is to just use dstring everywhere, but that doesn't necessarily scale well, and given the fact that the string module deals almost exclusively with string rather than wstring or dstring, it really doesn't make sense to use dstrings in the general case. Not to mention, the Linux I/O stuff uses UTF-8, and the Windows I/O stuff uses UTF-16, so dstring is less efficient for dealing with I/O. Even just making it an error - or at least a warning - to not give the type for foreach when iterating over UTF-8 and UTF-16 string types would help a lot in fixing string-related coding errors (so, they can choose char, wchar, or dchar, but they can't forget to put in the type and get shot in the foot because what they almost certainly wanted was dchar). However, there's a lot of generic code which runs into trouble because of this as well. The result is that you generally have to avoid foreach in generic code. Perhaps what we need is some way to distinguish between the exact element type on an array and the conceptual element type. So, for most arrays, they'd both be whatever the element type of the array is, but for strings the exact element type would be char, whchar, or dchar while the conceptual type would be dchar. That way, algorithms that don't care what the actual contents mean can use the exact element type, and the algorithms that actually care about processing the contents can use the conceptual element type. - Jonathan M Davis
Aug 19 2010
parent reply Kagamin <spam here.lot> writes:
Jonathan M Davis Wrote:

 Not to mention, the Linux I/O stuff uses UTF-8, and 
 the Windows I/O stuff uses UTF-16, so dstring is less efficient for dealing
with 
 I/O.
If we take dil as an example of application doing much of string processing. How much string processing it does and how intensively it communicates with OS (with string transcoding)?
Aug 19 2010
parent Jonathan M Davis <jmdavisprog gmail.com> writes:
On Thursday, August 19, 2010 12:24:22 Kagamin wrote:
 Jonathan M Davis Wrote:
 Not to mention, the Linux I/O stuff uses UTF-8, and
 the Windows I/O stuff uses UTF-16, so dstring is less efficient for
 dealing with I/O.
If we take dil as an example of application doing much of string processing. How much string processing it does and how intensively it communicates with OS (with string transcoding)?
I have never heard of dil. I have no idea. How big a hit the string type has on I/O is likely to be strongly dependent on the type of I/O you're using, the characteristics of your strings (as in things like what is the average number of code units in a code point in your strings and what is the average length of your strings), as well as all of the other CPU or memory-intensive stuff that you may be doing. However, it does make sense to make your string types the same size as the OS' native string types if you want to maximize efficiency. Of more importance, however, is the fact that it costs a lot of memory to use UTF-32 strings if you have a lot of strings. The string processing itself could actually be more efficient using dstring since you can then use random access operations on them (or it could be less efficient because of the extra memory costs involved), but there are big memory costs to using lots of dstrings. - Jonathan M Davis
Aug 19 2010
prev sibling parent reply Rainer Deyke <rainerd eldwood.com> writes:
On 8/19/2010 03:56, Jonathan Davis wrote:
 The problem is that chars are not characters. They are UTF-8 code
 units.
So what? You're acting like 'char' (and specifically 'char[]') is some sort of unique special case. In reality, it's just one case of encoded data. What about compressed data? What about packed arrays of bits? What about other containers? There's a useful generic idiom for iterating over a sequence of A as if it was a sequence of B: the adapter range. Narrow strings aren't special enough to deserve special language support. -- Rainer Deyke - rainerd eldwood.com
Aug 19 2010
next sibling parent dsimcha <dsimcha yahoo.com> writes:
== Quote from Rainer Deyke (rainerd eldwood.com)'s article
 On 8/19/2010 03:56, Jonathan Davis wrote:
 The problem is that chars are not characters. They are UTF-8 code
 units.
So what? You're acting like 'char' (and specifically 'char[]') is some sort of unique special case. In reality, it's just one case of encoded data. What about compressed data? What about packed arrays of bits? What about other containers? There's a useful generic idiom for iterating over a sequence of A as if it was a sequence of B: the adapter range. Narrow strings aren't special enough to deserve special language support.
Even though I don't agree, I believe this is a reasonable point of view. The biggest problem in my mind is that the inconsistency between ranges and foreach gives us the worst of both worlds when it comes to writing generic code. While I think both should use dchar, I'd rather see both use char than have this ridiculous inconsistency. I'm waiting for Walter (who has the final say on the core language) or Andrei (who has the final say on Phobos and ranges) to chime in on this one. If Walter insists that we can't make foreach use dchar, then for consistency's sake at least let's make std.range use char.
Aug 19 2010
prev sibling parent reply "Simen kjaeraas" <simen.kjaras gmail.com> writes:
Rainer Deyke <rainerd eldwood.com> wrote:

 On 8/19/2010 03:56, Jonathan Davis wrote:
 The problem is that chars are not characters. They are UTF-8 code
 units.
So what? You're acting like 'char' (and specifically 'char[]') is some sort of unique special case. In reality, it's just one case of encoded data. What about compressed data? What about packed arrays of bits? What about other containers?
First off, char, wchar, and dchar are special cases already - they're basically byte, short, and int, but are treated somewhat differently. One possibility, which would make strings a less integrated part of the language, is to make them simple range structs, and hide UTF-8/16 details in the implementation. If it were not for the fact that D touts its UTF capabilities, and that this would make it a little less true, and the fact that char/wchar/dchar are already treated specially, I would support this idea. -- Simen
Aug 20 2010
next sibling parent reply Jonathan M Davis <jmdavisprog gmail.com> writes:
On Friday, August 20, 2010 09:44:26 Simen kjaeraas wrote:
 Rainer Deyke <rainerd eldwood.com> wrote:
 On 8/19/2010 03:56, Jonathan Davis wrote:
 The problem is that chars are not characters. They are UTF-8 code
 units.
So what? You're acting like 'char' (and specifically 'char[]') is some sort of unique special case. In reality, it's just one case of encoded data. What about compressed data? What about packed arrays of bits? What about other containers?
First off, char, wchar, and dchar are special cases already - they're basically byte, short, and int, but are treated somewhat differently. One possibility, which would make strings a less integrated part of the language, is to make them simple range structs, and hide UTF-8/16 details in the implementation. If it were not for the fact that D touts its UTF capabilities, and that this would make it a little less true, and the fact that char/wchar/dchar are already treated specially, I would support this idea.
If you do that, you'd probably do something like struct String(C) { C[] array; dchar front() { size_t i = 0; return decod(a, i); } dchar back() { /* more complicated code*/ } void popFront() { array.popFront(); } void popBack() { array.popBack(); } bool empty() { return array.empty; } } alias String(immutable char) string; Naturally, there would be template constraints, the functions might be a bit more complex, and there would probably be some other functions (not to mention, you might have to do something fancy to get the immutable part to work since IIRC templates remove immutable and const so that they don't generate different templates for immutable, const, and mutable), but essentially, you would wrap the various string types in a struct with range operations based on dchar. You could get at the underlying array quite easily if you actually wanted array operations. And if you want string operations, well you have the range operations. Everywhere in the code where you currently have string, you'd have String(immutable char) instead of immutable (char)[]. I really don't know what all of the implications of this are. There have been similar suggestions before. You don't really hide the fact that they're UTF-8 and UTF-16. Rather you just make it so that the main interface to them is UTF-32. Anyone who wants at the UTF-8 or UTF-16 array can get at it just fine. I'm not sure how much this really saves you though, nor what all the problems a struct like this would cause over what we currently have. But you'd probably still have to special case stuff, since there are going to be algorithms that need to process the underlying array rather than the dchar range in order to be properly efficient, if work at all. Also, without universal function call syntax, I think that the only way to make it possible to call functions on it as if they were member functions is to use opDispatch(), which would definitely cause bugs (opDot() won't work since the most that you could do at that point is pass it along to the internal array, and then we're right back where we started). So, ultimately, I'm not sure that such a change would gain you much, and you're definitely losing something big. Ultimately, I think that we're stuck with what we've got, though we may be able to make some tweaks. Fundamentally, we're trying to treat something as two different things without treating it as two different things. We want to treat it as a range of characters and an array of unicode code units at the same time, using it as a range of characters where appropriate and using it as an array of code units where appropriate without having to special case it. I just don't think that that's going to work. We can improve our situation with the use of good template and trait stuff, along with making the use of iterating over string types without specifying a type a warning/error or making it default to dchar. But ultimately, there's a fundamental disjoint going on here, and we have to deal with it. - Jonathan M Davis
Aug 20 2010
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 08/20/2010 12:22 PM, Jonathan M Davis wrote:
 On Friday, August 20, 2010 09:44:26 Simen kjaeraas wrote:
 Rainer Deyke<rainerd eldwood.com>  wrote:
 On 8/19/2010 03:56, Jonathan Davis wrote:
 The problem is that chars are not characters. They are UTF-8 code
 units.
So what? You're acting like 'char' (and specifically 'char[]') is some sort of unique special case. In reality, it's just one case of encoded data. What about compressed data? What about packed arrays of bits? What about other containers?
First off, char, wchar, and dchar are special cases already - they're basically byte, short, and int, but are treated somewhat differently. One possibility, which would make strings a less integrated part of the language, is to make them simple range structs, and hide UTF-8/16 details in the implementation. If it were not for the fact that D touts its UTF capabilities, and that this would make it a little less true, and the fact that char/wchar/dchar are already treated specially, I would support this idea.
If you do that, you'd probably do something like struct String(C) { C[] array; dchar front() { size_t i = 0; return decod(a, i); } dchar back() { /* more complicated code*/ } void popFront() { array.popFront(); } void popBack() { array.popBack(); } bool empty() { return array.empty; } } alias String(immutable char) string;
Grep std/ for byDchar. Andrei
Aug 20 2010
prev sibling parent Kagamin <spam here.lot> writes:
Jonathan M Davis Wrote:

 Everywhere in the code where you currently have string, you'd have 
 String(immutable char) instead of immutable (char)[].
Not necessarily. I think, you can leave std.algorithm string-agnostic and special case string operations in, say, std.string, which will take and return regular string types, but internally call std.algorithms on dchar range wrappers, this is what std.algorithm does now, I suppose.
 Fundamentally, we're trying to treat something as two 
 different things without treating it as two different things. We want to treat
it 
 as a range of characters and an array of unicode code units at the same time, 
 using it as a range of characters where appropriate and using it as an array
of 
 code units where appropriate without having to special case it. I just don't 
 think that that's going to work.
I think we just want to do string operations. I believe, java and .Net live fine with String classes and string operations built into them.
Aug 20 2010
prev sibling parent Rainer Deyke <rainerd eldwood.com> writes:
On 8/20/2010 10:44, Simen kjaeraas wrote:
 First off, char, wchar, and dchar are special cases already - they're
 basically byte, short, and int, but are treated somewhat differently.
They're only special cases when placed in a built-in array. In any other container, they behave like normal types - unless the container uses built-in arrays internally, in which case it may not work at all. I have no objection to a string type that uses utf-8 internally but iterates over full code points. My objection is specifically to special-casing built-in arrays to behave differently from all other arrays when instantiated on 'char' and 'wchar'. Rename 'char[]' to 'char""' (and keep 'char[]' as a simple array) and my objection goes away. Again, I ask: what about 'Array!char'? -- Rainer Deyke - rainerd eldwood.com
Aug 20 2010
prev sibling next sibling parent Kagamin <spam here.lot> writes:
dsimcha Wrote:

 If D is at all serious about generic programming, we simply can't require this
 to be dealt with **everywhere** as a special case.
Just remove the special case of automatic conversion from strings to dchar[] and you will have one less surprize. After all, it was a deliberate design decision to make strings arrays. Make dchar ranges explicitly, you can't code without writing what you want, right? Or write a special case library for strings, that will do the job for you.
Aug 18 2010
prev sibling next sibling parent Kagamin <spam here.lot> writes:
dsimcha Wrote:

 If D is at all serious about generic programming, we simply can't require this
 to be dealt with **everywhere** as a special case.
I suspect, ranges were designed for FP, so use map instead of foreach. Or fold. Or another 3-letter abbreviation. This will give you a possibility to parallelize your code later, foreach is executed sequentially by design.
Aug 18 2010
prev sibling next sibling parent Pelle <pelle.mansson gmail.com> writes:
On 08/19/2010 04:37 AM, dsimcha wrote:
 I've been hacking in Phobos and parallelfuture and I've come to the conclusion
 that having typeof(c) in the expression foreach(c; string.init) not be a dchar
 is simply ridiculous.  I don't care how much existing code gets broken, this
 needs to be fixed.  Otherwise, all generic code will have to deal with it as a
 special case.  Most of it will probably overlook this special case in
 practice, and the net result will be more broken code than if we just bite the
 bullet and fix this now.
Currently, strings break foreach in generic code. This is terrible! I agree with this. I thought char[] was a UTF-8 sequence, not a byte sequence.
Aug 19 2010
prev sibling next sibling parent Michel Fortin <michel.fortin michelf.com> writes:
On 2010-08-18 22:37:04 -0400, dsimcha <dsimcha yahoo.com> said:

 If D is at all serious about generic programming, we simply can't require this
 to be dealt with **everywhere** as a special case.
I do agree that the current special case situation is pretty bad. Foreach really need to use ElementType!string by default. Whether this is done by changing foreach (my preference), or by reverting ElementType!string to its previous incarnation and using a special range to iterate over characters, I think it'd be an improvement over the current situation. Having the standard library and the language disagree with each other is be pretty bad. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Aug 19 2010
prev sibling next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from dsimcha (dsimcha yahoo.com)'s article
 I've been hacking in Phobos and parallelfuture and I've come to the conclusion
 that having typeof(c) in the expression foreach(c; string.init) not be a dchar
 is simply ridiculous.
BTW, what are some examples of where making dchar the default would **silently** break code? I can think of very few because now that we've gotten rid of implicit narrowing conversions in D2, you can't implicitly convert a dchar to a char or a byte. This should catch most cases at compile time.
Aug 19 2010
parent Kagamin <spam here.lot> writes:
dsimcha Wrote:

 BTW, what are some examples of where making dchar the default would
**silently**
 break code?
1. Read a file and cast the buffer to string. 2. Surprising difference in string lenghts that were just checked.
Aug 19 2010
prev sibling next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from dsimcha (dsimcha yahoo.com)'s article
 I've been hacking in Phobos and parallelfuture and I've come to the conclusion
 that having typeof(c) in the expression foreach(c; string.init) not be a dchar
 is simply ridiculous.  I don't care how much existing code gets broken, this
 needs to be fixed.
Here's another good one. This one uses Lockstep, which is in the SVN version of std.range and is designed to provide syntactic sugar for iterating over multiple ranges in lockstep via foreach. string str1, str2; foreach(c1, c2; lockstep(str1, str2)) {} // c1, c2 are dchars since Lockstep relies on range primitives. foreach(c; str1) {} // c is a char since the regular foreach loop doesn't use range // primitives. I'm starting to think the inconsistency between ranges and foreach is really the worst part. When viewed in isolation, Andrei's changes to std.range to make ElementType!string == dchar, etc. were definitely the right thing to do. However, if we can't fix foreach, it might be a good idea to undo them because in this case I think such a ridiculous, bug producing inconsistency is worse than doing The Wrong Thing consistently.
Aug 19 2010
parent Jonathan M Davis <jmdavisprog gmail.com> writes:
On Thursday, August 19, 2010 07:15:30 dsimcha wrote:
 == Quote from dsimcha (dsimcha yahoo.com)'s article
 
 I've been hacking in Phobos and parallelfuture and I've come to the
 conclusion that having typeof(c) in the expression foreach(c;
 string.init) not be a dchar is simply ridiculous.  I don't care how much
 existing code gets broken, this needs to be fixed.
Here's another good one. This one uses Lockstep, which is in the SVN version of std.range and is designed to provide syntactic sugar for iterating over multiple ranges in lockstep via foreach. string str1, str2; foreach(c1, c2; lockstep(str1, str2)) {} // c1, c2 are dchars since Lockstep relies on range primitives. foreach(c; str1) {} // c is a char since the regular foreach loop doesn't use range // primitives. I'm starting to think the inconsistency between ranges and foreach is really the worst part. When viewed in isolation, Andrei's changes to std.range to make ElementType!string == dchar, etc. were definitely the right thing to do. However, if we can't fix foreach, it might be a good idea to undo them because in this case I think such a ridiculous, bug producing inconsistency is worse than doing The Wrong Thing consistently.
Okay. Maybe this is what we do: 1. Make it a warning if not outright error to use foreach with any char or wchar array (be they mutable, const, or immutable) without indicating the type. So, foreach(c; mystring) { //... } would become illegal. You'd have to give the type for c. This would solve the problem where someone forgets to put the type. Since odds are that they wanted dchar anyway, the extra characters aren't really extra for most people. And the few who actually wanted char or wchar can just put the type. It shouldn't be a big deal. A programmer can still foolishly put char or wchar when what they actually need a dchar for what they're doing, but at least then it's a deliberate error due to ignorance rather than someone who knows what they're doing making a simple mistake. This will also catch errors in generic algorithms that end up trying to use foreach without giving the type. 2. Ditch ElementType in favor of something more like ExactElemType and ConceptElemType where ExactElemType is the actual type in the array/range and ConceptElemType is the type that is conceptually in the array/range. So, for most types, those two will be the same, but for string types, ExactElemType will be char, wchar, or dchar, while ConceptElemType will always be dchar. So, the algorithms that don't care about what the elements mean can just use ExactElemType while those that do care about what the elements mean use ConceptElemType. I'm not sure that this is the best solution. However, the fact that string and wstring are arrays but can't always be treated as arrays is pretty much inescapable as long as they're arrays. It seems like no matter what we do, you either lose the ability to treat strings as arrays or you have to special case them all over the place. If they were structs that gave access to their underlying array for array operations and gave range operations for normal use (possibly along with a function for giving you the nth element, though it couldn't truly be random access unless it were a dstring), then maybe we could get this to work better. But we're dealing the inherent problem that the container holds one type conceptually and a completely different type in reality. - Jonathan M Davis
Aug 19 2010
prev sibling next sibling parent "Simen kjaeraas" <simen.kjaras gmail.com> writes:
dsimcha <dsimcha yahoo.com> wrote:

 I've been hacking in Phobos and parallelfuture and I've come to the  
 conclusion
 that having typeof(c) in the expression foreach(c; string.init) not be a  
 dchar
 is simply ridiculous.  I don't care how much existing code gets broken,  
 this
 needs to be fixed.  Otherwise, all generic code will have to deal with  
 it as a
 special case.
The other alternative, as has been proposed, is to create (d|w)?string structs that are thin wrappers on top of immutable((d|w)?char)[]. Something along the lines of struct string { immutable(ubyte)[] payload; alias payload this; // Implement range primitives here. } should work, though it is nowhere near as elegant as what we have, and would probably break some code. Do note that I used ubyte instead of char as the element type, seeing as how char/wchar would be unnecessary in this case. This said, I am more in favor of changing the compiler than the strings. -- Simen
Aug 19 2010
prev sibling parent reply Kagamin <spam here.lot> writes:
Jonathan M Davis Wrote:

 No, it doesn't hurt to have the iteration type larger than the actual type,
but 
 you're not going to have overflow.
Trivial: take byte and add 256.
 could have had overflow putting it in, but when you're taking it out, you know 
 that it fits because it was already in there. You could have overflow issues
with 
 math or whatnot inside the body of your loop if you're assigning to the
foreach 
 variable, but that has nothing to do with what you're getting out of the loop. 
As long as what you get out of the loop doesn't depend on the element type. Didn't you demonstrated how such dependency can be introduced?
 It's fine with me to use narrow strings. Much as I'd love to avoid a lot of
these 
 issues, dstrings take up too much memory if you're going to be doing a lot of 
 string processing.
If you're going to take much memory, there probably won't be much difference between strings and dstrings, you'll take much memory in both cases. And don't forget that UTF-8 chars take up to 4 bytes.
 problem is that the default behavior is the abnormal (and therefore almost 
 certainly buggy) behavior. Generally D tries to make the normal behavior the 
 behavior that is less likely to cause bugs.
Type system hacks are likely to cause bugs.
 Very few people are actually going to 
 want to deal with code points. They want characters. The result is that it 
 becomes very easy to make mistakes with strings if you ever try and manipulate 
 them character-by-character.
If you care about people and want to force them to use dchar ranges, you can do it with the library: make it refuse narrow strings - as long as the library is unusable with narrow strings, people will have to do something about it, say, use wrappers like one proposed in this thread (but providing forward dchar range interface).
 It makes perfect sense for general arrays. It makes perfect sense if you don't 
 really care about the contents of the array for your algorithm (that is,
whether 
 they're code points or characters or just bytes in memory doesn't matter for 
 what you're doing). However, if you're actually processing characters, it
makes 
 no sense at all. This mess with foreach and strings is one of the big reasons 
 why foreach tends to be avoided in std.algorithm.
The problem here is that integers are not much different from characters in this regard.
 and given the fact that the string module deals almost exclusively with 
 string rather than wstring or dstring, it really doesn't make sense to use 
 dstrings in the general case.
This is my point: you can do it with library, if you can't, fix the library.
 Not to mention, the Linux I/O stuff uses UTF-8, and 
 the Windows I/O stuff uses UTF-16, so dstring is less efficient for dealing
with 
 I/O.
Every string type is inefficient here, but a wrapper comparable to NSString can fix it for you.
 Perhaps what we need is some way to distinguish between the exact element type 
 on an array and the conceptual element type. So, for most arrays, they'd both
be 
 whatever the element type of the array is, but for strings the exact element 
 type would be char, whchar, or dchar while the conceptual type would be dchar. 
Conceptually number is an infinite sequence of digits with decimal point. What do you plan to do about this?
Aug 19 2010
parent reply Jonathan M Davis <jmdavisprog gmail.com> writes:
On Thursday, August 19, 2010 12:18:03 Kagamin wrote:
 Jonathan M Davis Wrote:
 No, it doesn't hurt to have the iteration type larger than the actual
 type, but you're not going to have overflow.
Trivial: take byte and add 256.
Except that that only happens once you do something to the element that you get from foreach. You read byte just fine without having overflow problems. You can't do the same with char or wchar. You often need multiple of them to get anything meaningful - unlike bytes. If you want to change the iteration type to int or long or whatever when iterating over bytes so that you can change the variable without overflow issues, you can. But the byte itself is meaingful by itself. Such is not generally the case with char or wchar.
 It's fine with me to use narrow strings. Much as I'd love to avoid a lot
 of these issues, dstrings take up too much memory if you're going to be
 doing a lot of string processing.
If you're going to take much memory, there probably won't be much difference between strings and dstrings, you'll take much memory in both cases. And don't forget that UTF-8 chars take up to 4 bytes.
For ASCII characters, a UTF-32 character takes _4_ times as much memory as a UTF-8 character. Even if you use lots of Asian characters, as I understand it, most won't take more than 3. So, even if you're using primarily Asian characters with UTF-8, your still have 25% space savings. And since apparently, many Asian characters will fit into one wchar, if you use UTF-16 when you have lots of Asian characters, you're getting closer to 50% space savings over UTF-32. If you have a lot of strings, that's a lot of wasted memory.
 If you care about people and want to force them to use dchar ranges, you
 can do it with the library: make it refuse narrow strings - as long as the
 library is unusable with narrow strings, people will have to do something
 about it, say, use wrappers like one proposed in this thread (but
 providing forward dchar range interface).
We _can't_ force everyone to use dstring. That defeats having string and wstring in the first place and is incredibly inefficient space-wise. The standard libraries _need_ to work well with all string types.
 It makes perfect sense for general arrays. It makes perfect sense if you
 don't really care about the contents of the array for your algorithm
 (that is, whether they're code points or characters or just bytes in
 memory doesn't matter for what you're doing). However, if you're
 actually processing characters, it makes no sense at all. This mess with
 foreach and strings is one of the big reasons why foreach tends to be
 avoided in std.algorithm.
The problem here is that integers are not much different from characters in this regard.
Integers are totally different. An integer may be limited in the size of the number that it can hold, but it makes perfect sense to process each integer individually. An integer is a full value on its own. char and wchar are not. They're only parts of a whole.
 Conceptually number is an infinite sequence of digits with decimal point.
 What do you plan to do about this?
That's a totally different issue. The solution for that is to use a BigInt type which combines multiple integers (or bytes or longs or whatever) together to make larger values that primitive integral types can hold. In that case, if you were to try and iterate over indivdual ints within the BigInt, then you'd be screwed because they don't mean anything on your own. string and wstring are effectively BigInt for chars and wchars. You have to combine multiple of them to get meaningful values. The fact that one of them can't hold a big enough (let alone infinite) range is the whole reason that they were created in the first place (that and the fact that making the type big enough (i.e. dchar) on its own wastes a lot of space). - Jonathan M Davis
Aug 19 2010
parent Kagamin <spam here.lot> writes:
Jonathan M Davis Wrote:

 Trivial: take byte and add 256.
If you want to change the iteration type to int or long or whatever when iterating over bytes so that you can change the variable without overflow issues, you can. But the byte itself is meaingful by itself. Such is not generally the case with char or wchar.
I thought, it's your point that having a meaning doesn't help to avoid bugs.
 If you care about people and want to force them to use dchar ranges, you
 can do it with the library: make it refuse narrow strings - as long as the
 library is unusable with narrow strings, people will have to do something
 about it, say, use wrappers like one proposed in this thread (but
 providing forward dchar range interface).
We _can't_ force everyone to use dstring.
I'm talking not about dstrings, I said dchar range wrapper. Andrei mentioned byDchar, I don't know if that's the thing. Anyway, std.algorithm does iterate over dchars in narrow strings somehow. You can do it too.
Aug 20 2010