digitalmars.D - Why foreach(c; someString) must yield dchar

dsimcha (53/53) Aug 18 2010 I've been hacking in Phobos and parallelfuture and I've come to the conc...

Rainer Deyke (16/58) Aug 18 2010 I have long ago come to the opposite conclusion. An array of 'char'

Steven Schveighoffer (7/16) Aug 23 2010 I have to agree with Rainer here. I think maybe string shouldn't just b...

Jonathan M Davis (26/37) Aug 18 2010 Considering that in all likelihood 99+% of the cases where someone is it...

Rainer Deyke (8/13) Aug 18 2010 False dichotomy. If foreach acts differently with chars and wchars than
Kagamin (2/4) Aug 18 2010 And when someone is iterating over byte[] or short[], they want long, ri...

Jonathan Davis (29/35) Aug 19 2010 The problem is that chars are not characters. They are UTF-8 code

Kagamin (5/13) Aug 19 2010 Yes, but odds are that it's a bug. You can easily hit an overflow.

dsimcha (10/20) Aug 19 2010 write what you want.

Kagamin (2/7) Aug 19 2010 They say, there're bugs with inout. You don't need them fixed?

dsimcha (5/12) Aug 19 2010 No. inout is only important where you can't/don't want to use templates...

Steven Schveighoffer (7/26) Aug 23 2010 No no, inout is essential on templates as well. e.g.:

Jonathan M Davis (51/72) Aug 19 2010 No, it doesn't hurt to have the iteration type larger than the actual ty...

Kagamin (2/5) Aug 19 2010 If we take dil as an example of application doing much of string process...

Jonathan M Davis (14/22) Aug 19 2010 I have never heard of dil. I have no idea. How big a hit the string type...

Rainer Deyke (10/12) Aug 19 2010 So what? You're acting like 'char' (and specifically 'char[]') is some

dsimcha (9/19) Aug 19 2010 Even though I don't agree, I believe this is a reasonable point of view....
Simen kjaeraas (11/18) Aug 20 2010 First off, char, wchar, and dchar are special cases already - they're

Jonathan M Davis (50/69) Aug 20 2010 If you do that, you'd probably do something like

Andrei Alexandrescu (3/34) Aug 20 2010 Grep std/ for byDchar.
Kagamin (3/11) Aug 20 2010 I think we just want to do string operations. I believe, java and .Net l...

Rainer Deyke (12/14) Aug 20 2010 They're only special cases when placed in a built-in array. In any

Kagamin (3/5) Aug 18 2010 Just remove the special case of automatic conversion from strings to dch...
Kagamin (2/4) Aug 18 2010 I suspect, ranges were designed for FP, so use map instead of foreach. O...
Pelle (4/11) Aug 19 2010 Currently, strings break foreach in generic code. This is terrible!
Michel Fortin (12/14) Aug 19 2010 I do agree that the current special case situation is pretty bad.
dsimcha (5/8) Aug 19 2010 BTW, what are some examples of where making dchar the default would **si...

Kagamin (3/5) Aug 19 2010 1. Read a file and cast the buffer to string.

dsimcha (16/20) Aug 19 2010 Here's another good one. This one uses Lockstep, which is in the SVN ve...

Jonathan M Davis (36/62) Aug 19 2010 Okay. Maybe this is what we do:

Simen kjaeraas (16/25) Aug 19 2010 The other alternative, as has been proposed, is to create (d|w)?string
Kagamin (10/42) Aug 19 2010 As long as what you get out of the loop doesn't depend on the element ty...

Jonathan M Davis (33/62) Aug 19 2010 Except that that only happens once you do something to the element that ...

Kagamin (3/16) Aug 20 2010 I'm talking not about dstrings, I said dchar range wrapper. Andrei menti...

dsimcha <dsimcha yahoo.com> writes:

I've been hacking in Phobos and parallelfuture and I've come to the conclusion
that having typeof(c) in the expression foreach(c; string.init) not be a dchar
is simply ridiculous.  I don't care how much existing code gets broken, this
needs to be fixed.  Otherwise, all generic code will have to deal with it as a
special case.  Most of it will probably overlook this special case in
practice, and the net result will be more broken code than if we just bite the
bullet and fix this now.  Here are some examples of the absurdities created by
the current situation:

static assert(is(typeof({
    foreach(elem; T.init) {
        return elem;
    }
    assert(0);
}) == ElementType!(T));

Looks reasonable.  FAILS on narrow strings.

size_t walkLength1(R)(R input) {
    size_t ret = 0;
    foreach(elem; input) {
        ret++;
    }

    return ret;
}

size_t walkLength2(R)(R input) {
    size_t ret = 0;
    while(!input.empty) {
       ret++;
       input.popFront();
    }

    return ret;
}

assert(walkLength1(stuff) == walkLength2(stuff));

FAILS if stuff is a narrow string with characters that aren't a single code
point.

void printRange(R)(R range) {
    foreach(elem; range) {
        write(elem, ' ');
    }
    writeln();
}

Prints garbage if range is a string with characters that aren't a single code
point.

auto rangeMax(R)(R range) {
    enforce(!range.empty);

    auto ret = range.front;
    foreach(elem; range) {
        if(elem > ret) {
            ret = elem;
        }
    }

    return ret;
}

This will not find the largest character in the range if R is a narrow string.

If D is at all serious about generic programming, we simply can't require this
to be dealt with **everywhere** as a special case.

Aug 18 2010

Rainer Deyke <rainerd eldwood.com> writes:

On 8/18/2010 20:37, dsimcha wrote:
 I've been hacking in Phobos and parallelfuture and I've come to the conclusion
 that having typeof(c) in the expression foreach(c; string.init) not be a dchar
 is simply ridiculous.

I have long ago come to the opposite conclusion.  An array of 'char'
should act like any other array.  If you want a sequence of 'dchar' that
is internally stored as an array of 'char', don't call it 'char[]'.

You propose to fix a special case by adding more special cases.  This
will increase, not decrease, the number of cases that will need special
treatment in generic code.

Iterating over a sequence of 'char' as a sequence of 'dchar' is very
useful.  Implementing this functionality as a language feature, tied to
the built-in array type, is just plain wrong.

 static assert(is(typeof({
     foreach(elem; T.init) {
         return elem;
     }
     assert(0);
 }) == ElementType!(T));
 
 Looks reasonable.  FAILS on narrow strings.

Because ElementType!(string) is broken.

 size_t walkLength1(R)(R input) {
     size_t ret = 0;
     foreach(elem; input) {
         ret++;
     }
 
     return ret;
 }
 
 size_t walkLength2(R)(R input) {
     size_t ret = 0;
     while(!input.empty) {
        ret++;
        input.popFront();
     }
 
     return ret;
 }
 
 assert(walkLength1(stuff) == walkLength2(stuff));
 
 FAILS if stuff is a narrow string with characters that aren't a single code
point.

Because 'popFront' is broken for narrow strings.

 void printRange(R)(R range) {
     foreach(elem; range) {
         write(elem, ' ');
     }
     writeln();
 }
 
 Prints garbage if range is a string with characters that aren't a single code
 point.

Prints bytes from the string separated by spaces.  This may be
intentional behavior if the parser on the other side is not utf-aware.


-- 
Rainer Deyke - rainerd eldwood.com

Aug 18 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Wed, 18 Aug 2010 23:11:26 -0400, Rainer Deyke <rainerd eldwood.com>  
wrote:

 On 8/18/2010 20:37, dsimcha wrote:
 I've been hacking in Phobos and parallelfuture and I've come to the  
 conclusion
 that having typeof(c) in the expression foreach(c; string.init) not be  
 a dchar
 is simply ridiculous.

 I have long ago come to the opposite conclusion.  An array of 'char'
 should act like any other array.  If you want a sequence of 'dchar' that
 is internally stored as an array of 'char', don't call it 'char[]'.

I have to agree with Rainer here.  I think maybe string shouldn't just be  
an immutable(char)[].  I'd rather see it as a struct that wraps a char[]  
and presents the appropriate interface.

Ditto for wchar.

-Steve

Aug 23 2010

Jonathan M Davis <jmdavisprog gmail.com> writes:

On Wednesday 18 August 2010 19:37:04 dsimcha wrote:
 I've been hacking in Phobos and parallelfuture and I've come to the
 conclusion that having typeof(c) in the expression foreach(c; string.init)
 not be a dchar is simply ridiculous.  I don't care how much existing code
 gets broken, this needs to be fixed.  Otherwise, all generic code will
 have to deal with it as a special case.  Most of it will probably overlook
 this special case in practice, and the net result will be more broken code
 than if we just bite the bullet and fix this now.  Here are some examples
 of the absurdities created by the current situation:
 

[snip]
 If D is at all serious about generic programming, we simply can't require
 this to be dealt with **everywhere** as a special case.

Considering that in all likelihood 99+% of the cases where someone is iterating 
over char, they really want dchar, I have no problem whatsoever with such a 
change. It may break existing code, but I'd expect that it's more likely to fix 
it. People could still iterate over char or wchar if they want to - they'd just 
have to specify the type. The one thing about it that bugs me is that it means 
that foreach acts differently with chars and wchars then it does with
everything 
else, but really, that's a _lot_ less of an issue than the problems that you
get 
with generic programming where you have to special case strings all over the 
place.

As I understand it, Walter doesn't want to do this because it silently breaks
D1 
code. However, since odds are that that code should have been iterating over 
dchars in the first place, I really think that this change is worth making. In 
light of the costs to generic programming and the fact that programmers the 
world over are going to screw up when using foreach with strings when only a 
bare handful are actually going to want to iterate over chars or wchars, I'd
say 
that making this change is worth it. Yes, it may break some existing code, but 
one, I'd expect that it would _fix_ more code than it breaks, and two, this
will 
forever be a recurring bug in D programs if it doesn't get fixed. You _know_
that 
the average programmer is going to screw this up and that experienced ones will 
periodically forget to specify the type for foreach and get bittek by it, and 
the cost to generic programming is obviously very high if we leave it as is.
So, 
I'd definitely vote to make it so that foreach over chars and wchars defaults
to 
dchar. The pain that it will save is _far_ more than the pain that it will cost.

- Jonathan M Davis

Aug 18 2010

Rainer Deyke <rainerd eldwood.com> writes:

On 8/18/2010 21:12, Jonathan M Davis wrote:
 The one thing about it that bugs me is that it means 
 that foreach acts differently with chars and wchars then it does with
everything 
 else, but really, that's a _lot_ less of an issue than the problems that you
get 
 with generic programming where you have to special case strings all over the 
 place.

False dichotomy.  If foreach acts differently with chars and wchars than
it does with everything else, then you /do/ need to special case strings
all over the place.

Thought experiment: what happens if you iterate not over 'char[]', but
over 'Array!char'?


-- 
Rainer Deyke - rainerd eldwood.com

Aug 18 2010

Kagamin <spam here.lot> writes:

Jonathan M Davis Wrote:

 Considering that in all likelihood 99+% of the cases where someone is
iterating 
 over char, they really want dchar

And when someone is iterating over byte[] or short[], they want long, right?
Yeah, why not?

Aug 18 2010

Jonathan Davis <jmdavisprog gmail.com> writes:

On 8/19/10, Kagamin <spam here.lot> wrote:
 Jonathan M Davis Wrote:

 Considering that in all likelihood 99+% of the cases where someone is
 iterating
 over char, they really want dchar

 And when someone is iterating over byte[] or short[], they want long, right?
 Yeah, why not?

The problem is that chars are not characters. They are UTF-8 code
units. If all you're using is ASCII, you can get away with treating
them like one byte characters, but that doesn't work if you have any
characters which aren't in ASCII. dchars _are_ characters. The correct
way to iterate over a string or wstring if you want to treat the
elements as characters is to give the type as dchar.

foreach(dchar c; mystring)
{
    //...
}

If you use char or wchar, you're going to iterate over code units,
which is completely different. It is not generally the case that that
is the correct thing to do. If someone does that in their code, odds
are that it's a bug.

bytes and shorts are legitimate values on their own, so it wouldn't
make sense to give the type to foreach as long. You can deal with each
byte or short on its own just fine. You can't safely do that with code
units unless for some reason, you actually want to operate on code
units (which is unlikely), or you don't actually care about the
contents of the string for whatever you're doing (since some
algorithms don't actually care about the contents of the arrays/ranges
that they're dealing with).

So, it's almost a guarantee that the correct type for iterating over a
string or wstring is dchar, not char or wchar. String types are just
weird that way due to how multibyte unicode encodings work. So, since
it makes so little sense to iterate over chars or wchars by default,
it would make sense to make the default dchar.

- Jonathan M Davis

Aug 19 2010

Kagamin <spam here.lot> writes:

Jonathan Davis Wrote:

 bytes and shorts are legitimate values on their own, so it wouldn't
 make sense to give the type to foreach as long.

Having wider integer always has sense.

 byte or short on its own just fine.

Yes, but odds are that it's a bug. You can easily hit an overflow.

 So, it's almost a guarantee that the correct type for iterating over a
 string or wstring is dchar, not char or wchar. String types are just
 weird that way due to how multibyte unicode encodings work.

If you don't like narrow strings, don't use them. Use dstring. You are free to
write what you want.

 So, since it makes so little sense to iterate over chars or wchars by default,
 it would make sense to make the default dchar.

It's an iteration over array items. This makes perfect sense.

Aug 19 2010

dsimcha <dsimcha yahoo.com> writes:

== Quote from Kagamin (spam here.lot)'s article
 Jonathan Davis Wrote:
 bytes and shorts are legitimate values on their own, so it wouldn't
 make sense to give the type to foreach as long.

 Having wider integer always has sense.
 byte or short on its own just fine.

 Yes, but odds are that it's a bug. You can easily hit an overflow.
 So, it's almost a guarantee that the correct type for iterating over a
 string or wstring is dchar, not char or wchar. String types are just
 weird that way due to how multibyte unicode encodings work.

 If you don't like narrow strings, don't use them. Use dstring. You are free to

write what you want.

One major problem with this is the brokenness of std.string on non-UTF8 strings.
Otherwise this would be a good solution provided you're not dealing with tons of
strings, so space efficiency isn't a major concern.

Hmm, lately I've been focusing my hacking efforts on
debugging/polishing/removing
annoying inconsistencies in Phobos.  Maybe std.string should be my next target.
It's generally a frustrating module because in addition to the wide character
issue, lots of stuff requires immutable strings when it could work correctly and
safely with a const or mutable string.

Aug 19 2010

Kagamin <spam here.lot> writes:

dsimcha Wrote:

 Hmm, lately I've been focusing my hacking efforts on
debugging/polishing/removing
 annoying inconsistencies in Phobos.  Maybe std.string should be my next target.
 It's generally a frustrating module because in addition to the wide character
 issue, lots of stuff requires immutable strings when it could work correctly
and
 safely with a const or mutable string.

They say, there're bugs with inout. You don't need them fixed?

Aug 19 2010

dsimcha <dsimcha yahoo.com> writes:

== Quote from Kagamin (spam here.lot)'s article
 dsimcha Wrote:
 Hmm, lately I've been focusing my hacking efforts on
debugging/polishing/removing
 annoying inconsistencies in Phobos.  Maybe std.string should be my next target.
 It's generally a frustrating module because in addition to the wide character
 issue, lots of stuff requires immutable strings when it could work correctly
and
 safely with a const or mutable string.

 They say, there're bugs with inout. You don't need them fixed?

No.  inout is only important where you can't/don't want to use templates.  This
is
the case if you're concerned about code bloat, or need virtual functions.  In
std.string, not only **can** I use templates, I **have to** use them to deal
with
narrow vs. wide strings.

Aug 19 2010

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Thu, 19 Aug 2010 10:34:01 -0400, dsimcha <dsimcha yahoo.com> wrote:

 == Quote from Kagamin (spam here.lot)'s article
 dsimcha Wrote:
 Hmm, lately I've been focusing my hacking efforts on  

 debugging/polishing/removing
 annoying inconsistencies in Phobos.  Maybe std.string should be my  

 next target.
 It's generally a frustrating module because in addition to the wide  

 character
 issue, lots of stuff requires immutable strings when it could work  

 correctly and
 safely with a const or mutable string.

 They say, there're bugs with inout. You don't need them fixed?

 No.  inout is only important where you can't/don't want to use  
 templates.  This is
 the case if you're concerned about code bloat, or need virtual  
 functions.  In
 std.string, not only **can** I use templates, I **have to** use them to  
 deal with
 narrow vs. wide strings.

No no, inout is essential on templates as well.  e.g.:

inout(T) min(T)(inout(T) t1, inout(T) t2)
{
    return t2 < t1 ? t2 : t1;
}

-Steve

Aug 23 2010

Jonathan M Davis <jmdavisprog gmail.com> writes:

On Thursday, August 19, 2010 07:13:25 Kagamin wrote:
 Jonathan Davis Wrote:
 bytes and shorts are legitimate values on their own, so it wouldn't
 make sense to give the type to foreach as long.

 
 Having wider integer always has sense.
 
 byte or short on its own just fine.

 
 Yes, but odds are that it's a bug. You can easily hit an overflow.

No, it doesn't hurt to have the iteration type larger than the actual type, but 
you're not going to have overflow. The value is in the array already. Sure, you 
could have had overflow putting it in, but when you're taking it out, you know 
that it fits because it was already in there. You could have overflow issues
with 
math or whatnot inside the body of your loop if you're assigning to the foreach 
variable, but that has nothing to do with what you're getting out of the loop. 
With string and wstring, you're almost certainly getting a type that is 
inappropriate to process by itself.

 
 So, it's almost a guarantee that the correct type for iterating over a
 string or wstring is dchar, not char or wchar. String types are just
 weird that way due to how multibyte unicode encodings work.

 
 If you don't like narrow strings, don't use them. Use dstring. You are free
 to write what you want.

It's fine with me to use narrow strings. Much as I'd love to avoid a lot of
these 
issues, dstrings take up too much memory if you're going to be doing a lot of 
string processing. I'm aware of the issues and can program around them. The 
problem is that the default behavior is the abnormal (and therefore almost 
certainly buggy) behavior. Generally D tries to make the normal behavior the 
behavior that is less likely to cause bugs. Obviously, it doesn't always 
succeed, and this case is one of them. Very few people are actually going to 
want to deal with code points. They want characters. The result is that it 
becomes very easy to make mistakes with strings if you ever try and manipulate 
them character-by-character.

 
 So, since it makes so little sense to iterate over chars or wchars by
 default, it would make sense to make the default dchar.

 
 It's an iteration over array items. This makes perfect sense.

It makes perfect sense for general arrays. It makes perfect sense if you don't 
really care about the contents of the array for your algorithm (that is,
whether 
they're code points or characters or just bytes in memory doesn't matter for 
what you're doing). However, if you're actually processing characters, it makes 
no sense at all. This mess with foreach and strings is one of the big reasons 
why foreach tends to be avoided in std.algorithm.

The reality of the matter is that what the container conceptually contains 
(characters) and what it actually contains aren't the same. That causes
problems 
all over the place. Some reasonable workarounds have been found (for instance, 
strings are special-cased so that they're not random access ranges), but you 
have to special case string all over the place. The only way to avoid it 
completely is to just use dstring everywhere, but that doesn't necessarily
scale 
well, and given the fact that the string module deals almost exclusively with 
string rather than wstring or dstring, it really doesn't make sense to use 
dstrings in the general case. Not to mention, the Linux I/O stuff uses UTF-8,
and 
the Windows I/O stuff uses UTF-16, so dstring is less efficient for dealing
with 
I/O.

Even just making it an error - or at least a warning - to not give the type for 
foreach when iterating over UTF-8 and UTF-16 string types would help a lot in 
fixing string-related coding errors (so, they can choose char, wchar, or dchar, 
but they can't forget to put in the type and get shot in the foot because what 
they almost certainly wanted was dchar). However, there's a lot of generic code 
which runs into trouble because of this as well. The result is that you 
generally have to avoid foreach in generic code.

Perhaps what we need is some way to distinguish between the exact element type 
on an array and the conceptual element type. So, for most arrays, they'd both
be 
whatever the element type of the array is, but for strings the exact element 
type would be char, whchar, or dchar while the conceptual type would be dchar. 
That way, algorithms that don't care what the actual contents mean can use the 
exact element type, and the algorithms that actually care about processing the 
contents can use the conceptual element type.

- Jonathan M Davis

Aug 19 2010

Kagamin <spam here.lot> writes:

Jonathan M Davis Wrote:

 Not to mention, the Linux I/O stuff uses UTF-8, and 
 the Windows I/O stuff uses UTF-16, so dstring is less efficient for dealing
with 
 I/O.

If we take dil as an example of application doing much of string processing.
How much string processing it does and how intensively it communicates with OS
(with string transcoding)?

Aug 19 2010

Jonathan M Davis <jmdavisprog gmail.com> writes:

On Thursday, August 19, 2010 12:24:22 Kagamin wrote:
 Jonathan M Davis Wrote:
 Not to mention, the Linux I/O stuff uses UTF-8, and
 the Windows I/O stuff uses UTF-16, so dstring is less efficient for
 dealing with I/O.

 
 If we take dil as an example of application doing much of string
 processing. How much string processing it does and how intensively it
 communicates with OS (with string transcoding)?

I have never heard of dil. I have no idea. How big a hit the string type has on 
I/O is likely to be strongly dependent on the type of I/O you're using, the 
characteristics of your strings (as in things like what is the average number
of 
code units in a code point in your strings and what is the average length of 
your strings), as well as all of the other CPU or memory-intensive stuff that
you 
may be doing. However, it does make sense to make your string types the same 
size as the OS' native string types if you want to maximize efficiency.

Of more importance, however, is the fact that it costs a lot of memory to use 
UTF-32 strings if you have a lot of strings. The string processing itself could 
actually be more efficient using dstring since you can then use random access 
operations on them (or it could be less efficient because of the extra memory 
costs involved), but there are big memory costs to using lots of dstrings.

- Jonathan M Davis

Aug 19 2010

Rainer Deyke <rainerd eldwood.com> writes:

On 8/19/2010 03:56, Jonathan Davis wrote:
 The problem is that chars are not characters. They are UTF-8 code
 units.

So what?  You're acting like 'char' (and specifically 'char[]') is some
sort of unique special case.  In reality, it's just one case of encoded
data.  What about compressed data?  What about packed arrays of bits?
What about other containers?

There's a useful generic idiom for iterating over a sequence of A as if
it was a sequence of B: the adapter range.  Narrow strings aren't
special enough to deserve special language support.


-- 
Rainer Deyke - rainerd eldwood.com

Aug 19 2010

dsimcha <dsimcha yahoo.com> writes:

== Quote from Rainer Deyke (rainerd eldwood.com)'s article
 On 8/19/2010 03:56, Jonathan Davis wrote:
 The problem is that chars are not characters. They are UTF-8 code
 units.

 So what?  You're acting like 'char' (and specifically 'char[]') is some
 sort of unique special case.  In reality, it's just one case of encoded
 data.  What about compressed data?  What about packed arrays of bits?
 What about other containers?
 There's a useful generic idiom for iterating over a sequence of A as if
 it was a sequence of B: the adapter range.  Narrow strings aren't
 special enough to deserve special language support.

Even though I don't agree, I believe this is a reasonable point of view.  The
biggest problem in my mind is that the inconsistency between ranges and foreach
gives us the worst of both worlds when it comes to writing generic code.  While
I
think both should use dchar, I'd rather see both use char than have this
ridiculous inconsistency.  I'm waiting for Walter (who has the final say on the
core language) or Andrei (who has the final say on Phobos and ranges) to chime
in
on this one.  If Walter insists that we can't make foreach use dchar, then for
consistency's sake at least let's make std.range use char.

Aug 19 2010

"Simen kjaeraas" <simen.kjaras gmail.com> writes:

Rainer Deyke <rainerd eldwood.com> wrote:

 On 8/19/2010 03:56, Jonathan Davis wrote:
 The problem is that chars are not characters. They are UTF-8 code
 units.

 So what?  You're acting like 'char' (and specifically 'char[]') is some
 sort of unique special case.  In reality, it's just one case of encoded
 data.  What about compressed data?  What about packed arrays of bits?
 What about other containers?

First off, char, wchar, and dchar are special cases already - they're
basically byte, short, and int, but are treated somewhat differently.

One possibility, which would make strings a less integrated part of the
language, is to make them simple range structs, and hide UTF-8/16
details in the implementation. If it were not for the fact that D touts
its UTF capabilities, and that this would make it a little less true,
and the fact that char/wchar/dchar are already treated specially, I
would support this idea.

-- 
Simen

Aug 20 2010

Jonathan M Davis <jmdavisprog gmail.com> writes:

On Friday, August 20, 2010 09:44:26 Simen kjaeraas wrote:
 Rainer Deyke <rainerd eldwood.com> wrote:
 On 8/19/2010 03:56, Jonathan Davis wrote:
 The problem is that chars are not characters. They are UTF-8 code
 units.

 
 So what?  You're acting like 'char' (and specifically 'char[]') is some
 sort of unique special case.  In reality, it's just one case of encoded
 data.  What about compressed data?  What about packed arrays of bits?
 What about other containers?

 
 First off, char, wchar, and dchar are special cases already - they're
 basically byte, short, and int, but are treated somewhat differently.
 
 One possibility, which would make strings a less integrated part of the
 language, is to make them simple range structs, and hide UTF-8/16
 details in the implementation. If it were not for the fact that D touts
 its UTF capabilities, and that this would make it a little less true,
 and the fact that char/wchar/dchar are already treated specially, I
 would support this idea.

If you do that, you'd probably do something like

struct String(C)
{
    C[] array;
    
    dchar front() { size_t i = 0; return decod(a, i); }
    dchar back()  { /* more complicated code*/ }
    void popFront() { array.popFront(); }
    void popBack()  { array.popBack(); }
    bool empty()    { return array.empty; }
}

alias String(immutable char) string;


Naturally, there would be template constraints, the functions might be a bit 
more complex, and there would probably be some other functions (not to mention, 
you might have to do something fancy to get the immutable part to work since 
IIRC templates remove immutable and const so that they don't generate different 
templates for immutable, const, and mutable), but essentially, you would wrap 
the various string types in a struct with range operations based on dchar. You 
could get at the underlying array quite easily if you actually wanted array 
operations. And if you want string operations, well you have the range 
operations. Everywhere in the code where you currently have string, you'd have 
String(immutable char) instead of immutable (char)[].

I really don't know what all of the implications of this are. There have been 
similar suggestions before. You don't really hide the fact that they're UTF-8 
and UTF-16. Rather you just make it so that the main interface to them is 
UTF-32. Anyone who wants at the UTF-8 or UTF-16 array can get at it just fine.

I'm not sure how much this really saves you though, nor what all the problems a 
struct like this would cause over what we currently have. But you'd probably 
still have to special case stuff, since there are going to be algorithms that 
need to process the underlying array rather than the dchar range in order to be 
properly efficient, if work at all. Also, without universal function call
syntax, 
I think that the only way to make it possible to call functions on it as if
they 
were member functions is to use opDispatch(), which would definitely cause bugs 
(opDot() won't work since the most that you could do at that point is pass it 
along to the internal array, and then we're right back where we started). So, 
ultimately, I'm not sure that such a change would gain you much, and you're 
definitely losing something big.

Ultimately, I think that we're stuck with what we've got, though we may be able 
to make some tweaks. Fundamentally, we're trying to treat something as two 
different things without treating it as two different things. We want to treat
it 
as a range of characters and an array of unicode code units at the same time, 
using it as a range of characters where appropriate and using it as an array of 
code units where appropriate without having to special case it. I just don't 
think that that's going to work.

We can improve our situation with the use of good template and trait stuff,
along 
with making the use of iterating over string types without specifying a type a 
warning/error or making it default to dchar. But ultimately, there's a 
fundamental disjoint going on here, and we have to deal with it.

- Jonathan M Davis

Aug 20 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 08/20/2010 12:22 PM, Jonathan M Davis wrote:
 On Friday, August 20, 2010 09:44:26 Simen kjaeraas wrote:
 Rainer Deyke<rainerd eldwood.com>  wrote:
 On 8/19/2010 03:56, Jonathan Davis wrote:
 The problem is that chars are not characters. They are UTF-8 code
 units.

 So what?  You're acting like 'char' (and specifically 'char[]') is some
 sort of unique special case.  In reality, it's just one case of encoded
 data.  What about compressed data?  What about packed arrays of bits?
 What about other containers?

 First off, char, wchar, and dchar are special cases already - they're
 basically byte, short, and int, but are treated somewhat differently.

 One possibility, which would make strings a less integrated part of the
 language, is to make them simple range structs, and hide UTF-8/16
 details in the implementation. If it were not for the fact that D touts
 its UTF capabilities, and that this would make it a little less true,
 and the fact that char/wchar/dchar are already treated specially, I
 would support this idea.

 If you do that, you'd probably do something like

 struct String(C)
 {
      C[] array;

      dchar front() { size_t i = 0; return decod(a, i); }
      dchar back()  { /* more complicated code*/ }
      void popFront() { array.popFront(); }
      void popBack()  { array.popBack(); }
      bool empty()    { return array.empty; }
 }

 alias String(immutable char) string;

Grep std/ for byDchar.

Andrei

Aug 20 2010

Kagamin <spam here.lot> writes:

Jonathan M Davis Wrote:

 Everywhere in the code where you currently have string, you'd have 
 String(immutable char) instead of immutable (char)[].

Not necessarily. I think, you can leave std.algorithm string-agnostic and
special case string operations in, say, std.string, which will take and return
regular string types, but internally call std.algorithms on dchar range
wrappers, this is what std.algorithm does now, I suppose.

 Fundamentally, we're trying to treat something as two 
 different things without treating it as two different things. We want to treat
it 
 as a range of characters and an array of unicode code units at the same time, 
 using it as a range of characters where appropriate and using it as an array
of 
 code units where appropriate without having to special case it. I just don't 
 think that that's going to work.

I think we just want to do string operations. I believe, java and .Net live
fine with String classes and string operations built into them.

Aug 20 2010

Rainer Deyke <rainerd eldwood.com> writes:

On 8/20/2010 10:44, Simen kjaeraas wrote:
 First off, char, wchar, and dchar are special cases already - they're
 basically byte, short, and int, but are treated somewhat differently.

They're only special cases when placed in a built-in array.  In any
other container, they behave like normal types - unless the container
uses built-in arrays internally, in which case it may not work at all.

I have no objection to a string type that uses utf-8 internally but
iterates over full code points.  My objection is specifically to
special-casing built-in arrays to behave differently from all other
arrays when instantiated on 'char' and 'wchar'.  Rename 'char[]' to
'char""' (and keep 'char[]' as a simple array) and my objection goes away.

Again, I ask: what about 'Array!char'?


-- 
Rainer Deyke - rainerd eldwood.com

Aug 20 2010

Kagamin <spam here.lot> writes:

dsimcha Wrote:

 If D is at all serious about generic programming, we simply can't require this
 to be dealt with **everywhere** as a special case.

Just remove the special case of automatic conversion from strings to dchar[]
and you will have one less surprize. After all, it was a deliberate design
decision to make strings arrays.
Make dchar ranges explicitly, you can't code without writing what you want,
right? Or write a special case library for strings, that will do the job for
you.

Aug 18 2010

Kagamin <spam here.lot> writes:

dsimcha Wrote:

 If D is at all serious about generic programming, we simply can't require this
 to be dealt with **everywhere** as a special case.

I suspect, ranges were designed for FP, so use map instead of foreach. Or fold.
Or another 3-letter abbreviation. This will give you a possibility to
parallelize your code later, foreach is executed sequentially by design.

Aug 18 2010

Pelle <pelle.mansson gmail.com> writes:

On 08/19/2010 04:37 AM, dsimcha wrote:
 I've been hacking in Phobos and parallelfuture and I've come to the conclusion
 that having typeof(c) in the expression foreach(c; string.init) not be a dchar
 is simply ridiculous.  I don't care how much existing code gets broken, this
 needs to be fixed.  Otherwise, all generic code will have to deal with it as a
 special case.  Most of it will probably overlook this special case in
 practice, and the net result will be more broken code than if we just bite the
 bullet and fix this now.

Currently, strings break foreach in generic code. This is terrible!

I agree with this. I thought char[] was a UTF-8 sequence, not a byte 
sequence.

Aug 19 2010

Michel Fortin <michel.fortin michelf.com> writes:

On 2010-08-18 22:37:04 -0400, dsimcha <dsimcha yahoo.com> said:

 If D is at all serious about generic programming, we simply can't require this
 to be dealt with **everywhere** as a special case.

I do agree that the current special case situation is pretty bad. 
Foreach really need to use ElementType!string by default. Whether this 
is done by changing foreach (my preference), or by reverting 
ElementType!string to its previous incarnation and using a special 
range to iterate over characters, I think it'd be an improvement over 
the current situation. Having the standard library and the language 
disagree with each other is be pretty bad.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Aug 19 2010

dsimcha <dsimcha yahoo.com> writes:

== Quote from dsimcha (dsimcha yahoo.com)'s article
 I've been hacking in Phobos and parallelfuture and I've come to the conclusion
 that having typeof(c) in the expression foreach(c; string.init) not be a dchar
 is simply ridiculous.

BTW, what are some examples of where making dchar the default would **silently**
break code?  I can think of very few because now that we've gotten rid of
implicit
narrowing conversions in D2, you can't implicitly convert a dchar to a char or a
byte.  This should catch most cases at compile time.

Aug 19 2010

Kagamin <spam here.lot> writes:

dsimcha Wrote:

 BTW, what are some examples of where making dchar the default would
**silently**
 break code?

1. Read a file and cast the buffer to string.
2. Surprising difference in string lenghts that were just checked.

Aug 19 2010

dsimcha <dsimcha yahoo.com> writes:

== Quote from dsimcha (dsimcha yahoo.com)'s article
 I've been hacking in Phobos and parallelfuture and I've come to the conclusion
 that having typeof(c) in the expression foreach(c; string.init) not be a dchar
 is simply ridiculous.  I don't care how much existing code gets broken, this
 needs to be fixed.

Here's another good one.  This one uses Lockstep, which is in the SVN version of
std.range and is designed to provide syntactic sugar for iterating over multiple
ranges in lockstep via foreach.

string str1, str2;
foreach(c1, c2; lockstep(str1, str2)) {}

// c1, c2 are dchars since Lockstep relies on range primitives.

foreach(c; str1) {}
// c is a char since the regular foreach loop doesn't use range
// primitives.

I'm starting to think the inconsistency between ranges and foreach is really the
worst part.  When viewed in isolation, Andrei's changes to std.range to make
ElementType!string == dchar, etc. were definitely the right thing to do. 
However,
if we can't fix foreach, it might be a good idea to undo them because in this
case
I think such a ridiculous, bug producing inconsistency is worse than doing The
Wrong Thing consistently.

Aug 19 2010

Jonathan M Davis <jmdavisprog gmail.com> writes:

On Thursday, August 19, 2010 07:15:30 dsimcha wrote:
 == Quote from dsimcha (dsimcha yahoo.com)'s article
 
 I've been hacking in Phobos and parallelfuture and I've come to the
 conclusion that having typeof(c) in the expression foreach(c;
 string.init) not be a dchar is simply ridiculous.  I don't care how much
 existing code gets broken, this needs to be fixed.

 
 Here's another good one.  This one uses Lockstep, which is in the SVN
 version of std.range and is designed to provide syntactic sugar for
 iterating over multiple ranges in lockstep via foreach.
 
 string str1, str2;
 foreach(c1, c2; lockstep(str1, str2)) {}
 
 // c1, c2 are dchars since Lockstep relies on range primitives.
 
 foreach(c; str1) {}
 // c is a char since the regular foreach loop doesn't use range
 // primitives.
 
 I'm starting to think the inconsistency between ranges and foreach is
 really the worst part.  When viewed in isolation, Andrei's changes to
 std.range to make ElementType!string == dchar, etc. were definitely the
 right thing to do.  However, if we can't fix foreach, it might be a good
 idea to undo them because in this case I think such a ridiculous, bug
 producing inconsistency is worse than doing The Wrong Thing consistently.

Okay. Maybe this is what we do:

1. Make it a warning if not outright error to use foreach with any char or
wchar 
array (be they mutable, const, or immutable) without indicating the type. So,

foreach(c; mystring)
{
    //...
}

would become illegal. You'd have to give the type for c. This would solve the 
problem where someone forgets to put the type. Since odds are that they wanted 
dchar anyway, the extra characters aren't really extra for most people. And the 
few who actually wanted char or wchar can just put the type. It shouldn't be a 
big deal. A programmer can still foolishly put char or wchar when what they 
actually need a dchar for what they're doing, but at least then it's a 
deliberate error due to ignorance rather than someone who knows what they're 
doing making a simple mistake. This will also catch errors in generic
algorithms 
that end up trying to use foreach without giving the type.

2. Ditch ElementType in favor of something more like ExactElemType and 
ConceptElemType where ExactElemType is the actual type in the array/range and 
ConceptElemType is the type that is conceptually in the array/range. So, for 
most types, those two will be the same, but for string types, ExactElemType
will 
be char, wchar, or dchar, while ConceptElemType will always be dchar. So, the 
algorithms that don't care about what the elements mean can just use 
ExactElemType while those that do care about what the elements mean use 
ConceptElemType.

I'm not sure that this is the best solution. However, the fact that string and 
wstring are arrays but can't always be treated as arrays is pretty much 
inescapable as long as they're arrays. It seems like no matter what we do, you 
either lose the ability to treat strings as arrays or you have to special case 
them all over the place. If they were structs that gave access to their 
underlying array for array operations and gave range operations for normal use 
(possibly along with a function for giving you the nth element, though it 
couldn't truly be random access unless it were a dstring), then maybe we could 
get this to work better. But we're dealing the inherent problem that the 
container holds one type conceptually and a completely different type in
reality.

- Jonathan M Davis

Aug 19 2010

"Simen kjaeraas" <simen.kjaras gmail.com> writes:

dsimcha <dsimcha yahoo.com> wrote:

 I've been hacking in Phobos and parallelfuture and I've come to the  
 conclusion
 that having typeof(c) in the expression foreach(c; string.init) not be a  
 dchar
 is simply ridiculous.  I don't care how much existing code gets broken,  
 this
 needs to be fixed.  Otherwise, all generic code will have to deal with  
 it as a
 special case.

The other alternative, as has been proposed, is to create (d|w)?string
structs that are thin wrappers on top of immutable((d|w)?char)[].

Something along the lines of
struct string {
     immutable(ubyte)[] payload;
     alias payload this;
     // Implement range primitives here.
}
should work, though it is nowhere near as elegant as what we have, and
would probably break some code.

Do note that I used ubyte instead of char as the element type, seeing
as how char/wchar would be unnecessary in this case.


This said, I am more in favor of changing the compiler than the strings.


-- 
Simen

Aug 19 2010

Kagamin <spam here.lot> writes:

Jonathan M Davis Wrote:

 No, it doesn't hurt to have the iteration type larger than the actual type,
but 
 you're not going to have overflow.

Trivial: take byte and add 256.

 could have had overflow putting it in, but when you're taking it out, you know 
 that it fits because it was already in there. You could have overflow issues
with 
 math or whatnot inside the body of your loop if you're assigning to the
foreach 
 variable, but that has nothing to do with what you're getting out of the loop. 

As long as what you get out of the loop doesn't depend on the element type.
Didn't you demonstrated how such dependency can be introduced?

 It's fine with me to use narrow strings. Much as I'd love to avoid a lot of
these 
 issues, dstrings take up too much memory if you're going to be doing a lot of 
 string processing.

If you're going to take much memory, there probably won't be much difference
between strings and dstrings, you'll take much memory in both cases. And don't
forget that UTF-8 chars take up to 4 bytes.

 problem is that the default behavior is the abnormal (and therefore almost 
 certainly buggy) behavior. Generally D tries to make the normal behavior the 
 behavior that is less likely to cause bugs.

Type system hacks are likely to cause bugs.

 Very few people are actually going to 
 want to deal with code points. They want characters. The result is that it 
 becomes very easy to make mistakes with strings if you ever try and manipulate 
 them character-by-character.

If you care about people and want to force them to use dchar ranges, you can do
it with the library: make it refuse narrow strings - as long as the library is
unusable with narrow strings, people will have to do something about it, say,
use wrappers like one proposed in this thread (but providing forward dchar
range interface).

 It makes perfect sense for general arrays. It makes perfect sense if you don't 
 really care about the contents of the array for your algorithm (that is,
whether 
 they're code points or characters or just bytes in memory doesn't matter for 
 what you're doing). However, if you're actually processing characters, it
makes 
 no sense at all. This mess with foreach and strings is one of the big reasons 
 why foreach tends to be avoided in std.algorithm.

The problem here is that integers are not much different from characters in
this regard.

 and given the fact that the string module deals almost exclusively with 
 string rather than wstring or dstring, it really doesn't make sense to use 
 dstrings in the general case.

This is my point: you can do it with library, if you can't, fix the library.

 Not to mention, the Linux I/O stuff uses UTF-8, and 
 the Windows I/O stuff uses UTF-16, so dstring is less efficient for dealing
with 
 I/O.

Every string type is inefficient here, but a wrapper comparable to NSString can
fix it for you.

 Perhaps what we need is some way to distinguish between the exact element type 
 on an array and the conceptual element type. So, for most arrays, they'd both
be 
 whatever the element type of the array is, but for strings the exact element 
 type would be char, whchar, or dchar while the conceptual type would be dchar. 

Conceptually number is an infinite sequence of digits with decimal point. What
do you plan to do about this?

Aug 19 2010

Jonathan M Davis <jmdavisprog gmail.com> writes:

On Thursday, August 19, 2010 12:18:03 Kagamin wrote:
 Jonathan M Davis Wrote:
 No, it doesn't hurt to have the iteration type larger than the actual
 type, but you're not going to have overflow.

 
 Trivial: take byte and add 256.

Except that that only happens once you do something to the element that you get 
from foreach. You read byte just fine without having overflow problems. You
can't 
do the same with char or wchar. You often need multiple of them to get anything 
meaningful - unlike bytes. If you want to change the iteration type to int or 
long or whatever when iterating over bytes so that you can change the variable 
without overflow issues, you can. But the byte itself is meaingful by itself. 
Such is not generally the case with char or wchar.

 It's fine with me to use narrow strings. Much as I'd love to avoid a lot
 of these issues, dstrings take up too much memory if you're going to be
 doing a lot of string processing.

 
 If you're going to take much memory, there probably won't be much
 difference between strings and dstrings, you'll take much memory in both
 cases. And don't forget that UTF-8 chars take up to 4 bytes.

For ASCII characters, a UTF-32 character takes _4_ times as much memory as a 
UTF-8 character. Even if you use lots of Asian characters, as I understand it, 
most won't take more than 3. So, even if you're using primarily Asian
characters 
with UTF-8, your still have 25% space savings. And since apparently, many Asian 
characters will fit into one wchar, if you use UTF-16 when you have lots of
Asian 
characters, you're getting closer to 50% space savings over UTF-32. If you have 
a lot of strings, that's a lot of wasted memory.

 If you care about people and want to force them to use dchar ranges, you
 can do it with the library: make it refuse narrow strings - as long as the
 library is unusable with narrow strings, people will have to do something
 about it, say, use wrappers like one proposed in this thread (but
 providing forward dchar range interface).

We _can't_ force everyone to use dstring. That defeats having string and
wstring 
in the first place and is incredibly inefficient space-wise. The standard
libraries 
_need_ to work well with all string types.

 It makes perfect sense for general arrays. It makes perfect sense if you
 don't really care about the contents of the array for your algorithm
 (that is, whether they're code points or characters or just bytes in
 memory doesn't matter for what you're doing). However, if you're
 actually processing characters, it makes no sense at all. This mess with
 foreach and strings is one of the big reasons why foreach tends to be
 avoided in std.algorithm.

 
 The problem here is that integers are not much different from characters in
 this regard.

Integers are totally different. An integer may be limited in the size of the 
number that it can hold, but it makes perfect sense to process each integer 
individually. An integer is a full value on its own. char and wchar are not. 
They're only parts of a whole.

 Conceptually number is an infinite sequence of digits with decimal point.
 What do you plan to do about this?

That's a totally different issue. The solution for that is to use a BigInt type 
which combines multiple integers (or bytes or longs or whatever) together to 
make larger values that primitive integral types can hold. In that case, if you 
were to try and iterate over indivdual ints within the BigInt, then you'd be 
screwed because they don't mean anything on your own. string and wstring are 
effectively BigInt for chars and wchars. You have to combine multiple of them
to 
get meaningful values. The fact that one of them can't hold a big enough (let 
alone infinite) range is the whole reason that they were created in the first 
place (that and the fact that making the type big enough (i.e. dchar) on its
own 
wastes a lot of space).

- Jonathan M Davis

Aug 19 2010

Kagamin <spam here.lot> writes:

Jonathan M Davis Wrote:

 Trivial: take byte and add 256.

 
 If you want to change the iteration type to int or 
 long or whatever when iterating over bytes so that you can change the variable 
 without overflow issues, you can. But the byte itself is meaingful by itself. 
 Such is not generally the case with char or wchar.

I thought, it's your point that having a meaning doesn't help to avoid bugs.

 If you care about people and want to force them to use dchar ranges, you
 can do it with the library: make it refuse narrow strings - as long as the
 library is unusable with narrow strings, people will have to do something
 about it, say, use wrappers like one proposed in this thread (but
 providing forward dchar range interface).

 
 We _can't_ force everyone to use dstring.

I'm talking not about dstrings, I said dchar range wrapper. Andrei mentioned
byDchar, I don't know if that's the thing. Anyway, std.algorithm does iterate
over dchars in narrow strings somehow. You can do it too.

Aug 20 2010

D Programming

C/C++ Programming

Other

digitalmars.D - Why foreach(c; someString) must yield dchar