digitalmars.D - const and phobos string functions

Regan Heath (41/41) Aug 01 2007 I wanted to post this real/concrete case of where 'const' and the

Kirk McDonald (7/66) Aug 01 2007 A lot of those casts go away if tmp is a string[].

Regan Heath (30/47) Aug 02 2007 True. I knew someone would realise/spot that. ;) I don't think it

Kirk McDonald (21/79) Aug 02 2007 In this particular case, you could call this mutating form of tolower()

Regan Heath (12/29) Aug 02 2007 Sure, that solves this particular case. But, do you agree there is

Reiner Pope (13/48) Aug 02 2007 I think the code you gave here would be misleading. For instance,

Regan Heath (34/75) Aug 02 2007 Good point.

Reiner Pope (67/158) Aug 02 2007 A lot of functions just to do the string operation efficiently, though,

Regan Heath (14/20) Aug 02 2007 Provided it always performs a .dup when it modifies the input it should

Daniel Keep (27/37) Aug 02 2007 I actually tried doing this in a math library of mine, but I didn't like

Daniel Keep (2/9) Aug 02 2007 Those members should, of course, be static. me.duh();

Regan Heath <regan netmail.co.nz> writes:

I wanted to post this real/concrete case of where 'const' and the 
current phobos implementation of certain string routines combine to 
cause a few irritations, inefficiencies and some IMO illogical behaviour.

Note in the code below 'remove' is a template function which uses 
'memmove' to remove an item from the array.

(The code):

char[][] tmp;

tmp = cast(char[][])splitlines(cast(string)read(a[3..$]));
foreach(ref line; tmp) line = cast(char[])strip(line);
for(int i = 0; i < tmp.length; i++)
{
   if (tmp[i].length == 0)
     tmp.remove(i);
}

(The problem):

Notice all the cast()'s required above.  I am of the belief that casting 
should only be done in exceptional circumstances, and this shouldn't be 
one of them.

If you consider char[] as being a a string which you are the full owner 
of, you can change it at will, and 'string' as being a string which you 
are not the full owner of, as you can only read it then in many cases it 
doesn't make sense to pass a char[] to a function and get a 'string' 
back, unless the function has 'taken ownership' of the string data or is 
returning different string data which you are not the owner of.

The obvious case where it make sense is a property setter which might 
copy the input and return a 'string' reference to the copy.

In the cases above (splitlines, strip) it doesn't make sense for the 
function to 'take ownership' of the strings as they do not store them 
beyond the lifetime of the function call.

Therefore if passed a char[] they should return char[][] and char[] 
respectively.

Of course if passed 'string' they should return 'string[]' and 'string' 
respectively.

(The solution):

Given the requirements it seems templating is the obvious solution.  I 
have described in earlier posts (though these relate to tolower 
specifically):

http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=55335
http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=55337

All the phobos routines should likely be templated in the same fashion.

Regan

Aug 01 2007

Kirk McDonald <kirklin.mcdonald gmail.com> writes:

Regan Heath wrote:
 I wanted to post this real/concrete case of where 'const' and the 
 current phobos implementation of certain string routines combine to 
 cause a few irritations, inefficiencies and some IMO illogical behaviour.
 
 Note in the code below 'remove' is a template function which uses 
 'memmove' to remove an item from the array.
 
 (The code):
 
 char[][] tmp;
 
 tmp = cast(char[][])splitlines(cast(string)read(a[3..$]));
 foreach(ref line; tmp) line = cast(char[])strip(line);
 for(int i = 0; i < tmp.length; i++)
 {
   if (tmp[i].length == 0)
     tmp.remove(i);
 }
 
 (The problem):
 
 Notice all the cast()'s required above.  I am of the belief that casting 
 should only be done in exceptional circumstances, and this shouldn't be 
 one of them.
 
 If you consider char[] as being a a string which you are the full owner 
 of, you can change it at will, and 'string' as being a string which you 
 are not the full owner of, as you can only read it then in many cases it 
 doesn't make sense to pass a char[] to a function and get a 'string' 
 back, unless the function has 'taken ownership' of the string data or is 
 returning different string data which you are not the owner of.
 
 The obvious case where it make sense is a property setter which might 
 copy the input and return a 'string' reference to the copy.
 
 In the cases above (splitlines, strip) it doesn't make sense for the 
 function to 'take ownership' of the strings as they do not store them 
 beyond the lifetime of the function call.
 
 Therefore if passed a char[] they should return char[][] and char[] 
 respectively.
 
 Of course if passed 'string' they should return 'string[]' and 'string' 
 respectively.
 
 (The solution):
 
 Given the requirements it seems templating is the obvious solution.  I 
 have described in earlier posts (though these relate to tolower 
 specifically):
 
 http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmar
.D&article_id=55335 
 
 http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmar
.D&article_id=55337 
 
 
 All the phobos routines should likely be templated in the same fashion.
 
 Regan

A lot of those casts go away if tmp is a string[].

-- 
Kirk McDonald
http://kirkmcdonald.blogspot.com
Pyd: Connecting D and Python
http://pyd.dsource.org

Aug 01 2007

Regan Heath <regan netmail.co.nz> writes:

Kirk McDonald wrote:
 Regan Heath wrote:
 Note in the code below 'remove' is a template function which uses 
 'memmove' to remove an item from the array.

 (The code):

 char[][] tmp;

 tmp = cast(char[][])splitlines(cast(string)read(a[3..$]));
 foreach(ref line; tmp) line = cast(char[])strip(line);
 for(int i = 0; i < tmp.length; i++)
 {
   if (tmp[i].length == 0)
     tmp.remove(i);
 }

 
 A lot of those casts go away if tmp is a string[].

True.  I knew someone would realise/spot that.  ;)  I don't think it 
solves the basic problem however, which is why I didn't mention it in me 
original post, perhaps I should have.  Let me explain further...

Imagine calling splitlines on char[] getting string[] then calling 
tolower on each line.

Because you have a string[], tolower always has to dup any line it wants 
to change.

If you template tolower and splitlines so that they accept const and 
non-const input and return the same then in the case where it gets 
non-const input tolower can avoid duplication (AKA *performance gain*)

The same is true for any function (phobos or otherwise) which will (or 
even might) modify the input data.

In the case of splitlines or strip they don't need to modify the input 
data so there is no performance gain by templating them and handling 
string and char[] differently.

However, what you do get is a function which no longer 'takes ownership' 
of the array passed and can then be used in a sequence of modifications 
without having to cast or dup to 'take ownership back' by brute force 
(the problem my code illustrates)

To think of it in the abstract sense, by passing a char[] you're passing 
ownership of the data to the function, when it returns it is passing 
ownership back again in the form char[] or char[][].

When you pass string you're not passing ownership and the function has 
no other option but to duplicate the input when it wants to modify it.

In the current case of tolower it returns string and therefore retains 
ownership of the string, despite the fact that it promptly forgets it 
ever existed.  Not much we can do about this case because we cannot 
decide what to return from a function at runtime.

Regan

Aug 02 2007

Kirk McDonald <kirklin.mcdonald gmail.com> writes:

Regan Heath wrote:
 Kirk McDonald wrote:
 Regan Heath wrote:
 Note in the code below 'remove' is a template function which uses 
 'memmove' to remove an item from the array.

 (The code):

 char[][] tmp;

 tmp = cast(char[][])splitlines(cast(string)read(a[3..$]));
 foreach(ref line; tmp) line = cast(char[])strip(line);
 for(int i = 0; i < tmp.length; i++)
 {
   if (tmp[i].length == 0)
     tmp.remove(i);
 }

 A lot of those casts go away if tmp is a string[].

 
 True.  I knew someone would realise/spot that.  ;)  I don't think it 
 solves the basic problem however, which is why I didn't mention it in me 
 original post, perhaps I should have.  Let me explain further...
 
 Imagine calling splitlines on char[] getting string[] then calling 
 tolower on each line.
 
 Because you have a string[], tolower always has to dup any line it wants 
 to change.
 
 If you template tolower and splitlines so that they accept const and 
 non-const input and return the same then in the case where it gets 
 non-const input tolower can avoid duplication (AKA *performance gain*)
 
 The same is true for any function (phobos or otherwise) which will (or 
 even might) modify the input data.
 
 In the case of splitlines or strip they don't need to modify the input 
 data so there is no performance gain by templating them and handling 
 string and char[] differently.
 
 However, what you do get is a function which no longer 'takes ownership' 
 of the array passed and can then be used in a sequence of modifications 
 without having to cast or dup to 'take ownership back' by brute force 
 (the problem my code illustrates)
 
 To think of it in the abstract sense, by passing a char[] you're passing 
 ownership of the data to the function, when it returns it is passing 
 ownership back again in the form char[] or char[][].
 
 When you pass string you're not passing ownership and the function has 
 no other option but to duplicate the input when it wants to modify it.
 
 In the current case of tolower it returns string and therefore retains 
 ownership of the string, despite the fact that it promptly forgets it 
 ever existed.  Not much we can do about this case because we cannot 
 decide what to return from a function at runtime.
 
 Regan

In this particular case, you could call this mutating form of tolower() 
on the buffer returned from read(), and then split it afterwards 
(perhaps yielding strings). This makes a certain degree of logical 
sense: If you'd wanted to keep the original contents of the buffer, 
you'd have to duplicate it at some point anyway. Since you don't, you 
can alter it immediately.

I submit the following Python idiom: Functions which mutate their 
arguments should return nothing. That is:

// Return new string
string tolower(string);
// Mutate argument
void tolower(char[]);

This rather strictly highlights the difference between char[] and 
string, and makes it essentially impossible to mix up library functions 
differentiated in this way.

-- 
Kirk McDonald
http://kirkmcdonald.blogspot.com
Pyd: Connecting D and Python
http://pyd.dsource.org

Aug 02 2007

Regan Heath <regan netmail.co.nz> writes:

Kirk McDonald wrote:
 In this particular case, you could call this mutating form of tolower() 
 on the buffer returned from read(), and then split it afterwards 
 (perhaps yielding strings). This makes a certain degree of logical 
 sense: If you'd wanted to keep the original contents of the buffer, 
 you'd have to duplicate it at some point anyway. Since you don't, you 
 can alter it immediately.

Sure, that solves this particular case.  But, do you agree there is 
generally a problem, or shall I continue to dream up example cases for 
you to solve in some other manner :)

 I submit the following Python idiom: Functions which mutate their 
 arguments should return nothing. That is:
 
 // Return new string
 string tolower(string);
 // Mutate argument
 void tolower(char[]);
 
 This rather strictly highlights the difference between char[] and 
 string, and makes it essentially impossible to mix up library functions 
 differentiated in this way.

That's all well and good until you want to write:

char[] s;
...
foo(tolower(s));

If you template tolower in the manner I described it's not possible to 
mix it up and call the wrong one anyway (as it selects the correct one 
based on the input) so it's a non-problem.

Regan

Aug 02 2007

Reiner Pope <some address.com> writes:

Regan Heath wrote:
 Kirk McDonald wrote:
 In this particular case, you could call this mutating form of 
 tolower() on the buffer returned from read(), and then split it 
 afterwards (perhaps yielding strings). This makes a certain degree of 
 logical sense: If you'd wanted to keep the original contents of the 
 buffer, you'd have to duplicate it at some point anyway. Since you 
 don't, you can alter it immediately.

 
 Sure, that solves this particular case.  But, do you agree there is 
 generally a problem, or shall I continue to dream up example cases for 
 you to solve in some other manner :)
 
 I submit the following Python idiom: Functions which mutate their 
 arguments should return nothing. That is:

 // Return new string
 string tolower(string);
 // Mutate argument
 void tolower(char[]);

 This rather strictly highlights the difference between char[] and 
 string, and makes it essentially impossible to mix up library 
 functions differentiated in this way.

 
 That's all well and good until you want to write:
 
 char[] s;
 ...
 foo(tolower(s));
 
 If you template tolower in the manner I described it's not possible to 
 mix it up and call the wrong one anyway (as it selects the correct one 
 based on the input) so it's a non-problem.
 
 Regan

I think the code you gave here would be misleading. For instance, 
suppose you had your templated overloads, effectively giving you

    // copy-on-write
    string tolower(string);
    // in-place
    char[] tolower(char[]);

Then you would get surprising results if you did this:

char[] s = "Hello World!".dup;
int a = howManyLettersDiffer(tolower(s), s);
assert (a == 2); // assert failed, a is actually 0

I think avoiding this is exactly why Kirk suggested his overloads.

   -- Reiner

Aug 02 2007

Regan Heath <regan netmail.co.nz> writes:

Reiner Pope wrote:
 Regan Heath wrote:
 Kirk McDonald wrote:
 I submit the following Python idiom: Functions which mutate their 
 arguments should return nothing. That is:

 // Return new string
 string tolower(string);
 // Mutate argument
 void tolower(char[]);

 This rather strictly highlights the difference between char[] and 
 string, and makes it essentially impossible to mix up library 
 functions differentiated in this way.

 That's all well and good until you want to write:

 char[] s;
 ...
 foo(tolower(s));

 If you template tolower in the manner I described it's not possible to 
 mix it up and call the wrong one anyway (as it selects the correct one 
 based on the input) so it's a non-problem.

 Regan

 
 I think the code you gave here would be misleading. For instance, 
 suppose you had your templated overloads, effectively giving you
 
    // copy-on-write
    string tolower(string);
    // in-place
    char[] tolower(char[]);
 
 Then you would get surprising results if you did this:
 
 char[] s = "Hello World!".dup;
 int a = howManyLettersDiffer(tolower(s), s);
 assert (a == 2); // assert failed, a is actually 0
 
 I think avoiding this is exactly why Kirk suggested his overloads.

Good point.

Using Kirk's Python'esque function signatures you'd have:

char[] s = "Hello World!".dup;
tolower(s);
int a = howManyLettersDiffer(s, s);
assert (a == 2); // assert failed, a is actually 0

which is immediately and obviously a pointless operation, requring a 
re-code to:

char[] s = "Hello World!".dup;
char[] s2 = s.dup
tolower(s2);
int a = howManyLettersDiffer(s2, s);
assert (a == 2); // assert failed, a is actually 0

And the 'string' case:

string s = "Hello World!";
int a = howManyLettersDiffer(tolower(s), s);
assert (a == 2); // assert failed, a is actually 0

which would already behave as desired.

It seems that tolower is always going to have to 'copy on write' and 
return that copy.

But, that doesn't stop you templating it so that it returns the copy as 
char[] when you pass char[] (which is all I really wanted in the first 
place) :)

Unfortunately as we cannot overload on return type it would prevent an 
inplace tolower in the form (given by Kirk):

void tolower(char[])

Instead perhaps we need a naming convention for inplace modifying 
functions, options:

void <func>Inplace(<args>)
void <func>IP(<args>)       // "IP" stands for inplace
void <func>M(<args>)        // "M" stands for mutates

or something like those.

Regan

Aug 02 2007

Reiner Pope <some address.com> writes:

Regan Heath wrote:
 Reiner Pope wrote:
 Regan Heath wrote:
 Kirk McDonald wrote:
 I submit the following Python idiom: Functions which mutate their 
 arguments should return nothing. That is:

 // Return new string
 string tolower(string);
 // Mutate argument
 void tolower(char[]);

 This rather strictly highlights the difference between char[] and 
 string, and makes it essentially impossible to mix up library 
 functions differentiated in this way.

 That's all well and good until you want to write:

 char[] s;
 ...
 foo(tolower(s));

 If you template tolower in the manner I described it's not possible 
 to mix it up and call the wrong one anyway (as it selects the correct 
 one based on the input) so it's a non-problem.

 Regan

 I think the code you gave here would be misleading. For instance, 
 suppose you had your templated overloads, effectively giving you

    // copy-on-write
    string tolower(string);
    // in-place
    char[] tolower(char[]);

 Then you would get surprising results if you did this:

 char[] s = "Hello World!".dup;
 int a = howManyLettersDiffer(tolower(s), s);
 assert (a == 2); // assert failed, a is actually 0

 I think avoiding this is exactly why Kirk suggested his overloads.

 
 Good point.
 
 Using Kirk's Python'esque function signatures you'd have:
 
 char[] s = "Hello World!".dup;
 tolower(s);
 int a = howManyLettersDiffer(s, s);
 assert (a == 2); // assert failed, a is actually 0
 
 which is immediately and obviously a pointless operation, requring a 
 re-code to:
 
 char[] s = "Hello World!".dup;
 char[] s2 = s.dup
 tolower(s2);
 int a = howManyLettersDiffer(s2, s);
 assert (a == 2); // assert failed, a is actually 0
 
 And the 'string' case:
 
 string s = "Hello World!";
 int a = howManyLettersDiffer(tolower(s), s);
 assert (a == 2); // assert failed, a is actually 0
 
 which would already behave as desired.
 
 It seems that tolower is always going to have to 'copy on write' and 
 return that copy.
 
 But, that doesn't stop you templating it so that it returns the copy as 
 char[] when you pass char[] (which is all I really wanted in the first 
 place) :)
 
 Unfortunately as we cannot overload on return type it would prevent an 
 inplace tolower in the form (given by Kirk):
 
 void tolower(char[])
 
 Instead perhaps we need a naming convention for inplace modifying 
 functions, options:
 
 void <func>Inplace(<args>)
 void <func>IP(<args>)       // "IP" stands for inplace
 void <func>M(<args>)        // "M" stands for mutates
 
 or something like those.
 
 Regan

A lot of functions just to do the string operation efficiently, though, 
isn't it? It's easy to see in cases like this why Walter cites the 
Pareto principle, saying mostly this isn't the bottleneck.

But still, isn't D supposed to give you fast code _easily_ ?

I must say, I *like* the simplicity of overloading with a mutable and a 
readonly version, although we have established that it can certainly 
lead to some confusion. Mind you, D arrays ignore that potential 
confusion (in the .sort and .reverse properties).

I also had some fancy ideas for a CoW wrapper, which looks something like:

struct CoW_Wrapper(T)
{
   const(T) val;
   private bool isMutable;

   void opAssign(T t)
   {
     val = t;
     isMutable = true;
   }

   void opAssign(const(T) t)
   {
     val = t;
     isMutable = false;
   }

   T mutable()
   {
     if (!isMutable)
     {
       val = val.dup;
       isMutable = true;
     }
     return cast(T) val;
   }
}

This would make using CoW quite simple, and would also solve the problem 
of keeping track across function boundaries of whether it's been duped, 
for things like:

    string s = ...;
    string s2 = s.tolower().replace("foo", "bar").entab();

which could potentially involve 3 dups. (I'm aware of the easier 
solution of templating to support char[]->char[] and string->string 
overloads, but let's have some fun going the extra way :-) )

You would write something like tolower as (ignoring unicode stuff):

alias CoW_Wrapper!(char[]) cow_string;

cow_string tolower(cow_string s)
{
   foreach (i; 0..s.val.length)
   {
     if (s.val[i] >= 'A' && s.val[i] <= 'Z')
       s.mutable[i] = s.val[i] + 'a' - 'A';
   }
   return s;
}

The neat feature is that the cow_string returned knows whether it 
isMutable, so you will always have the fewest required dups: 0 or 1.

Unfortunately, it doesn't currently look very nice because of the lack 
of opImplicitCast, which means you'd have to write:

   string s2 = s.tolower.val;

instead of

   string s2 = s.tolower;

(there's also problems with the type of s, but that can be solved with 
some template stuff on the tolower side; those template manipulations 
can also bypass the cow_string struct if the parameter type is char[], 
by defining
    char[] mutable(char[] s) { return s; }
)

   -- Reiner

Aug 02 2007

Regan Heath <regan netmail.co.nz> writes:

Reiner Pope wrote:
 I must say, I *like* the simplicity of overloading with a mutable and a 
 readonly version, although we have established that it can certainly 
 lead to some confusion. 

Provided it always performs a .dup when it modifies the input it should 
be fine.  i.e. take the existing tolower in std.string and simply change 
the function signature to:

T tolower(T)(T s)

and make no other changes and your example:

char[] s = "Hello World!".dup;
int a = howManyLettersDiffer(tolower(s), s);
assert (a == 2); // assert failed, a is actually 0

will work as expected.

 Mind you, D arrays ignore that potential
 confusion (in the .sort and .reverse properties).

True.

 I also had some fancy ideas for a CoW wrapper, which looks something like:

<snip>

I wrote something like that once too.

Regan

Aug 02 2007

Daniel Keep <daniel.keep.lists gmail.com> writes:

Regan Heath wrote:
 Instead perhaps we need a naming convention for inplace modifying
 functions, options:
 
 void <func>Inplace(<args>)
 void <func>IP(<args>)       // "IP" stands for inplace
 void <func>M(<args>)        // "M" stands for mutates
 
 or something like those.
 
 Regan

I actually tried doing this in a math library of mine, but I didn't like
using normal letters; I wanted to use a Unicode character that was easy
to type, showed up in Vim, and would visually distinguish the functions.

Sadly, all I could come up with[1] for "in place" was '¶', and the
compiler didn't like that :(

Sometimes, I wonder if it wouldn't help a lot to be able to define
function overloads based on argument decoration.  So, you could have
something like:

  tolower(str);
  tolower(inplace str);
  tolower(takecopy str);

There are so many times where you have several functions that do the
exact same thing (semantically speaking), but have different
implementations.

Maybe I could fake it with structs...

struct tolower
{
    string opCall(string str);
    void inplace(char[] str);
    char[] takecopy(char[] str);
    char[] takecopy(string str);
}

Something to think about, anyway...

	-- Daniel

[1] That's what you get if you type ^KIP in Vim.  If you type ^Kip, you
get 'ぴ', and ^KiP is 'ピ' :P

Aug 02 2007

Daniel Keep <daniel.keep.lists gmail.com> writes:

Daniel Keep wrote:
 struct tolower
 {
     string opCall(string str);
     void inplace(char[] str);
     char[] takecopy(char[] str);
     char[] takecopy(string str);
 }

Those members should, of course, be static.  me.duh();

Aug 02 2007

D Programming

C/C++ Programming

Other

digitalmars.D - const and phobos string functions