www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - const and phobos string functions

reply Regan Heath <regan netmail.co.nz> writes:
I wanted to post this real/concrete case of where 'const' and the 
current phobos implementation of certain string routines combine to 
cause a few irritations, inefficiencies and some IMO illogical behaviour.

Note in the code below 'remove' is a template function which uses 
'memmove' to remove an item from the array.

(The code):

char[][] tmp;

tmp = cast(char[][])splitlines(cast(string)read(a[3..$]));
foreach(ref line; tmp) line = cast(char[])strip(line);
for(int i = 0; i < tmp.length; i++)
{
   if (tmp[i].length == 0)
     tmp.remove(i);
}

(The problem):

Notice all the cast()'s required above.  I am of the belief that casting 
should only be done in exceptional circumstances, and this shouldn't be 
one of them.

If you consider char[] as being a a string which you are the full owner 
of, you can change it at will, and 'string' as being a string which you 
are not the full owner of, as you can only read it then in many cases it 
doesn't make sense to pass a char[] to a function and get a 'string' 
back, unless the function has 'taken ownership' of the string data or is 
returning different string data which you are not the owner of.

The obvious case where it make sense is a property setter which might 
copy the input and return a 'string' reference to the copy.

In the cases above (splitlines, strip) it doesn't make sense for the 
function to 'take ownership' of the strings as they do not store them 
beyond the lifetime of the function call.

Therefore if passed a char[] they should return char[][] and char[] 
respectively.

Of course if passed 'string' they should return 'string[]' and 'string' 
respectively.

(The solution):

Given the requirements it seems templating is the obvious solution.  I 
have described in earlier posts (though these relate to tolower 
specifically):

http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=55335
http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=55337

All the phobos routines should likely be templated in the same fashion.

Regan
Aug 01 2007
parent reply Kirk McDonald <kirklin.mcdonald gmail.com> writes:
Regan Heath wrote:
 I wanted to post this real/concrete case of where 'const' and the 
 current phobos implementation of certain string routines combine to 
 cause a few irritations, inefficiencies and some IMO illogical behaviour.
 
 Note in the code below 'remove' is a template function which uses 
 'memmove' to remove an item from the array.
 
 (The code):
 
 char[][] tmp;
 
 tmp = cast(char[][])splitlines(cast(string)read(a[3..$]));
 foreach(ref line; tmp) line = cast(char[])strip(line);
 for(int i = 0; i < tmp.length; i++)
 {
   if (tmp[i].length == 0)
     tmp.remove(i);
 }
 
 (The problem):
 
 Notice all the cast()'s required above.  I am of the belief that casting 
 should only be done in exceptional circumstances, and this shouldn't be 
 one of them.
 
 If you consider char[] as being a a string which you are the full owner 
 of, you can change it at will, and 'string' as being a string which you 
 are not the full owner of, as you can only read it then in many cases it 
 doesn't make sense to pass a char[] to a function and get a 'string' 
 back, unless the function has 'taken ownership' of the string data or is 
 returning different string data which you are not the owner of.
 
 The obvious case where it make sense is a property setter which might 
 copy the input and return a 'string' reference to the copy.
 
 In the cases above (splitlines, strip) it doesn't make sense for the 
 function to 'take ownership' of the strings as they do not store them 
 beyond the lifetime of the function call.
 
 Therefore if passed a char[] they should return char[][] and char[] 
 respectively.
 
 Of course if passed 'string' they should return 'string[]' and 'string' 
 respectively.
 
 (The solution):
 
 Given the requirements it seems templating is the obvious solution.  I 
 have described in earlier posts (though these relate to tolower 
 specifically):
 
 http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmar
.D&article_id=55335 
 
 http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmar
.D&article_id=55337 
 
 
 All the phobos routines should likely be templated in the same fashion.
 
 Regan
A lot of those casts go away if tmp is a string[]. -- Kirk McDonald http://kirkmcdonald.blogspot.com Pyd: Connecting D and Python http://pyd.dsource.org
Aug 01 2007
parent reply Regan Heath <regan netmail.co.nz> writes:
Kirk McDonald wrote:
 Regan Heath wrote:
 Note in the code below 'remove' is a template function which uses 
 'memmove' to remove an item from the array.

 (The code):

 char[][] tmp;

 tmp = cast(char[][])splitlines(cast(string)read(a[3..$]));
 foreach(ref line; tmp) line = cast(char[])strip(line);
 for(int i = 0; i < tmp.length; i++)
 {
   if (tmp[i].length == 0)
     tmp.remove(i);
 }
A lot of those casts go away if tmp is a string[].
True. I knew someone would realise/spot that. ;) I don't think it solves the basic problem however, which is why I didn't mention it in me original post, perhaps I should have. Let me explain further... Imagine calling splitlines on char[] getting string[] then calling tolower on each line. Because you have a string[], tolower always has to dup any line it wants to change. If you template tolower and splitlines so that they accept const and non-const input and return the same then in the case where it gets non-const input tolower can avoid duplication (AKA *performance gain*) The same is true for any function (phobos or otherwise) which will (or even might) modify the input data. In the case of splitlines or strip they don't need to modify the input data so there is no performance gain by templating them and handling string and char[] differently. However, what you do get is a function which no longer 'takes ownership' of the array passed and can then be used in a sequence of modifications without having to cast or dup to 'take ownership back' by brute force (the problem my code illustrates) To think of it in the abstract sense, by passing a char[] you're passing ownership of the data to the function, when it returns it is passing ownership back again in the form char[] or char[][]. When you pass string you're not passing ownership and the function has no other option but to duplicate the input when it wants to modify it. In the current case of tolower it returns string and therefore retains ownership of the string, despite the fact that it promptly forgets it ever existed. Not much we can do about this case because we cannot decide what to return from a function at runtime. Regan
Aug 02 2007
parent reply Kirk McDonald <kirklin.mcdonald gmail.com> writes:
Regan Heath wrote:
 Kirk McDonald wrote:
 Regan Heath wrote:
 Note in the code below 'remove' is a template function which uses 
 'memmove' to remove an item from the array.

 (The code):

 char[][] tmp;

 tmp = cast(char[][])splitlines(cast(string)read(a[3..$]));
 foreach(ref line; tmp) line = cast(char[])strip(line);
 for(int i = 0; i < tmp.length; i++)
 {
   if (tmp[i].length == 0)
     tmp.remove(i);
 }
A lot of those casts go away if tmp is a string[].
True. I knew someone would realise/spot that. ;) I don't think it solves the basic problem however, which is why I didn't mention it in me original post, perhaps I should have. Let me explain further... Imagine calling splitlines on char[] getting string[] then calling tolower on each line. Because you have a string[], tolower always has to dup any line it wants to change. If you template tolower and splitlines so that they accept const and non-const input and return the same then in the case where it gets non-const input tolower can avoid duplication (AKA *performance gain*) The same is true for any function (phobos or otherwise) which will (or even might) modify the input data. In the case of splitlines or strip they don't need to modify the input data so there is no performance gain by templating them and handling string and char[] differently. However, what you do get is a function which no longer 'takes ownership' of the array passed and can then be used in a sequence of modifications without having to cast or dup to 'take ownership back' by brute force (the problem my code illustrates) To think of it in the abstract sense, by passing a char[] you're passing ownership of the data to the function, when it returns it is passing ownership back again in the form char[] or char[][]. When you pass string you're not passing ownership and the function has no other option but to duplicate the input when it wants to modify it. In the current case of tolower it returns string and therefore retains ownership of the string, despite the fact that it promptly forgets it ever existed. Not much we can do about this case because we cannot decide what to return from a function at runtime. Regan
In this particular case, you could call this mutating form of tolower() on the buffer returned from read(), and then split it afterwards (perhaps yielding strings). This makes a certain degree of logical sense: If you'd wanted to keep the original contents of the buffer, you'd have to duplicate it at some point anyway. Since you don't, you can alter it immediately. I submit the following Python idiom: Functions which mutate their arguments should return nothing. That is: // Return new string string tolower(string); // Mutate argument void tolower(char[]); This rather strictly highlights the difference between char[] and string, and makes it essentially impossible to mix up library functions differentiated in this way. -- Kirk McDonald http://kirkmcdonald.blogspot.com Pyd: Connecting D and Python http://pyd.dsource.org
Aug 02 2007
parent reply Regan Heath <regan netmail.co.nz> writes:
Kirk McDonald wrote:
 In this particular case, you could call this mutating form of tolower() 
 on the buffer returned from read(), and then split it afterwards 
 (perhaps yielding strings). This makes a certain degree of logical 
 sense: If you'd wanted to keep the original contents of the buffer, 
 you'd have to duplicate it at some point anyway. Since you don't, you 
 can alter it immediately.
Sure, that solves this particular case. But, do you agree there is generally a problem, or shall I continue to dream up example cases for you to solve in some other manner :)
 I submit the following Python idiom: Functions which mutate their 
 arguments should return nothing. That is:
 
 // Return new string
 string tolower(string);
 // Mutate argument
 void tolower(char[]);
 
 This rather strictly highlights the difference between char[] and 
 string, and makes it essentially impossible to mix up library functions 
 differentiated in this way.
That's all well and good until you want to write: char[] s; ... foo(tolower(s)); If you template tolower in the manner I described it's not possible to mix it up and call the wrong one anyway (as it selects the correct one based on the input) so it's a non-problem. Regan
Aug 02 2007
parent reply Reiner Pope <some address.com> writes:
Regan Heath wrote:
 Kirk McDonald wrote:
 In this particular case, you could call this mutating form of 
 tolower() on the buffer returned from read(), and then split it 
 afterwards (perhaps yielding strings). This makes a certain degree of 
 logical sense: If you'd wanted to keep the original contents of the 
 buffer, you'd have to duplicate it at some point anyway. Since you 
 don't, you can alter it immediately.
Sure, that solves this particular case. But, do you agree there is generally a problem, or shall I continue to dream up example cases for you to solve in some other manner :)
 I submit the following Python idiom: Functions which mutate their 
 arguments should return nothing. That is:

 // Return new string
 string tolower(string);
 // Mutate argument
 void tolower(char[]);

 This rather strictly highlights the difference between char[] and 
 string, and makes it essentially impossible to mix up library 
 functions differentiated in this way.
That's all well and good until you want to write: char[] s; ... foo(tolower(s)); If you template tolower in the manner I described it's not possible to mix it up and call the wrong one anyway (as it selects the correct one based on the input) so it's a non-problem. Regan
I think the code you gave here would be misleading. For instance, suppose you had your templated overloads, effectively giving you // copy-on-write string tolower(string); // in-place char[] tolower(char[]); Then you would get surprising results if you did this: char[] s = "Hello World!".dup; int a = howManyLettersDiffer(tolower(s), s); assert (a == 2); // assert failed, a is actually 0 I think avoiding this is exactly why Kirk suggested his overloads. -- Reiner
Aug 02 2007
parent reply Regan Heath <regan netmail.co.nz> writes:
Reiner Pope wrote:
 Regan Heath wrote:
 Kirk McDonald wrote:
 I submit the following Python idiom: Functions which mutate their 
 arguments should return nothing. That is:

 // Return new string
 string tolower(string);
 // Mutate argument
 void tolower(char[]);

 This rather strictly highlights the difference between char[] and 
 string, and makes it essentially impossible to mix up library 
 functions differentiated in this way.
That's all well and good until you want to write: char[] s; ... foo(tolower(s)); If you template tolower in the manner I described it's not possible to mix it up and call the wrong one anyway (as it selects the correct one based on the input) so it's a non-problem. Regan
I think the code you gave here would be misleading. For instance, suppose you had your templated overloads, effectively giving you // copy-on-write string tolower(string); // in-place char[] tolower(char[]); Then you would get surprising results if you did this: char[] s = "Hello World!".dup; int a = howManyLettersDiffer(tolower(s), s); assert (a == 2); // assert failed, a is actually 0 I think avoiding this is exactly why Kirk suggested his overloads.
Good point. Using Kirk's Python'esque function signatures you'd have: char[] s = "Hello World!".dup; tolower(s); int a = howManyLettersDiffer(s, s); assert (a == 2); // assert failed, a is actually 0 which is immediately and obviously a pointless operation, requring a re-code to: char[] s = "Hello World!".dup; char[] s2 = s.dup tolower(s2); int a = howManyLettersDiffer(s2, s); assert (a == 2); // assert failed, a is actually 0 And the 'string' case: string s = "Hello World!"; int a = howManyLettersDiffer(tolower(s), s); assert (a == 2); // assert failed, a is actually 0 which would already behave as desired. It seems that tolower is always going to have to 'copy on write' and return that copy. But, that doesn't stop you templating it so that it returns the copy as char[] when you pass char[] (which is all I really wanted in the first place) :) Unfortunately as we cannot overload on return type it would prevent an inplace tolower in the form (given by Kirk): void tolower(char[]) Instead perhaps we need a naming convention for inplace modifying functions, options: void <func>Inplace(<args>) void <func>IP(<args>) // "IP" stands for inplace void <func>M(<args>) // "M" stands for mutates or something like those. Regan
Aug 02 2007
next sibling parent reply Reiner Pope <some address.com> writes:
Regan Heath wrote:
 Reiner Pope wrote:
 Regan Heath wrote:
 Kirk McDonald wrote:
 I submit the following Python idiom: Functions which mutate their 
 arguments should return nothing. That is:

 // Return new string
 string tolower(string);
 // Mutate argument
 void tolower(char[]);

 This rather strictly highlights the difference between char[] and 
 string, and makes it essentially impossible to mix up library 
 functions differentiated in this way.
That's all well and good until you want to write: char[] s; ... foo(tolower(s)); If you template tolower in the manner I described it's not possible to mix it up and call the wrong one anyway (as it selects the correct one based on the input) so it's a non-problem. Regan
I think the code you gave here would be misleading. For instance, suppose you had your templated overloads, effectively giving you // copy-on-write string tolower(string); // in-place char[] tolower(char[]); Then you would get surprising results if you did this: char[] s = "Hello World!".dup; int a = howManyLettersDiffer(tolower(s), s); assert (a == 2); // assert failed, a is actually 0 I think avoiding this is exactly why Kirk suggested his overloads.
Good point. Using Kirk's Python'esque function signatures you'd have: char[] s = "Hello World!".dup; tolower(s); int a = howManyLettersDiffer(s, s); assert (a == 2); // assert failed, a is actually 0 which is immediately and obviously a pointless operation, requring a re-code to: char[] s = "Hello World!".dup; char[] s2 = s.dup tolower(s2); int a = howManyLettersDiffer(s2, s); assert (a == 2); // assert failed, a is actually 0 And the 'string' case: string s = "Hello World!"; int a = howManyLettersDiffer(tolower(s), s); assert (a == 2); // assert failed, a is actually 0 which would already behave as desired. It seems that tolower is always going to have to 'copy on write' and return that copy. But, that doesn't stop you templating it so that it returns the copy as char[] when you pass char[] (which is all I really wanted in the first place) :) Unfortunately as we cannot overload on return type it would prevent an inplace tolower in the form (given by Kirk): void tolower(char[]) Instead perhaps we need a naming convention for inplace modifying functions, options: void <func>Inplace(<args>) void <func>IP(<args>) // "IP" stands for inplace void <func>M(<args>) // "M" stands for mutates or something like those. Regan
A lot of functions just to do the string operation efficiently, though, isn't it? It's easy to see in cases like this why Walter cites the Pareto principle, saying mostly this isn't the bottleneck. But still, isn't D supposed to give you fast code _easily_ ? I must say, I *like* the simplicity of overloading with a mutable and a readonly version, although we have established that it can certainly lead to some confusion. Mind you, D arrays ignore that potential confusion (in the .sort and .reverse properties). I also had some fancy ideas for a CoW wrapper, which looks something like: struct CoW_Wrapper(T) { const(T) val; private bool isMutable; void opAssign(T t) { val = t; isMutable = true; } void opAssign(const(T) t) { val = t; isMutable = false; } T mutable() { if (!isMutable) { val = val.dup; isMutable = true; } return cast(T) val; } } This would make using CoW quite simple, and would also solve the problem of keeping track across function boundaries of whether it's been duped, for things like: string s = ...; string s2 = s.tolower().replace("foo", "bar").entab(); which could potentially involve 3 dups. (I'm aware of the easier solution of templating to support char[]->char[] and string->string overloads, but let's have some fun going the extra way :-) ) You would write something like tolower as (ignoring unicode stuff): alias CoW_Wrapper!(char[]) cow_string; cow_string tolower(cow_string s) { foreach (i; 0..s.val.length) { if (s.val[i] >= 'A' && s.val[i] <= 'Z') s.mutable[i] = s.val[i] + 'a' - 'A'; } return s; } The neat feature is that the cow_string returned knows whether it isMutable, so you will always have the fewest required dups: 0 or 1. Unfortunately, it doesn't currently look very nice because of the lack of opImplicitCast, which means you'd have to write: string s2 = s.tolower.val; instead of string s2 = s.tolower; (there's also problems with the type of s, but that can be solved with some template stuff on the tolower side; those template manipulations can also bypass the cow_string struct if the parameter type is char[], by defining char[] mutable(char[] s) { return s; } ) -- Reiner
Aug 02 2007
parent Regan Heath <regan netmail.co.nz> writes:
Reiner Pope wrote:
 I must say, I *like* the simplicity of overloading with a mutable and a 
 readonly version, although we have established that it can certainly 
 lead to some confusion. 
Provided it always performs a .dup when it modifies the input it should be fine. i.e. take the existing tolower in std.string and simply change the function signature to: T tolower(T)(T s) and make no other changes and your example: char[] s = "Hello World!".dup; int a = howManyLettersDiffer(tolower(s), s); assert (a == 2); // assert failed, a is actually 0 will work as expected.
 Mind you, D arrays ignore that potential
 confusion (in the .sort and .reverse properties).
True.
 I also had some fancy ideas for a CoW wrapper, which looks something like:
<snip> I wrote something like that once too. Regan
Aug 02 2007
prev sibling parent reply Daniel Keep <daniel.keep.lists gmail.com> writes:
Regan Heath wrote:
 Instead perhaps we need a naming convention for inplace modifying
 functions, options:
 
 void <func>Inplace(<args>)
 void <func>IP(<args>)       // "IP" stands for inplace
 void <func>M(<args>)        // "M" stands for mutates
 
 or something like those.
 
 Regan
I actually tried doing this in a math library of mine, but I didn't like using normal letters; I wanted to use a Unicode character that was easy to type, showed up in Vim, and would visually distinguish the functions. Sadly, all I could come up with[1] for "in place" was '¶', and the compiler didn't like that :( Sometimes, I wonder if it wouldn't help a lot to be able to define function overloads based on argument decoration. So, you could have something like: tolower(str); tolower(inplace str); tolower(takecopy str); There are so many times where you have several functions that do the exact same thing (semantically speaking), but have different implementations. Maybe I could fake it with structs... struct tolower { string opCall(string str); void inplace(char[] str); char[] takecopy(char[] str); char[] takecopy(string str); } Something to think about, anyway... -- Daniel [1] That's what you get if you type ^KIP in Vim. If you type ^Kip, you get 'ぴ', and ^KiP is 'ピ' :P
Aug 02 2007
parent Daniel Keep <daniel.keep.lists gmail.com> writes:
Daniel Keep wrote:
 struct tolower
 {
     string opCall(string str);
     void inplace(char[] str);
     char[] takecopy(char[] str);
     char[] takecopy(string str);
 }
Those members should, of course, be static. me.duh();
Aug 02 2007