digitalmars.D - Why string alias is invariant ?
- Sergey Gromov (4/4) Jan 30 2008 Really, why ? You can't assign a dynamically created data to a variable...
- Sean Kelly (9/10) Jan 30 2008 I think the hope is that 'string' will place us in a position to get
- Sergey Gromov (9/19) Jan 30 2008 I don't quite get it. If 'const' is "for documenting/enforcing API
- Christian Kamm (8/11) Jan 30 2008 You need to duplicate your mutable strings:
- Janice Caron (24/27) Jan 31 2008 Correct, and this is a good default. There are, of course, times when
- Sean Kelly (19/30) Jan 31 2008 I'd say that it is an improvement in safety rather than efficiency
- Janice Caron (7/8) Jan 31 2008 The problem there is the idup. Replace it with
- Don Clugston (2/13) Jan 31 2008 Seems that assumeUnique is going to be so common, it deserves a shorter ...
- Janice Caron (19/20) Jan 31 2008 Oddly enough,
- Sean Kelly (3/14) Jan 31 2008 It does? I didn't think tolower would modify a string in place.
- Janice Caron (8/10) Jan 31 2008 OK - /one/ of the duplications goes away.
- Sean Kelly (3/17) Jan 31 2008 Ah, good point.
- Frits van Bommel (15/22) Jan 31 2008 I'm not sure if that's true[2].
- Janice Caron (8/14) Jan 31 2008 I know. I would have mentioned that, but I didn't want to needlessly
- Janice Caron (16/18) Feb 01 2008 I'm obviously telling you stuff you already know - I apologise. I
- Sergey Gromov (5/36) Jan 31 2008 If I were doing this in C by loading an entire file in memory, I would
- Janice Caron (9/12) Jan 31 2008 It does.
- Sean Kelly (3/37) Jan 31 2008 The D language does. It simply isn't a feature provided by Phobos.
- Sergey Gromov (21/33) Jan 31 2008 Yes, this works for string tokenizers for instance. But doesn't quite
- Janice Caron (22/25) Jan 31 2008 Yes, D has a problem here, in that it has no way to express
- Steven Schveighoffer (4/17) Jan 31 2008 I actually filed an enhancement request for that.
- =?ISO-8859-1?Q?Jari-Matti_M=e4kel=e4?= (11/42) Jan 31 2008 That's not true. Linear types and their derivatives solve this problem
Really, why ? You can't assign a dynamically created data to a variable of type string. You can't pass a dynamically created data into a function receiving string ! If you wanted efficiency, if you wanted to guarantee that anybody receiving string can receive a char literal without duplication,---make it const then. Invariant casts to const(char)[] implicitly, so as char[]. And even then, I'd prefer plain old char[] for string and foo(in string s) for a function that guarantees not to change it. Right now I'm having trouble calling std.file.listdir() just because it receives string. P.S. Many thought mus be put into choosing a return type for a library function. Because if it returns a unique copy of data it must be char[] so that i'm free to modify it. The const(char)[] or const(char[]) must only be used if the function is not sure if the returned result is unique. SnakE
Jan 30 2008
Sergey Gromov wrote:Really, why ? You can't assign a dynamically created data to a variable of type string. You can't pass a dynamically created data into a function receiving string ! If you wanted efficiency, if you wanted to guarantee that anybody receiving string can receive a char literal without duplication,---make it const then. Invariant casts to const(char)[] implicitly, so as char[]. And even then, I'd prefer plain old char[] for string and foo(in string s) for a function that guarantees not to change it.I think the hope is that 'string' will place us in a position to get functional programming-type optimizations "for free" once the language and compiler are in a position to provide them. Such optimizations typically aren't possible with "const char[]" because the compiler must assume that the string data could change unexpectedly. In essence, I believe that 'invariant' is intended for optimization while 'const' is intended for documenting/enforcing API contracts. Sean
Jan 30 2008
Sean Kelly Wrote:Sergey Gromov wrote:I don't quite get it. If 'const' is "for documenting/enforcing API contracts", then "string[] listdir(string)" is an API bug. If it were "char[][] listdir(in char[])", then the compiler wouldn't be able to "get functional programming-type optimizations" within the 'listdir()'. Overall, the use of invariant 'string' is limited to unmodified string literals and their substrings, because for other data the behaviour is undefined, by specification. SnakEReally, why ? You can't assign a dynamically created data to a variable of type string. You can't pass a dynamically created data into a function receiving string ! If you wanted efficiency, if you wanted to guarantee that anybody receiving string can receive a char literal without duplication,---make it const then. Invariant casts to const(char)[] implicitly, so as char[]. And even then, I'd prefer plain old char[] for string and foo(in string s) for a function that guarantees not to change it.I think the hope is that 'string' will place us in a position to get functional programming-type optimizations "for free" once the language and compiler are in a position to provide them. Such optimizations typically aren't possible with "const char[]" because the compiler must assume that the string data could change unexpectedly. In essence, I believe that 'invariant' is intended for optimization while 'const' is intended for documenting/enforcing API contracts.
Jan 30 2008
Overall, the use of invariant 'string' is limited to unmodified string literals and their substrings, because for other data the behaviour is undefined, by specification.You need to duplicate your mutable strings: char[] mutablestring = "hello".dup(); string invariantstring = mutablestring.idup(); Where I think the second line is a shorthand for string invariantstring = cast(string) mutablestring.dup(); and the cast to invariant is valid since you guarantee to give no one else access to the newly dupped data. Christian Kamm
Jan 30 2008
On Jan 30, 2008 11:03 PM, Sergey Gromov <snake.scaly gmail.com> wrote:You can't assign a dynamically created data to a variable of type string.Correct, and this is a good default. There are, of course, times when the default is not what you want. Others have suggested "idup" as the solution, but that means an extra copying pass. Forturnately, however, there is a solution which does /exactly/ what you want, with no superflous copying. It is this: import std.contracts; char[] buffer = whatever; string s = assumeUnique(buffer);You can't pass a dynamically created data into a function receiving string!Again, correct. This is also a good thing. It allows copy-on-write to work correctly. For example, consider a function which lowercases a string. Since string is invariant, that means that when lowercasing something that is /already/ lowercase, the function is free to return the original string. If string were merely const (as opposed to invariant), it would have to make a copy every time. Again, assumeUnique is your friend. (But don't lie to the compiler or things will go badly wrong!)P.S. Many thought mus be put into choosing a return type for a library function. Because if it returns a unique copy of data it must be char[] so that i'm free to modify it.Well, consider again the example of lowercasing a string to see why that is not so. If I return the original string (not a copy of it), then you are /not/ free to modify it, because there might be other pointers to that data. So you must first copy it (using dup) and then you can modify the copy. This means that the copy need be done /only when it is required/, instead of every single time you call the function - so yes, it is an improvement in efficiency.
Jan 31 2008
Janice Caron wrote:On Jan 30, 2008 11:03 PM, Sergey Gromov <snake.scaly gmail.com> wrote:I'd say that it is an improvement in safety rather than efficiency because the model assumes the string may be shared and thus enforces copy on write. But consider something like this: char[] data = cast(char[]) read( "myfile.txt" ); char[][] lines = splitlines( data ); foreach( line; lines ) { writefln( tolower( line.idup ) ); } In this routine, the programmer knows he is the sole owner of the data and simply wants to print the contents of a file in lower case line-by-line. And to do so the contents of data must be duplicated, which causes GC churn and may slow the app considerably. I suppose one possible fix for this sample app would be to convert the entire file to lowercase in one shot, thus incurring only one copy, but in a real app it may not be apparent or even possible to modify the algorithm in such a way. SeanP.S. Many thought mus be put into choosing a return type for a library function. Because if it returns a unique copy of data it must be char[] so that i'm free to modify it.Well, consider again the example of lowercasing a string to see why that is not so. If I return the original string (not a copy of it), then you are /not/ free to modify it, because there might be other pointers to that data. So you must first copy it (using dup) and then you can modify the copy. This means that the copy need be done /only when it is required/, instead of every single time you call the function - so yes, it is an improvement in efficiency.
Jan 31 2008
On 1/31/08, Sean Kelly <sean f4.ca> wrote:And to do so the contents of data must be duplicated,The problem there is the idup. Replace it with foreach( line; lines ) { writefln( tolower( assumeUnique(line)) ); } and the duplication goes away.
Jan 31 2008
Janice Caron wrote:On 1/31/08, Sean Kelly <sean f4.ca> wrote:Seems that assumeUnique is going to be so common, it deserves a shorter name.And to do so the contents of data must be duplicated,The problem there is the idup. Replace it with foreach( line; lines ) { writefln( tolower( assumeUnique(line)) ); } and the duplication goes away.
Jan 31 2008
On 1/31/08, Don Clugston <dac nospam.com.au> wrote:Seems that assumeUnique is going to be so common, it deserves a shorter name.Oddly enough, cast(string)s is shorter than assumeUnique(s) and achieves the same effect. :) I think someone somewhere must have decided that explicit casts are to be avoided. That said, assumeUnique works for things that aren't strings, too. It's basically equivalent to cast(invariant). assumeUnique does have a side effect though, which is that its parameter must be an lvalue, which assumeUnique nulls. So you couldn't do string data = assumeUnique( read( "myfile.txt" )); even if you wanted to. The "officially correct" way would be char[] temp = cast(char[]) read( "myfile.txt" ); string data = assumeUnique(temp); /* temp is now null */ But that's way too much typing for me, so I'd just write: string data = cast(string) read("myfile.txt");
Jan 31 2008
Janice Caron wrote:On 1/31/08, Sean Kelly <sean f4.ca> wrote:It does? I didn't think tolower would modify a string in place. SeanAnd to do so the contents of data must be duplicated,The problem there is the idup. Replace it with foreach( line; lines ) { writefln( tolower( assumeUnique(line)) ); } and the duplication goes away.
Jan 31 2008
On 1/31/08, Sean Kelly <sean f4.ca> wrote:OK - /one/ of the duplications goes away. It's not possible, even in principle, to lowercase a char[] in place, because a char[] by definition is an array of UTF-8 code units, /not/ an array of characters. Lowercasing a character may result in the length of its UTF-8 sequence changing. If the length increases, you're screwed. You can lowercase a dchar[] in place, but not a char[].and the duplication goes away.It does? I didn't think tolower would modify a string in place.
Jan 31 2008
Janice Caron wrote:On 1/31/08, Sean Kelly <sean f4.ca> wrote:Ah, good point. SeanOK - /one/ of the duplications goes away. It's not possible, even in principle, to lowercase a char[] in place, because a char[] by definition is an array of UTF-8 code units, /not/ an array of characters. Lowercasing a character may result in the length of its UTF-8 sequence changing. If the length increases, you're screwed. You can lowercase a dchar[] in place, but not a char[].and the duplication goes away.It does? I didn't think tolower would modify a string in place.
Jan 31 2008
Janice Caron wrote:It's not possible, even in principle, to lowercase a char[] in place, because a char[] by definition is an array of UTF-8 code units, /not/ an array of characters. Lowercasing a character may result in the length of its UTF-8 sequence changing. If the length increases, you're screwed. You can lowercase a dchar[] in place, but not a char[].I'm not sure if that's true[2]. However, I *am* sure it's *not* true for uppercasing. Some code points expand to 2 or 3 codepoints when uppercased. One common case is U+00DF "ß", LATIN SMALL LETTER SHARP S, which expands to "SS" (two characters) when uppercased[1]. Another example from the Unicode standard, U+0390, GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS apparently expands to three codepoints. [1] Interestingly though, the UTF-8 (aka char[]) representation is the same length :P. [2] The relevant section[3] of the Unicode standard says "Case mappings may produce strings of different lengths than the original." but proceeds to only give examples for uppercasing. [3] Section 5.18, see http://www.unicode.org/versions/Unicode5.0.0/ch05.pdf#G21180
Jan 31 2008
On Feb 1, 2008 12:03 AM, Frits van Bommel <fvbommel remwovexcapss.nl> wrote:Some code points expand to 2 or 3 codepoints when uppercased. One common case is U+00DF "ß", LATIN SMALL LETTER SHARP S, which expands to "SS" (two characters) when uppercased[1]. Another example from the Unicode standard, U+0390, GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS apparently expands to three codepoints.I know. I would have mentioned that, but I didn't want to needlessly complicate the issue. But Unicode makes a distinction between "simple casing" and "full casing". What you're talking about is full casing. In simple casing, one character maps to one character. So you could uppercase U+00DF (to itself) using simple-casing. When using full casing, as you quite rightly point out, one can only case-change strings, not characters.
Jan 31 2008
On Feb 1, 2008 7:50 AM, Janice Caron <caron800 googlemail.com> wrote:So you could uppercase U+00DF (to itself) using simple-casing.I'm obviously telling you stuff you already know - I apologise. I would imagine that normalisation forms probably also complicate full casing. One interesting thing is that simple casing also works just fine for wchar (that is, UTF-16). That's because every letter of every living language will be found in the Basic Multilingual Plane (the range U+0000 to U+FFFF). Codepoints outside this range are either symbols, or letters of dead languages. (Or combining characters, etc.), so it's probably safe to leave all non-BMP codepoints unchanged when case-changing. Codepoints in the BMP occupy a single UTF-16 code unit. In many languages (e.g. Chinese), a UTF-16 string is likely to be shorter than the corresponding UTF-8 string. This makes me suspect that UTF-16 may well be the ideal choice for string representation in the real world. (It's what Java went with). Maybe UTF-16, not UTF-8, should be the default kind of string?
Feb 01 2008
Sean Kelly Wrote:Janice Caron wrote:If I were doing this in C by loading an entire file in memory, I would lowercase it in-place which is both faster and takes less memory. D should support such efficient solutions. SnakEOn Jan 30, 2008 11:03 PM, Sergey Gromov <snake.scaly gmail.com> wrote:I'd say that it is an improvement in safety rather than efficiency because the model assumes the string may be shared and thus enforces copy on write. But consider something like this: char[] data = cast(char[]) read( "myfile.txt" ); char[][] lines = splitlines( data ); foreach( line; lines ) { writefln( tolower( line.idup ) ); } In this routine, the programmer knows he is the sole owner of the data and simply wants to print the contents of a file in lower case line-by-line. And to do so the contents of data must be duplicated, which causes GC churn and may slow the app considerably.P.S. Many thought mus be put into choosing a return type for a library function. Because if it returns a unique copy of data it must be char[] so that i'm free to modify it.Well, consider again the example of lowercasing a string to see why that is not so. If I return the original string (not a copy of it), then you are /not/ free to modify it, because there might be other pointers to that data. So you must first copy it (using dup) and then This means that the copy need be done /only when it is required/, instead of every single time you call the function - so yes, it is an improvement in efficiency.
Jan 31 2008
On 1/31/08, Sergey Gromov <snake.scaly gmail.com> wrote:If I were doing this in C by loading an entire file in memory, I would lowercase it in-place which is both faster and takes less memory. D should support such efficient solutions.It does. char[] data = cast(char[]) read( "myfile.txt" ); inPlaceLower(data); Now all you have to do is write the function inPlaceLower(). :) Presumably, what you really mean is that functions like inPlaceLower() should be in Phobos. I won't argue with that, because I agree with you. But the D language does not prevent you from modifying in place - it only prevents you from modifying invariant data in place.
Jan 31 2008
Sergey Gromov wrote:Sean Kelly Wrote:The D language does. It simply isn't a feature provided by Phobos. SeanJanice Caron wrote:If I were doing this in C by loading an entire file in memory, I would lowercase it in-place which is both faster and takes less memory. D should support such efficient solutions.On Jan 30, 2008 11:03 PM, Sergey Gromov <snake.scaly gmail.com> wrote:I'd say that it is an improvement in safety rather than efficiency because the model assumes the string may be shared and thus enforces copy on write. But consider something like this: char[] data = cast(char[]) read( "myfile.txt" ); char[][] lines = splitlines( data ); foreach( line; lines ) { writefln( tolower( line.idup ) ); } In this routine, the programmer knows he is the sole owner of the data and simply wants to print the contents of a file in lower case line-by-line. And to do so the contents of data must be duplicated, which causes GC churn and may slow the app considerably.P.S. Many thought mus be put into choosing a return type for a library function. Because if it returns a unique copy of data it must be char[] so that i'm free to modify it.Well, consider again the example of lowercasing a string to see why that is not so. If I return the original string (not a copy of it), then you are /not/ free to modify it, because there might be other pointers to that data. So you must first copy it (using dup) and then This means that the copy need be done /only when it is required/, instead of every single time you call the function - so yes, it is an improvement in efficiency.
Jan 31 2008
Well thanks for explanation, all this really makes sense. Though there are many benefits and drawbacks for different approaches which I can't keep track of. Janice Caron Wrote:On Jan 30, 2008 11:03 PM, Sergey Gromov <snake.scaly gmail.com> wrote:Yes, this works for string tokenizers for instance. But doesn't quite for lowercasing. The already lowercase case is generally rare and requires an additional pass to check, so you only benefit from this if you optimize for memory. And if you optimize for speed, you'd want lowercasing in-place because you're obviously in the middle of processing data, and the chances that this data is immutable are little. Even if so, there's always .dup for you. And of course there's no reason in having 'string' parameters in functions like read(), listdir() etc. because their result never contains parts of the arguments.You can't pass a dynamically created data into a function receiving string!Again, correct. This is also a good thing. It allows copy-on-write to work correctly. For example, consider a function which lowercases a string. Since string is invariant, that means that when lowercasing something that is /already/ lowercase, the function is free to return the original string.But I'm talking about always returning a unique copy of data. What about that listdir() ? The returned names are obviously unique because they're received from file system. If you make them invariant, and I want to uppercase them, I can't do that in-place because I'm not sure if they're actually unique. If you make them mutable but I need them as is, I still cannot count on them not to change because who knows where this data is also used. Maybe a contract should be added for standard library that if a function returns a mutable array, it guarantees that this array can be modified without side effects and therefore can be safely assumed unique. SnakEP.S. Many thought mus be put into choosing a return type for a library function. Because if it returns a unique copy of data it must be char[] so that i'm free to modify it.Well, consider again the example of lowercasing a string to see why that is not so.
Jan 31 2008
On 1/31/08, Sergey Gromov <snake.scaly gmail.com> wrote:Maybe a contract should be added for standard library that if a function returns a mutable array, it guarantees that this array can be modified without side effects and therefore can be safely assumed unique.Yes, D has a problem here, in that it has no way to express uniqueness. (That's true of other languages too, of course). If a string is immutable, then it doesn't /have/ to be unique (but if it is, it's safe to cast it to invariant), but if a string is mutable then it /must/ be. That's a problem, because one simply cannot declare unique(char)[] array; (...although it would be interesting to try...) So D just makes it your problem. One problem area where this sort of thing arises is concatenation. If you concatenate char arrays, then /regardless/ of the constancy of the inputs, the result is guaranteed to be unique (because, by specification, concatenation always makes a copy). If we had a "unique" type-constructor, one could make the result type unique(T)[]. But we don't, so D makes the (arbitrary) choice of invariant(T)[]. Also, it is currently a syntax error to concatenate a char[] with an invariant(char)[], even though both will implicitly cast to const(char)[]. So it's not a perfect system, but it is better than what we had before, and (I would argue) better than C++. But remember that D2.x is experimental. We're here to play with it. If we don't like it, things could change. Our experiences are important here in shaping the future.
Jan 31 2008
"Janice Caron" wroteOn 1/31/08, Sergey Gromov <snake.scaly gmail.com> wrote:I actually filed an enhancement request for that. http://d.puremagic.com/issues/show_bug.cgi?id=1654 -SteveMaybe a contract should be added for standard library that if a function returns a mutable array, it guarantees that this array can be modified without side effects and therefore can be safely assumed unique.One problem area where this sort of thing arises is concatenation. If you concatenate char arrays, then /regardless/ of the constancy of the inputs, the result is guaranteed to be unique (because, by specification, concatenation always makes a copy). If we had a "unique" type-constructor, one could make the result type unique(T)[]. But we don't, so D makes the (arbitrary) choice of invariant(T)[]. Also, it is currently a syntax error to concatenate a char[] with an invariant(char)[], even though both will implicitly cast to const(char)[].
Jan 31 2008
Janice Caron Wrote:On 1/31/08, Sergey Gromov <snake.scaly gmail.com> wrote:That's not true. Linear types and their derivatives solve this problem elegantly and have been known at least since 1990. They have interesting properties. The data can be simultaneously "constant" (referentially transparent) and unique, and still destructive updates are possible in situ. The Clean language is an example of this, it has uniqueness types[1]. But I'm afraid the current implementations of D aren't really well suited for these kinds of "advanced" type systems in the near future... [1] http://www.cs.ru.nl/~clean/download/papers/1996/bare96-uniclosed.pdfMaybe a contract should be added for standard library that if a function returns a mutable array, it guarantees that this array can be modified without side effects and therefore can be safely assumed unique.Yes, D has a problem here, in that it has no way to express uniqueness. (That's true of other languages too, of course).If a string is immutable, then it doesn't /have/ to be unique (but if it is, it's safe to cast it to invariant), but if a string is mutable then it /must/ be. That's a problem, because one simply cannot declare unique(char)[] array; (...although it would be interesting to try...) So D just makes it your problem. One problem area where this sort of thing arises is concatenation. If you concatenate char arrays, then /regardless/ of the constancy of the inputs, the result is guaranteed to be unique (because, by specification, concatenation always makes a copy). If we had a "unique" type-constructor, one could make the result type unique(T)[]. But we don't, so D makes the (arbitrary) choice of invariant(T)[]. Also, it is currently a syntax error to concatenate a char[] with an invariant(char)[], even though both will implicitly cast to const(char)[]. So it's not a perfect system, but it is better than what we had before, and (I would argue) better than C++. But remember that D2.x is experimental. We're here to play with it. If we don't like it, things could change. Our experiences are important here in shaping the future.
Jan 31 2008