digitalmars.D - Fixing std.string
- dsimcha (13/13) Aug 19 2010 As I mentioned buried deep in another thread, std.string is in serious n...
- Andrei Alexandrescu (25/38) Aug 19 2010 I don't know - my guess is that UTF-8 is widespread in English-speaking
- Russel Winder (17/19) Aug 19 2010 he
- Andrei Alexandrescu (14/24) Aug 20 2010 Hey Russell,
- Ezneh (2/2) Aug 20 2010 There's also this in std.string which requires a fix :
- Andrei Alexandrescu (8/11) Aug 20 2010 Sure. On the face of it, I think isNumeric is a silly function because
- bearophile (5/11) Aug 20 2010 Do you mean that such ugly replacement is meant to be used in user code,...
- Andrei Alexandrescu (4/13) Aug 20 2010 Put it in the body.
- bearophile (20/21) Aug 20 2010 A possible design is to use only one template function, like:
- Ezneh (7/15) Aug 20 2010 I found another way to improve the isNumeric function.
- Jonathan M Davis (10/21) Aug 20 2010 Oh, the immutability can definitely be a good thing. That's why string i...
- Michael Rynn (128/144) Aug 23 2010 The problems are combinatorial, because of encoding schemes.
- Jonathan M Davis (11/14) Aug 24 2010 A lot of functions in Phobos are templated on string type, so you don't ...
- Norbert Nemec (5/8) Aug 24 2010 Wouldn't it be sufficient to take const as input? IIRC, both mutable and...
- Simen kjaeraas (6/16) Aug 24 2010 What should the functions return, then? If the output is always
As I mentioned buried deep in another thread, std.string is in serious need of fixing, for two reasons: 1. Most of it doesn't work with UTF-16/UTF-32 strings. 2. Much of it requires the input to be immutable even when there's no good reason for this constraint. I'm trying to understand a few things before I dive into fixing it: 1. How did it get to be this way? Why did it seem like a good idea at the time to only support UTF-8 and only immutable strings? 2. Is there any "deep" design/technical issue that makes these hard to fix, or is it basically just lack of manpower and other priorities? 3. Is there any good reason to avoid just templating everything to work with all 9 string types (mutable/const/immutable char/wchar/dchar[]) or whatever subset is reasonable for the given function?
Aug 19 2010
On 08/19/2010 09:22 PM, dsimcha wrote:As I mentioned buried deep in another thread, std.string is in serious need of fixing, for two reasons: 1. Most of it doesn't work with UTF-16/UTF-32 strings. 2. Much of it requires the input to be immutable even when there's no good reason for this constraint.Absolutely. Thanks for looking into this!I'm trying to understand a few things before I dive into fixing it: 1. How did it get to be this way? Why did it seem like a good idea at the time to only support UTF-8 and only immutable strings?I don't know - my guess is that UTF-8 is widespread in English-speaking countries and this is one.2. Is there any "deep" design/technical issue that makes these hard to fix, or is it basically just lack of manpower and other priorities?The latter. I wanted to get to this for the longest time, and I think it's awesome that you're looking into it.3. Is there any good reason to avoid just templating everything to work with all 9 string types (mutable/const/immutable char/wchar/dchar[]) or whatever subset is reasonable for the given function?There's no reason. But I hope we'd go a step further: a) Aggressively make everything string-specific more general and move it into std.algorithm. b) After (a) ideally std.string should contain only a modicum of string-specific stuff such as case and whitespace information. I believe the functionality of the following functions could easily be generalized and move to std.algorithm or std.range, perhaps consolidated with existing functionality and under a different name: cmp, indexOf, lastIndexOf, repeat, join, split, stripl, stripr, strip, chomp, chompPrefix, replace, replaceSlice, insert, count, maketrans, translate, squeeze, munch, succ, tr. The other functions (or certain overloads of the above) stay put in std.string and should be indeed templated by input with the constraint if (isSomeString!Str) or better yet allow any input, forward, or bidirectional range (as the algorithm needs) constained by if (isXxxRange!R && is(ElementType!R : dchar). Thanks again for looking into this, it's important and rewarding work. Andrei
Aug 19 2010
On Fri, 2010-08-20 at 02:22 +0000, dsimcha wrote: [ . . . ]1. How did it get to be this way? Why did it seem like a good idea at t=hetime to only support UTF-8 and only immutable strings?But isn't the thinking these days that immutable strings are a good thing? Immutability is generally a good thing for all parallel, and indeed concurrent, computations. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Aug 19 2010
Russel Winder wrote:On Fri, 2010-08-20 at 02:22 +0000, dsimcha wrote: [ . . . ]Hey Russell, The idea is for the algorithms to impose as little on their inputs. If you're searching a character in a string you wouldn't care whether the string is mutable or not - the algorithm is the same. Currently many algorithm in std.string require (a) immutable and (b) UTF-8 strings as inputs. Either or both limitations should be relaxed as much as possible. char[] thisIsMutable = new char[100]; char[] thisIsMutableW = new wchar[100]; ... assert(indexOf(thisIsMutable, "abc") != -1); // should work assert(indexOf(thisIsMutableW, "abc") != -1); // should work assert(indexOf(thisIsMutableW, "abc"w) != -1); // even this should work Andrei1. How did it get to be this way? Why did it seem like a good idea at the time to only support UTF-8 and only immutable strings?But isn't the thinking these days that immutable strings are a good thing? Immutability is generally a good thing for all parallel, and indeed concurrent, computations.
Aug 20 2010
There's also this in std.string which requires a fix : http://d.puremagic.com/issues/show_bug.cgi?id=4673
Aug 20 2010
Ezneh wrote:There's also this in std.string which requires a fix : http://d.puremagic.com/issues/show_bug.cgi?id=4673Sure. On the face of it, I think isNumeric is a silly function because the effort expended on doing a good prediction is almost the same as doing the actual conversion - so why not just try it. I guess return collectException(to!real(input)) is null; should be a fine replacement for isNumeric. Andrei
Aug 20 2010
Andrei Alexandrescu:Sure. On the face of it, I think isNumeric is a silly function because the effort expended on doing a good prediction is almost the same as doing the actual conversion - so why not just try it.The difference is that a well designed isNumeric doesn't need to use exceptions, this may make it faster (never forget that DMD exceptions are something like 12 times slower than Java-Sun ones). On the other hand I think isNumeric() is not used in situations where high performance is needed.I guess return collectException(to!real(input)) is null; should be a fine replacement for isNumeric.Do you mean that such ugly replacement is meant to be used in user code, or do you mean to replace the contents of the isNumeric() function with that code? Bye, bearophile
Aug 20 2010
On 08/20/2010 07:21 AM, bearophile wrote:Andrei Alexandrescu:Good point.Sure. On the face of it, I think isNumeric is a silly function because the effort expended on doing a good prediction is almost the same as doing the actual conversion - so why not just try it.The difference is that a well designed isNumeric doesn't need to use exceptions, this may make it faster (never forget that DMD exceptions are something like 12 times slower than Java-Sun ones). On the other hand I think isNumeric() is not used in situations where high performance is needed.Put it in the body. AndreiI guess return collectException(to!real(input)) is null; should be a fine replacement for isNumeric.Do you mean that such ugly replacement is meant to be used in user code, or do you mean to replace the contents of the isNumeric() function with that code?
Aug 20 2010
bearophile Wrote:The difference is that a well designed isNumeric doesn't need to use exceptions,<A possible design is to use only one template function, like: private auto _realConvert(bool useExepions)(string txt) { ... if (some_error_condition) { static if (useExcepions) throw new ConversionError(...); else return false; } ... static if (useExcepions) return result; else return true; } And then create isNumeric() and to!real() calling _realConvert!(false) and _realConvert!(true). (But maybe the simple implementation with collectException(to!real(input)) is enough because it's uncommon to use isNumeric where speed matters a lot). Bye, bearophile
Aug 20 2010
Andrei Alexandrescu Wrote:I guess return collectException(to!real(input)) is null; should be a fine replacement for isNumeric. AndreiI found another way to improve the isNumeric function. I'm doing it with a (ugly) regular expression but it works very well but maybe we could (should !) improve it a bit more. See in the attach file how I did it. There are some asserts and "old tests". I probably forget something but this can be a new base for the isNumeric function. Give me any feed back and telle me wath you think about it.
Aug 20 2010
On Thursday 19 August 2010 23:27:33 Russel Winder wrote:On Fri, 2010-08-20 at 02:22 +0000, dsimcha wrote: [ . . . ]Oh, the immutability can definitely be a good thing. That's why string is immutable(char)[]. However, forcing people to use string instead of the other possible string types is unnecessarily restrictive. There are cases where you can't use immutable stuff or where it's inefficient to do so. By making std.string handle all of the various string types as much as possible, it makes it much more flexible. But since string, wstring, and dstring are all immutable, most string processing will likely be on immutable. It's just that you won't be forced to do it that way if you want to take advatage of std.string. - Jonathan m Davis1. How did it get to be this way? Why did it seem like a good idea at the time to only support UTF-8 and only immutable strings?But isn't the thinking these days that immutable strings are a good thing? Immutability is generally a good thing for all parallel, and indeed concurrent, computations.
Aug 20 2010
On Fri, 20 Aug 2010 02:22:56 +0000, dsimcha wrote:As I mentioned buried deep in another thread, std.string is in serious need of fixing, for two reasons: 1. Most of it doesn't work with UTF-16/UTF-32 strings. 2. Much of it requires the input to be immutable even when there's no good reason for this constraint. I'm trying to understand a few things before I dive into fixing it: 1. How did it get to be this way? Why did it seem like a good idea at the time to only support UTF-8 and only immutable strings? 2. Is there any "deep" design/technical issue that makes these hard to fix, or is it basically just lack of manpower and other priorities?The problems are combinatorial, because of encoding schemes. I imagine that when someone wants a function that is missing from std.string, they might write one, and might even add to it. I also found std.utf to not contain exactly what I needed. The functions toUTF16, to UTF8, have signatures like wstring toUTF16(const(dchar)[] s). But when hacking a class I found I wanted functions that would almost have the very same innards, but could also append mutable character arrays of any sort. // Does almost the same as toUTF16, but creates or appends a mutable array. void append_UTF16m(ref wchar[] r, const(dchar)[] s) {...} At the expense of another nested function call, which I imagine most people would not want to pay, toUTF16 becomes a call to append_UTF16m. wstring toUTF16(const(dchar)[] s) { wchar[] temp = null; append_UTF16m(temp, s); return assumeUnique(temp); } But isNumeric for me required a parsing function, when I was religiously trying to use ranges, and know what sort of conversion function to call afterwards. I know its really simple-minded, but it did the required job. enum NumberClass { NUM_ERROR = -1, NUM_EMPTY, NUM_INTEGER, NUM_REAL } /// R is an input range, P is a output range (put). /// Return a NumberClass value. /// Collect characters in P for later processing. /// Does no NAN or INF, only checks for error, empty, integer, or real. /// E or e might be an exponent, or just the end of a number. NumberClass getNumberString(R, P)(R ipt, P opt, int recurse = 0 ) { int digitct = 0; bool done = ipt.empty; bool decPoint = false; for(;;) { if (ipt.empty) break; auto test = ipt.front; ipt.popFront; switch(test) { case '-': case '+': if (digitct > 0) { done = true; } break; case '.': if (!decPoint) decPoint = true; else done = true; break; default: if (!isdigit(test)) { done = true; if (test == 'e' || test == 'E') { // Ambiguous end of number, or exponent? if (recurse == 0) { opt.put(test); if (getNumberString(ipt,opt, recurse+1) ==NumberClass.NUM_INTEGER) return NumberClass.NUM_REAL; else return NumberClass.NUM_ERROR; } // assume end of number } } else digitct++; break; } if (done) break; opt.put(test); } if (digitct == 0) return NumberClass.NUM_EMPTY; if (decPoint) return NumberClass.NUM_REAL; return NumberClass.NUM_INTEGER; } A string class. http://dsource.org/projects/xmlp/trunk/alt/ustring.d The component structures maintain a terminating null character and pretend it is not there. It seemed a good idea at the time when I was doing a lot of windows API calls which expected null terminated C-strings of char or wchar. The UString class does conversions on accessing cstr(), wstr() or dstr(), on the assumption that last used will be most frequent, and ideally caches a decent hash value. I only have some limited uses of UString so far, because character arrays are so powerful. struct cstext { char[] str_ = null; ... } struct wstext { wchar[] str_ = null; ... } struct dstext { dchar[] str_ = null; ... } class UString { private { union { vstruc vstr; // not fully supported? cstext cstr; wstext wstr; dstext dstr; } UStringType ztype; hash_t hash_; } ...
Aug 23 2010
On Monday 23 August 2010 23:16:25 Michael Rynn wrote:The problems are combinatorial, because of encoding schemes. I imagine that when someone wants a function that is missing from std.string, they might write one, and might even add to it.A lot of functions in Phobos are templated on string type, so you don't have to define multiple versions of them. Very few, if any, are actually defined for multiple string types. Now, because each template instantiation results in another version of the function in the resulting binary, if you try and use all of the functions with all of the string types, then you do get combinatorial problems. But thanks to the templates, you don't have to worry about it directly, and it's not like it's going to be a typical use case for most string functions to be used by multiple string types in the same program. It will happen, but not enough to generally be an issue. - Jonathan M Davis
Aug 24 2010
On 20/08/10 03:22, dsimcha wrote:3. Is there any good reason to avoid just templating everything to work with all 9 string types (mutable/const/immutable char/wchar/dchar[]) or whatever subset is reasonable for the given function?Wouldn't it be sufficient to take const as input? IIRC, both mutable and immutable can be implicitly converted to const and this exactly the purpose that const is designed for: data that I can't change but that other code may be able to change. Or am I mixing something up here?
Aug 24 2010
Norbert Nemec <Norbert nemec-online.de> wrote:On 20/08/10 03:22, dsimcha wrote:What should the functions return, then? If the output is always const(char)[], I need to cast it to make it immutable(char)[] or char[], and casting is an unsafe operation. -- Simen3. Is there any good reason to avoid just templating everything to work with all 9 string types (mutable/const/immutable char/wchar/dchar[]) or whatever subset is reasonable for the given function?Wouldn't it be sufficient to take const as input? IIRC, both mutable and immutable can be implicitly converted to const and this exactly the purpose that const is designed for: data that I can't change but that other code may be able to change. Or am I mixing something up here?
Aug 24 2010