www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Re: Making all strings UTF ranges has some risk of WTF

reply bearophile <bearophileHUGS lycos.com> writes:
I'd like D to give the right results and to be not bug-prone by default (even
if it's a bit slower) and do something more optimized but maybe faster when I
want it and I know what I am doing. So I like the idea of making strings more
correct.

I liked the idea of D strings acting like normal arrays, but they are different
data structures, so it's better to accept them as something different, even if
similar (especially UTF32 ones). So a specific struct, named like String, can
be used to represent a string. String literals too return such struct.

In C language the strlen() is O(n), so people can learn that finding the length
of most strings is O(n) in D too (UTF32 strings today are not common. If D gets
success in a future the UTF32 strings can become the only used strings. Such
things are happened many times in computer history). The String structs can
cache inside themselves the length and hash value the first time such values
are computed (so each String is a struct 4 words long).

Time ago you (Andrei) told me that you don't like a function like len() to have
a different computational complexity, like O(n) or O(1), according to the
input. But a simple possibility is for "foo".length property to return the
length of the walk on all chars, that is O(n) the first time you call it (it's
O(1) on UTF32 strings). Some algorithms or some parts of code (and string
literals, etc.) can know what's the length of the string they return or use, so
they can put this this length value inside the String cached value and there's
never a need to compute it.

The struct String that represents the string supports the [] operator too
(there can be a MutableString too, that is not used often), but it usually
walks the string to find the right character to return. Then a method like
"foo".ubyte(i) can be used to return the i-th ubyte of the string. So initially
this [] is done with just a O(n) walk, later some performance optimization can
be added, like a simple skip tree based on indexes.

Bye,
bearophile
Feb 04 2010
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
 The String structs can cache inside themselves the length and hash value the
first time such
 values are computed (so each String is a struct 4 words long).

So it's not really immutable. It contains an immutable dynamic array that can be called "str", plus two mutable words. To access the underlying 8/16/32 bit units it's composed you can use the [] of the str attribute: "foo".str[i] Bye, bearophile
Feb 04 2010
prev sibling parent =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
bearophile wrote:
 Time ago you (Andrei) told me that you don't like a function like 

according to the input. This reminds me of an excellent talk by Matt Austern on STL's singly-linked lists. One of the interesting points of the design was around the length of the singly-linked list. In the end, he decides not to provide one. I think his point was that getting the length of the data structure was not one of the main operations of a singly list. I agree that the users who really need length can wrap it in a struct that stores the length. Although there doesn't seem to be a slide dedicated to that point, the presentation is here: http://www.accu-usa.org/Slides/SinglyLinkedLists.ppt Ali
Feb 04 2010