digitalmars.D - std.stringbuffer

Janice Caron (89/89) Apr 29 2008 Hi all,

Jarrett Billingsley (3/9) Apr 29 2008 Might I ask why a StringBuffer class would be necessary?

Janice Caron (2/3) Apr 29 2008 Walter has vetoed that one, so it's moot now. :-)

Bruno Medeiros (12/21) Apr 29 2008 I'm with Jarret here, why the hell do we need a StringBuffer class?

Me Here (49/63) Apr 29 2008 As one of those that has request "a standard library of string functions...

Janice Caron (12/24) Apr 30 2008 Yeah, I got that from an earlier post when someone said "What you need

Sean Kelly (9/14) Apr 30 2008 This would only work for large arrays I'm afraid, given the GC

Janice Caron (11/13) Apr 30 2008 So does Phobos. std.gc.realloc().

Sean Kelly (6/19) Apr 30 2008 It's perhaps worth noting here that C++ objects don't typically minimize

Me Here (60/76) Apr 30 2008 I did laugh. Not quite "any colour you like so long as its black", but

Janice Caron (20/34) Apr 30 2008 One is file, the other is a folder. std.string is a file, so it can't

Me Here (31/32) Apr 30 2008 What's in a name? Pre-conceptions of other worlds and other tools.

Janice Caron (10/13) Apr 30 2008 I've kind of lost track of the number of times I've said this in

Matti Niemenmaa (12/16) Apr 30 2008 It's possible that, in some obscure case, you can't uppercase UTF-16 in ...

Janice Caron (8/10) Apr 30 2008 Perhaps surprisingly, that's not so. This is because the alphabets of
Janice Caron (9/9) Apr 30 2008 Oh, sorry, I didn't read your whole post before replying. .

Matti Niemenmaa (7/12) Apr 30 2008 You're right, of course. I was referring more to some hypothetical toUpp...

terranium (2/4) Apr 30 2008 does this have any practical use?

Janice Caron (7/11) Apr 30 2008 Private use characters can be used for invented alphabets, e.g.

terranium (2/5) Apr 30 2008 really?

Janice Caron (7/12) Apr 30 2008 Yes really.

Sean Kelly (10/24) Apr 30 2008 In all fairness, you can uppercase UTF-8 in place so long as none of
Me Here (34/48) Apr 30 2008 Ignoring for the moment Matti's pronouncement that this is an obscure an...
Frits van Bommel (9/26) Apr 30 2008 Actually, you can't uppercase UTF-16 and UTF-32 in-place either if you

Janice Caron (12/14) Apr 30 2008 I know about that, and for the future I have plans for a proper

Spacen Jasset (11/28) May 01 2008 I think uppercasing non ascii (english) characters is a more of

Janice Caron (12/15) May 01 2008 The Unicode Standard defines casing unambiguously for all characters.

Steven Schveighoffer (6/20) May 01 2008 What about inPlaceToUpperASCII(char[] str)?

Robert Fraser (10/52) Apr 30 2008 I like StringBuffers :-). Did Walter veto the idea completely or did he

Janice Caron (4/6) Apr 30 2008 Yeah, he said not a class. And that was probably my fault because in

Bill Baxter (6/16) Apr 30 2008 Herein lies the genius in Tango's naming conventions. You *can* have

Steven Schveighoffer (3/18) Apr 30 2008 Not on Windoze :)

Sean Kelly (6/24) Apr 30 2008 It should still work, I believe. The source file will have a .d extensi...

Steven Schveighoffer (6/35) Apr 30 2008 Excellent point, I completely forgot that even though you import std.Str...

Bill Baxter (4/40) May 01 2008 Yes it works fine on Windows too. I pretty much work only on Windows

Bruno Medeiros (5/12) May 01 2008 Something like this would be completely unacceptable not to work on Wind...

Steven Schveighoffer (3/10) May 01 2008 I was wrong, look at my response to Sean. Sorry about that.

Adam D. Ruppe (5/10) Apr 30 2008 --

Sean Kelly (17/24) Apr 30 2008 D arrays do have this feature, thanks to a suggestion by Derek Parnell. ...

Me Here (95/95) Apr 30 2008 As my ascii art was screwed by the time it got to the server, here is a

Janice Caron (4/9) Apr 30 2008 Sorry, I meant

Pedro Ferreira (2/16) May 02 2008 Weren't 'void[]'s banned?

Janice Caron (10/12) Apr 29 2008 That's why we're having this discussion.

Bruno Medeiros (10/25) May 01 2008 "mutable versions were called "by mistake" "? I don't think that point

Frits van Bommel (7/26) May 01 2008 What if you wanted a modified copy of the input, but that input happened...

Steven Schveighoffer (14/40) May 01 2008 Any modifying versions would take mutable strings, COW version would req...
Bruno Medeiros (8/36) May 01 2008 Yes, the idea to distinguish them with a different name sounds good

Frits van Bommel (18/27) May 01 2008 I don't like 'doToUpper', but something like 'makeUpper' could be a good...

Simen Kjaeraas (3/6) May 01 2008 So anyone who uses alphabets other than pure english will have to write ...
Pedro Ferreira (17/31) May 02 2008 (snip)

"Janice Caron" <caron800 googlemail.com> writes:

Hi all,

More than one person has complained about the lack of string functions
in Phobos which operate on mutable chars. In the thread titled "Is all
this Invariant ****....", I suggested creating a new module,
std.stringbuffer, to contain two things:

(1) a StringBuffer class
(2) parallel mutable versions of the functions in std.string.

Walter OKed the idea, so it looks like that's a go. To that end, I've
looked through the functions in std.string and sorted them into
different groups. I think it's important to get the API right so
comments are welcome on all of the below:

The following functions are incorrectly declared in std.string because
they are currently declared to take strings, not const(char)[]. They
should be:

	long atoi(in char[] s)
	real atof(in char[] s)
	size_t count(in char[] s, in char[] sub)
	bool inPattern(dchar c, in char[] pattern)
	int inPattern(dchar c, in char[][] patterns)
	size_t countchars(in char[] s, in char[] pattern)
	bool isNumeric(in char[] s, in bool bAllowSep = false)
	size_t column(char[] str, int tabsize = 8)

The following functions are badly declared in std.string because they
are declared to take and return strings. With the following change,
they become type agnostic

	size_t isEmail(in char[] s)
	size_t isURL(in char[] s)

The following function is the /only/ function currently in std.string
which takes an optional mutable buffer to use instead of allocating on
the heap. For consistency, let's put the mutable version into
std.stringbuffer, and let std.string have an invariant version, as
follows:

	string soundex(string s)

The remaining functions go in std.stringbuffer.

The following functions all take an optional mutable buffer as input
into which to write the return value to avoid allocation.

	char[] tolower(in char[] s, char[] buffer=null)
	char[] toupper(in char[] s, char[] buffer=null)
	char[] capitalize(in char[] s, char[] buffer=null)
	char[] capwords(in char[] s, char[] buffer=null)
	char[] repeat(in char[] s, size_t n, char[] buffer=null)
	char[] join(in char[][] words, char[] sep, char[] buffer=null)
	char[] ljustify(in char[] s, int width, char[] buffer=null)
	char[] rjustify(in char[] s, int width, char[] buffer=null)
	char[] center(in char[] s, int width, char[] buffer=null)
	char[] zfill(in char[] s, int width, char[] buffer=null)
	char[] replace(in char[] s, in char[] from, in char[] to, char[] buffer=null)
	char[] replaceSlice(in char[] s, in char[] slice, in char[]
replacement, char[] buffer=null)
	char[] insert(in char[] s, size_t index, in char[] sub, char[] buffer=null)
	char[] expandtabs(in char[] str, int tabsize=8, char[] buffer=null)
	char[] entab(in char[] s, int tabsize=8, char[] buffer=null) // in place?
	char[] maketrans(in char[] from, in char[] to, char[] buffer=null)
	char[] translate(in char[] s, in char[] transtab, in char[] delchars,
char[] buffer=null)
	char[] succ(in char[] s, char[] buffer=null)
	char[] soundex(in char[] s, char[] buffer=null)
	char[] wrap(in char[] s, int columns = 80, in char[] firstindent =
null, in char[] indent = null, int tabsize = 8, char[] buffer=null)

The following functions I am uncertain about. They could be declared
to take a mutable buffer as input, consistent with the above. /Or/
they could operate on data in place. Opinions are welcome.

	char[] removechars(in char[] s, in char[] pattern, char[]
buffer=null) // in place?
	char[] squeeze(in char[] s, in char[] pattern = null, char[]
buffer=null) // in place?
	char[] tr(in char[] str, in char[] from, in char[] to, in char[]
modifiers=null, char[] buffer=null) // in place?

The following functions need to be overloaded for both const and mutable input

	char[][] split(char[] s)
	const(char)[][] split(const(char)[] s)
	char[][] split(char[] s, in char[] delim)
	const(char)[][] split(const(char)[] s, in char[] delim)
	char[][] splitlines(char[] s)
	const(char)[][] splitlines((char)[] s)

	char[] stripl(char[] s)
	const(char)[] stripl(const(char)[] s)
	char[] stripr(char[] s)
	const(char)[] stripr(const(char)[] s)
	char[] strip(char[] s)
	const(char)[] strip(const(char)[] s)
	char[] chop(char[] s)
	const(char)[] chop(const(char)[] s)

Not sure what to do about the following one. AAs of mutable arrays are
notoriously difficult to get bug free. Should we bother with this one?

	char[][char[]] abbrev(in char[][] values) // May be impractical

Finally - what do we all think about the inconstitent capitalization
thoughout std.string. (toupper versus toString, capwords versus
endsWith, etc.)

Apr 29 2008

"Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:

"Janice Caron" <caron800 googlemail.com> wrote in message 
news:mailman.508.1209497029.2351.digitalmars-d puremagic.com...
 Hi all,

 More than one person has complained about the lack of string functions
 in Phobos which operate on mutable chars. In the thread titled "Is all
 this Invariant ****....", I suggested creating a new module,
 std.stringbuffer, to contain two things:

 (1) a StringBuffer class

Might I ask why a StringBuffer class would be necessary?

Apr 29 2008

"Janice Caron" <caron800 googlemail.com> writes:

2008/4/29 Jarrett Billingsley <kb3ctd2 yahoo.com>:
  Might I ask why a StringBuffer class would be necessary?

Walter has vetoed that one, so it's moot now. :-)

Apr 29 2008

Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:

A couple of thoughts:

Janice Caron wrote:
 Hi all,
 
 More than one person has complained about the lack of string functions
 in Phobos which operate on mutable chars. In the thread titled "Is all
 this Invariant ****....", I suggested creating a new module,
 std.stringbuffer, to contain two things:
 
 (1) a StringBuffer class
 (2) parallel mutable versions of the functions in std.string.

I'm with Jarret here, why the hell do we need a StringBuffer class? 
'string' is not a class either, so just use char[].

I would recomment aliasing char[] to 'mstring' (short for mutable 
string. I think such an alias is more readable than 'char[]'

Also, is there a reason why these mutable functions shouldn't be in 
std.string, together with their invariant/const brethren? I don't think 
it makes sense to have another package if one opt by the (2) solution.


-- 
Bruno Medeiros - Software Developer, MSc. in CS/E graduate
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D

Apr 29 2008

"Me Here" <p9e883002 sneakemail.com> writes:

Bruno Medeiros wrote:

A couple of thoughts:

std.stringbuffer, to contain two things:

(1) a StringBuffer class
(2) parallel mutable versions of the functions in std.string.

I'm with Jarret here, why the hell do we need a StringBuffer class? 
'string' is not a class either, so just use char[].

I would recomment aliasing char[] to 'mstring' (short for mutable string. 
I think such an alias is more readable than 'char[]'

Also, is there a reason why these mutable functions shouldn't be in 
std.string, together with their invariant/const brethren?
I don't think it makes sense to have another package if one opt by the (2) 
solution.

As one of those that has request "a standard library of string functions 
that accept and return mutable strings Ie. char[]",
I see no reason it should be a class, free function seem to work just 
fine. A class would just be bloat.

I would be perfectly happy for these to co-exist in the std.string space. 
Indeed I would prefer it.

If a separate namespace is deamed /essential/, then I see no reason to go 
with, and certainly did "ask for" it to be called
the misleading name of std.StringBuffer. As far as I recall, that was 
Janice's own suggestion.

For preference, if a separate namespace is absolutely necessary, I go for:

     std.string.mutable

The right namespace and does what is says on the tin.

Further, /if/ I had any input to the design, then the suggestion for me to 
have to pass in preallocated buffers
to accomodate the mutated data if it needs to grow would be scotched 
forthwith.

They should look and work in exactly the same way as the existing v2 
std.string functions taking the same number
and order of parameters. Just char[] (or compatible alias) instead of 
string.

If the buffers need to grow, then allocate space from wherever (I assume 
the heap) that std.string allocates from now.
If they do not change size, then return the original intact.
If they shrink, and if the D array internals permit this, then adjust the 
.length attribute whilst leaving the actual
allocation unchanged. That way it is there for use should further 
mutations cause it to grow again.
This also helps prevent heap fragmentation if the functions are called on 
heap allocated data.

Finally, if the retention of unused but allocated space in an array is a 
feature of the current design, then I would add
a debug time warning indicating when a char[] has had to be grown. These 
could be used during devlopment to adjust
the preallocated size of arrays to be large enough to accomodate all 
(most? typical?) requirements.

In summary: mutable string functions shoudl do exactly the same as the 
invarient functions do now,
except only reallocate if necessary and (optionally) issue a warning under 
debug if they have to.

Seems almost as if a template solution could be used, except that I think 
the additional conditional code would
hamper the performance of both instantiations. Unless Ds templating is 
capable of optiising away branches
of code that relate to the /other/ type instantiations? I've had no 
occasion to use templates in D yet, so that might
be pie in the sky.



--

Apr 29 2008

"Janice Caron" <caron800 googlemail.com> writes:

2008/4/30 Me Here <p9e883002 sneakemail.com>:
  If a separate namespace is deamed /essential/, then I see no reason to go
 with, and certainly did "ask for" it to be called
  the misleading name of std.StringBuffer. As far as I recall, that was
 Janice's own suggestion.

Yeah, I got that from an earlier post when someone said "What you need
is a string buffer" in response to some question.

The name can be anything we want it to be.


  For preference, if a separate namespace is absolutely necessary, I go for:

     std.string.mutable

Except "std.string.anything" :-)

"std.string" is a module, so it can't also be a package. That's a
limitation of the D language.

  Finally, if the retention of unused but allocated space in an array is a
 feature of the current design, then I would add
  a debug time warning indicating when a char[] has had to be grown. These
 could be used during devlopment to adjust
  the preallocated size of arrays to be large enough to accomodate all (most?
 typical?) requirements.

I would support the addition of some function like

    gc.minimise(char[])

which returned all the unused space following the end of the array
back to the gc, without any copying of the used part. I wouldn't be
able to write that though - the gc is not my area of expertise.

Apr 30 2008

Sean Kelly <sean invisibleduck.org> writes:

== Quote from Janice Caron (caron800 googlemail.com)'s article
 I would support the addition of some function like
     gc.minimise(char[])
 which returned all the unused space following the end of the array
 back to the gc, without any copying of the used part. I wouldn't be
 able to write that though - the gc is not my area of expertise.

This would only work for large arrays I'm afraid, given the GC
implementation for D--it uses fixed-size blocks until the block
size is 4096 bytes or larger.  Also, the shrinking would be done
in chunks of 4096 bytes, so a fairly substantial size change would
have to occur for anything to happen at all.  That said, things get
a lot easier if moving the block is allowed.  Tango even exposes
a GC.realloc() routine which will do this for you.


Sean

Apr 30 2008

"Janice Caron" <caron800 googlemail.com> writes:

2008/4/30 Sean Kelly <sean invisibleduck.org>:
  Tango even exposes
  a GC.realloc() routine which will do this for you.

So does Phobos. std.gc.realloc().

However, realloc() doesn't promise not to copy, and not copying is the
objective. Thanks for all the cool info, but I just think programmers
would just feel more "comfortable" if, after they've done all their
in-place string manipulations, they can call some minimizing function,
even if only to give them a warm fuzzy feeling that they're not
wasting any more memory than is necessary.

Frankly, it could even be implemented a do-nothing function. That way,
at least "blame" for excessive memory use passes from the programmer
to Phobos, and future gc implementations might do things differently.

Apr 30 2008

Sean Kelly <sean invisibleduck.org> writes:

== Quote from Janice Caron (caron800 googlemail.com)'s article
 2008/4/30 Sean Kelly <sean invisibleduck.org>:
  Tango even exposes
  a GC.realloc() routine which will do this for you.

 So does Phobos. std.gc.realloc().
 However, realloc() doesn't promise not to copy, and not copying is the
 objective. Thanks for all the cool info, but I just think programmers
 would just feel more "comfortable" if, after they've done all their
 in-place string manipulations, they can call some minimizing function,
 even if only to give them a warm fuzzy feeling that they're not
 wasting any more memory than is necessary.

It's perhaps worth noting here that C++ objects don't typically minimize
either.  That's why Scott Meyers (?) proposed the idiom:

myVector.swap(std::vector(myVector));

 Frankly, it could even be implemented a do-nothing function. That way,
 at least "blame" for excessive memory use passes from the programmer
 to Phobos, and future gc implementations might do things differently.

Fair enough.


Sean

Apr 30 2008

"Me Here" <p9e883002 sneakemail.com> writes:

Janice Caron wrote:


The name can be anything we want it to be.
...
Except "std.string.anything" :-)

I did laugh. Not quite "any colour you like so long as its black", but 
close :)
"std.string" is a module, so it can't also be a package. That's a
limitation of the D language.

Now. This is where you show me up to be nothing but a pretender in this 
forum.
I have no idea what the distinction is be tween thos two in D.


  Finally, if the retention of unused but allocated space in an array is a
feature of the current design, then I would add
  a debug time warning indicating when a char[] has had to be grown. These
could be used during devlopment to adjust
  the preallocated size of arrays to be large enough to accomodate all (most?
typical?) requirements.

I would support the addition of some function like

     gc.minimise(char[])

which returned all the unused space following the end of the array
back to the gc, without any copying of the used part. I wouldn't be
able to write that though - the gc is not my area of expertise.

I /think/ you may have misunderstood my intent here. Unsurprising cos it 
was badly outlined.
And I'm not at all sure that D works this way.

In, for example, Perl, an array can be pre-sized but then set to be empty.
That is, it can have space preallocated to it, but contain nothing.
Likewise strings have two length attributes internally.
- one denotes the length of the contents, as woudl be returned to the 
program by the length() function.
- one indicated the actual length of the ram allocated to it.

This allows, or example, chomp() to simply move adjust a number (the 
program visible length) and do
not adjustment or reallocation at all. It can also adjust the left hand 
end of the contents
effectively foreshortening the string, again without adjusting the 
allocation.
So visually, a scalar holding a string might at some point in its life 
look something like:
(this ascii art is going to come out a mess on the server but...)

header
[ offset     ]     |--------+
[actualLen  
]-------------------------------------------------------------------------->
[pgmVisible]              |------------------------------------------------>
[pointer    ]----v         |
                    [][][][][][][the contents the program can see is
here][][][][][][]

Basically, it start out with offset zero and only as much padding (if any) 
as is required to bring it to suitable alignment.
But if you remove characters at the end (chomp or chop) then the padding 
grows as the content shrink and nothing is allocated.
If you remove characters from the front of the string the offset 
accomodates that and the allocation doesn't change.
And if further mutations expand the string, then these spaces are reused 
before a new allocation is made.

If for example, you know you are going to be build ia long string up 
piecewise from small appendages, you can inilialise it to some
length big enough for the expected final length and the truncate it 
(assign '' to it) and it will retain its allocation, even though the
program visible length is zero. Then, as you add stuff to it, it grows 
into the allocation.

My point was that /if/ Ds arrays have a similar capability, to be 
preallocated large and empty and grow into the space then
when a mutation requires a reallocation of a mutable array because it has 
outgrown its original allocation,
then a debug-enabled warning saying by how much, might allow the 
programmer to preallocate the initial mutable array larger and
so avoid reallocation at runtime.

There's a whole heap of speculation about what might be going on inside D 
that I have no real knowledge of at all.

Note:There is no suggestion here that D shoudl work this way. Only that if 
it does allow preallocation of arrays sizes,
then a warning when a mutation causes allocation would allow the 
programmer to best use that facility.

Cheers, b.

--

Apr 30 2008

"Janice Caron" <caron800 googlemail.com> writes:

2008/4/30 Me Here <p9e883002 sneakemail.com>:
 "std.string" is a module, so it can't also be a package. That's a
 limitation of the D language.

  Now. This is where you show me up to be nothing but a pretender in this
 forum.
  I have no idea what the distinction is be tween thos two in D.

One is file, the other is a folder. std.string is a file, so it can't
also be a folder.


  I /think/ you may have misunderstood my intent here. Unsurprising cos it
 was badly outlined.
  And I'm not at all sure that D works this way.

  In, for example, Perl, an array can be pre-sized but then set to be empty.
  That is, it can have space preallocated to it, but contain nothing.
  Likewise strings have two length attributes internally.
  - one denotes the length of the contents, as woudl be returned to the
 program by the length() function.
  - one indicated the actual length of the ram allocated to it.

Well, that's what a StringBuffer would do, but nobody seemed to like
the idea. A string contains two pieces of information: (1) ptr, and
(2) length. A StringBuffer would carry a third piece of information:
(3) capacity. (Actually, in general it would be Buffer!(T), with
StringBuffer just being a special case).

Built in-strings to have a capacity, but it's not carried round in a
field. Instead. to find the capacity of an array, you have to call
std.gc.capacity(array) - and I can't see how there can not be a
performance hit there.

Increasing the length of a D array doesn't necessarily mean
reallocating (although as noted above, the code has to do some work to
find out the capacity), but it /does/ mean re-initialising the newly
exposed elements. Again, that has to be a performance hit. With a
Buffer!(), you could increase the length (up to capacity) not only
without reallocating but also without reinitializing, just by changing
the value of an int.

But <shrugs> - if people don't want StringBuffers, who am I to argue?

Apr 30 2008

"Me Here" <p9e883002 sneakemail.com> writes:

Janice Caron wrote:

But <shrugs> - if people don't want StringBuffers, who am I to argue?

What's in a name? Pre-conceptions of other worlds and other tools. 
Specifically Java.
Additionally, the casing suggests a class?

For my part, I simply want string functions that operate on char[]s.

Because, I percieve that for the type of mutations I am currently doing,
Invarient strings would incur too high a cost.

If your StringBuffer concept would accept and manipulate char[]s
and not require the instantiation, initialisation and syntax of an object.

By which I mean that if having used a string function upon my char[]
I can still apply slice operations to it using the standard syntax.
And then apply another string function, and then another slice.

Or even, apply a string function to a slice of a larger string and
mutate that larger string, in-place through the slice:

     char[] a = ...2000 chars from somewhere.

     char[] field1 = a[ 312 .. 357 ];
     field1.toUpper();

     char[] checksum = a[ $-16 ..  $ ];
     checksum = md5hex( a );
...

Then I will be very happy.

Beyond that, I have no requirements :)

All the stuff about warnings and internal and external lengths was just 
speclation
about what might be going on inside on the basis of what I know, have seen 
(Perl)
and have personally implemented. (Not Perl).

Cheers, b.

Ps. Is there a paper/article/reference on the reasoning behind Invariant 
strings somewhere?


--

Apr 30 2008

"Janice Caron" <caron800 googlemail.com> writes:

2008/4/30 Me Here <p9e883002 sneakemail.com>:
     char[] a = ...2000 chars from somewhere.

     char[] field1 = a[ 312 .. 357 ];
     field1.toUpper();

I've kind of lost track of the number of times I've said this in
recent days, but...

You cannot uppercase in place, because for any given dchar, c, the
number of UTF-8 bytes required to express c may be different from the
number of UTF-8 bytes required to express toupper(c).

If any of you have plans to uppercase or lowercase UTF-8 in place,
forget that now. It just ain't possible. (You can uppercase ASCII,
UTF-16, or UTF-32 in place. But not UTF-8, and char[], by definition,
is UTF-8).

Apr 30 2008

Matti Niemenmaa <see_signature for.real.address> writes:

Janice Caron wrote:
 If any of you have plans to uppercase or lowercase UTF-8 in place,
 forget that now. It just ain't possible. (You can uppercase ASCII,
 UTF-16, or UTF-32 in place. But not UTF-8, and char[], by definition,
 is UTF-8).

It's possible that, in some obscure case, you can't uppercase UTF-16 in place 
either.

A code point in the private use area (U+E000 to U+F8FF), which can be 
represented with one UTF-16 code unit, may uppercase to something in the 
supplementary private use areas (U+F0000 upwards), whose code points require
two 
UTF-16 code units each. Of course the toUpper function in question must be
aware 
of this configuration of the private use areas.

This is an extremely contrived case and I doubt it'll ever come up in practice, 
anywhere, but in theory it might. <g>

-- 
E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi

Apr 30 2008

"Janice Caron" <caron800 googlemail.com> writes:

2008/4/30 Matti Niemenmaa <see_signature for.real.address>:
 It's possible that, in some obscure case, you can't uppercase UTF-16 in
 place either.

Perhaps surprisingly, that's not so. This is because the alphabets of
*ALL* living languages exist within Unicode's "Basic Multilingual
Plane" (...which is to say, they can be encoded in a single wchar).

The characters outside the BMP (...those which need a dchar, not a
wchar...) are the letters of dead languages, or other special symbols.

The probability that a letter from a living language will uppercase to
a letter of a dead language is as near to zero as makes no odds.

Apr 30 2008

"Janice Caron" <caron800 googlemail.com> writes:

Oh, sorry, I didn't read your whole post before replying. <embarrassed>.

OK, so private use characters might be a contrived exception. BUT,
nobody expects toUpper() to acknowledge private use characters. That
would require a run-time extensibility mechanism which is way beyond
what toUpper() does now, and likely beyond anything it's ever likely
to do any time soon. Maybe some future Unicode library with a
registerPrivateUseCharacters() function might cover that
functionality, but there are no plans for that on the table right now.
(And even then - as you say - it's a /very/ contrived case).

Apr 30 2008

Matti Niemenmaa <see_signature for.real.address> writes:

Janice Caron wrote:
 OK, so private use characters might be a contrived exception. BUT,
 nobody expects toUpper() to acknowledge private use characters. That
 would require a run-time extensibility mechanism which is way beyond
 what toUpper() does now, and likely beyond anything it's ever likely
 to do any time soon.

You're right, of course. I was referring more to some hypothetical toUpper() 
function rather than one which I would expect to find in any standard 
library---the generic case of "uppercasing a character" as opposed to 
std.string[buffer].toUpper.

-- 
E-mail address: matti.niemenmaa+news, domain is iki (DOT) fi

Apr 30 2008

terranium <spam here.lot> writes:

Matti Niemenmaa Wrote:

 A code point in the private use area (U+E000 to U+F8FF), which can be 
 represented with one UTF-16 code unit, may uppercase to something in the 

does this have any practical use?

Apr 30 2008

"Janice Caron" <caron800 googlemail.com> writes:

2008/4/30 terranium <spam here.lot>:
 Matti Niemenmaa Wrote:

  > A code point in the private use area (U+E000 to U+F8FF), which can be
  > represented with one UTF-16 code unit, may uppercase to something in the

  does this have any practical use?

Private use characters can be used for invented alphabets, e.g.
Klingon, or my-made-up-funky-alphabet. You can define them to be
whatever you want. However the mechanism for /interpreting/ such
characters is outside the scope of Unicode. All co-operating
applications have to have the same knowledge of what those characters
"mean".

Apr 30 2008

terranium <spam here.lot> writes:

Janice Caron Wrote:

 You cannot uppercase in place, because for any given dchar, c, the
 number of UTF-8 bytes required to express c may be different from the
 number of UTF-8 bytes required to express toupper(c).

really?

Apr 30 2008

"Janice Caron" <caron800 googlemail.com> writes:

2008/4/30 terranium <spam here.lot>:
 Janice Caron Wrote:

  > You cannot uppercase in place, because for any given dchar, c, the
  > number of UTF-8 bytes required to express c may be different from the
  > number of UTF-8 bytes required to express toupper(c).

  really?

Yes really.

    toUpper( '\u2C65' ) == '\u023A'
    toLower( '\u023A' ) == '\u2C65'

'\u023A' requires two bytes in UTF-8
'\u2C65' requires three bytes in UTF-8

Not a problem in UTF-16, of course.

Apr 30 2008

Sean Kelly <sean invisibleduck.org> writes:

== Quote from Janice Caron (caron800 googlemail.com)'s article
 2008/4/30 Me Here <p9e883002 sneakemail.com>:
     char[] a = ...2000 chars from somewhere.

     char[] field1 = a[ 312 .. 357 ];
     field1.toUpper();

 I've kind of lost track of the number of times I've said this in
 recent days, but...
 You cannot uppercase in place, because for any given dchar, c, the
 number of UTF-8 bytes required to express c may be different from the
 number of UTF-8 bytes required to express toupper(c).
 If any of you have plans to uppercase or lowercase UTF-8 in place,
 forget that now. It just ain't possible. (You can uppercase ASCII,
 UTF-16, or UTF-32 in place. But not UTF-8, and char[], by definition,
 is UTF-8).

In all fairness, you can uppercase UTF-8 in place so long as none of
the characters within the string require a multi-byte capital.  Thus
one questionable strategy would be to uppercase in place until the
first multibyte conversion is required.  The obvious downside being
that the original buffer may end up partially capitalized, with the
fully capitalized result returned in a new buffer.  I'm sure people
processing ASCII text would love this, but I can see it causing
problems elsewhere.


Sean

Apr 30 2008

"Me Here" <p9e883002 sneakemail.com> writes:

Janice Caron wrote:

2008/4/30 Me Here <p9e883002 sneakemail.com>:
     char[] a = ...2000 chars from somewhere.

     char[] field1 = a[ 312 .. 357 ];
     field1.toUpper();

I've kind of lost track of the number of times I've said this in
recent days, but...

You cannot uppercase in place, because for any given dchar, c, the
number of UTF-8 bytes required to express c may be different from the
number of UTF-8 bytes required to express toupper(c).

If any of you have plans to uppercase or lowercase UTF-8 in place,
forget that now. It just ain't possible. (You can uppercase ASCII,
UTF-16, or UTF-32 in place. But not UTF-8, and char[], by definition,
is UTF-8).

Ignoring for the moment Matti's pronouncement that this is an obscure and 
unlikely event,
it really depends upon how the library is coded.

For example, if the case change is effected in place for the majority of 
cases when it can be,
when the occasion occurs that it cannot, and raises a runtime exception, 
catch the error
and use replaceSlice to handle it:

	import std.stdio;
	import std.string;

	int main( char[][] args ) {
	    char [] s = "the quick brown fox";
	    try{
	        s[ 8 .. 9 ] = \u1234;
	    }
	    catch {
	        s =  s.replaceSlice( s[ 8 .. 9 ], \u1234  );
	    }
	    writefln( s );
	    return 0;
	}

Though it would be (much) nicer if the builtin lvalue slice handled this 
for us.
I was just disappointed for the second to (re)discover this imitation of
Ds slicing. I had forgotten because other languages I used do not.

This is one of those things that I doubt I will ever agree with the 
decision.
But I'm just another jerk on the internet with an opinion, and we all know
what that is analogous to.

If the language doesn't handle it, the the library should.
If it doesn't, then I will have to. And you, and Bill and Fred and Sue ,,,

Cheers, b.


--

Apr 30 2008

Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:

Janice Caron wrote:
 2008/4/30 Me Here <p9e883002 sneakemail.com>:
     char[] a = ...2000 chars from somewhere.

     char[] field1 = a[ 312 .. 357 ];
     field1.toUpper();

 
 I've kind of lost track of the number of times I've said this in
 recent days, but...
 
 You cannot uppercase in place, because for any given dchar, c, the
 number of UTF-8 bytes required to express c may be different from the
 number of UTF-8 bytes required to express toupper(c).
 
 If any of you have plans to uppercase or lowercase UTF-8 in place,
 forget that now. It just ain't possible. (You can uppercase ASCII,
 UTF-16, or UTF-32 in place. But not UTF-8, and char[], by definition,
 is UTF-8).

Actually, you can't uppercase UTF-16 and UTF-32 in-place either if you 
want to be entirely correct. For example: \u00df ("ß") --> \u0053 \u0053 
("SS"). This increases the byte count for both UTF-16 and UTF-32.
(This does work for UTF-8 though, since \u00df happens to require 2 
UTF-8 code units, and both \u0053s only one each)

(See <http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt> for what 
should be a complete list of characters with similar annoying casing 
properties)

Apr 30 2008

"Janice Caron" <caron800 googlemail.com> writes:

2008/5/1 Frits van Bommel <fvbommel remwovexcapss.nl>:
  Actually, you can't uppercase UTF-16 and UTF-32 in-place either if you want
 to be entirely correct. For example: \u00df ("ß") --> \u0053 \u0053 ("SS").

I know about that, and for the future I have plans for a proper
unicode lib with normalisation, full casing, etc. However - none of
that is the job of std.string.toUpper() or std.string.toLower(). These
functions only need to /simple/ casing, not /full/ casing, and in
/simple/ casing, one dchar always maps to one dchar. In particular
'\u00DF' maps to '\u00DF'.

In full casing, toLower('\u1E9E') (LATIN CAPITAL LETTER SHARP S) is
'\u00DF' (LATIN SMALL LETTER SHARP S), but the converse is not true.
What fun! :-). But full casing is not the concern of std.string (nor
of std.stringbuffer, or whatever we end up calling it), so we don't
need to worry about that here.

Apr 30 2008

Spacen Jasset <spacenjasset yahoo.co.uk> writes:

Janice Caron wrote:
 2008/4/30 Me Here <p9e883002 sneakemail.com>:
     char[] a = ...2000 chars from somewhere.

     char[] field1 = a[ 312 .. 357 ];
     field1.toUpper();

 
 I've kind of lost track of the number of times I've said this in
 recent days, but...
 
 You cannot uppercase in place, because for any given dchar, c, the
 number of UTF-8 bytes required to express c may be different from the
 number of UTF-8 bytes required to express toupper(c).
 
 If any of you have plans to uppercase or lowercase UTF-8 in place,
 forget that now. It just ain't possible. (You can uppercase ASCII,
 UTF-16, or UTF-32 in place. But not UTF-8, and char[], by definition,
 is UTF-8).

I think uppercasing non ascii (english) characters is a more of 
specialised business anyway (some languages have no notion of upper 
case, and yet others depend on context), which often should be perfomed 
by a presentation layer.

People need a toupper/lower all the time, and 90% of the time they use 
it on strings that are in the ascii range, often because they deal with 
protocols, file formats and other such things.

In which case phobos's string.toupper shouldn't really be doing work 
outside of ascii, in my opinion anyway. This also means that a string 
can be uppercased in place.

May 01 2008

"Janice Caron" <caron800 googlemail.com> writes:

On 01/05/2008, Spacen Jasset <spacenjasset yahoo.co.uk> wrote:
  I think uppercasing non ascii (english) characters is a more of specialised
 business anyway (some languages have no notion of upper case, and yet others
 depend on context), which often should be perfomed by a presentation layer.

The Unicode Standard defines casing unambiguously for all characters.
Yes, toupper() of a Chinese character will leave it unchanged, but
it's still defined, and that is /not/ locale dependent.

However, casing in place is possible for UTF-8 if you're prepared to
throw an exception for those (extremely rare) cases when the sequence
length changes. So that means, you'd need two versions, the in-place
version

    toUpperInPlace(char[] s)  // might throw

and the general version

    char[] toUpper(const(char)[] s, char[] buffer=null)

That could be done

May 01 2008

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

"Janice Caron" wrote
 2008/4/30 Me Here:
     char[] a = ...2000 chars from somewhere.

     char[] field1 = a[ 312 .. 357 ];
     field1.toUpper();

 I've kind of lost track of the number of times I've said this in
 recent days, but...

 You cannot uppercase in place, because for any given dchar, c, the
 number of UTF-8 bytes required to express c may be different from the
 number of UTF-8 bytes required to express toupper(c).

 If any of you have plans to uppercase or lowercase UTF-8 in place,
 forget that now. It just ain't possible. (You can uppercase ASCII,
 UTF-16, or UTF-32 in place. But not UTF-8, and char[], by definition,
 is UTF-8).

What about inPlaceToUpperASCII(char[] str)?

in other words, yeah, toUpper can use a UTF-8 string, and return a UTF-8 
string, but I can see use in having a function that expects to receive ASCII 
and uppercases in-place.  The function would be a lot simpler in any case :)

-Steve

May 01 2008

Robert Fraser <fraserofthenight gmail.com> writes:

Janice Caron wrote:
 2008/4/30 Me Here <p9e883002 sneakemail.com>:
 "std.string" is a module, so it can't also be a package. That's a
 limitation of the D language.

  Now. This is where you show me up to be nothing but a pretender in this
 forum.
  I have no idea what the distinction is be tween thos two in D.

 
 One is file, the other is a folder. std.string is a file, so it can't
 also be a folder.
 
 
  I /think/ you may have misunderstood my intent here. Unsurprising cos it
 was badly outlined.
  And I'm not at all sure that D works this way.

  In, for example, Perl, an array can be pre-sized but then set to be empty.
  That is, it can have space preallocated to it, but contain nothing.
  Likewise strings have two length attributes internally.
  - one denotes the length of the contents, as woudl be returned to the
 program by the length() function.
  - one indicated the actual length of the ram allocated to it.

 
 Well, that's what a StringBuffer would do, but nobody seemed to like
 the idea. A string contains two pieces of information: (1) ptr, and
 (2) length. A StringBuffer would carry a third piece of information:
 (3) capacity. (Actually, in general it would be Buffer!(T), with
 StringBuffer just being a special case).
 
 Built in-strings to have a capacity, but it's not carried round in a
 field. Instead. to find the capacity of an array, you have to call
 std.gc.capacity(array) - and I can't see how there can not be a
 performance hit there.
 
 Increasing the length of a D array doesn't necessarily mean
 reallocating (although as noted above, the code has to do some work to
 find out the capacity), but it /does/ mean re-initialising the newly
 exposed elements. Again, that has to be a performance hit. With a
 Buffer!(), you could increase the length (up to capacity) not only
 without reallocating but also without reinitializing, just by changing
 the value of an int.
 
 But <shrugs> - if people don't want StringBuffers, who am I to argue?

I like StringBuffers :-). Did Walter veto the idea completely or did he 
say "not a class". I'd use a struct - there's no extra bloat, the 
interface can be encapsulated, and people can use a pointer if they're 
passing between functions (since it will most often be used within the 
scope of a single function anyway). Or just pass it on the stack, if 
it's guaranteed to only be 3 DWORDs.

My suggestion (grain of salt) is to represent them similarly to the way 
mtext does by using two bits somewhere to hold the character type (char, 
wchar, dchar) and change character types as needed.

Apr 30 2008

"Janice Caron" <caron800 googlemail.com> writes:

2008/4/30 Robert Fraser <fraserofthenight gmail.com>:
  I like StringBuffers :-). Did Walter veto the idea completely or did he say
 "not a class".

Yeah, he said not a class. And that was probably my fault because in
my first post on this thread I used the word "class".
Janice

Apr 30 2008

Bill Baxter <dnewsgroup billbaxter.com> writes:

Janice Caron wrote:
 2008/4/30 Me Here <p9e883002 sneakemail.com>:
 "std.string" is a module, so it can't also be a package. That's a
 limitation of the D language.

  Now. This is where you show me up to be nothing but a pretender in this
 forum.
  I have no idea what the distinction is be tween thos two in D.

 
 One is file, the other is a folder. std.string is a file, so it can't
 also be a folder.
 

Herein lies the genius in Tango's naming conventions.  You *can* have 
both a package std.string, and a module named std.String.  If you 
consistently use different case for package and module names, then you 
can have your cake and eat it too.

--bb

Apr 30 2008

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

"Bill Baxter" wrote
 Janice Caron wrote:
 2008/4/30 Me Here :
 "std.string" is a module, so it can't also be a package. That's a
 limitation of the D language.

  Now. This is where you show me up to be nothing but a pretender in this
 forum.
  I have no idea what the distinction is be tween thos two in D.

 One is file, the other is a folder. std.string is a file, so it can't
 also be a folder.

 Herein lies the genius in Tango's naming conventions.  You *can* have both 
 a package std.string, and a module named std.String.  If you consistently 
 use different case for package and module names, then you can have your 
 cake and eat it too.

Not on Windoze :)

-Steve

Apr 30 2008

Sean Kelly <sean invisibleduck.org> writes:

== Quote from Steven Schveighoffer (schveiguy yahoo.com)'s article
 "Bill Baxter" wrote
 Janice Caron wrote:
 2008/4/30 Me Here :
 "std.string" is a module, so it can't also be a package. That's a
 limitation of the D language.

  Now. This is where you show me up to be nothing but a pretender in this
 forum.
  I have no idea what the distinction is be tween thos two in D.

 One is file, the other is a folder. std.string is a file, so it can't
 also be a folder.

 Herein lies the genius in Tango's naming conventions.  You *can* have both
 a package std.string, and a module named std.String.  If you consistently
 use different case for package and module names, then you can have your
 cake and eat it too.

 Not on Windoze :)

It should still work, I believe.  The source file will have a .d extension and
the folder
won't, so there shouldn't be a filesystem collision.  Or are you saying that the
compiler does some checking behind the scenes anyway?  I'll admit I've never
actually tried this.


Sean

Apr 30 2008

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

"Sean Kelly" wrote
 == Quote from Steven Schveighoffer
 "Bill Baxter" wrote
 Janice Caron wrote:
 2008/4/30 Me Here :
 "std.string" is a module, so it can't also be a package. That's a
 limitation of the D language.

  Now. This is where you show me up to be nothing but a pretender in 
 this
 forum.
  I have no idea what the distinction is be tween thos two in D.

 One is file, the other is a folder. std.string is a file, so it can't
 also be a folder.

 Herein lies the genius in Tango's naming conventions.  You *can* have 
 both
 a package std.string, and a module named std.String.  If you 
 consistently
 use different case for package and module names, then you can have your
 cake and eat it too.

 Not on Windoze :)

 It should still work, I believe.  The source file will have a .d extension 
 and the folder
 won't, so there shouldn't be a filesystem collision.  Or are you saying 
 that the
 compiler does some checking behind the scenes anyway?  I'll admit I've 
 never
 actually tried this.

Excellent point, I completely forgot that even though you import std.String, 
you are really looking at the file
std/String.d.

In that case, I think you are right, it would work on Windoze.

-Steve

Apr 30 2008

Bill Baxter <dnewsgroup billbaxter.com> writes:

Steven Schveighoffer wrote:
 "Sean Kelly" wrote
 == Quote from Steven Schveighoffer
 "Bill Baxter" wrote
 Janice Caron wrote:
 2008/4/30 Me Here :
 "std.string" is a module, so it can't also be a package. That's a
 limitation of the D language.

  Now. This is where you show me up to be nothing but a pretender in 
 this
 forum.
  I have no idea what the distinction is be tween thos two in D.

 One is file, the other is a folder. std.string is a file, so it can't
 also be a folder.

 Herein lies the genius in Tango's naming conventions.  You *can* have 
 both
 a package std.string, and a module named std.String.  If you 
 consistently
 use different case for package and module names, then you can have your
 cake and eat it too.

 Not on Windoze :)

 It should still work, I believe.  The source file will have a .d extension 
 and the folder
 won't, so there shouldn't be a filesystem collision.  Or are you saying 
 that the
 compiler does some checking behind the scenes anyway?  I'll admit I've 
 never
 actually tried this.

 
 Excellent point, I completely forgot that even though you import std.String, 
 you are really looking at the file
 std/String.d.
 
 In that case, I think you are right, it would work on Windoze.
 
 -Steve 

Yes it works fine on Windows too.  I pretty much work only on Windows 
testing things occasionally on VMWare Linux.

--bb

May 01 2008

Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:

Steven Schveighoffer wrote:
 "Bill Baxter" wrote
 
 Not on Windoze :)
 
 -Steve 
 
 

Something like this would be completely unacceptable not to work on Windows.

-- 
Bruno Medeiros - Software Developer, MSc. in CS/E graduate
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D

May 01 2008

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

"Bruno Medeiros" wrote
 Steven Schveighoffer wrote:
 Not on Windoze :)

 -Steve

 Something like this would be completely unacceptable not to work on 
 Windows.

I was wrong, look at my response to Sean.  Sorry about that.

-Steve

May 01 2008

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Thu, May 01, 2008 at 02:19:51AM +0900, Bill Baxter wrote:
 Herein lies the genius in Tango's naming conventions.  You *can* have 
 both a package std.string, and a module named std.String.  If you 
 consistently use different case for package and module names, then you 
 can have your cake and eat it too.

Does that work on Windows?

 --bb

-- 
Adam D. Ruppe
http://arsdnet.net

Apr 30 2008

Sean Kelly <sean invisibleduck.org> writes:

== Quote from Me Here (p9e883002 sneakemail.com)'s article
 My point was that /if/ Ds arrays have a similar capability, to be
 preallocated large and empty and grow into the space then
 when a mutation requires a reallocation of a mutable array because it has
 outgrown its original allocation,
 then a debug-enabled warning saying by how much, might allow the
 programmer to preallocate the initial mutable array larger and
 so avoid reallocation at runtime.

D arrays do have this feature, thanks to a suggestion by Derek Parnell.  That
is,
reducing the array's length property does not cause a reallocation, even when
length is set to zero.  Thus it is possible to do:

    void fn( inout char[] buf )
    {
        buf.length = 1024; // preallocate 1024 bytes of storage
        buf.length = 0;
        buf ~= "hello"; // will copy into preallocated buffer
    }

Thus the proper way to discard a buffer is to do:

    buf = null;

I think for specific buffers it's probably enough to print their length when
you're
done filling them and then explicitly preallocate the next run based on this
info.
Tango also offers a means of performing program-level preallocation via
GC.reserve()
for people so inclined.


Sean

Apr 30 2008

"Me Here" <p9e883002 sneakemail.com> writes:

As my ascii art was screwed by the time it got to the server, here is a 
better illustration of what goes on:
This is long and wordy and maybe of no interest. But it does illustrate 
the point i was trying to make.

[0] Perl> use Devel::Peek;;

allocated.
SV = NULL(0x0) at 0x194a9cc
   REFCNT = 1
   FLAGS = ()


[0] Perl> Dump $s;;
SV = PV(0x2252e8) at 0x194a9cc
   REFCNT = 1
   FLAGS = (POK,pPOK)


length

incase we pass it to C


teh first 5 characters
[0] Perl> Dump $s;;
SV = PVIV(0x2256ec) at 0x194a9cc
   REFCNT = 1
   FLAGS = (POK,OOK,pPOK)

start of the buffer
   PV = 0x191e1a9 ( "abcde" . ) "fghijklmnopqrstuvwxyz"\0





[0] Perl> Dump $s;;
SV = PVIV(0x2256ec) at 0x194a9cc
   REFCNT = 1
   FLAGS = (POK,OOK,pPOK)
   IV = 5  (OFFSET)
   PV = 0x191e1a9 ( "abcde" . ) "fghijklmnop"\0



[0] Perl> $s = 'XX' . $s;;          Prepend some new stuff back
[0] Perl> Dump $s;;
SV = PVIV(0x2256ec) at 0x194a9cc
   REFCNT = 1
   FLAGS = (POK,OOK,pPOK)
   IV = 5  (OFFSET)
   PV = 0x191e1a9 ( "abcde" . ) "XXfghijklmnop"\0




[0] Perl> Dump $s;;
SV = PVIV(0x2256ec) at 0x194a9cc
   REFCNT = 1
   FLAGS = (POK,OOK,pPOK)
   IV = 5  (OFFSET)
   PV = 0x191e1a9 ( "abcde" . ) "XXfghijklmnopXX"\0

   LEN = 22


offset space
[0] Perl> Dump $s;;
SV = PVIV(0x2256ec) at 0x194a9cc
   REFCNT = 1
   FLAGS = (POK,OOK,pPOK)

   PV = 0x191e1a9 ( "abcde" . ) "XXfghijklmnopXX??????"\0
   CUR = 21
   LEN = 22


[0] Perl> Dump $s;;
SV = PVIV(0x2256ec) at 0x194a9cc
   REFCNT = 1
   FLAGS = (POK,pPOK)


   CUR = 23
   LEN = 27


[0] Perl> Dump $s;;
SV = PVIV(0x2256ec) at 0x194a9cc
   REFCNT = 1
   FLAGS = (POK,pPOK)


   CUR = 26
   LEN = 27


[0] Perl> Dump $s;;
SV = PVIV(0x2256ec) at 0x194a9cc
   REFCNT = 1
   FLAGS = (POK,pPOK)


memory.

   CUR = 27
   LEN = 28

--

Apr 30 2008

"Janice Caron" <caron800 googlemail.com> writes:

2008/4/30 Janice Caron <caron800 googlemail.com>:
  I would support the addition of some function like

     gc.minimise(char[])

  which returned all the unused space following the end of the array
  back to the gc, without any copying of the used part. I wouldn't be
  able to write that though - the gc is not my area of expertise.

Sorry, I meant

    std.gc.minimise(void[] array)

This function doesn't exist right now.

Apr 30 2008

Pedro Ferreira <ask me.pt> writes:

Janice Caron escreveu:
 2008/4/30 Janice Caron <caron800 googlemail.com>:
  I would support the addition of some function like

     gc.minimise(char[])

  which returned all the unused space following the end of the array
  back to the gc, without any copying of the used part. I wouldn't be
  able to write that though - the gc is not my area of expertise.

 
 Sorry, I meant
 
     std.gc.minimise(void[] array)
 
 This function doesn't exist right now.

Weren't 'void[]'s banned?

May 02 2008

"Janice Caron" <caron800 googlemail.com> writes:

2008/4/30 Bruno Medeiros <brunodomedeiros+spam com.gmail>:
  Also, is there a reason why these mutable functions shouldn't be in
 std.string, together with their invariant/const brethren?

That's why we're having this discussion.

The idea is that std.string can be optimised for invariant strings,
while std.stringbuffer could be optimised for mutable strings. There
are pros and cons for separate modules. I don't think Walter wants
std.string "polluted" by all these functions he doesn't much care for.
Also, it would be bad if mutable versions were called "by mistake"
with consequent unexpected behavior.

But keep discussing. The people I want to hear from most are the
people calling for mutable string functions.

Apr 29 2008

Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:

Janice Caron wrote:
 2008/4/30 Bruno Medeiros <brunodomedeiros+spam com.gmail>:
  Also, is there a reason why these mutable functions shouldn't be in
 std.string, together with their invariant/const brethren?

 
 That's why we're having this discussion.
 
 The idea is that std.string can be optimised for invariant strings,
 while std.stringbuffer could be optimised for mutable strings. There
 are pros and cons for separate modules. I don't think Walter wants
 std.string "polluted" by all these functions he doesn't much care for.
 Also, it would be bad if mutable versions were called "by mistake"
 with consequent unexpected behavior.
 

"mutable versions were called "by mistake" "? I don't think that point 
applies to D, after all, the purpose of the immutability system is for 
the compiler to check that this won't happen, so unless there is some 
compiler bug, that shouldn't happen in D.

 But keep discussing. The people I want to hear from most are the
 people calling for mutable string functions.

You may find that a large segment of those people are using Tango, and 
so they might not participate much in this Phobos design issue discussion.

-- 
Bruno Medeiros - Software Developer, MSc. in CS/E graduate
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D

May 01 2008

Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:

Bruno Medeiros wrote:
 Janice Caron wrote:
 2008/4/30 Bruno Medeiros <brunodomedeiros+spam com.gmail>:
  Also, is there a reason why these mutable functions shouldn't be in
 std.string, together with their invariant/const brethren?

 That's why we're having this discussion.

 The idea is that std.string can be optimised for invariant strings,
 while std.stringbuffer could be optimised for mutable strings. There
 are pros and cons for separate modules. I don't think Walter wants
 std.string "polluted" by all these functions he doesn't much care for.
 Also, it would be bad if mutable versions were called "by mistake"
 with consequent unexpected behavior.

 
 "mutable versions were called "by mistake" "? I don't think that point 
 applies to D, after all, the purpose of the immutability system is for 
 the compiler to check that this won't happen, so unless there is some 
 compiler bug, that shouldn't happen in D.

What if you wanted a modified copy of the input, but that input happened 
to be mutable?

The modifying versions should have some distinguishing characteristic to 
separate them from the COW versions. I'd say either a different function 
name or an extra out-buffer parameter (as long as they still work if the 
buffer is the same array as the normal input).

May 01 2008

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

"Frits van Bommel" wrote
 Bruno Medeiros wrote:
 Janice Caron wrote:
 2008/4/30 Bruno Medeiros:
  Also, is there a reason why these mutable functions shouldn't be in
 std.string, together with their invariant/const brethren?

 That's why we're having this discussion.

 The idea is that std.string can be optimised for invariant strings,
 while std.stringbuffer could be optimised for mutable strings. There
 are pros and cons for separate modules. I don't think Walter wants
 std.string "polluted" by all these functions he doesn't much care for.
 Also, it would be bad if mutable versions were called "by mistake"
 with consequent unexpected behavior.

 "mutable versions were called "by mistake" "? I don't think that point 
 applies to D, after all, the purpose of the immutability system is for 
 the compiler to check that this won't happen, so unless there is some 
 compiler bug, that shouldn't happen in D.

 What if you wanted a modified copy of the input, but that input happened 
 to be mutable?

 The modifying versions should have some distinguishing characteristic to 
 separate them from the COW versions. I'd say either a different function 
 name or an extra out-buffer parameter (as long as they still work if the 
 buffer is the same array as the normal input).

Any modifying versions would take mutable strings, COW version would require 
invariant strings.  They would be able to go in the same module, because 
there would be no ambiguity.

But if you have non-modifying versions that you want to use on mutable 
strings, those would most likely take a const pointer.  Those would have to 
be named differently than the invariant versions, because invariant 
implicitly casts to const.

Besides all this, it is good to separate them into 2 different modules 
because the linker includes all functions that are in a module, not just 
ones that are used.  So if you are of the persuasion to only use mutable or 
only use COW functions, then you probably don't want to link in the other 
versions if you can help it.

-Steve

May 01 2008

Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:

Frits van Bommel wrote:
 Bruno Medeiros wrote:
 Janice Caron wrote:
 2008/4/30 Bruno Medeiros <brunodomedeiros+spam com.gmail>:
  Also, is there a reason why these mutable functions shouldn't be in
 std.string, together with their invariant/const brethren?

 That's why we're having this discussion.

 The idea is that std.string can be optimised for invariant strings,
 while std.stringbuffer could be optimised for mutable strings. There
 are pros and cons for separate modules. I don't think Walter wants
 std.string "polluted" by all these functions he doesn't much care for.
 Also, it would be bad if mutable versions were called "by mistake"
 with consequent unexpected behavior.

 "mutable versions were called "by mistake" "? I don't think that point 
 applies to D, after all, the purpose of the immutability system is for 
 the compiler to check that this won't happen, so unless there is some 
 compiler bug, that shouldn't happen in D.

 
 What if you wanted a modified copy of the input, but that input happened 
 to be mutable?
 

Hum, I see what you mean, yes, that could happen.

 The modifying versions should have some distinguishing characteristic to 
 separate them from the COW versions. I'd say either a different function 
 name or an extra out-buffer parameter (as long as they still work if the 
 buffer is the same array as the normal input).

Yes, the idea to distinguish them with a different name sounds good 
(names like "doToUpper", maybe?). So that means you agree it should be 
in the same package? :P

-- 
Bruno Medeiros - Software Developer, MSc. in CS/E graduate
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D

May 01 2008

Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:

Bruno Medeiros wrote:
 Frits van Bommel wrote:
 The modifying versions should have some distinguishing characteristic 
 to separate them from the COW versions. I'd say either a different 
 function name or an extra out-buffer parameter (as long as they still 
 work if the buffer is the same array as the normal input).

 
 Yes, the idea to distinguish them with a different name sounds good 
 (names like "doToUpper", maybe?). So that means you agree it should be 
 in the same package? :P

I don't like 'doToUpper', but something like 'makeUpper' could be a good 
convention. That makes it pretty clear they're modifying the input, I think.

I don't particularly care what package they're in, but their names 
should make it clear what they do. Especially if you're working with 
both sets of functions in the same module...


Looking at the Phobos2 std.string docs, I do think some of those 
functions could benefit from at least a const(char)[] overload so 
they'll work with non-invariant parameters too. The ones that don't even 
return string data[1] should probably just replace all invariant 
parameters with const ones.
Of course, for the rest the return type of const overloads could be 
debated. (First question: should they ever return a slice? If not, 
should the return type be mutable or invariant[2]?)


[1]: In particular: inPattern(), size_t count*(), bool is*() and size_t 
column() are the ones I saw.
[2]: It shouldn't be const though, that'd be pointless: returning newly 
allocated memory as const means it's effectively invariant anyway.

May 01 2008

Simen Kjaeraas <simen.kjaras gmail.com> writes:

Spacen Jasset Wrote:
 string.toupper shouldn't really be doing work 
 outside of ascii, in my opinion anyway. This also means that a string 
 can be uppercased in place.

So anyone who uses alphabets other than pure english will have to write their
own function to uppercase their strings, even though the unicode standard
defines how it should work, and D is supposed to support unicode?

--Simen

May 01 2008

Pedro Ferreira <ask me.pt> writes:

Janice Caron escreveu:
 Hi all,
 
 More than one person has complained about the lack of string functions
 in Phobos which operate on mutable chars. In the thread titled "Is all
 this Invariant ****....", I suggested creating a new module,
 std.stringbuffer, to contain two things:
 
 (1) a StringBuffer class
 (2) parallel mutable versions of the functions in std.string.
 
 Walter OKed the idea, so it looks like that's a go. To that end, I've
 looked through the functions in std.string and sorted them into
 different groups. I think it's important to get the API right so
 comments are welcome on all of the below:

(snip)

I agree with this and will welcome the module. I've had to do some ugly 
.idup and .dup around a compiler I coded to accomodate for various 
functions around Phobos (such as writeLine from OutputStream).
I'd like to suggest, though, the usage of template code:

T[] split(T)(in data)

and perform a static if inside. It'd save the assle of maintaining two 
modules in seperate, which are bound to have different functions some 
day. For example,say that a function is added to std.string and not to 
std.stringbuffer.
Also, it would be easier to maintain documentation consistency.
On an extra note, ASCII UTF variants could be taken care in a single 
function.
That would require a lot of work though. Well, should you require 
assistance, gimme a shout.


Cheers

May 02 2008

D Programming

C/C++ Programming

Other

digitalmars.D - std.stringbuffer