digitalmars.D.learn - D-ish way to work with strings?

=?iso-8859-1?Q?Robert_M._M=FCnch?= (12/12) Dec 22 2019 I want to do all the basics mutating things with strings: append,

=?iso-8859-1?Q?Robert_M._M=FCnch?= (8/8) Dec 22 2019 Want to add I'm talking about unicode strings.

H. S. Teoh (25/31) Dec 23 2019 [...]

=?iso-8859-1?Q?Robert_M._M=FCnch?= (17/37) Dec 27 2019 I know. My point was that with UTF-8 code-points (not being a

H. S. Teoh (12/20) Dec 27 2019 [...]

Steven Schveighoffer (16/26) Dec 22 2019 switch to using char[].

=?iso-8859-1?Q?Robert_M._M=FCnch?= (9/21) Dec 27 2019 My "strings" change a lot, so not really a good fit to use string.

=?iso-8859-1?Q?Robert_M._M=FCnch?= <robert.muench saphirion.com> writes:

I want to do all the basics mutating things with strings: append, 
insert, replace

What is the D-ish way to do that since�string�is aliased to�immutable(char)[]?

Using arrays, using ~ operator, always copying, changing, combining my 
strings into a new one? Does it make sense to think about reducing GC 
pressure?

I'm a bit lost in the possibilities and don't find any "that's the way 
to do it".

-- 
Robert M. M�nch
http://www.saphirion.com
smarter | better | faster

Dec 22 2019

=?iso-8859-1?Q?Robert_M._M=FCnch?= <robert.muench saphirion.com> writes:

Want to add I'm talking about unicode strings.

Wouldn't it make sense to handle everything as UTF-32 so that iteration 
is simple because code-point = code-unit?

And later on, convert to UTF-16 or UTF-8 on demand?

-- 
Robert M. M�nch
http://www.saphirion.com
smarter | better | faster

Dec 22 2019

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Sun, Dec 22, 2019 at 06:27:03PM +0100, Robert M. M�nch via
Digitalmars-d-learn wrote:
 Want to add I'm talking about unicode strings.
 
 Wouldn't it make sense to handle everything as UTF-32 so that
 iteration is simple because code-point = code-unit?
 
 And later on, convert to UTF-16 or UTF-8 on demand?

[...]

Be careful that code point != "character" the way most people understand
the word "character".  The word you're looking for is "grapheme".
Which, unfortunately, is rather complex and very slow to handle in
Unicode. See std.uni.byGrapheme.

Usually you want to just stick with UTF-8 (usually) or UTF-16 (for
Windows and Java interop). UTF-32 wastes a lot of space, and *still*
doesn't give you what you think you want, and Grapheme[] is just dog
slow because of the amount of decoding/recoding needed to manipulate it.

What are you planning to do with your strings?  IME, using ~
occasionally doesn't add *too* much GC pressure, and slicing is usually
the idiomatic way of working with strings in D (it can result in faster
code than C because you don't have to keep strcpy()'d stuff all over the
place).  If you're appending string a LOT, you might want to consider
using std.array.appender in your inner loops to alleviate some of the
cost of using ~ too much.  Or use lazy evaluation and ranges to defer
actually constructing the string until the end when it's ready to be
stored.

Still, this all depends on what you're trying to do with your strings.
Elaborate a bit more about your use case, and we might be able to give
better advice.


T

-- 
Nobody is perfect.  I am Nobody. -- pepoluan, GKC forum

Dec 23 2019

=?iso-8859-1?Q?Robert_M._M=FCnch?= <robert.muench saphirion.com> writes:

On 2019-12-23 15:05:20 +0000, H. S. Teoh said:

 On Sun, Dec 22, 2019 at 06:27:03PM +0100, Robert M. M�nch via 
 Digitalmars-d-learn wrote:
 Want to add I'm talking about unicode strings.
 
 Wouldn't it make sense to handle everything as UTF-32 so that
 iteration is simple because code-point = code-unit?
 
 And later on, convert to UTF-16 or UTF-8 on demand?

 [...]
 
 Be careful that code point != "character" the way most people understand
 the word "character".

I know. My point was that with UTF-8 code-points (not being a 
character) have different sizes. Which you need to take into account if 
you want to iterate by code-points.

 The word you're looking for is "grapheme". Which, unfortunately, is 
 rather complex and very slow to handle in
 Unicode. See std.uni.byGrapheme.

Yes, that's when we come to "characters". And a "grapheme" can consists 
of several code-points. Is grapheme handling just slow in D or in 
general? If it's the latter, well, than that's just how it is.

 Usually you want to just stick with UTF-8 (usually) or UTF-16 (for
 Windows and Java interop). UTF-32 wastes a lot of space, and *still*
 doesn't give you what you think you want, and Grapheme[] is just dog
 slow because of the amount of decoding/recoding needed to manipulate it.

I need to handle graphemes when things are goind to be rendered and edited.

 What are you planning to do with your strings?

Pretty simple: Have user editable content that is rendered using 
different fonts supporting unicode.

So, all editing functions: insert, replace, delete at all locations in 
the string supporting all unicode characters.

Viele Gr�sse.

-- 
Robert M. M�nch
http://www.saphirion.com
smarter | better | faster

Dec 27 2019

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Fri, Dec 27, 2019 at 01:23:57PM +0100, Robert M. M�nch via
Digitalmars-d-learn wrote:
 On 2019-12-23 15:05:20 +0000, H. S. Teoh said:

[...]
 What are you planning to do with your strings?

 
 Pretty simple: Have user editable content that is rendered using
 different fonts supporting unicode.
 
 So, all editing functions: insert, replace, delete at all locations in
 the string supporting all unicode characters.

[...]

Ah, I see.  In that case you might want to consider using graphemes by
default, since that's what most closely corresponds to how the user will
perceive a "character".  For processing outside of editing, though, you
might want to consider converting to some other representation for
manipulation, since graphemes are slow (the decoding process is complex,
and we can't work around that because that's what Unicode requires).


T

-- 
Windows: the ultimate triumph of marketing over technology. -- Adrian von Bidder

Dec 27 2019

Steven Schveighoffer <schveiguy gmail.com> writes:

On 12/22/19 9:15 AM, Robert M. Münch wrote:
 I want to do all the basics mutating things with strings: append, 
 insert, replace
 
 What is the D-ish way to do that since string is aliased 
 to immutable(char)[]?

switch to using char[].

Unfortunately, there's a lot of code out there that accepts string 
instead of const(char)[], which is more usable.

I think many people don't realize the purpose of the string type. It's 
meant to be something that is heap-allocated (or as a global), and NEVER 
goes out of scope. Many things are shoehorned into string which 
shouldn't be.

 Using arrays, using ~ operator, always copying, changing, combining my 
 strings into a new one? Does it make sense to think about reducing GC 
 pressure?

It really depends on your use cases. strings are great precisely because 
they don't change. slicing makes huge sense there.

 I'm a bit lost in the possibilities and don't find any "that's the way 
 to do it".

Again, use char[] if you are going to be rearranging strings. And you 
have to take care not to cheat and cast to string. Always use idup if 
you need one.

If you find Phobos functions that unnecessarily take string instead of 
const(char)[] please post to bugzilla.

-Steve

Dec 22 2019

=?iso-8859-1?Q?Robert_M._M=FCnch?= <robert.muench saphirion.com> writes:

On 2019-12-22 18:45:52 +0000, Steven Schveighoffer said:

 switch to using char[]. Unfortunately, there's a lot of code out there 
 that accepts string instead of const(char)[], which is more usable. I 
 think many people don't realize the purpose of the string type. It's 
 meant to be something that is heap-allocated (or as a global), and 
 NEVER goes out of scope.

Hi Steve, thanks for the feedback. Makes sense to me.

 It really depends on your use cases. strings are great precisely 
 because they don't change. slicing makes huge sense there.

My "strings" change a lot, so not really a good fit to use string.

 Again, use char[] if you are going to be rearranging strings. And you 
 have to take care not to cheat and cast to string. Always use idup if 
 you need one.

Will do.

 If you find Phobos functions that unnecessarily take string instead of 
 const(char)[] please post to bugzilla.

Ok, will keep an eye on it.

-- 
Robert M. M�nch
http://www.saphirion.com
smarter | better | faster

Dec 27 2019

D Programming

C/C++ Programming

Other

digitalmars.D.learn - D-ish way to work with strings?