digitalmars.D.learn - D-ish way to work with strings?
- =?iso-8859-1?Q?Robert_M._M=FCnch?= (12/12) Dec 22 2019 I want to do all the basics mutating things with strings: append,
- =?iso-8859-1?Q?Robert_M._M=FCnch?= (8/8) Dec 22 2019 Want to add I'm talking about unicode strings.
- H. S. Teoh (25/31) Dec 23 2019 [...]
- =?iso-8859-1?Q?Robert_M._M=FCnch?= (17/37) Dec 27 2019 I know. My point was that with UTF-8 code-points (not being a
- H. S. Teoh (12/20) Dec 27 2019 [...]
- Steven Schveighoffer (16/26) Dec 22 2019 switch to using char[].
- =?iso-8859-1?Q?Robert_M._M=FCnch?= (9/21) Dec 27 2019 My "strings" change a lot, so not really a good fit to use string.
I want to do all the basics mutating things with strings: append, insert, replace What is the D-ish way to do that since string is aliased to immutable(char)[]? Using arrays, using ~ operator, always copying, changing, combining my strings into a new one? Does it make sense to think about reducing GC pressure? I'm a bit lost in the possibilities and don't find any "that's the way to do it". -- Robert M. Münch http://www.saphirion.com smarter | better | faster
Dec 22 2019
Want to add I'm talking about unicode strings. Wouldn't it make sense to handle everything as UTF-32 so that iteration is simple because code-point = code-unit? And later on, convert to UTF-16 or UTF-8 on demand? -- Robert M. Münch http://www.saphirion.com smarter | better | faster
Dec 22 2019
On Sun, Dec 22, 2019 at 06:27:03PM +0100, Robert M. Münch via Digitalmars-d-learn wrote:Want to add I'm talking about unicode strings. Wouldn't it make sense to handle everything as UTF-32 so that iteration is simple because code-point = code-unit? And later on, convert to UTF-16 or UTF-8 on demand?[...] Be careful that code point != "character" the way most people understand the word "character". The word you're looking for is "grapheme". Which, unfortunately, is rather complex and very slow to handle in Unicode. See std.uni.byGrapheme. Usually you want to just stick with UTF-8 (usually) or UTF-16 (for Windows and Java interop). UTF-32 wastes a lot of space, and *still* doesn't give you what you think you want, and Grapheme[] is just dog slow because of the amount of decoding/recoding needed to manipulate it. What are you planning to do with your strings? IME, using ~ occasionally doesn't add *too* much GC pressure, and slicing is usually the idiomatic way of working with strings in D (it can result in faster code than C because you don't have to keep strcpy()'d stuff all over the place). If you're appending string a LOT, you might want to consider using std.array.appender in your inner loops to alleviate some of the cost of using ~ too much. Or use lazy evaluation and ranges to defer actually constructing the string until the end when it's ready to be stored. Still, this all depends on what you're trying to do with your strings. Elaborate a bit more about your use case, and we might be able to give better advice. T -- Nobody is perfect. I am Nobody. -- pepoluan, GKC forum
Dec 23 2019
On 2019-12-23 15:05:20 +0000, H. S. Teoh said:On Sun, Dec 22, 2019 at 06:27:03PM +0100, Robert M. Münch via Digitalmars-d-learn wrote:I know. My point was that with UTF-8 code-points (not being a character) have different sizes. Which you need to take into account if you want to iterate by code-points.Want to add I'm talking about unicode strings. Wouldn't it make sense to handle everything as UTF-32 so that iteration is simple because code-point = code-unit? And later on, convert to UTF-16 or UTF-8 on demand?[...] Be careful that code point != "character" the way most people understand the word "character".The word you're looking for is "grapheme". Which, unfortunately, is rather complex and very slow to handle in Unicode. See std.uni.byGrapheme.Yes, that's when we come to "characters". And a "grapheme" can consists of several code-points. Is grapheme handling just slow in D or in general? If it's the latter, well, than that's just how it is.Usually you want to just stick with UTF-8 (usually) or UTF-16 (for Windows and Java interop). UTF-32 wastes a lot of space, and *still* doesn't give you what you think you want, and Grapheme[] is just dog slow because of the amount of decoding/recoding needed to manipulate it.I need to handle graphemes when things are goind to be rendered and edited.What are you planning to do with your strings?Pretty simple: Have user editable content that is rendered using different fonts supporting unicode. So, all editing functions: insert, replace, delete at all locations in the string supporting all unicode characters. Viele Grüsse. -- Robert M. Münch http://www.saphirion.com smarter | better | faster
Dec 27 2019
On Fri, Dec 27, 2019 at 01:23:57PM +0100, Robert M. Münch via Digitalmars-d-learn wrote:On 2019-12-23 15:05:20 +0000, H. S. Teoh said:[...][...] Ah, I see. In that case you might want to consider using graphemes by default, since that's what most closely corresponds to how the user will perceive a "character". For processing outside of editing, though, you might want to consider converting to some other representation for manipulation, since graphemes are slow (the decoding process is complex, and we can't work around that because that's what Unicode requires). T -- Windows: the ultimate triumph of marketing over technology. -- Adrian von BidderWhat are you planning to do with your strings?Pretty simple: Have user editable content that is rendered using different fonts supporting unicode. So, all editing functions: insert, replace, delete at all locations in the string supporting all unicode characters.
Dec 27 2019
On 12/22/19 9:15 AM, Robert M. Münch wrote:I want to do all the basics mutating things with strings: append, insert, replace What is the D-ish way to do that since string is aliased to immutable(char)[]?switch to using char[]. Unfortunately, there's a lot of code out there that accepts string instead of const(char)[], which is more usable. I think many people don't realize the purpose of the string type. It's meant to be something that is heap-allocated (or as a global), and NEVER goes out of scope. Many things are shoehorned into string which shouldn't be.Using arrays, using ~ operator, always copying, changing, combining my strings into a new one? Does it make sense to think about reducing GC pressure?It really depends on your use cases. strings are great precisely because they don't change. slicing makes huge sense there.I'm a bit lost in the possibilities and don't find any "that's the way to do it".Again, use char[] if you are going to be rearranging strings. And you have to take care not to cheat and cast to string. Always use idup if you need one. If you find Phobos functions that unnecessarily take string instead of const(char)[] please post to bugzilla. -Steve
Dec 22 2019
On 2019-12-22 18:45:52 +0000, Steven Schveighoffer said:switch to using char[]. Unfortunately, there's a lot of code out there that accepts string instead of const(char)[], which is more usable. I think many people don't realize the purpose of the string type. It's meant to be something that is heap-allocated (or as a global), and NEVER goes out of scope.Hi Steve, thanks for the feedback. Makes sense to me.It really depends on your use cases. strings are great precisely because they don't change. slicing makes huge sense there.My "strings" change a lot, so not really a good fit to use string.Again, use char[] if you are going to be rearranging strings. And you have to take care not to cheat and cast to string. Always use idup if you need one.Will do.If you find Phobos functions that unnecessarily take string instead of const(char)[] please post to bugzilla.Ok, will keep an eye on it. -- Robert M. Münch http://www.saphirion.com smarter | better | faster
Dec 27 2019