digitalmars.D.learn - iteration over a string
- Timothee Cour (69/69) May 28 2013 Questions regarding iteration over code points of a utf8 string:
- =?UTF-8?B?QWxpIMOHZWhyZWxp?= (20/23) May 28 2013 Yes, the whole situation is a little messy. :)
- =?UTF-8?B?QWxpIMOHZWhyZWxp?= (4/6) May 28 2013 Rather:
- Diggory (17/17) May 28 2013 Most algorithms for strings need the offset rather than the
Questions regarding iteration over code points of a utf8 string: In all that follows, I don't want to go through intermediate UTF32 representation by making a copy of my string, but I want to iterate over its code points. say my string is declared as: string a=3D"=CE=A9abc"; //if email reader screws this up, it's a 'Omega' fo= llowed by abc A) this doesn't work obviously: foreach(i,ai; a){ write(i,",",ai," "); } //prints 0,=EF=BF=BD 1,=EF=BF=BD 2,a 3,b 4,c (ie decomposes at the 'char' l= evel, so 5 elements) B) foreach(i,dchar ai;a){ write(i,",",ai," "); } // prints 0,=CE=A9 2,a 3,b 4,c (ie decomposes at code points, so 4 elements= ) But index i skips position 1, indicating the start index of code points; it prints [0,2,3,4] Is that a bug or a feature? C) writeln(a.walkLength); // prints 4 for(size_t i;!a.empty;a.popFront,i++) write(i,",",a.front," "); // prints 0,=CE=A9 1,a 2,b 3,c This seems the most correct for interpreting a string as a range over code points, where index i has positions [0,1,2,3] Is there a more idiomatic way? D) How to make the standard algorithms (std.map, etc) work well with the iteration over code points as in method C above ? For example this one is very confusing for me: string a=3D"=CE=A9=CE=A9ab"; auto b1=3Da.map!(a=3D>"<"d~a~">"d).array; writeln(b1.length);//6 writeln(b1);//["<=CE=A9>", "<=CE=A9>", "<a>", "<b>", "", ""] Why are there 2 empty strings at the end? (one per Omega if you vary the number of such symbols in the string). E) The fact that there are 2 ways to iterate over strings is confusing: For example reading at docs, ForeachType is different from ElementType and ElementType is special cased for narrow strings; foreach(i;ai;a){foo(i,ai);} doesn't behave as for(size_t i;!a.empty;a.popFront,i++) {foo(i,a.front);} walkLength !=3D length for strings F) Why can't we have the following design instead: * no special case with isNarrowString scattered throughout phobos * iteration with foreach behaves as iteration with popFront/empty/front, and walkLength =3D=3D length * ForeachType =3D=3D ElementType (ie one is redundant) * require *explicit user syntax* to construct a range over code points from a string: struct CodepointRange{ this(string a){...} auto popFront(){} auto empty(){} auto length(){}// } now the user can do: a.map!foo =3D> will iterate over char a.CodepointRange.map!foo =3D> will iterate over code points. Everything seems more orhogonal that way, and user has clear understanding of complexity of each operation.
May 28 2013
On 05/28/2013 12:26 AM, Timothee Cour wrote:In all that follows, I don't want to go through intermediate UTF32 representation by making a copy of my string, but I want to iterate over its code points.Yes, the whole situation is a little messy. :) There is also std.range.stride: foreach (ai; a.stride(1)) { // ... } If you need the index as well, and do not want to manage it explicitly, one way is to use zip and sequence: import std.stdio; import std.range; void main() { string a="Ωabc"; foreach (i, ai; zip(sequence!"n", a.stride(1))) { write(i,",",ai," "); } } The output: Ω a b c Ali
May 28 2013
On 05/28/2013 12:42 AM, Ali Çehreli wrote:The output: Ω a b cRather: 0,Ω 1,a 2,b 3,c Ali
May 28 2013
Most algorithms for strings need the offset rather than the character index, so: foreach (i; dchar c; str) Gives the offset into the string for "i" If you really need the character index just count it: int charIndex = 0; foreach (dchar c; str) { // ... ++charIndex; } If strings were treated specially so that they looked like arrays of dchars but used UTF-8 internally it would hide all sorts of performance costs. Random access into a UTF-8 string by the character index is O(n) whereas index by the offset is O(1). If you are using random access by character index heavily you should therefore convert to a dstring first and then you can get the O(1) random access time.
May 28 2013