digitalmars.D.learn - Does something like std.algorithm.iteration:splitter with multiple

ParticlePeter (13/13) Mar 23 2016 I need to parse an ascii with multiple tokens. The tokens can be

ParticlePeter (5/9) Mar 23 2016 file

Andrea Fontana (2/12) Mar 23 2016 Any input => output example?

ParticlePeter (21/22) Mar 23 2016 Sure, it is ensight gold case file format:

Simen Kjaeraas (57/70) Mar 23 2016 Without a bit more detail, it's a bit hard to help.

ParticlePeter (12/69) Mar 23 2016 Thanks Simen,

Simen Kjaeraas (23/34) Mar 23 2016 My pleasure. :) Testing it on your example data shows it to work

wobbles (39/53) Mar 23 2016 This isn't tested, but this is my first thought:

ParticlePeter (36/38) Mar 27 2016 Thanks Wobbles, I took your approach. There were some minor

wobbles (2/7) Mar 28 2016 Great, thanks for fixing it up!

ParticlePeter <ParticlePeter gmx.de> writes:

I need to parse an ascii with multiple tokens. The tokens can be 
seen as keys. After every token there is a bunch of lines 
belonging to that token, the values.
The order of tokens is unknown.

I would like to read the file in as a whole string, and split the 
string with:
splitter(fileString, [token1, token2, ... tokenN]);

And would like to get a range of strings each starting with 
tokenX and ending before the next token.

Does something like this exist?

I know how to parse the string line by line and create new 
strings and append the appropriate lines, but I don't know how to 
do this with a lazy result range and new allocations.

Mar 23 2016

ParticlePeter <ParticlePeter gmx.de> writes:

On Wednesday, 23 March 2016 at 11:57:49 UTC, ParticlePeter wrote:

Stupid typos:
 I need to parse an ascii

file
 with multiple tokens. ...

...
 to do this with a lazy result range and

without
 new allocations.

Mar 23 2016

Andrea Fontana <nospam example.com> writes:

On Wednesday, 23 March 2016 at 12:00:15 UTC, ParticlePeter wrote:
 On Wednesday, 23 March 2016 at 11:57:49 UTC, ParticlePeter 
 wrote:

 Stupid typos:
 I need to parse an ascii

 file
 with multiple tokens. ...

 ...
 to do this with a lazy result range and

 without
 new allocations.


Any input => output example?

Mar 23 2016

ParticlePeter <ParticlePeter gmx.de> writes:

On Wednesday, 23 March 2016 at 14:20:12 UTC, Andrea Fontana wrote:
 Any input => output example?

Sure, it is ensight gold case file format:

FORMAT
type:  ensight gold

GEOMETRY
model:                   1                        exgold2.geo**

VARIABLE
scalar per node:         1     Stress             exgold2.scl**
vector per node:         1     Displacement       exgold2.dis**

TIME
time set:                      1
number of steps:               3
filename start number:         0
filename increment:            1
time values:       1.0   2.0   3.0


The separators would be ["FORMAT", "TIME", "VARIABLE", 
"GEOMETRY"].
The blank lines between the blocks and the order of the 
separators in the file is not known.
I would expect a range of four ranges of lines: one for each 
text-block above.

Mar 23 2016

Simen Kjaeraas <simen.kjaras gmail.com> writes:

On Wednesday, 23 March 2016 at 11:57:49 UTC, ParticlePeter wrote:
 I need to parse an ascii with multiple tokens. The tokens can 
 be seen as keys. After every token there is a bunch of lines 
 belonging to that token, the values.
 The order of tokens is unknown.

 I would like to read the file in as a whole string, and split 
 the string with:
 splitter(fileString, [token1, token2, ... tokenN]);

 And would like to get a range of strings each starting with 
 tokenX and ending before the next token.

 Does something like this exist?

 I know how to parse the string line by line and create new 
 strings and append the appropriate lines, but I don't know how 
 to do this with a lazy result range and new allocations.

Without a bit more detail, it's a bit hard to help.

std.algorithm.splitter has an overload that takes a function 
instead of a separator:

     import std.algorithm;
     auto a = "a,b;c";
     auto b = a.splitter!(e => e == ';' || e == ',');
     assert(equal(b, ["a", "b", "c"]));

However, not only are the separators lost in the process, it only 
allows single-element separators. This might be good enough given 
the information you've divulged, but I'll hazard a guess it isn't.

My next stop is std.algorithm.chunkBy:

     auto a = ["a","b","c", "d", "e"];
     auto b = a.chunkBy!(e => e == "a" || e == "d");
     auto result = [
         tuple(true, ["a"]), tuple(false, ["b", "c"]),
         tuple(true, ["d"]), tuple(false, ["e"])
         ];

No assert here, since the ranges in the tuples are not arrays. My 
immediate concern is that two consecutive tokens with no 
intervening values will mess it up. Also, the result looks a bit 
messy. A little more involved, and according to documentation not 
guaranteed to work:

bool isToken(string s) {
     return s == "a" || s == "d";
}

bool tokenCounter(string s) {
     static string oldToken;
     static bool counter = true;
     if (s.isToken && s != oldToken) {
         oldToken = s;
         counter = !counter;
     }
     return counter;
}

unittest {
     import std.algorithm;
     import std.stdio;
     import std.typecons;
     import std.array;

     auto a = ["a","b","c", "d", "e", "a", "d"];
     auto b = a.chunkBy!tokenCounter.map!(e=>e[1]);
     auto result = [
         ["a", "b", "c"],
         ["d", "e"],
         ["a"],
         ["d"]
         ];
     writeln(b);
     writeln(result);
}

Again no assert, but b and result have basically the same 
contents. Also handles consecutive tokens neatly (but consecutive 
identical tokens will be grouped together).

Hope this helps.

--
   Simen

Mar 23 2016

ParticlePeter <ParticlePeter gmx.de> writes:

On Wednesday, 23 March 2016 at 15:23:38 UTC, Simen Kjaeraas wrote:
 Without a bit more detail, it's a bit hard to help.

 std.algorithm.splitter has an overload that takes a function 
 instead of a separator:

     import std.algorithm;
     auto a = "a,b;c";
     auto b = a.splitter!(e => e == ';' || e == ',');
     assert(equal(b, ["a", "b", "c"]));

 However, not only are the separators lost in the process, it 
 only allows single-element separators. This might be good 
 enough given the information you've divulged, but I'll hazard a 
 guess it isn't.

 My next stop is std.algorithm.chunkBy:

     auto a = ["a","b","c", "d", "e"];
     auto b = a.chunkBy!(e => e == "a" || e == "d");
     auto result = [
         tuple(true, ["a"]), tuple(false, ["b", "c"]),
         tuple(true, ["d"]), tuple(false, ["e"])
         ];

 No assert here, since the ranges in the tuples are not arrays. 
 My immediate concern is that two consecutive tokens with no 
 intervening values will mess it up. Also, the result looks a 
 bit messy. A little more involved, and according to 
 documentation not guaranteed to work:

 bool isToken(string s) {
     return s == "a" || s == "d";
 }

 bool tokenCounter(string s) {
     static string oldToken;
     static bool counter = true;
     if (s.isToken && s != oldToken) {
         oldToken = s;
         counter = !counter;
     }
     return counter;
 }

 unittest {
     import std.algorithm;
     import std.stdio;
     import std.typecons;
     import std.array;

     auto a = ["a","b","c", "d", "e", "a", "d"];
     auto b = a.chunkBy!tokenCounter.map!(e=>e[1]);
     auto result = [
         ["a", "b", "c"],
         ["d", "e"],
         ["a"],
         ["d"]
         ];
     writeln(b);
     writeln(result);
 }

 Again no assert, but b and result have basically the same 
 contents. Also handles consecutive tokens neatly (but 
 consecutive identical tokens will be grouped together).

 Hope this helps.

 --
   Simen

Thanks Simen,
your tokenCounter is inspirational, for the rest I'll take some 
time for testing.

But some additional thoughts from my sided:
I get all the lines of the file into one range. Calling array on 
it should give me an array, but how would I use find to get an 
index into this array?
With the indices I could slice up the array into four slices, no 
allocation required. If there is no easy way to just get an index 
instead of an range, I would try to use something like the 
tokenCounter to find all the indices.

Mar 23 2016

Simen Kjaeraas <simen.kjaras gmail.com> writes:

On Wednesday, 23 March 2016 at 18:10:05 UTC, ParticlePeter wrote:
 Thanks Simen,
 your tokenCounter is inspirational, for the rest I'll take some 
 time for testing.

My pleasure. :) Testing it on your example data shows it to work 
there. However, as stated above, the documentation says it's 
undefined, so future changes (even optimizations and bugfixes) to 
Phobos could make it stop working:

"This predicate must be an equivalence relation, that is, it must 
be reflexive (pred(x,x) is always true), symmetric (pred(x,y) == 
pred(y,x)), and transitive (pred(x,y) && pred(y,z) implies 
pred(x,z)). If this is not the case, the range returned by 
chunkBy may assert at runtime or behave erratically."

 But some additional thoughts from my sided:
 I get all the lines of the file into one range. Calling array 
 on it should give me an array, but how would I use find to get 
 an index into this array?
 With the indices I could slice up the array into four slices, 
 no allocation required. If there is no easy way to just get an 
 index instead of an range, I would try to use something like 
 the tokenCounter to find all the indices.

The chunkBy example should not allocate. chunkBy itself is lazy, 
as are its sub-ranges. No copying of string contents is 
performed. So unless you have very specific reasons to use 
slicing, I don't see why chunkBy shouldn't be good enough.

Full disclosure:
There is a malloc call in RefCounted, which is used for 
optimization purposes when chunkBy is called on a forward range. 
When chunkBy is called on an array, that's a 6-word allocation 
(24 bytes on 32-bit, 48 bytes on 64-bit), happening once. There 
are no other dependencies that allocate.

Such is the beauty of D. :)

--
   Simen

Mar 23 2016

wobbles <grogan.colin gmail.com> writes:

On Wednesday, 23 March 2016 at 11:57:49 UTC, ParticlePeter wrote:
 I need to parse an ascii with multiple tokens. The tokens can 
 be seen as keys. After every token there is a bunch of lines 
 belonging to that token, the values.
 The order of tokens is unknown.

 I would like to read the file in as a whole string, and split 
 the string with:
 splitter(fileString, [token1, token2, ... tokenN]);

 And would like to get a range of strings each starting with 
 tokenX and ending before the next token.

 Does something like this exist?

 I know how to parse the string line by line and create new 
 strings and append the appropriate lines, but I don't know how 
 to do this with a lazy result range and new allocations.

This isn't tested, but this is my first thought:

void main(){
     string testString = "this:is:a-test;"
     foreach(str; testString.multiSlice([":","-",";"]))
        writefln("Got: %s", str);
}

auto multiSlice(string string, string[] delims){
    struct MultiSliceRange{
         string m_str;
         string[] m_delims;
         bool empty(){
            return m_str.length == 0;
         }

         void popFront(){
            auto idx = findNextIndex;
            m_str = m_str[idx..$];
            return;
         }

         string front(){
             auto idx = findNextIndex;
             return m_str[0..idx];
         }
         private long findNextIndex(){
             long foundIndex=-1;
             foreach(delim; m_delims){
                 if(m_str.canFind(delim)){
                     if(foundIndex == -1 || m_str.indexOf(delim) 
= 0)){

                          foundIndex = m_str.indexOf(delim);
                     }
                 }
             }
             return foundIndex;
         }
    }

    return MultiSliceRange(string, delims);
}


Again, totally untested, but I think logically it should work. ( 
No D compiler on this machine so it mightn't even compile :] )

Mar 23 2016

ParticlePeter <ParticlePeter gmx.de> writes:

On Wednesday, 23 March 2016 at 20:00:55 UTC, wobbles wrote:
 Again, totally untested, but I think logically it should work. 
 ( No D compiler on this machine so it mightn't even compile :] )

Thanks Wobbles, I took your approach. There were some minor 
issues, here is a working version:

auto multiSlice(string data, string[] delims)  {

    import std.algorithm : canFind;
    import std.string : indexOf;

    struct MultiSliceRange  {
       string m_str;
       string[] m_delims;
       bool empty(){
          return m_str.length == 0;
       }

       void popFront(){
          auto idx = findNextIndex;
          m_str = m_str[idx..$];
          return;
       }

       string front(){
          auto idx = findNextIndex;
          return m_str[0..idx];
       }

       private size_t findNextIndex()  {
          auto index = size_t.max;
          foreach(delim; m_delims)  {
             if(m_str.canFind(delim))  {
                auto foundIndex = m_str.indexOf(delim);
                if(index > foundIndex && foundIndex > 0)  {
                   index = foundIndex;
                }
             }
          }
          return index;
       }
    }

    return MultiSliceRange(data, delims);
}

Mar 27 2016

wobbles <grogan.colin gmail.com> writes:

On Sunday, 27 March 2016 at 07:45:00 UTC, ParticlePeter wrote:
 On Wednesday, 23 March 2016 at 20:00:55 UTC, wobbles wrote:
 [...]

 Thanks Wobbles, I took your approach. There were some minor 
 issues, here is a working version:

 [...]

Great, thanks for fixing it up!

Mar 28 2016

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Does something like std.algorithm.iteration:splitter with multiple