www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Does something like std.algorithm.iteration:splitter with multiple

reply ParticlePeter <ParticlePeter gmx.de> writes:
I need to parse an ascii with multiple tokens. The tokens can be 
seen as keys. After every token there is a bunch of lines 
belonging to that token, the values.
The order of tokens is unknown.

I would like to read the file in as a whole string, and split the 
string with:
splitter(fileString, [token1, token2, ... tokenN]);

And would like to get a range of strings each starting with 
tokenX and ending before the next token.

Does something like this exist?

I know how to parse the string line by line and create new 
strings and append the appropriate lines, but I don't know how to 
do this with a lazy result range and new allocations.
Mar 23 2016
next sibling parent reply ParticlePeter <ParticlePeter gmx.de> writes:
On Wednesday, 23 March 2016 at 11:57:49 UTC, ParticlePeter wrote:

Stupid typos:
 I need to parse an ascii
file
 with multiple tokens. ...
...
 to do this with a lazy result range and
without
 new allocations.
Mar 23 2016
parent reply Andrea Fontana <nospam example.com> writes:
On Wednesday, 23 March 2016 at 12:00:15 UTC, ParticlePeter wrote:
 On Wednesday, 23 March 2016 at 11:57:49 UTC, ParticlePeter 
 wrote:

 Stupid typos:
 I need to parse an ascii
file
 with multiple tokens. ...
...
 to do this with a lazy result range and
without
 new allocations.
Any input => output example?
Mar 23 2016
parent ParticlePeter <ParticlePeter gmx.de> writes:
On Wednesday, 23 March 2016 at 14:20:12 UTC, Andrea Fontana wrote:
 Any input => output example?
Sure, it is ensight gold case file format: FORMAT type: ensight gold GEOMETRY model: 1 exgold2.geo** VARIABLE scalar per node: 1 Stress exgold2.scl** vector per node: 1 Displacement exgold2.dis** TIME time set: 1 number of steps: 3 filename start number: 0 filename increment: 1 time values: 1.0 2.0 3.0 The separators would be ["FORMAT", "TIME", "VARIABLE", "GEOMETRY"]. The blank lines between the blocks and the order of the separators in the file is not known. I would expect a range of four ranges of lines: one for each text-block above.
Mar 23 2016
prev sibling next sibling parent reply Simen Kjaeraas <simen.kjaras gmail.com> writes:
On Wednesday, 23 March 2016 at 11:57:49 UTC, ParticlePeter wrote:
 I need to parse an ascii with multiple tokens. The tokens can 
 be seen as keys. After every token there is a bunch of lines 
 belonging to that token, the values.
 The order of tokens is unknown.

 I would like to read the file in as a whole string, and split 
 the string with:
 splitter(fileString, [token1, token2, ... tokenN]);

 And would like to get a range of strings each starting with 
 tokenX and ending before the next token.

 Does something like this exist?

 I know how to parse the string line by line and create new 
 strings and append the appropriate lines, but I don't know how 
 to do this with a lazy result range and new allocations.
Without a bit more detail, it's a bit hard to help. std.algorithm.splitter has an overload that takes a function instead of a separator: import std.algorithm; auto a = "a,b;c"; auto b = a.splitter!(e => e == ';' || e == ','); assert(equal(b, ["a", "b", "c"])); However, not only are the separators lost in the process, it only allows single-element separators. This might be good enough given the information you've divulged, but I'll hazard a guess it isn't. My next stop is std.algorithm.chunkBy: auto a = ["a","b","c", "d", "e"]; auto b = a.chunkBy!(e => e == "a" || e == "d"); auto result = [ tuple(true, ["a"]), tuple(false, ["b", "c"]), tuple(true, ["d"]), tuple(false, ["e"]) ]; No assert here, since the ranges in the tuples are not arrays. My immediate concern is that two consecutive tokens with no intervening values will mess it up. Also, the result looks a bit messy. A little more involved, and according to documentation not guaranteed to work: bool isToken(string s) { return s == "a" || s == "d"; } bool tokenCounter(string s) { static string oldToken; static bool counter = true; if (s.isToken && s != oldToken) { oldToken = s; counter = !counter; } return counter; } unittest { import std.algorithm; import std.stdio; import std.typecons; import std.array; auto a = ["a","b","c", "d", "e", "a", "d"]; auto b = a.chunkBy!tokenCounter.map!(e=>e[1]); auto result = [ ["a", "b", "c"], ["d", "e"], ["a"], ["d"] ]; writeln(b); writeln(result); } Again no assert, but b and result have basically the same contents. Also handles consecutive tokens neatly (but consecutive identical tokens will be grouped together). Hope this helps. -- Simen
Mar 23 2016
parent reply ParticlePeter <ParticlePeter gmx.de> writes:
On Wednesday, 23 March 2016 at 15:23:38 UTC, Simen Kjaeraas wrote:
 Without a bit more detail, it's a bit hard to help.

 std.algorithm.splitter has an overload that takes a function 
 instead of a separator:

     import std.algorithm;
     auto a = "a,b;c";
     auto b = a.splitter!(e => e == ';' || e == ',');
     assert(equal(b, ["a", "b", "c"]));

 However, not only are the separators lost in the process, it 
 only allows single-element separators. This might be good 
 enough given the information you've divulged, but I'll hazard a 
 guess it isn't.

 My next stop is std.algorithm.chunkBy:

     auto a = ["a","b","c", "d", "e"];
     auto b = a.chunkBy!(e => e == "a" || e == "d");
     auto result = [
         tuple(true, ["a"]), tuple(false, ["b", "c"]),
         tuple(true, ["d"]), tuple(false, ["e"])
         ];

 No assert here, since the ranges in the tuples are not arrays. 
 My immediate concern is that two consecutive tokens with no 
 intervening values will mess it up. Also, the result looks a 
 bit messy. A little more involved, and according to 
 documentation not guaranteed to work:

 bool isToken(string s) {
     return s == "a" || s == "d";
 }

 bool tokenCounter(string s) {
     static string oldToken;
     static bool counter = true;
     if (s.isToken && s != oldToken) {
         oldToken = s;
         counter = !counter;
     }
     return counter;
 }

 unittest {
     import std.algorithm;
     import std.stdio;
     import std.typecons;
     import std.array;

     auto a = ["a","b","c", "d", "e", "a", "d"];
     auto b = a.chunkBy!tokenCounter.map!(e=>e[1]);
     auto result = [
         ["a", "b", "c"],
         ["d", "e"],
         ["a"],
         ["d"]
         ];
     writeln(b);
     writeln(result);
 }

 Again no assert, but b and result have basically the same 
 contents. Also handles consecutive tokens neatly (but 
 consecutive identical tokens will be grouped together).

 Hope this helps.

 --
   Simen
Thanks Simen, your tokenCounter is inspirational, for the rest I'll take some time for testing. But some additional thoughts from my sided: I get all the lines of the file into one range. Calling array on it should give me an array, but how would I use find to get an index into this array? With the indices I could slice up the array into four slices, no allocation required. If there is no easy way to just get an index instead of an range, I would try to use something like the tokenCounter to find all the indices.
Mar 23 2016
parent Simen Kjaeraas <simen.kjaras gmail.com> writes:
On Wednesday, 23 March 2016 at 18:10:05 UTC, ParticlePeter wrote:
 Thanks Simen,
 your tokenCounter is inspirational, for the rest I'll take some 
 time for testing.
My pleasure. :) Testing it on your example data shows it to work there. However, as stated above, the documentation says it's undefined, so future changes (even optimizations and bugfixes) to Phobos could make it stop working: "This predicate must be an equivalence relation, that is, it must be reflexive (pred(x,x) is always true), symmetric (pred(x,y) == pred(y,x)), and transitive (pred(x,y) && pred(y,z) implies pred(x,z)). If this is not the case, the range returned by chunkBy may assert at runtime or behave erratically."
 But some additional thoughts from my sided:
 I get all the lines of the file into one range. Calling array 
 on it should give me an array, but how would I use find to get 
 an index into this array?
 With the indices I could slice up the array into four slices, 
 no allocation required. If there is no easy way to just get an 
 index instead of an range, I would try to use something like 
 the tokenCounter to find all the indices.
The chunkBy example should not allocate. chunkBy itself is lazy, as are its sub-ranges. No copying of string contents is performed. So unless you have very specific reasons to use slicing, I don't see why chunkBy shouldn't be good enough. Full disclosure: There is a malloc call in RefCounted, which is used for optimization purposes when chunkBy is called on a forward range. When chunkBy is called on an array, that's a 6-word allocation (24 bytes on 32-bit, 48 bytes on 64-bit), happening once. There are no other dependencies that allocate. Such is the beauty of D. :) -- Simen
Mar 23 2016
prev sibling parent reply wobbles <grogan.colin gmail.com> writes:
On Wednesday, 23 March 2016 at 11:57:49 UTC, ParticlePeter wrote:
 I need to parse an ascii with multiple tokens. The tokens can 
 be seen as keys. After every token there is a bunch of lines 
 belonging to that token, the values.
 The order of tokens is unknown.

 I would like to read the file in as a whole string, and split 
 the string with:
 splitter(fileString, [token1, token2, ... tokenN]);

 And would like to get a range of strings each starting with 
 tokenX and ending before the next token.

 Does something like this exist?

 I know how to parse the string line by line and create new 
 strings and append the appropriate lines, but I don't know how 
 to do this with a lazy result range and new allocations.
This isn't tested, but this is my first thought: void main(){ string testString = "this:is:a-test;" foreach(str; testString.multiSlice([":","-",";"])) writefln("Got: %s", str); } auto multiSlice(string string, string[] delims){ struct MultiSliceRange{ string m_str; string[] m_delims; bool empty(){ return m_str.length == 0; } void popFront(){ auto idx = findNextIndex; m_str = m_str[idx..$]; return; } string front(){ auto idx = findNextIndex; return m_str[0..idx]; } private long findNextIndex(){ long foundIndex=-1; foreach(delim; m_delims){ if(m_str.canFind(delim)){ if(foundIndex == -1 || m_str.indexOf(delim)
= 0)){
foundIndex = m_str.indexOf(delim); } } } return foundIndex; } } return MultiSliceRange(string, delims); } Again, totally untested, but I think logically it should work. ( No D compiler on this machine so it mightn't even compile :] )
Mar 23 2016
parent reply ParticlePeter <ParticlePeter gmx.de> writes:
On Wednesday, 23 March 2016 at 20:00:55 UTC, wobbles wrote:
 Again, totally untested, but I think logically it should work. 
 ( No D compiler on this machine so it mightn't even compile :] )
Thanks Wobbles, I took your approach. There were some minor issues, here is a working version: auto multiSlice(string data, string[] delims) { import std.algorithm : canFind; import std.string : indexOf; struct MultiSliceRange { string m_str; string[] m_delims; bool empty(){ return m_str.length == 0; } void popFront(){ auto idx = findNextIndex; m_str = m_str[idx..$]; return; } string front(){ auto idx = findNextIndex; return m_str[0..idx]; } private size_t findNextIndex() { auto index = size_t.max; foreach(delim; m_delims) { if(m_str.canFind(delim)) { auto foundIndex = m_str.indexOf(delim); if(index > foundIndex && foundIndex > 0) { index = foundIndex; } } } return index; } } return MultiSliceRange(data, delims); }
Mar 27 2016
parent wobbles <grogan.colin gmail.com> writes:
On Sunday, 27 March 2016 at 07:45:00 UTC, ParticlePeter wrote:
 On Wednesday, 23 March 2016 at 20:00:55 UTC, wobbles wrote:
 [...]
Thanks Wobbles, I took your approach. There were some minor issues, here is a working version: [...]
Great, thanks for fixing it up!
Mar 28 2016