digitalmars.D.learn - Efficiently streaming data to associative array

Guillaume Chatelet (34/34) Aug 08 2017 Let's say I'm processing MB of data, I'm lazily iterating over

Steven Schveighoffer (6/50) Aug 08 2017 I wouldn't use formattedRead, as I think this is going to allocate

Guillaume Chatelet (24/75) Aug 08 2017 I haven't yet dug into formattedRead but thx for letting me know

kerdemdemir (9/31) Aug 09 2017 As a total beginner I am feeling a bit not comfortable with basic

Guillaume Chatelet (26/35) Aug 09 2017 You don't need this most of the time, if you already have the

Anonymouse (18/20) Aug 08 2017 What would you suggest to use in its stead? My use-case is

Steven Schveighoffer (14/21) Aug 09 2017 using splitter(","), and then parsing each field using appropriate

Jon Degenhardt (8/31) Aug 10 2017 The blog post Steve referred to has examples of this type

Guillaume Chatelet <chatelet.guillaume gmail.com> writes:

Let's say I'm processing MB of data, I'm lazily iterating over 
the incoming lines storing data in an associative array. I don't 
want to copy unless I have to.

Contrived example follows:

input file
----------
a,b,15
c,d,12
...

Efficient ingestion
-------------------
void main() {

   size_t[string][string] indexed_map;

   foreach(char[] line ; stdin.byLine) {
     char[] a;
     char[] b;
     size_t value;
     line.formattedRead!"%s,%s,%d"(a,b,value);

     auto pA = a in indexed_map;
     if(pA is null) {
       pA = &(indexed_map[a.idup] = (size_t[string]).init);
     }

     auto pB = b in (*pA);
     if(pB is null) {
       pB = &((*pA)[b.idup] = size_t.init);
     }

     // Technically unneeded but let's say we have more than 2 
dimensions.
     (*pB) = value;
   }

   indexed_map.writeln;
}


I qualify this code as ugly but fast. Any idea on how to make 
this less ugly? Is there something in Phobos to help?

Aug 08 2017

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 8/8/17 11:28 AM, Guillaume Chatelet wrote:
 Let's say I'm processing MB of data, I'm lazily iterating over the 
 incoming lines storing data in an associative array. I don't want to 
 copy unless I have to.
 
 Contrived example follows:
 
 input file
 ----------
 a,b,15
 c,d,12
 ....
 
 Efficient ingestion
 -------------------
 void main() {
 
    size_t[string][string] indexed_map;
 
    foreach(char[] line ; stdin.byLine) {
      char[] a;
      char[] b;
      size_t value;
      line.formattedRead!"%s,%s,%d"(a,b,value);
 
      auto pA = a in indexed_map;
      if(pA is null) {
        pA = &(indexed_map[a.idup] = (size_t[string]).init);
      }
 
      auto pB = b in (*pA);
      if(pB is null) {
        pB = &((*pA)[b.idup] = size_t.init
      }
 
      // Technically unneeded but let's say we have more than 2 dimensions.
      (*pB) = value;
    }
 
    indexed_map.writeln;
 }
 
 
 I qualify this code as ugly but fast. Any idea on how to make this less 
 ugly? Is there something in Phobos to help?

I wouldn't use formattedRead, as I think this is going to allocate 
temporaries for a and b.

Note, this is very close to Jon Degenhardt's blog post in May: 
https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/

-Steve

Aug 08 2017

Guillaume Chatelet <chatelet.guillaume gmail.com> writes:

On Tuesday, 8 August 2017 at 16:00:17 UTC, Steven Schveighoffer 
wrote:
 On 8/8/17 11:28 AM, Guillaume Chatelet wrote:
 Let's say I'm processing MB of data, I'm lazily iterating over 
 the incoming lines storing data in an associative array. I 
 don't want to copy unless I have to.
 
 Contrived example follows:
 
 input file
 ----------
 a,b,15
 c,d,12
 ....
 
 Efficient ingestion
 -------------------
 void main() {
 
    size_t[string][string] indexed_map;
 
    foreach(char[] line ; stdin.byLine) {
      char[] a;
      char[] b;
      size_t value;
      line.formattedRead!"%s,%s,%d"(a,b,value);
 
      auto pA = a in indexed_map;
      if(pA is null) {
        pA = &(indexed_map[a.idup] = (size_t[string]).init);
      }
 
      auto pB = b in (*pA);
      if(pB is null) {
        pB = &((*pA)[b.idup] = size_t.init
      }
 
      // Technically unneeded but let's say we have more than 2 
 dimensions.
      (*pB) = value;
    }
 
    indexed_map.writeln;
 }
 
 
 I qualify this code as ugly but fast. Any idea on how to make 
 this less ugly? Is there something in Phobos to help?

 I wouldn't use formattedRead, as I think this is going to 
 allocate temporaries for a and b.

 Note, this is very close to Jon Degenhardt's blog post in May: 
 https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/

 -Steve

I haven't yet dug into formattedRead but thx for letting me know 
: )
I was mostly speaking about the pattern with the AA. I guess the 
best I can do is a templated function to hide the ugliness.


ref Value GetWithDefault(Value)(ref Value[string] map, const 
(char[]) key) {
   auto pValue = key in map;
   if(pValue) return *pValue;
   return map[key.idup] = Value.init;
}

void main() {

   size_t[string][string] indexed_map;

   foreach(char[] line ; stdin.byLine) {
     char[] a;
     char[] b;
     size_t value;
     line.formattedRead!"%s,%s,%d"(a,b,value);

     indexed_map.GetWithDefault(a).GetWithDefault(b) = value;
   }

   indexed_map.writeln;
}


Not too bad actually !

Aug 08 2017

kerdemdemir <kerdemdemir hotmail.com> writes:

 I haven't yet dug into formattedRead but thx for letting me 
 know : )
 I was mostly speaking about the pattern with the AA. I guess 
 the best I can do is a templated function to hide the ugliness.


 ref Value GetWithDefault(Value)(ref Value[string] map, const 
 (char[]) key) {
   auto pValue = key in map;
   if(pValue) return *pValue;
   return map[key.idup] = Value.init;
 }

 void main() {

   size_t[string][string] indexed_map;

   foreach(char[] line ; stdin.byLine) {
     char[] a;
     char[] b;
     size_t value;
     line.formattedRead!"%s,%s,%d"(a,b,value);

     indexed_map.GetWithDefault(a).GetWithDefault(b) = value;
   }

   indexed_map.writeln;
 }


 Not too bad actually !

As a total beginner I am feeling a bit not comfortable with basic 
operations in AA.

First even I am very happy we have pointers but using pointers in 
a common operation like this IMHO makes the language a bit not 
safe.

Second "in" keyword always seemed so specific to me.

I think I will use your solution "ref Value 
GetWithDefault(Value)" very often since it hides the two things 
above.

Aug 09 2017

Guillaume Chatelet <chatelet.guillaume gmail.com> writes:

On Wednesday, 9 August 2017 at 10:00:14 UTC, kerdemdemir wrote:
 As a total beginner I am feeling a bit not comfortable with 
 basic operations in AA.

 First even I am very happy we have pointers but using pointers 
 in a common operation like this IMHO makes the language a bit 
 not safe.

 Second "in" keyword always seemed so specific to me.

 I think I will use your solution "ref Value 
 GetWithDefault(Value)" very often since it hides the two things 
 above.

You don't need this most of the time, if you already have the 
correct type it's easy:

size_t[string][string] indexed_map;

string a, b; // a and b are strings not char[]
indexed_map[a][b] = value; // this will create the AA slots if 
needed

In my specific case the data is streamed from stdin and is not 
kept in memory.
byLine returns a view of the stdin buffer which may be replaced 
at the next for-loop iteration so I can't use the index operator 
directly, I need a string that does not change over time.

I could have used this code:

void main() {
   size_t[string][string] indexed_map;
   foreach(char[] line ; stdin.byLine) {
     char[] a;
     char[] b;
     size_t value;
     line.formattedRead!"%s,%s,%d"(a,b,value);
     indexed_map[a.idup][b.idup] = value;
   }
   indexed_map.writeln;
}

It's perfectly ok if data is small. In my case data is huge and 
creating a copy of the strings at each iteration is costly.

Aug 09 2017

Anonymouse <asdf asdf.net> writes:

On Tuesday, 8 August 2017 at 16:00:17 UTC, Steven Schveighoffer 
wrote:
 I wouldn't use formattedRead, as I think this is going to 
 allocate temporaries for a and b.

What would you suggest to use in its stead? My use-case is 
similar to the OP's in that I have a string of tokens that I want 
split into variables.

import std.stdio;
import std.format;

void main()
{
     string abc, def;
     int ghi, jkl;

     string s = "abc,123,def,456";
     s.formattedRead!"%s,%d,%s,%d"(abc, ghi, def, jkl);

     writeln(abc);
     writeln(def);
     writeln(ghi);
     writeln(jkl);
}

Aug 08 2017

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 8/8/17 3:43 PM, Anonymouse wrote:
 On Tuesday, 8 August 2017 at 16:00:17 UTC, Steven Schveighoffer wrote:
 I wouldn't use formattedRead, as I think this is going to allocate 
 temporaries for a and b.

 
 What would you suggest to use in its stead? My use-case is similar to 
 the OP's in that I have a string of tokens that I want split into 
 variables.

using splitter(","), and then parsing each field using appropriate 
function (e.g. to!)

For example, the OP's code, I would do:

auto r = line.splitter(",");
a = r.front;
r.popFront;
b = r.front;
r.popFront;
c = r.front.to!int;

It would be nice if formattedRead didn't use appender, and instead 
sliced, but I'm not sure it can be fixed.

Note, one could make a template that does this automatically in one line.

-Steve

Aug 09 2017

Jon Degenhardt <jond noreply.com> writes:

On Wednesday, 9 August 2017 at 13:36:46 UTC, Steven Schveighoffer 
wrote:
 On 8/8/17 3:43 PM, Anonymouse wrote:
 On Tuesday, 8 August 2017 at 16:00:17 UTC, Steven 
 Schveighoffer wrote:
 I wouldn't use formattedRead, as I think this is going to 
 allocate temporaries for a and b.

 
 What would you suggest to use in its stead? My use-case is 
 similar to the OP's in that I have a string of tokens that I 
 want split into variables.

 using splitter(","), and then parsing each field using 
 appropriate function (e.g. to!)

 For example, the OP's code, I would do:

 auto r = line.splitter(",");
 a = r.front;
 r.popFront;
 b = r.front;
 r.popFront;
 c = r.front.to!int;

 It would be nice if formattedRead didn't use appender, and 
 instead sliced, but I'm not sure it can be fixed.

 Note, one could make a template that does this automatically in 
 one line.

 -Steve

The blog post Steve referred to has examples of this type 
processing while iterating over lines in a file. A couple 
different ways to access the elements are shown. AA access is 
addressed also: 
https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/

--Jon

Aug 10 2017

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Efficiently streaming data to associative array