digitalmars.D.learn - Efficiently streaming data to associative array
- Guillaume Chatelet (34/34) Aug 08 2017 Let's say I'm processing MB of data, I'm lazily iterating over
- Steven Schveighoffer (6/50) Aug 08 2017 I wouldn't use formattedRead, as I think this is going to allocate
- Guillaume Chatelet (24/75) Aug 08 2017 I haven't yet dug into formattedRead but thx for letting me know
- kerdemdemir (9/31) Aug 09 2017 As a total beginner I am feeling a bit not comfortable with basic
- Guillaume Chatelet (26/35) Aug 09 2017 You don't need this most of the time, if you already have the
- Anonymouse (18/20) Aug 08 2017 What would you suggest to use in its stead? My use-case is
- Steven Schveighoffer (14/21) Aug 09 2017 using splitter(","), and then parsing each field using appropriate
- Jon Degenhardt (8/31) Aug 10 2017 The blog post Steve referred to has examples of this type
Let's say I'm processing MB of data, I'm lazily iterating over the incoming lines storing data in an associative array. I don't want to copy unless I have to. Contrived example follows: input file ---------- a,b,15 c,d,12 ... Efficient ingestion ------------------- void main() { size_t[string][string] indexed_map; foreach(char[] line ; stdin.byLine) { char[] a; char[] b; size_t value; line.formattedRead!"%s,%s,%d"(a,b,value); auto pA = a in indexed_map; if(pA is null) { pA = &(indexed_map[a.idup] = (size_t[string]).init); } auto pB = b in (*pA); if(pB is null) { pB = &((*pA)[b.idup] = size_t.init); } // Technically unneeded but let's say we have more than 2 dimensions. (*pB) = value; } indexed_map.writeln; } I qualify this code as ugly but fast. Any idea on how to make this less ugly? Is there something in Phobos to help?
Aug 08 2017
On 8/8/17 11:28 AM, Guillaume Chatelet wrote:Let's say I'm processing MB of data, I'm lazily iterating over the incoming lines storing data in an associative array. I don't want to copy unless I have to. Contrived example follows: input file ---------- a,b,15 c,d,12 .... Efficient ingestion ------------------- void main() { size_t[string][string] indexed_map; foreach(char[] line ; stdin.byLine) { char[] a; char[] b; size_t value; line.formattedRead!"%s,%s,%d"(a,b,value); auto pA = a in indexed_map; if(pA is null) { pA = &(indexed_map[a.idup] = (size_t[string]).init); } auto pB = b in (*pA); if(pB is null) { pB = &((*pA)[b.idup] = size_t.init } // Technically unneeded but let's say we have more than 2 dimensions. (*pB) = value; } indexed_map.writeln; } I qualify this code as ugly but fast. Any idea on how to make this less ugly? Is there something in Phobos to help?I wouldn't use formattedRead, as I think this is going to allocate temporaries for a and b. Note, this is very close to Jon Degenhardt's blog post in May: https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ -Steve
Aug 08 2017
On Tuesday, 8 August 2017 at 16:00:17 UTC, Steven Schveighoffer wrote:On 8/8/17 11:28 AM, Guillaume Chatelet wrote:I haven't yet dug into formattedRead but thx for letting me know : ) I was mostly speaking about the pattern with the AA. I guess the best I can do is a templated function to hide the ugliness. ref Value GetWithDefault(Value)(ref Value[string] map, const (char[]) key) { auto pValue = key in map; if(pValue) return *pValue; return map[key.idup] = Value.init; } void main() { size_t[string][string] indexed_map; foreach(char[] line ; stdin.byLine) { char[] a; char[] b; size_t value; line.formattedRead!"%s,%s,%d"(a,b,value); indexed_map.GetWithDefault(a).GetWithDefault(b) = value; } indexed_map.writeln; } Not too bad actually !Let's say I'm processing MB of data, I'm lazily iterating over the incoming lines storing data in an associative array. I don't want to copy unless I have to. Contrived example follows: input file ---------- a,b,15 c,d,12 .... Efficient ingestion ------------------- void main() { size_t[string][string] indexed_map; foreach(char[] line ; stdin.byLine) { char[] a; char[] b; size_t value; line.formattedRead!"%s,%s,%d"(a,b,value); auto pA = a in indexed_map; if(pA is null) { pA = &(indexed_map[a.idup] = (size_t[string]).init); } auto pB = b in (*pA); if(pB is null) { pB = &((*pA)[b.idup] = size_t.init } // Technically unneeded but let's say we have more than 2 dimensions. (*pB) = value; } indexed_map.writeln; } I qualify this code as ugly but fast. Any idea on how to make this less ugly? Is there something in Phobos to help?I wouldn't use formattedRead, as I think this is going to allocate temporaries for a and b. Note, this is very close to Jon Degenhardt's blog post in May: https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ -Steve
Aug 08 2017
I haven't yet dug into formattedRead but thx for letting me know : ) I was mostly speaking about the pattern with the AA. I guess the best I can do is a templated function to hide the ugliness. ref Value GetWithDefault(Value)(ref Value[string] map, const (char[]) key) { auto pValue = key in map; if(pValue) return *pValue; return map[key.idup] = Value.init; } void main() { size_t[string][string] indexed_map; foreach(char[] line ; stdin.byLine) { char[] a; char[] b; size_t value; line.formattedRead!"%s,%s,%d"(a,b,value); indexed_map.GetWithDefault(a).GetWithDefault(b) = value; } indexed_map.writeln; } Not too bad actually !As a total beginner I am feeling a bit not comfortable with basic operations in AA. First even I am very happy we have pointers but using pointers in a common operation like this IMHO makes the language a bit not safe. Second "in" keyword always seemed so specific to me. I think I will use your solution "ref Value GetWithDefault(Value)" very often since it hides the two things above.
Aug 09 2017
On Wednesday, 9 August 2017 at 10:00:14 UTC, kerdemdemir wrote:As a total beginner I am feeling a bit not comfortable with basic operations in AA. First even I am very happy we have pointers but using pointers in a common operation like this IMHO makes the language a bit not safe. Second "in" keyword always seemed so specific to me. I think I will use your solution "ref Value GetWithDefault(Value)" very often since it hides the two things above.You don't need this most of the time, if you already have the correct type it's easy: size_t[string][string] indexed_map; string a, b; // a and b are strings not char[] indexed_map[a][b] = value; // this will create the AA slots if needed In my specific case the data is streamed from stdin and is not kept in memory. byLine returns a view of the stdin buffer which may be replaced at the next for-loop iteration so I can't use the index operator directly, I need a string that does not change over time. I could have used this code: void main() { size_t[string][string] indexed_map; foreach(char[] line ; stdin.byLine) { char[] a; char[] b; size_t value; line.formattedRead!"%s,%s,%d"(a,b,value); indexed_map[a.idup][b.idup] = value; } indexed_map.writeln; } It's perfectly ok if data is small. In my case data is huge and creating a copy of the strings at each iteration is costly.
Aug 09 2017
On Tuesday, 8 August 2017 at 16:00:17 UTC, Steven Schveighoffer wrote:I wouldn't use formattedRead, as I think this is going to allocate temporaries for a and b.What would you suggest to use in its stead? My use-case is similar to the OP's in that I have a string of tokens that I want split into variables. import std.stdio; import std.format; void main() { string abc, def; int ghi, jkl; string s = "abc,123,def,456"; s.formattedRead!"%s,%d,%s,%d"(abc, ghi, def, jkl); writeln(abc); writeln(def); writeln(ghi); writeln(jkl); }
Aug 08 2017
On 8/8/17 3:43 PM, Anonymouse wrote:On Tuesday, 8 August 2017 at 16:00:17 UTC, Steven Schveighoffer wrote:using splitter(","), and then parsing each field using appropriate function (e.g. to!) For example, the OP's code, I would do: auto r = line.splitter(","); a = r.front; r.popFront; b = r.front; r.popFront; c = r.front.to!int; It would be nice if formattedRead didn't use appender, and instead sliced, but I'm not sure it can be fixed. Note, one could make a template that does this automatically in one line. -SteveI wouldn't use formattedRead, as I think this is going to allocate temporaries for a and b.What would you suggest to use in its stead? My use-case is similar to the OP's in that I have a string of tokens that I want split into variables.
Aug 09 2017
On Wednesday, 9 August 2017 at 13:36:46 UTC, Steven Schveighoffer wrote:On 8/8/17 3:43 PM, Anonymouse wrote:The blog post Steve referred to has examples of this type processing while iterating over lines in a file. A couple different ways to access the elements are shown. AA access is addressed also: https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ --JonOn Tuesday, 8 August 2017 at 16:00:17 UTC, Steven Schveighoffer wrote:using splitter(","), and then parsing each field using appropriate function (e.g. to!) For example, the OP's code, I would do: auto r = line.splitter(","); a = r.front; r.popFront; b = r.front; r.popFront; c = r.front.to!int; It would be nice if formattedRead didn't use appender, and instead sliced, but I'm not sure it can be fixed. Note, one could make a template that does this automatically in one line. -SteveI wouldn't use formattedRead, as I think this is going to allocate temporaries for a and b.What would you suggest to use in its stead? My use-case is similar to the OP's in that I have a string of tokens that I want split into variables.
Aug 10 2017