digitalmars.D.learn - d word counting approach performs well but has higher mem usage

dwdv (78/78) Nov 03 2018 Hi there,

Stanislav Blinov (51/72) Nov 03 2018 Yup. AAs do keep their memory around (supposedly for reuse). You

dwdv (16/36) Nov 04 2018 What a darn shame. This way I'm missing out on all those slick internet

Jon Degenhardt (33/44) Nov 04 2018 Not exactly the same problem, but there is relevant discussion in

dwdv <dwdv posteo.de> writes:

Hi there,

the task is simple: count word occurrences from stdin (around 150mb in 
this case) and print sorted results to stdout in a somewhat idiomatic 
fashion.

Now, d is quite elegant while maintaining high performance compared to 
both c and c++, but I, as a complete beginner, can't identify where the 
10x memory usage (~300mb, see results below) is coming from.

Unicode overhead? Internal buffer? Is something slurping the whole file? 
Assoc array allocations? Couldn't find huge allocs with dmd -vgc and 
-profile=gc either. What did I do wrong?

```d ===================================================================
void main()
{
     import std.stdio, std.algorithm, std.range;

     int[string] count;
     foreach(const word; stdin.byLine.map!splitter.joiner) {
         ++count[word];
     }

     //or even:
     //foreach(line; stdin.byLine) {
     //    foreach(const word; line.splitter) {
     //        ++count[word];
     //    }
     //}

     count.byKeyValue
         .array
         .sort!((a, b) => a.value > b.value)
         .each!(a => writefln("%d %s", a.value, a.key));
}
```

```c++ (for reference) =================================================
#include <iostream>
#include <vector>
#include <unordered_map>
#include <algorithm>

using namespace std;

int main() {
     string s;
     unordered_map<string, int> count;
     std::ios::sync_with_stdio(false);
     while (cin >> s) {
         count[s]++;
     }

     vector<pair<string, int>> temp {begin(count), end(count)};
     sort(begin(temp), end(temp),
         [](const auto& a, const auto& b) {return b.second < a.second;});
     for (const auto& elem : temp) {
         cout << elem.second << " " << elem.first << '\n';
     }
}
```

Results on an old celeron dual core (wall clock and res mem):
0:08.78, 313732 kb <= d dmd
0:08.25, 318084 kb <= d ldc
0:08.38, 38512 kb  <= c++ idiomatic (above)
0:07.76, 30276 kb  <= c++ boost
0:08.42, 26756 kb  <= c verbose, hand-rolled hashtable

Mem and time measured like so:
/usr/bin/time -v $cmd < input >/dev/null

Input words file creation (around 300k * 50 words):
tr '\n' ' ' < /usr/share/dict/$lang > joined
for i in {1..50}; do cat joined >> input; done

word count sample output:
[... snip ...]
50 ironsmith
50 gloried
50 quindecagon
50 directory's
50 hydrobiological

Compilation flags:
dmd -O -release -mcpu=native -ofwc-d-dmd wc.d
ldc2 -O3 -release -flto=full -mcpu=native -ofwc-d-ldc wc.d
clang -std=c11 -O3 -march=native -flto -o wp-c-clang wp.c
clang++ -std=c++17 -O3 -march=native -flto -o wp-cpp-clang wp-boost.cpp

Versions:
dmd: v2.082.1
ldc: 1.12.0 (based on DMD v2.082.1 and LLVM 6.0.1)
llvm/clang: 6.0.1

Nov 03 2018

Stanislav Blinov <stanislav.blinov gmail.com> writes:

On Saturday, 3 November 2018 at 14:26:02 UTC, dwdv wrote:

 Assoc array allocations?

Yup. AAs do keep their memory around (supposedly for reuse). You 
can insert calls to GC.stats (import core.memory) at various 
points to see actual GC heap usage. If you don't touch that AA at 
all you'll only use up some Kb of the GC heap when reading the 
file.
Why it consumes so much is a question to the implementation.

 What did I do wrong?

Well, you didn't actually put the keys into the AA ;) I'm 
guessing you didn't look closely at the output, otherwise you 
would've noticed that something was wrong.

AAs want immutable keys. .byLine returns (in this case) a char[]. 
It's a slice of it's internal buffer that is reused on reading 
each line; it gets overwritten on every iteration. This way the 
reading loop only consumes as much as the longest line requires. 
But a `char[]` is not a `string` and you wouldn't be able to 
index the AA with it:

```
Error: associative arrays can only be assigned values with 
immutable keys, not char[]
```

But by putting `const` in `foreach` you tricked the compiler into 
letting you index the AA with a (supposed) const key. Which, of 
course, went fine as far as insertion/updates went, since hashes 
still matched. But when you iterate later, pretty much every key 
is in fact a reference to some older memory, which is still 
somewhere on the GC heap; you don't get a segfault, but neither 
do you get correct "words".

You *need* to have an actual `string` when you first insert into 
the AA.

 ```d 
 ===================================================================
 void main()
 {
     import std.stdio, std.algorithm, std.range;

       import core.memory;
     int[string] count;

       void updateCount(char[] word) {
           auto ptr = word in count;
           if (!ptr)
               // note the .idup!
               count[word.idup] = 1;
           else
               (*ptr)++;
       }

       // no const!
     foreach(word; stdin.byLine.map!splitter.joiner) {

           updateCount(word);
     }

     //or even:
     //foreach(line; stdin.byLine) {

             // no const!
     //    foreach(word; line.splitter) {

       //        updateCount(word);
     //    }
     //}

       writeln(GC.stats);
       GC.collect;
       writeln(GC.stats);

     count.byKeyValue
         .array
         .sort!((a, b) => a.value > b.value)
         .each!(a => writefln("%d %s", a.value, a.key));

       writeln(GC.stats);
       GC.collect;
       writeln(GC.stats);
 }
 ```

Note that if you .clear() and even .destroy() the AA, it'll still 
keep a bunch of memory allocated. I guess built-in AAs just love 
to hoard.

Nov 03 2018

dwdv <dwdv posteo.de> writes:

 Assoc array allocations?

 
 Yup. AAs do keep their memory around (supposedly for reuse). [...] 
 Why it consumes so much is a question to the implementation.
 [...] I guess built-in AAs just love to hoard.

What a darn shame. This way I'm missing out on all those slick internet 
benchmark points. :)

Guess I have to find a workaround then, since it's swapping memory like 
crazy on larger real-world inputs on my fairly low-end machine.

 What did I do wrong?

 
 Well, you didn't actually put the keys into the AA ;) I'm guessing you 
 didn't look closely at the output, otherwise you would've noticed that 
 something was wrong.
 [...]
 ```
 Error: associative arrays can only be assigned values with immutable 
 keys, not char[]
 ```
 [...]
 But when you iterate later, pretty much every key is in fact a reference 
 to some older memory, which is still somewhere on the GC heap; you don't 
 get a segfault, but neither do you get correct "words".
 

You are absolutely right, I dismissed the aforementioned error without a 
second thought as soon as the compiler stopped complaining by throwing a 
const declaration at it. Oh well, should have read the assoc array spec 
page more thoroughly since it contains this very snippet here:

```
auto word = line[wordStart .. wordEnd];
++dictionary[word.idup];   // increment count for word
```

And yes, using more irregular log files as input instead of the 
concatenated uniform dict reveals quite a bit of nonsense that is being 
printed to stdout.

Thank you, Stanislav, for taking the time to explain all this.

Nov 04 2018

Jon Degenhardt <jond noreply.com> writes:

On Saturday, 3 November 2018 at 14:26:02 UTC, dwdv wrote:
 Hi there,

 the task is simple: count word occurrences from stdin (around 
 150mb in this case) and print sorted results to stdout in a 
 somewhat idiomatic fashion.

 Now, d is quite elegant while maintaining high performance 
 compared to both c and c++, but I, as a complete beginner, 
 can't identify where the 10x memory usage (~300mb, see results 
 below) is coming from.

 Unicode overhead? Internal buffer? Is something slurping the 
 whole file? Assoc array allocations? Couldn't find huge allocs 
 with dmd -vgc and -profile=gc either. What did I do wrong?

Not exactly the same problem, but there is relevant discussion in 
the blog post I wrote a while ago:  
https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/

See in particular the section on Associate Array lookup 
optimization. This takes advantage of the fact that it's only 
necessary to create the immutable string the first time a key is 
entered into the hash. Subsequent occurrences do not need to take 
this step. As creating allocates new memory, even if only used 
temporarily, this is a meaningful savings.

There have been additional APIs added to the AA interface since I 
wrote the blog post, I believe it is now possible to accomplish 
the same thing with more succinct code.

Other optimization possibilities:
* Avoid auto-decode: Not sure if your code is hitting this, but 
if so it's a significant performance hit. Unfortunately, it's not 
always obvious when this is happening. The task your are 
performing doesn't need auto-decode because it is splitting on 
single-byte utf-8 char boundaries (newline and space).

* LTO on druntime/phobos: This is easy and will have a material 
speedup. Simply add
         '-defaultlib=phobos2-ldc-lto,druntime-ldc-lto'
to the 'ldc2' build line, after the '-flto=full' entry. This will 
be a win because it will enable a number of optimizations in the 
internal loop.

* Reading the whole file vs line by line - 'byLine' is really 
fast. It's also nice and general, as it allows reading arbitrary 
size files or standard input without changes to the code. 
However, it's not as fast as reading the file in a single shot.

* std.algorithm.joiner - Has improved dramatically, but is still 
slower than a foreach loop. See: 
https://github.com/dlang/phobos/pull/6492

--Jon

Nov 04 2018

D Programming

C/C++ Programming

Other

digitalmars.D.learn - d word counting approach performs well but has higher mem usage