www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - d word counting approach performs well but has higher mem usage

reply dwdv <dwdv posteo.de> writes:
Hi there,

the task is simple: count word occurrences from stdin (around 150mb in 
this case) and print sorted results to stdout in a somewhat idiomatic 
fashion.

Now, d is quite elegant while maintaining high performance compared to 
both c and c++, but I, as a complete beginner, can't identify where the 
10x memory usage (~300mb, see results below) is coming from.

Unicode overhead? Internal buffer? Is something slurping the whole file? 
Assoc array allocations? Couldn't find huge allocs with dmd -vgc and 
-profile=gc either. What did I do wrong?

```d ===================================================================
void main()
{
     import std.stdio, std.algorithm, std.range;

     int[string] count;
     foreach(const word; stdin.byLine.map!splitter.joiner) {
         ++count[word];
     }

     //or even:
     //foreach(line; stdin.byLine) {
     //    foreach(const word; line.splitter) {
     //        ++count[word];
     //    }
     //}

     count.byKeyValue
         .array
         .sort!((a, b) => a.value > b.value)
         .each!(a => writefln("%d %s", a.value, a.key));
}
```

```c++ (for reference) =================================================
#include <iostream>
#include <vector>
#include <unordered_map>
#include <algorithm>

using namespace std;

int main() {
     string s;
     unordered_map<string, int> count;
     std::ios::sync_with_stdio(false);
     while (cin >> s) {
         count[s]++;
     }

     vector<pair<string, int>> temp {begin(count), end(count)};
     sort(begin(temp), end(temp),
         [](const auto& a, const auto& b) {return b.second < a.second;});
     for (const auto& elem : temp) {
         cout << elem.second << " " << elem.first << '\n';
     }
}
```

Results on an old celeron dual core (wall clock and res mem):
0:08.78, 313732 kb <= d dmd
0:08.25, 318084 kb <= d ldc
0:08.38, 38512 kb  <= c++ idiomatic (above)
0:07.76, 30276 kb  <= c++ boost
0:08.42, 26756 kb  <= c verbose, hand-rolled hashtable

Mem and time measured like so:
/usr/bin/time -v $cmd < input >/dev/null

Input words file creation (around 300k * 50 words):
tr '\n' ' ' < /usr/share/dict/$lang > joined
for i in {1..50}; do cat joined >> input; done

word count sample output:
[... snip ...]
50 ironsmith
50 gloried
50 quindecagon
50 directory's
50 hydrobiological

Compilation flags:
dmd -O -release -mcpu=native -ofwc-d-dmd wc.d
ldc2 -O3 -release -flto=full -mcpu=native -ofwc-d-ldc wc.d
clang -std=c11 -O3 -march=native -flto -o wp-c-clang wp.c
clang++ -std=c++17 -O3 -march=native -flto -o wp-cpp-clang wp-boost.cpp

Versions:
dmd: v2.082.1
ldc: 1.12.0 (based on DMD v2.082.1 and LLVM 6.0.1)
llvm/clang: 6.0.1
Nov 03 2018
next sibling parent reply Stanislav Blinov <stanislav.blinov gmail.com> writes:
On Saturday, 3 November 2018 at 14:26:02 UTC, dwdv wrote:

 Assoc array allocations?
Yup. AAs do keep their memory around (supposedly for reuse). You can insert calls to GC.stats (import core.memory) at various points to see actual GC heap usage. If you don't touch that AA at all you'll only use up some Kb of the GC heap when reading the file. Why it consumes so much is a question to the implementation.
 What did I do wrong?
Well, you didn't actually put the keys into the AA ;) I'm guessing you didn't look closely at the output, otherwise you would've noticed that something was wrong. AAs want immutable keys. .byLine returns (in this case) a char[]. It's a slice of it's internal buffer that is reused on reading each line; it gets overwritten on every iteration. This way the reading loop only consumes as much as the longest line requires. But a `char[]` is not a `string` and you wouldn't be able to index the AA with it: ``` Error: associative arrays can only be assigned values with immutable keys, not char[] ``` But by putting `const` in `foreach` you tricked the compiler into letting you index the AA with a (supposed) const key. Which, of course, went fine as far as insertion/updates went, since hashes still matched. But when you iterate later, pretty much every key is in fact a reference to some older memory, which is still somewhere on the GC heap; you don't get a segfault, but neither do you get correct "words". You *need* to have an actual `string` when you first insert into the AA.
 ```d 
 ===================================================================
 void main()
 {
     import std.stdio, std.algorithm, std.range;
import core.memory;
     int[string] count;
void updateCount(char[] word) { auto ptr = word in count; if (!ptr) // note the .idup! count[word.idup] = 1; else (*ptr)++; } // no const!
     foreach(word; stdin.byLine.map!splitter.joiner) {
updateCount(word);
     }

     //or even:
     //foreach(line; stdin.byLine) {
// no const!
     //    foreach(word; line.splitter) {
// updateCount(word);
     //    }
     //}
writeln(GC.stats); GC.collect; writeln(GC.stats);
     count.byKeyValue
         .array
         .sort!((a, b) => a.value > b.value)
         .each!(a => writefln("%d %s", a.value, a.key));
writeln(GC.stats); GC.collect; writeln(GC.stats);
 }
 ```
Note that if you .clear() and even .destroy() the AA, it'll still keep a bunch of memory allocated. I guess built-in AAs just love to hoard.
Nov 03 2018
parent dwdv <dwdv posteo.de> writes:
 Assoc array allocations?
Yup. AAs do keep their memory around (supposedly for reuse). [...] Why it consumes so much is a question to the implementation. [...] I guess built-in AAs just love to hoard.
What a darn shame. This way I'm missing out on all those slick internet benchmark points. :) Guess I have to find a workaround then, since it's swapping memory like crazy on larger real-world inputs on my fairly low-end machine.
 What did I do wrong?
Well, you didn't actually put the keys into the AA ;) I'm guessing you didn't look closely at the output, otherwise you would've noticed that something was wrong. [...] ``` Error: associative arrays can only be assigned values with immutable keys, not char[] ``` [...] But when you iterate later, pretty much every key is in fact a reference to some older memory, which is still somewhere on the GC heap; you don't get a segfault, but neither do you get correct "words".
You are absolutely right, I dismissed the aforementioned error without a second thought as soon as the compiler stopped complaining by throwing a const declaration at it. Oh well, should have read the assoc array spec page more thoroughly since it contains this very snippet here: ``` auto word = line[wordStart .. wordEnd]; ++dictionary[word.idup]; // increment count for word ``` And yes, using more irregular log files as input instead of the concatenated uniform dict reveals quite a bit of nonsense that is being printed to stdout. Thank you, Stanislav, for taking the time to explain all this.
Nov 04 2018
prev sibling parent Jon Degenhardt <jond noreply.com> writes:
On Saturday, 3 November 2018 at 14:26:02 UTC, dwdv wrote:
 Hi there,

 the task is simple: count word occurrences from stdin (around 
 150mb in this case) and print sorted results to stdout in a 
 somewhat idiomatic fashion.

 Now, d is quite elegant while maintaining high performance 
 compared to both c and c++, but I, as a complete beginner, 
 can't identify where the 10x memory usage (~300mb, see results 
 below) is coming from.

 Unicode overhead? Internal buffer? Is something slurping the 
 whole file? Assoc array allocations? Couldn't find huge allocs 
 with dmd -vgc and -profile=gc either. What did I do wrong?
Not exactly the same problem, but there is relevant discussion in the blog post I wrote a while ago: https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ See in particular the section on Associate Array lookup optimization. This takes advantage of the fact that it's only necessary to create the immutable string the first time a key is entered into the hash. Subsequent occurrences do not need to take this step. As creating allocates new memory, even if only used temporarily, this is a meaningful savings. There have been additional APIs added to the AA interface since I wrote the blog post, I believe it is now possible to accomplish the same thing with more succinct code. Other optimization possibilities: * Avoid auto-decode: Not sure if your code is hitting this, but if so it's a significant performance hit. Unfortunately, it's not always obvious when this is happening. The task your are performing doesn't need auto-decode because it is splitting on single-byte utf-8 char boundaries (newline and space). * LTO on druntime/phobos: This is easy and will have a material speedup. Simply add '-defaultlib=phobos2-ldc-lto,druntime-ldc-lto' to the 'ldc2' build line, after the '-flto=full' entry. This will be a win because it will enable a number of optimizations in the internal loop. * Reading the whole file vs line by line - 'byLine' is really fast. It's also nice and general, as it allows reading arbitrary size files or standard input without changes to the code. However, it's not as fast as reading the file in a single shot. * std.algorithm.joiner - Has improved dramatically, but is still slower than a foreach loop. See: https://github.com/dlang/phobos/pull/6492 --Jon
Nov 04 2018