digitalmars.D.learn - Parallel reads on std.container.array.Array
- Arun Chandrasekaran (70/70) Dec 07 2017 I was wondering if std.container.array.Array supports threadsafe
- Arun Chandrasekaran (3/6) Dec 08 2017 Please ignore, this is because of the write.
- Arun Chandrasekaran (34/47) Dec 08 2017 My mistake (IO bottleneck, std.stdio.write is probably
- Kagamin (6/8) Dec 08 2017 No, your code can also fail on a system with inconsistent cache
- Arun Chandrasekaran (6/14) Dec 08 2017 I'm OK with some delay between the writes and the reads. The same
- Arun Chandrasekaran (12/12) Dec 08 2017 So I tried the same on Haswell processor with LDC 1.6.0 and it
- Arun Chandrasekaran (8/19) Dec 08 2017 Learnt (from David Nadlinger) that due to lifetime management of
I was wondering if std.container.array.Array supports threadsafe parallel reads similar to std::vector. I've created a small program for demonstration https://github.com/carun/parallel-read-tester It works fine with just couple of problems though: 1. D version takes way too long compared to C++ version. ``` bash build-and-run.sh g++ (Ubuntu 7.2.0-8ubuntu3) 7.2.0 Copyright (C) 2017 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. LDC - the LLVM D compiler (1.6.0): based on DMD v2.076.1 and LLVM 5.0.0 built with LDC - the LLVM D compiler (1.6.0) Default target: x86_64-unknown-linux-gnu Host CPU: skylake http://dlang.org - http://wiki.dlang.org/LDC Registered Targets: aarch64 - AArch64 (little endian) aarch64_be - AArch64 (big endian) arm - ARM arm64 - ARM64 (little endian) armeb - ARM (big endian) nvptx - NVIDIA PTX 32-bit nvptx64 - NVIDIA PTX 64-bit ppc32 - PowerPC 32 ppc64 - PowerPC 64 ppc64le - PowerPC 64 LE thumb - Thumb thumbeb - Thumb (big endian) x86 - 32-bit X86: Pentium-Pro and above x86-64 - 64-bit X86: EM64T and AMD64 === Starting CPP version === Took 3.7583 to load 2000000 items. Gonna search in parallel... 5 4000000 6 4000000 2 4000000 0 4000000 1 4000000 7 4000000 4 4000000 3 4000000 Took 7.0247 to search === Starting D version === Took 1 sec, 506 ms, 672 μs, and 4 hnsecs to load 2000000 items. Gonna search in parallel... 3 4000000 4 4000000 2 4000000 6 4000000 7 4000000 5 4000000 1 4000000 0 4000000 Took 13 secs, 53 ms, 790 μs, and 3 hnsecs to search. ``` 2. I'm on an 8 CPU box and I don't seem to hit 800% CPU with D version (max 720%). However I can get 800% CPU usage with the C++ version. 2. Introducing a string in the struct Data results in "std.container.Array.reserve failed to allocate memory", whereas adding a similar std::string in the C++ struct seems to work fine. Am I missing anything obvious here? Also why doesn't std.container.array support an equivalent of std::vector::erase? Cheers, Arun
Dec 07 2017
On Friday, 8 December 2017 at 07:34:53 UTC, Arun Chandrasekaran wrote:2. I'm on an 8 CPU box and I don't seem to hit 800% CPU with D version (max 720%). However I can get 800% CPU usage with the C++ version.Please ignore, this is because of the write.
Dec 08 2017
On Friday, 8 December 2017 at 07:34:53 UTC, Arun Chandrasekaran wrote:I was wondering if std.container.array.Array supports threadsafe parallel reads similar to std::vector. I've created a small program for demonstration https://github.com/carun/parallel-read-tester It works fine with just couple of problems though: 1. D version takes way too long compared to C++ version.My mistake (IO bottleneck, std.stdio.write is probably flushing?)! The timings are now close enough, in the order of milliseconds. This is not just with one run, but multiple runs. (I should probably test this on a Xeon server). === Starting CPP version === Took 3.79253 to load 2000000 items. Gonna search in parallel... 4 400000000 1 400000000 3 400000000 2 400000000 6 400000000 7 400000000 5 400000000 0 400000000 Took 6.28018 to search === Starting D version === Took 1 sec, 474 ms, 869 μs, and 4 hnsecs to load 2000000 items. Gonna search in parallel... 0 400000000 1 400000000 2 400000000 7 400000000 6 400000000 4 400000000 3 400000000 5 400000000 Took 6 secs, 472 ms, 467 μs, and 8 hnsecs to search. The one that puzzles me is, what's wrong with the CPP version? :) Why is it slow loading the gallery (more than twice as slow as the D counterpart)? I thought std::vector::emplace_back should do a decent job. RVO in D?2. Introducing a string in the struct Data results in "std.container.Array.reserve failed to allocate memory", whereas adding a similar std::string in the C++ struct seems to work fine.Couldn't find the reason!Am I missing anything obvious here? Also why doesn't std.container.array support an equivalent of std::vector::erase?
Dec 08 2017
On Friday, 8 December 2017 at 07:34:53 UTC, Arun Chandrasekaran wrote:I was wondering if std.container.array.Array supports threadsafe parallel reads similar to std::vector.No, your code can also fail on a system with inconsistent cache because data written by writing thread can remain in its cache and not reach shared memory in time or reading threads can read from their stale cache.
Dec 08 2017
On Friday, 8 December 2017 at 10:01:14 UTC, Kagamin wrote:On Friday, 8 December 2017 at 07:34:53 UTC, Arun Chandrasekaran wrote:I'm OK with some delay between the writes and the reads. The same applies to the writes and reads across processes. At least between threads the impact/delay is minimum whereas between processes it's even worse as the page will have to be reflected in all the mapped processes.I was wondering if std.container.array.Array supports threadsafe parallel reads similar to std::vector.No, your code can also fail on a system with inconsistent cache because data written by writing thread can remain in its cache and not reach shared memory in time or reading threads can read from their stale cache.
Dec 08 2017
So I tried the same on Haswell processor with LDC 1.6.0 and it crashes ``` === Starting D version === Took 1 sec, 107 ms, and 383 μs to load 1000000 items. Gonna search in parallel... *** Error in `./dmain-ldc': double free or corruption (fasttop): 0x0000000000edc6e0 *** *** Error in `./dmain-ldc': double free or corruption (fasttop): 0x0000000000edc6e0 *** ``` DMD on the other hand takes forever to run and doesn't complete.
Dec 08 2017
On Saturday, 9 December 2017 at 01:34:40 UTC, Arun Chandrasekaran wrote:So I tried the same on Haswell processor with LDC 1.6.0 and it crashes ``` === Starting D version === Took 1 sec, 107 ms, and 383 μs to load 1000000 items. Gonna search in parallel... *** Error in `./dmain-ldc': double free or corruption (fasttop): 0x0000000000edc6e0 *** *** Error in `./dmain-ldc': double free or corruption (fasttop): 0x0000000000edc6e0 *** ```Learnt (from David Nadlinger) that due to lifetime management of transitory ranges, they can't be used for parallel reads. Iterating by index has solved the problem. However, accessing the items in Array results in value copy. Is that expected? How can I fix this? http://forum.dlang.org/post/cfhkszdbkaezprbzrnlc forum.dlang.org
Dec 08 2017