digitalmars.D.learn - Parallel reads on std.container.array.Array

Arun Chandrasekaran (70/70) Dec 07 2017 I was wondering if std.container.array.Array supports threadsafe

Arun Chandrasekaran (3/6) Dec 08 2017 Please ignore, this is because of the write.
Arun Chandrasekaran (34/47) Dec 08 2017 My mistake (IO bottleneck, std.stdio.write is probably
Kagamin (6/8) Dec 08 2017 No, your code can also fail on a system with inconsistent cache

Arun Chandrasekaran (6/14) Dec 08 2017 I'm OK with some delay between the writes and the reads. The same

Arun Chandrasekaran (12/12) Dec 08 2017 So I tried the same on Haswell processor with LDC 1.6.0 and it

Arun Chandrasekaran (8/19) Dec 08 2017 Learnt (from David Nadlinger) that due to lifetime management of

Arun Chandrasekaran <aruncxy gmail.com> writes:

I was wondering if std.container.array.Array supports threadsafe 
parallel reads similar to std::vector. I've created a small 
program for demonstration 
https://github.com/carun/parallel-read-tester

It works fine with just couple of problems though:

1. D version takes way too long compared to C++ version.

```
bash build-and-run.sh
g++ (Ubuntu 7.2.0-8ubuntu3) 7.2.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  
There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A 
PARTICULAR PURPOSE.

LDC - the LLVM D compiler (1.6.0):
   based on DMD v2.076.1 and LLVM 5.0.0
   built with LDC - the LLVM D compiler (1.6.0)
   Default target: x86_64-unknown-linux-gnu
   Host CPU: skylake
   http://dlang.org - http://wiki.dlang.org/LDC

   Registered Targets:
     aarch64    - AArch64 (little endian)
     aarch64_be - AArch64 (big endian)
     arm        - ARM
     arm64      - ARM64 (little endian)
     armeb      - ARM (big endian)
     nvptx      - NVIDIA PTX 32-bit
     nvptx64    - NVIDIA PTX 64-bit
     ppc32      - PowerPC 32
     ppc64      - PowerPC 64
     ppc64le    - PowerPC 64 LE
     thumb      - Thumb
     thumbeb    - Thumb (big endian)
     x86        - 32-bit X86: Pentium-Pro and above
     x86-64     - 64-bit X86: EM64T and AMD64

=== Starting CPP version ===
Took 3.7583 to load 2000000 items. Gonna search in parallel...
5 4000000
6 4000000
2 4000000
0 4000000
1 4000000
7 4000000
4 4000000
3 4000000
Took 7.0247 to search

=== Starting D version ===
Took 1 sec, 506 ms, 672 μs, and 4 hnsecs to load 2000000 items. 
Gonna search in parallel...
3 4000000
4 4000000
2 4000000
6 4000000
7 4000000
5 4000000
1 4000000
0 4000000
Took 13 secs, 53 ms, 790 μs, and 3 hnsecs to search.
```
2. I'm on an 8 CPU box and I don't seem to hit 800% CPU with D 
version (max 720%). However I can get 800% CPU usage with the C++ 
version.

2. Introducing a string in the struct Data results in 
"std.container.Array.reserve failed to allocate memory", whereas 
adding a similar std::string in the C++ struct seems to work fine.

Am I missing anything obvious here?

Also why doesn't std.container.array support an equivalent of 
std::vector::erase?

Cheers,
Arun

Dec 07 2017

Arun Chandrasekaran <aruncxy gmail.com> writes:

On Friday, 8 December 2017 at 07:34:53 UTC, Arun Chandrasekaran 
wrote:

 2. I'm on an 8 CPU box and I don't seem to hit 800% CPU with D 
 version (max 720%). However I can get 800% CPU usage with the 
 C++ version.

Please ignore, this is because of the write.

Dec 08 2017

Arun Chandrasekaran <aruncxy gmail.com> writes:

On Friday, 8 December 2017 at 07:34:53 UTC, Arun Chandrasekaran 
wrote:
 I was wondering if std.container.array.Array supports 
 threadsafe parallel reads similar to std::vector. I've created 
 a small program for demonstration 
 https://github.com/carun/parallel-read-tester

 It works fine with just couple of problems though:

 1. D version takes way too long compared to C++ version.

My mistake (IO bottleneck, std.stdio.write is probably 
flushing?)! The timings are now close enough, in the order of 
milliseconds. This is not just with one run, but multiple runs. 
(I should probably test this on a Xeon server).

=== Starting CPP version ===
Took 3.79253 to load 2000000 items. Gonna search in parallel...
4 400000000
1 400000000
3 400000000
2 400000000
6 400000000
7 400000000
5 400000000
0 400000000
Took 6.28018 to search

=== Starting D version ===
Took 1 sec, 474 ms, 869 μs, and 4 hnsecs to load 2000000 items. 
Gonna search in parallel...
0 400000000
1 400000000
2 400000000
7 400000000
6 400000000
4 400000000
3 400000000
5 400000000
Took 6 secs, 472 ms, 467 μs, and 8 hnsecs to search.

The one that puzzles me is, what's wrong with the CPP version? :) 
Why is it slow loading the gallery (more than twice as slow as 
the D counterpart)? I thought std::vector::emplace_back should do 
a decent job. RVO in D?

 2. Introducing a string in the struct Data results in 
 "std.container.Array.reserve failed to allocate memory", 
 whereas adding a similar std::string in the C++ struct seems to 
 work fine.

Couldn't find the reason!

 Am I missing anything obvious here?

 Also why doesn't std.container.array support an equivalent of 
 std::vector::erase?

Dec 08 2017

Kagamin <spam here.lot> writes:

On Friday, 8 December 2017 at 07:34:53 UTC, Arun Chandrasekaran 
wrote:
 I was wondering if std.container.array.Array supports 
 threadsafe parallel reads similar to std::vector.

No, your code can also fail on a system with inconsistent cache 
because data written by writing thread can remain in its cache 
and not reach shared memory in time or reading threads can read 
from their stale cache.

Dec 08 2017

Arun Chandrasekaran <aruncxy gmail.com> writes:

On Friday, 8 December 2017 at 10:01:14 UTC, Kagamin wrote:
 On Friday, 8 December 2017 at 07:34:53 UTC, Arun Chandrasekaran 
 wrote:
 I was wondering if std.container.array.Array supports 
 threadsafe parallel reads similar to std::vector.

 No, your code can also fail on a system with inconsistent cache 
 because data written by writing thread can remain in its cache 
 and not reach shared memory in time or reading threads can read 
 from their stale cache.

I'm OK with some delay between the writes and the reads. The same 
applies to the writes and reads across processes. At least 
between threads the impact/delay is minimum whereas between 
processes it's even worse as the page will have to be reflected 
in all the mapped processes.

Dec 08 2017

Arun Chandrasekaran <aruncxy gmail.com> writes:

So I tried the same on Haswell processor with LDC 1.6.0 and it 
crashes

```
=== Starting D version ===
Took 1 sec, 107 ms, and 383 μs to load 1000000 items. Gonna 
search in parallel...
*** Error in `./dmain-ldc': double free or corruption (fasttop): 
0x0000000000edc6e0 ***
*** Error in `./dmain-ldc': double free or corruption (fasttop): 
0x0000000000edc6e0 ***
```

DMD on the other hand takes forever to run and doesn't complete.

Dec 08 2017

Arun Chandrasekaran <aruncxy gmail.com> writes:

On Saturday, 9 December 2017 at 01:34:40 UTC, Arun Chandrasekaran 
wrote:
 So I tried the same on Haswell processor with LDC 1.6.0 and it 
 crashes

 ```
 === Starting D version ===
 Took 1 sec, 107 ms, and 383 μs to load 1000000 items. Gonna 
 search in parallel...
 *** Error in `./dmain-ldc': double free or corruption 
 (fasttop): 0x0000000000edc6e0 ***
 *** Error in `./dmain-ldc': double free or corruption 
 (fasttop): 0x0000000000edc6e0 ***
 ```

Learnt (from David Nadlinger) that due to lifetime management of 
transitory ranges, they can't be used for parallel reads. 
Iterating by index has solved the problem.

However, accessing the items in Array results in value copy. Is 
that expected? How can I fix this?

http://forum.dlang.org/post/cfhkszdbkaezprbzrnlc forum.dlang.org

Dec 08 2017

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Parallel reads on std.container.array.Array