www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - fork/exec performance problem?

reply Johan Holmberg via Digitalmars-d <digitalmars-d puremagic.com> writes:
Hi!

I'm trying to write a simple D program to emulate "parallel -u -jN", ie.
running a number of commands in parallel to take advantage of a multicore
machine (I'm testing on a 24-core Ubuntu machine).

I have written almost equivalent programs in C++ and D, and hoped that they
should run equally fast. But the performance of the D version degrades when
the number of commands increase, and I don't understand why. Maybe I'm
using D incorrectly? Or is it the garbage collector that kicks in (even if
I hope I don't allocate much memory after the initial setup)?

My first testcase consisted of a file with 85000 C/C++ compilation
commands, to be run 24 in parallel. Most source files are really small
(different modules in the runtime library of a C/C++ compiler for embedded
development, built in different flavors).

If I invoke the D program 9 times with around 10000 (85000/9 to be exact)
commands each time, it performs almost on par with the C++ version. But
with all 85000 files in one invokation, the D version takes 1.5 times as
long (6min 30s --> 10min).

My programs (C++ and D) are really simple:

1) read all commands from STDIN into an array in the program
2) iterate over the array and keep N programs running at all times
3) start new programs with "fork/exec"
4) wait for finished programs with "waitpid"

If I compare the start of a 85000-run and a 10000-run, the 85000-run is
slower right from the start. I don't understand why? The only difference
must be that 85000-run has allocated a bigger array.

My D program can be viewed at:


https://bitbucket.org/holmberg556/examples/src/79ef65e389346e9957c535b77201a829af9c62f2/parallel_exec/parallel_exec_dlang.d

Any help would be appreciated.

/Johan Holmberg
Jun 14 2016
parent rikki cattermole <rikki cattermole.co.nz> writes:
On 15/06/2016 9:59 AM, Johan Holmberg via Digitalmars-d wrote:
 Hi!

 I'm trying to write a simple D program to emulate "parallel -u -jN", ie.
 running a number of commands in parallel to take advantage of a
 multicore machine (I'm testing on a 24-core Ubuntu machine).

 I have written almost equivalent programs in C++ and D, and hoped that
 they should run equally fast. But the performance of the D version
 degrades when the number of commands increase, and I don't understand
 why. Maybe I'm using D incorrectly? Or is it the garbage collector that
 kicks in (even if I hope I don't allocate much memory after the initial
 setup)?

 My first testcase consisted of a file with 85000 C/C++ compilation
 commands, to be run 24 in parallel. Most source files are really small
 (different modules in the runtime library of a C/C++ compiler for
 embedded development, built in different flavors).

 If I invoke the D program 9 times with around 10000 (85000/9 to be
 exact) commands each time, it performs almost on par with the C++
 version. But with all 85000 files in one invokation, the D version takes
 1.5 times as long (6min 30s --> 10min).

 My programs (C++ and D) are really simple:

 1) read all commands from STDIN into an array in the program
 2) iterate over the array and keep N programs running at all times
 3) start new programs with "fork/exec"
 4) wait for finished programs with "waitpid"

 If I compare the start of a 85000-run and a 10000-run, the 85000-run is
 slower right from the start. I don't understand why? The only difference
 must be that 85000-run has allocated a bigger array.

 My D program can be viewed at:


 https://bitbucket.org/holmberg556/examples/src/79ef65e389346e9957c535b77201a829af9c62f2/parallel_exec/parallel_exec_dlang.d

 Any help would be appreciated.

 /Johan Holmberg
This is more appropriate for D.learn. Few things, disable the GC and force a collect in that while loop. Next you're allocating hugely. I would recommend replacing commands variable with some form of 'smarter' array. Basically allocating in blocks instead of just appending one at a time. I'm not sure Appender is quite what you want here so it will be home made so to speak. My revised edition: import std.conv; import std.stdio; import std.string; import std.process; import core.stdc.stdlib; import core.sys.posix.unistd; import core.sys.posix.sys.wait; int process_start(string cmdline) { int pid = fork(); if (pid == -1) { perror("fork"); exit(1); } else if (pid == 0) { string[3] argv = ["sh", "-c", cmdline]; execvp("sh", argv); _exit(126); } else { return pid; } assert(0); } void process_wait(out int pid, out int status) { pid = waitpid(0, &status, 0); if (pid == -1) { perror("waitpid"); exit(1); } } int main(string[] argv) { import core.memory; GC.disable; int maxrunning = 1; if (argv.length > 1) { maxrunning = to!int(argv[1]); } bool verbose = (argv.length > 2); string command; string[] commands; foreach (line; stdin.byLine()) { commands ~= line.idup; } if (verbose) { writeln("#parallel = ", maxrunning); writeln("#commands = ", commands.length); stdout.flush(); } int next = 0; int nrunning = 0; while (next < commands.length || nrunning > 0) { while (next < commands.length && nrunning < maxrunning) { process_start(commands[next]); next++; nrunning++; } int pid; int exitstatus; process_wait(pid, exitstatus); nrunning--; if (exitstatus != 0) { writeln("ERROR: ..."); exit(1); } GC.collect; } return 0; }
Jun 14 2016