www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Phobos threads performance

reply bearophile <bearophileHUGS lycos.com> writes:
I have taken a look at the Chameneos-redux multithread benchmarks, the
explanations are are the bottom of this page:

http://shootout.alioth.debian.org/gp4/benchmark.php?test=chameneosredux&lang=all

(I think I have created a Psyco version almost 2X faster than the Python one).

This is the D + Phobos working implementation:
http://shootout.alioth.debian.org/gp4/benchmark.php?test=chameneosredux&lang=dlang&id=0

On my Win PC with N = 1_000_000 that D version runs in about 10 seconds. My CPU
has two cores, but the CPU usage is about 70-75% (while both the Java and C++
version push the cores to 100%).


This is a C++ version, that I think looks very close to the D version (I think
they are both translations of the Java version):
https://alioth.debian.org/tracker/download.php/30402/411646/310955/2682/chame.cpp

To run it on Windows I have used:
ftp://sources.redhat.com/pub/pthreads-win32/prebuilt-dll-2-8-0-release/
Added files to MinGW:
  pthread.h
  sched.h
  semaphore.h
  libpthreadGC2.a
Compiled code with:
  g++ -O3 -s -mthreads chame.cpp -o chame -lpthreadGC2

With still n = 1_000_000 this C++ code runs in about 1.13 seconds.

Do you know why is the C++ so much faster, and why the D version doesn't uses
the two cores fully?

Bye,
bearophile
Jul 20 2008
next sibling parent reply The Anh Tran <trtheanh gmail.com> writes:
Don't use win to write Alioth game. :(
WaitForSingleObject is _much_ slower than pthread_mutex_lock
I'm still changing here & there in chame.d Hope it'll better.

D version allocate mem during the meeting loop.
I omitted that alloc in C++ ver.

bearophile wrote:
 I have taken a look at the Chameneos-redux multithread benchmarks, the
explanations are are the bottom of this page:
 
 http://shootout.alioth.debian.org/gp4/benchmark.php?test=chameneosredux&lang=all
 
 (I think I have created a Psyco version almost 2X faster than the Python one).
 
 This is the D + Phobos working implementation:
 http://shootout.alioth.debian.org/gp4/benchmark.php?test=chameneosredux&lang=dlang&id=0
 
 On my Win PC with N = 1_000_000 that D version runs in about 10 seconds. My
CPU has two cores, but the CPU usage is about 70-75% (while both the Java and
C++ version push the cores to 100%).
 
 
 This is a C++ version, that I think looks very close to the D version (I think
they are both translations of the Java version):
 https://alioth.debian.org/tracker/download.php/30402/411646/310955/2682/chame.cpp
 
 To run it on Windows I have used:
 ftp://sources.redhat.com/pub/pthreads-win32/prebuilt-dll-2-8-0-release/
 Added files to MinGW:
   pthread.h
   sched.h
   semaphore.h
   libpthreadGC2.a
 Compiled code with:
   g++ -O3 -s -mthreads chame.cpp -o chame -lpthreadGC2
 
 With still n = 1_000_000 this C++ code runs in about 1.13 seconds.
 
 Do you know why is the C++ so much faster, and why the D version doesn't uses
the two cores fully?
 
 Bye,
 bearophile
Jul 21 2008
next sibling parent The Anh Tran <trtheanh gmail.com> writes:
Clarification: the accepted threadring.d is written in win. I let 503 
threads free roam.
On my Pentium M 2200Mhz. 10.000.000 only costs ~10s.
But on their P4, the result is 330s.

If i changed to mutex, in win, it'll slower.
But in linux, much faster. But they haven't accepted my new solution.
Jul 21 2008
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
The Anh Tran:
 Don't use win to write Alioth game. :(
I don't fully agree. I think a portable enough language has to allow you to compile the program on different operating systems and give you the same results. D wants to be a quite portable language. So this is one more test for the language itself. Java generally allows me to do that, as Python. But as you may have seen this time I have found the Psyco version may give different results (but maybe the error is mine somewhere), so I may give up on that. Your D version works correctly on Win too.
 WaitForSingleObject is _much_ slower than pthread_mutex_lock
 I'm still changing here & there in chame.d Hope it'll better.
So how much faster is the Tango version of yours?
 D version allocate mem during the meeting loop.
 I omitted that alloc in C++ ver.
Can't you avoid the same allocation with D? Some notes: - Even if most people in this D newsgroups ignore the Shootout site, lot of people take a look at that site when they want to choose what language to use, so developing fast programs for that site is an important advertising. Haskell people have understood this very well, you can see it from the amount of work given in those benchmarks, they have even changed their language to improve results in some of those benchmarks: http://www.haskell.org/haskellwiki/Great_language_shootout - Many times I have found the Shootout site useful to learn pieces of the syntax of other languages. So I think it has a very big pedagogical purpose too. Because it shows you real non banal algorithms implemented in a very efficient way in lot of different languages. So you have to write your code well, because lot of people will learn from your code. - Very often you can find performance problems in your language looking at how it performs compared to other languages. Here for example the threading in Phobos seems various times slower than the C++ version, that in the meantime was posted: http://shootout.alioth.debian.org/gp4/benchmark.php?test=chameneosredux&lang=gpp&id=0 As you can see the C++ version takes 16.7 s, while your D version needs 41 s. Bye, bearophile
Jul 21 2008
parent reply The Anh Tran <trtheanh gmail.com> writes:
I come from Visual C++ world. And just learn D in about 10 days. So i 
screwed up many times in D code :/
Perhap i'll posted here for peers commence first.

bearophile wrote:
 The Anh Tran:
 Don't use win to write Alioth game. :(
I don't fully agree. I think a portable enough language has to allow you to compile the program on different operating systems and give you the same results. D wants to be a quite portable language. So this is one more test for the language itself. Java generally allows me to do that, as Python. But as you may have seen this time I have found the Psyco version may give different results (but maybe the error is mine somewhere), so I may give up on that. Your D version works correctly on Win too.
I completely agree with you. But sometime the same code gives big surprise. Ie: threadring game.
 Can't you avoid the same allocation with D?
Yes :). The D ver was targeting code beauty, not speed. :|
 - Very often you can find performance problems in your language looking at how
it performs compared to other languages. Here for example the threading in
Phobos seems various times slower than the C++ version, that in the meantime
was posted:
 http://shootout.alioth.debian.org/gp4/benchmark.php?test=chameneosredux&lang=gpp&id=0
 As you can see the C++ version takes 16.7 s, while your D version needs 41 s.
I have no idea. They only differ by 2 mem alloc calls.
Jul 21 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
The Anh Tran:
 But sometime the same code gives big surprise. Ie: threadring game.
This surprises can be useful to you to learn and to the designers of D to debug it (I'm assuming such libraries have to work on both operating systems).
 Can't you avoid the same allocation with D?
Yes :). The D ver was targeting code beauty, not speed. :|
The Shootout site allows more than on version for each benchmark, if their purpose is different. For example here you can see two D versions: http://shootout.alioth.debian.org/gp4/benchmark.php?test=binarytrees&lang=all One is mine, its purpose is to have a short and high level code, the other is for speed. The main purpose of the site is to compare speed, but it compares memory used and code complexity too.
 I have no idea. They only differ by 2 mem alloc calls.
Maybe the thread module of Phobos has some problems. But I suggest you to write a version without the memory allocations, to see how it performs. The garbage collector of D may be the fault too here. Bye and thank you, bearophile
Jul 21 2008
parent reply The Anh Tran <trtheanh gmail.com> writes:
bearophile wrote:
 The Anh Tran:
 But sometime the same code gives big surprise. Ie: threadring game.
This surprises can be useful to you to learn and to the designers of D to debug it (I'm assuming such libraries have to work on both operating systems).
 Can't you avoid the same allocation with D?
Yes :). The D ver was targeting code beauty, not speed. :|
The Shootout site allows more than on version for each benchmark, if their purpose is different. For example here you can see two D versions: http://shootout.alioth.debian.org/gp4/benchmark.php?test=binarytrees&lang=all One is mine, its purpose is to have a short and high level code, the other is for speed. The main purpose of the site is to compare speed, but it compares memory used and code complexity too.
 I have no idea. They only differ by 2 mem alloc calls.
Maybe the thread module of Phobos has some problems. But I suggest you to write a version without the memory allocations, to see how it performs. The garbage collector of D may be the fault too here. Bye and thank you, bearophile
Jul 21 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
The Anh Tran:
...
 import std.c.linux.pthread;
With that you have lost cross-OS compatibility :-] (I can't run it at the moment). Anyway, you may reformat your code a bit, test that it works correctly, and submit it to the Shootoout, to see if they like it. Bye, bearophile
Jul 21 2008
parent bearophile <bearophileHUGS lycos.com> writes:
I have cleaned your code and I have submitted it, but I don't know yet if it
works correctly on their PC:

https://alioth.debian.org/tracker/download.php/30402/411646/310968/2693/chamene2.d

Bye,
bearophile
Jul 22 2008
prev sibling parent The Anh Tran <trtheanh gmail.com> writes:
This is my newest threadring.d for the threadring game:
http://shootout.alioth.debian.org/gp4/benchmark.php?test=threadring&lang=all

Need D expert commences. Many thanks.

module ThreadRing;

import std.stdio : writefln;
import std.conv : toInt;

import std.c.linux.pthread;
import std.c.stdlib : exit;

const uint NUM_THREADS = 503;
const uint STACK_SIZE = 16*1024;

int	token = -1;
bool finished = false;

extern (C)
{
	// static array, local data, should be better for L2 cache
	pthread_mutex_t[NUM_THREADS]		mutex;
	// again, local data is better for P4 small L2 cache
	char[STACK_SIZE][NUM_THREADS]	stacks;

	void* thread_func( void *num )
	{
		int thisnode	= cast(int)num;
		int nextnode	= ( thisnode + 1 ) % NUM_THREADS;

		while (true)
		{
			pthread_mutex_lock( &(mutex[ thisnode ]) );

			if ( token > 0 ) // branch prediction as taken
			{
				token--;
				pthread_mutex_unlock( &(mutex[ nextnode ]) );
			}
			else
			{
				 writefln( thisnode +1 );
				 exit(0);
			}
		}

		return null;
	}
}

int main(string[] args)
{
	try
	{
		token = toInt(args[1]);
	}
	catch (Exception e)	
	{
		token = 1000; // test case
	}

	pthread_t cthread;
	pthread_attr_t stack_attr;

	pthread_attr_init(&stack_attr);

	for (int i = 0; i < NUM_THREADS; i++)
	{
		pthread_mutex_init( &(mutex[ i ]), null);
		pthread_mutex_lock( &(mutex[ i ]) );

		// manual set stack space & stack size for each thread
		// stack space is allocated closely together
		pthread_attr_setstack( &stack_attr, &(stacks[i]), STACK_SIZE );

		pthread_create( &cthread, &stack_attr, &thread_func, cast(void*)i );
	}

	// start game
	pthread_mutex_unlock( &(mutex[0]) );

	// wait for result
	pthread_join( cthread, null );

	return 1;
}
Jul 21 2008