digitalmars.D.learn - Phobos threads performance

bearophile (21/21) Jul 20 2008 I have taken a look at the Chameneos-redux multithread benchmarks, the e...

The Anh Tran (6/37) Jul 21 2008 Don't use win to write Alioth game. :(

The Anh Tran (6/6) Jul 21 2008 Clarification: the accepted threadring.d is written in win. I let 503
bearophile (16/21) Jul 21 2008 I don't fully agree. I think a portable enough language has to allow you...

The Anh Tran (8/19) Jul 21 2008 I come from Visual C++ world. And just learn D in about 10 days. So i

bearophile (8/12) Jul 21 2008 The Shootout site allows more than on version for each benchmark, if the...

The Anh Tran (1/21) Jul 21 2008

bearophile (7/8) Jul 21 2008 With that you have lost cross-OS compatibility :-]

bearophile (4/4) Jul 22 2008 I have cleaned your code and I have submitted it, but I don't know yet i...

The Anh Tran (67/67) Jul 21 2008 This is my newest threadring.d for the threadring game:

bearophile <bearophileHUGS lycos.com> writes:

I have taken a look at the Chameneos-redux multithread benchmarks, the
explanations are are the bottom of this page:

http://shootout.alioth.debian.org/gp4/benchmark.php?test=chameneosredux&lang=all

(I think I have created a Psyco version almost 2X faster than the Python one).

This is the D + Phobos working implementation:
http://shootout.alioth.debian.org/gp4/benchmark.php?test=chameneosredux&lang=dlang&id=0

On my Win PC with N = 1_000_000 that D version runs in about 10 seconds. My CPU
has two cores, but the CPU usage is about 70-75% (while both the Java and C++
version push the cores to 100%).


This is a C++ version, that I think looks very close to the D version (I think
they are both translations of the Java version):
https://alioth.debian.org/tracker/download.php/30402/411646/310955/2682/chame.cpp

To run it on Windows I have used:
ftp://sources.redhat.com/pub/pthreads-win32/prebuilt-dll-2-8-0-release/
Added files to MinGW:
  pthread.h
  sched.h
  semaphore.h
  libpthreadGC2.a
Compiled code with:
  g++ -O3 -s -mthreads chame.cpp -o chame -lpthreadGC2

With still n = 1_000_000 this C++ code runs in about 1.13 seconds.

Do you know why is the C++ so much faster, and why the D version doesn't uses
the two cores fully?

Bye,
bearophile

Jul 20 2008

The Anh Tran <trtheanh gmail.com> writes:

Don't use win to write Alioth game. :(
WaitForSingleObject is _much_ slower than pthread_mutex_lock
I'm still changing here & there in chame.d Hope it'll better.

D version allocate mem during the meeting loop.
I omitted that alloc in C++ ver.

bearophile wrote:
 I have taken a look at the Chameneos-redux multithread benchmarks, the
explanations are are the bottom of this page:
 
 http://shootout.alioth.debian.org/gp4/benchmark.php?test=chameneosredux&lang=all
 
 (I think I have created a Psyco version almost 2X faster than the Python one).
 
 This is the D + Phobos working implementation:
 http://shootout.alioth.debian.org/gp4/benchmark.php?test=chameneosredux&lang=dlang&id=0
 
 On my Win PC with N = 1_000_000 that D version runs in about 10 seconds. My
CPU has two cores, but the CPU usage is about 70-75% (while both the Java and
C++ version push the cores to 100%).
 
 
 This is a C++ version, that I think looks very close to the D version (I think
they are both translations of the Java version):
 https://alioth.debian.org/tracker/download.php/30402/411646/310955/2682/chame.cpp
 
 To run it on Windows I have used:
 ftp://sources.redhat.com/pub/pthreads-win32/prebuilt-dll-2-8-0-release/
 Added files to MinGW:
   pthread.h
   sched.h
   semaphore.h
   libpthreadGC2.a
 Compiled code with:
   g++ -O3 -s -mthreads chame.cpp -o chame -lpthreadGC2
 
 With still n = 1_000_000 this C++ code runs in about 1.13 seconds.
 
 Do you know why is the C++ so much faster, and why the D version doesn't uses
the two cores fully?

 
 Bye,
 bearophile

Jul 21 2008

The Anh Tran <trtheanh gmail.com> writes:

Clarification: the accepted threadring.d is written in win. I let 503 
threads free roam.
On my Pentium M 2200Mhz. 10.000.000 only costs ~10s.
But on their P4, the result is 330s.

If i changed to mutex, in win, it'll slower.
But in linux, much faster. But they haven't accepted my new solution.

Jul 21 2008

bearophile <bearophileHUGS lycos.com> writes:

The Anh Tran:
 Don't use win to write Alioth game. :(

I don't fully agree. I think a portable enough language has to allow you to
compile the program on different operating systems and give you the same
results. D wants to be a quite portable language. So this is one more test for
the language itself.
Java generally allows me to do that, as Python.
But as you may have seen this time I have found the Psyco version may give
different results (but maybe the error is mine somewhere), so I may give up on
that.
Your D version works correctly on Win too.


 WaitForSingleObject is _much_ slower than pthread_mutex_lock
 I'm still changing here & there in chame.d Hope it'll better.

So how much faster is the Tango version of yours?


 D version allocate mem during the meeting loop.
 I omitted that alloc in C++ ver.

Can't you avoid the same allocation with D?

Some notes:
- Even if most people in this D newsgroups ignore the Shootout site, lot of
people take a look at that site when they want to choose what language to use,
so developing fast programs for that site is an important advertising. Haskell
people have understood this very well, you can see it from the amount of work
given in those benchmarks, they have even changed their language to improve
results in some of those benchmarks:
http://www.haskell.org/haskellwiki/Great_language_shootout
- Many times I have found the Shootout site useful to learn pieces of the
syntax of other languages. So I think it has a very big pedagogical purpose
too. Because it shows you real non banal algorithms implemented in a very
efficient way in lot of different languages. So you have to write your code
well, because lot of people will learn from your code.
- Very often you can find performance problems in your language looking at how
it performs compared to other languages. Here for example the threading in
Phobos seems various times slower than the C++ version, that in the meantime
was posted:
http://shootout.alioth.debian.org/gp4/benchmark.php?test=chameneosredux&lang=gpp&id=0
As you can see the C++ version takes 16.7 s, while your D version needs 41 s.

Bye,
bearophile

Jul 21 2008

The Anh Tran <trtheanh gmail.com> writes:

I come from Visual C++ world. And just learn D in about 10 days. So i 
screwed up many times in D code :/
Perhap i'll posted here for peers commence first.

bearophile wrote:
 The Anh Tran:
 Don't use win to write Alioth game. :(

 
 I don't fully agree. I think a portable enough language has to allow you to
compile the program on different operating systems and give you the same
results. D wants to be a quite portable language. So this is one more test for
the language itself.
 Java generally allows me to do that, as Python.
 But as you may have seen this time I have found the Psyco version may give
different results (but maybe the error is mine somewhere), so I may give up on
that.
 Your D version works correctly on Win too.

I completely agree with you. But sometime the same code gives big 
surprise. Ie: threadring game.


 Can't you avoid the same allocation with D?

Yes :). The D ver was targeting code beauty, not speed. :|

 - Very often you can find performance problems in your language looking at how
it performs compared to other languages. Here for example the threading in
Phobos seems various times slower than the C++ version, that in the meantime
was posted:
 http://shootout.alioth.debian.org/gp4/benchmark.php?test=chameneosredux&lang=gpp&id=0
 As you can see the C++ version takes 16.7 s, while your D version needs 41 s.

I have no idea. They only differ by 2 mem alloc calls.

Jul 21 2008

bearophile <bearophileHUGS lycos.com> writes:

The Anh Tran:
 But sometime the same code gives big surprise. Ie: threadring game.

This surprises can be useful to you to learn and to the designers of D to debug
it (I'm assuming such libraries have to work on both operating systems).


 Can't you avoid the same allocation with D?

 Yes :). The D ver was targeting code beauty, not speed. :|

The Shootout site allows more than on version for each benchmark, if their
purpose is different. For example here you can see two D versions:
http://shootout.alioth.debian.org/gp4/benchmark.php?test=binarytrees&lang=all
One is mine, its purpose is to have a short and high level code, the other is
for speed. The main purpose of the site is to compare speed, but it compares
memory used and code complexity too.


 I have no idea. They only differ by 2 mem alloc calls.

Maybe the thread module of Phobos has some problems. But I suggest you to write
a version without the memory allocations, to see how it performs. The garbage
collector of D may be the fault too here.

Bye and thank you,
bearophile

Jul 21 2008

The Anh Tran <trtheanh gmail.com> writes:

bearophile wrote:
 The Anh Tran:
 But sometime the same code gives big surprise. Ie: threadring game.

 
 This surprises can be useful to you to learn and to the designers of D to
debug it (I'm assuming such libraries have to work on both operating systems).
 
 
 Can't you avoid the same allocation with D?

 Yes :). The D ver was targeting code beauty, not speed. :|

 
 The Shootout site allows more than on version for each benchmark, if their
purpose is different. For example here you can see two D versions:
 http://shootout.alioth.debian.org/gp4/benchmark.php?test=binarytrees&lang=all
 One is mine, its purpose is to have a short and high level code, the other is
for speed. The main purpose of the site is to compare speed, but it compares
memory used and code complexity too.
 
 
 I have no idea. They only differ by 2 mem alloc calls.

 
 Maybe the thread module of Phobos has some problems. But I suggest you to
write a version without the memory allocations, to see how it performs. The
garbage collector of D may be the fault too here.
 
 Bye and thank you,
 bearophile

Jul 21 2008

bearophile <bearophileHUGS lycos.com> writes:

The Anh Tran:
...
 import std.c.linux.pthread;

With that you have lost cross-OS compatibility :-]
(I can't run it at the moment).

Anyway, you may reformat your code a bit, test that it works correctly, and
submit it to the Shootoout, to see if they like it.

Bye,
bearophile

Jul 21 2008

bearophile <bearophileHUGS lycos.com> writes:

I have cleaned your code and I have submitted it, but I don't know yet if it
works correctly on their PC:

https://alioth.debian.org/tracker/download.php/30402/411646/310968/2693/chamene2.d

Bye,
bearophile

Jul 22 2008

The Anh Tran <trtheanh gmail.com> writes:

This is my newest threadring.d for the threadring game:
http://shootout.alioth.debian.org/gp4/benchmark.php?test=threadring&lang=all

Need D expert commences. Many thanks.

module ThreadRing;

import std.stdio : writefln;
import std.conv : toInt;

import std.c.linux.pthread;
import std.c.stdlib : exit;

const uint NUM_THREADS = 503;
const uint STACK_SIZE = 16*1024;

int	token = -1;
bool finished = false;

extern (C)
{
	// static array, local data, should be better for L2 cache
	pthread_mutex_t[NUM_THREADS]		mutex;
	// again, local data is better for P4 small L2 cache
	char[STACK_SIZE][NUM_THREADS]	stacks;

	void* thread_func( void *num )
	{
		int thisnode	= cast(int)num;
		int nextnode	= ( thisnode + 1 ) % NUM_THREADS;

		while (true)
		{
			pthread_mutex_lock( &(mutex[ thisnode ]) );

			if ( token > 0 ) // branch prediction as taken
			{
				token--;
				pthread_mutex_unlock( &(mutex[ nextnode ]) );
			}
			else
			{
				 writefln( thisnode +1 );
				 exit(0);
			}
		}

		return null;
	}
}

int main(string[] args)
{
	try
	{
		token = toInt(args[1]);
	}
	catch (Exception e)	
	{
		token = 1000; // test case
	}

	pthread_t cthread;
	pthread_attr_t stack_attr;

	pthread_attr_init(&stack_attr);

	for (int i = 0; i < NUM_THREADS; i++)
	{
		pthread_mutex_init( &(mutex[ i ]), null);
		pthread_mutex_lock( &(mutex[ i ]) );

		// manual set stack space & stack size for each thread
		// stack space is allocated closely together
		pthread_attr_setstack( &stack_attr, &(stacks[i]), STACK_SIZE );

		pthread_create( &cthread, &stack_attr, &thread_func, cast(void*)i );
	}

	// start game
	pthread_mutex_unlock( &(mutex[0]) );

	// wait for result
	pthread_join( cthread, null );

	return 1;
}

Jul 21 2008

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Phobos threads performance