digitalmars.D.announce - Re: DMD 1.034 and 2.018 releases

bearophile <bearophileHUGS lycos.com> Aug 09 2008

Christopher Wright <dhasenan gmail.com> Aug 09 2008

"Jarrett Billingsley" <kb3ctd2 yahoo.com> Aug 09 2008
JAnderson <ask me.com> Aug 09 2008
renoX <renosky free.fr> Sep 07 2008

bearophile <bearophileHUGS lycos.com> writes:

Walter Bright:
 Can you make it faster?


Lot of people today have 2 (or even 4 cores), the order of the computation of
those ops is arbitrary, so a major (nearly linear, hopefully) speedup will
probably come as soon all the cores are used. This job splitting is probably an
advantage even then the ops aren't computed by asm code.

I've taken a look at my code, and so far I don't see many spots where the array
operations (once they actually give some speedup) can be useful (there are many
other things I can find much more useful than such ops, see my wish lists). But
if the array ops are useful for enough people, then it may be useful to burn
some programming time to make those array ops use all the 2-4+ cores.

Bye,
bearophile

Aug 09 2008

Christopher Wright <dhasenan gmail.com> writes:

bearophile wrote:
 Walter Bright:
 Can you make it faster?


 Lot of people today have 2 (or even 4 cores), the order of the computation of
those ops is arbitrary, so a major (nearly linear, hopefully) speedup will
probably come as soon all the cores are used. This job splitting is probably an
advantage even then the ops aren't computed by asm code.


The overhead of creating a new thread for this would be significant. 
You'd probably be better off using a regular loop for arrays that are 
not huge.

Aug 09 2008

"Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:

"Christopher Wright" <dhasenan gmail.com> wrote in message 
news:g7ljal$2i84$1 digitalmars.com...
 bearophile wrote:
 Walter Bright:
 Can you make it faster?


 Lot of people today have 2 (or even 4 cores), the order of the 
 computation of those ops is arbitrary, so a major (nearly linear, 
 hopefully) speedup will probably come as soon all the cores are used. 
 This job splitting is probably an advantage even then the ops aren't 
 computed by asm code.


 The overhead of creating a new thread for this would be significant. You'd 
 probably be better off using a regular loop for arrays that are not huge.


I think we could see a lot more improvement from using vector ops to perform 
SIMD operations.  They are just begging for it.

Aug 09 2008

JAnderson <ask me.com> writes:

Christopher Wright wrote:
 bearophile wrote:
 Walter Bright:
 Can you make it faster?


 Lot of people today have 2 (or even 4 cores), the order of the 
 computation of those ops is arbitrary, so a major (nearly linear, 
 hopefully) speedup will probably come as soon all the cores are used. 
 This job splitting is probably an advantage even then the ops aren't 
 computed by asm code.


 The overhead of creating a new thread for this would be significant. 
 You'd probably be better off using a regular loop for arrays that are 
 not huge.


I agree.  I think a lot of profiling would be in order to see when 
certain things become an advantage to use.  Then use a branch to jump to 
the best algorithm for the particular case (platform + length of array). 
  Hopefully the compiler could inline the algorithm so that constant 
sized arrays don't pay for the additional overhead.

There would be a small cost for the extra branch for small dynamic 
arrays.  Ideally one could  argue that if this becomes a performance 
bottleneck then the program is doing a lot of operations on lots of 
small arrays.  The user could change the design to group their small 
arrays into a larger array to get the performance they desire.

-Joel

Aug 09 2008

renoX <renosky free.fr> writes:

Christopher Wright a �crit :
 bearophile wrote:
 Walter Bright:
 Can you make it faster?


 Lot of people today have 2 (or even 4 cores), the order of the 
 computation of those ops is arbitrary, so a major (nearly linear, 
 hopefully) speedup will probably come as soon all the cores are used. 
 This job splitting is probably an advantage even then the ops aren't 
 computed by asm code.


 The overhead of creating a new thread for this would be significant.



Well for this kind of scheme, you wouldn't start a new set of thread 
each time! Just start a set of worker threads (one per cpu pinned to 
each cpu) which are created at startup of the program, and do nothing 
until they are woken up when there is an operation which can be 
accelerated through parallelism.

 You'd probably be better off using a regular loop for arrays that are 
 not huge.


Sure, even with pre-created threads, using several cpu induce additional 
cost at startup and end cost so this would be worthwhile only with loops 
'big enough'..

A pitfall also is to ensure that two cpu don't write to the same cache 
line, otherwise this 'false sharing' will reduce the performance.

renoX

Sep 07 2008

D Programming

C/C++ Programming

Other

digitalmars.D.announce - Re: DMD 1.034 and 2.018 releases