digitalmars.D - WordCount performance
- bearophile (61/61) Mar 26 2008 The following little program comes from a progressive stripping down of ...
- Frits van Bommel (7/9) Mar 26 2008 The first thing that comes to mind: "For the Nth time, someone thinks
- bearophile (4/6) Mar 26 2008 Yes, I am probably talking about problems in the DMD optimizer too, here...
- Jason House (9/24) Mar 26 2008 I have a D program that's an AI engine. It competes against programs
- Saaa (1/4) Mar 26 2008 Exactly the same here.
- Derek Parnell (12/16) Mar 26 2008 Regardless of the application in question, if run time speed is the prim...
- Saaa (5/12) Mar 26 2008 What if you can't do b nor have the time to learn b and see that another...
- Silv3r (2/3) Mar 26 2008 Exactly, I wouldn't be able to put it any clearer myself. ;)
- Walter Bright (21/23) Mar 27 2008 Because the insight gained by figuring out why one is faster than the
- Saaa (11/32) Mar 27 2008 The first steps taken on this newsgroup regarding benchmarks(program spe...
- Walter Bright (3/6) Mar 27 2008 D does match the optimization level of gcc: the GDC (gnu D compiler)
- Don Clugston (7/14) Mar 27 2008 Is DMD ever likely to come close? Given the size TODO list for the D fro...
- Walter Bright (4/14) Mar 27 2008 I think it already is close. My back end long ago reached the point
- Dave (7/18) Mar 27 2008 Except for floating point optimizations, even if some loss of accuracy i...
- Saaa (4/10) Mar 27 2008 How does DMD's optimization compare?
- Walter Bright (5/8) Mar 27 2008 The backend is not only similar, it is exactly the same.
- Dave (5/13) Mar 27 2008 Any ETA on when that will be worked into the optimizer?
- Walter Bright (2/8) Mar 27 2008 Too many other things right now!
- Dave (4/15) Mar 27 2008 The shootout benchmarks seem to indicate that in the general case DMD is...
- bearophile (34/36) Mar 27 2008 Micro-benchmarks have the huge advantage that they allow you to find whe...
- bearophile (5/6) Mar 26 2008 Relax, in many real situations D (compiled with DMD) is much faster. Tha...
- Daniel (4/75) Mar 26 2008 As an exercise for the viewer, if you use SSE2, prefetch, and non-branch...
- Saaa (1/5) Mar 26 2008 Please teach me!
- TomD (24/24) Mar 27 2008 bearophile Wrote:
- Walter Bright (3/5) Mar 27 2008 To get the fastest code out of dmd:
- Walter Bright (28/28) Mar 27 2008 I did a little checking.
- Saaa (1/6) Mar 27 2008 lol
- Walter Bright (5/12) Mar 27 2008 The hilarity is that full library source was included with the compiler
- Georg Wrede (5/18) Mar 27 2008 Hmm. What kind of situations would need multiple threads simultaneously
- Walter Bright (5/9) Mar 27 2008 If it was unsynchronized, if you had two threads reading lines from the
- Frits van Bommel (7/15) Mar 27 2008 I believe his point was "What kind of program would ever want to do
- Georg Wrede (3/23) Mar 27 2008 Heck, I just spent ages formulating the same thing. You said it in a
- Walter Bright (6/6) Mar 28 2008 You guys do have a point that perhaps syncing is not needed on stdin.
- Sean Kelly (8/14) Mar 28 2008 Another option might be to check the thread count is greater than
- Walter Bright (3/9) Mar 28 2008 You have to be very careful with such schemes because they can run afoul...
- Frits van Bommel (51/61) Mar 28 2008 In this case though, if the thread count was 1 when checked it's safe to...
- Frits van Bommel (2/3) Mar 28 2008 Bugzilla'd: http://d.puremagic.com/issues/show_bug.cgi?id=1957
- Walter Bright (1/1) Mar 28 2008 Thanks for putting it in bugzilla.
- Sean Kelly (7/16) Mar 28 2008 With the way this call is implemented in Tango, it is completely safe to...
- Benji Smith (4/14) Mar 28 2008 I thought the double-checked locking idiom was only buggy in the Java
- Walter Bright (2/4) Mar 29 2008 It's a common bug in C/C++ code.
- Sean Kelly (5/18) Mar 29 2008 It's broken in C++ as well. It actually works in D however, provided on...
- Kevin Bealer (26/47) Mar 29 2008 Couldn't you two mutexes to make this work even in C++ (but I'm using D ...
- Sean Kelly (6/52) Mar 29 2008 second mutex doesn't actually protect anything, but it should be a porta...
- janderson (6/16) Mar 29 2008 Perhaps a method of disabling locking would be useful. Either by simply...
- Georg Wrede (30/42) Mar 27 2008 Yes. Now, my problem was, it should be pretty uncommon to need to have
- bearophile (10/14) Mar 27 2008 Thank you taking a look at this benchmark. I understand that D is safer ...
The following little program comes from a progressive stripping down of a program I was creating. This C and D code give the approximate count of the words in a file: D version: import std.c.stdio: printf, getchar, EOF; import std.ctype: isspace; void main() { int count, c; //OUTER: while (1) { while (1) { c = getchar(); if (c == EOF) //break OUTER; goto END; if (!isspace(c)) break; } count++; while (1) { c = getchar(); if (c == EOF) //break OUTER; goto END; if (isspace(c)) break; } } END: printf("%d\n", count); } C version: #include <stdio.h> #include <ctype.h> int main() { int count = 0, c; while (1) { while (1) { c = getchar(); if (c == EOF) goto END; if (!isspace(c)) break; } count++; while (1) { c = getchar(); if (c == EOF) goto END; if (isspace(c)) break; } } END: printf("%d\n", count); return 0; } To test it, I have used a 7.5 MB file of real text. The C version (compiled with MinGW 4.2.1) is ~7.8 times faster (0.43 s instead of 3.35 s) than that very simpler code compiled with DMD (1.028). If I use a named break in the D code (that OUTER), to avoid the goto, the running speed is essentially the same. On a 50 MB file of text the timings are 2.43 s and 20.74 s (C version 8.5+ times faster). Disabling the GC doesn't change running speed of the D version. A 7-8 times difference on such simple program is big enough to make me curious, do you know what the problem can be? (Maybe the getchar() as a function instead of macro?) Bye, bearophile
Mar 26 2008
bearophile wrote:To test it, I have used a 7.5 MB file of real text. The C version (compiled with MinGW 4.2.1) is ~7.8 times faster (0.43 s instead of 3.35 s) than that very simpler code compiled with DMD (1.028). If I use a named break in the D code (that OUTER), to avoid the goto, the running speed is essentially the same. On a 50 MB file of text the timings are 2.43 s and 20.74 s (C version 8.5+ times faster).The first thing that comes to mind: "For the Nth time, someone thinks language X is 'faster' than language Y because of a comparison between compilers with completely different backends (and thus optimizers), even though there are pairs of compilers that use the same backend for either language." Try comparing DMC vs. DMD, or GCC vs. GDC...
Mar 26 2008
Frits van Bommel:because of a comparison between compilers with completely different backends (and thus optimizers),Yes, I am probably talking about problems in the DMD optimizer too, here, that may need to be addressed (or explained). Bye, bearophile
Mar 26 2008
Frits van Bommel wrote:bearophile wrote:I have a D program that's an AI engine. It competes against programs written in other languages and with different compilers. Speed matters to me, regardless of which compiler my competition uses. One of my attractions to D (besides all the cool features) was that it compiled to native assembly like C++. I implicitly assumed I'd be getting C++ (speed) without the warts. When performance gaps are an order of magitude, the D language seems like a horrible choice. I've thought on more than one occasion about going back to C++ or Java for the speed.To test it, I have used a 7.5 MB file of real text. The C version (compiled with MinGW 4.2.1) is ~7.8 times faster (0.43 s instead of 3.35 s) than that very simpler code compiled with DMD (1.028). If I use a named break in the D code (that OUTER), to avoid the goto, the running speed is essentially the same. On a 50 MB file of text the timings are 2.43 s and 20.74 s (C version 8.5+ times faster).The first thing that comes to mind: "For the Nth time, someone thinks language X is 'faster' than language Y because of a comparison between compilers with completely different backends (and thus optimizers), even though there are pairs of compilers that use the same backend for either language." Try comparing DMC vs. DMD, or GCC vs. GDC...
Mar 26 2008
I have a D program that's an AI engine. It competes against programs written in other languages and with different compilers. Speed matters to me, regardless of which compiler my competition uses.Exactly the same here.
Mar 26 2008
On Thu, 27 Mar 2008 02:37:28 +0100, Saaa wrote:Regardless of the application in question, if run time speed is the primary driver then ... (a) optimize the algorithms (b) code it in assembler (c) profile it (d) jmp (a) -- Derek (skype: derek.j.parnell) Melbourne, Australia 27/03/2008 12:42:39 PMI have a D program that's an AI engine. It competes against programs written in other languages and with different compilers. Speed matters to me, regardless of which compiler my competition uses.Exactly the same here.
Mar 26 2008
Regardless of the application in question, if run time speed is the primary driver then ... (a) optimize the algorithms (b) code it in assembler (c) profile it (d) jmp (a)What if you can't do b nor have the time to learn b and see that another compiler(language) runs faster without the need of b? I love all the features D gives me and the prospect that somebody who knows assembler could optimize parts of the code later on. But, I'd just wish for D to be as fast as C without the use of b.
Mar 26 2008
Saaa Wrote:But, I'd just wish for D to be as fast as C without the use of b.Exactly, I wouldn't be able to put it any clearer myself. ;)
Mar 26 2008
Saaa wrote:What if you can't do b nor have the time to learn b and see that another compiler(language) runs faster without the need of b?Because the insight gained by figuring out why one is faster than the other will be very valuable to you in future projects. I know a lot of programmers who made mistaken assumptions about why one program was faster than another, assumptions that eventually wound up costing them dearly. A sampling: 1) used the wrong compiler switches 2) the 'faster' program was faster because it had a disastrous bug in it 3) thought that their benchmark was measuring loop performance, when it was actually measuring the time spent in printf or malloc, rendering all their conclusions about how to write fast loops false. 4) thought they'd discovered a unique fast algorithm, when what was actually happening was the compiler optimizer had figured out how to eliminate a redundant DIV instruction 5) the benchmark proving that language A runs faster than language B turned out to be a result peculiar to that particular benchmark, and on the real program the reverse was true So, essentially, if you don't know *why* the benchmark produced the results it did, you don't have enough information to draw any conclusions about the language.
Mar 27 2008
What if you can't do b nor have the time to learn b and see that another compiler(language) runs faster without the need of b?Because the insight gained by figuring out why one is faster than the other will be very valuable to you in future projects. I know a lot of programmers who made mistaken assumptions about why one program was faster than another, assumptions that eventually wound up costing them dearly.I don't really see how this is an answer to my question. I suspect this is a more general post about language benchmarks :)A sampling: 1) used the wrong compiler switches 2) the 'faster' program was faster because it had a disastrous bug in it 3) thought that their benchmark was measuring loop performance, when it was actually measuring the time spent in printf or malloc, rendering all their conclusions about how to write fast loops false. 4) thought they'd discovered a unique fast algorithm, when what was actually happening was the compiler optimizer had figured out how to eliminate a redundant DIV instruction 5) the benchmark proving that language A runs faster than language B turned out to be a result peculiar to that particular benchmark, and on the real program the reverse was true So, essentially, if you don't know *why* the benchmark produced the results it did, you don't have enough information to draw any conclusions about the language.The first steps taken on this newsgroup regarding benchmarks(program speed) are generally (as should be) 1, 2 and 3. 4 is sort of what I was talking about: I would like the D compiler to match the optimization level of (the gcc or even intel:) C compiler. I am not sure whether you meant to say the opposite of this but: You don't need to understand asm to rule out 1,2 and 3; and know that the speed difference lies within 4 (compiler optimizations).
Mar 27 2008
Saaa wrote:4 is sort of what I was talking about: I would like the D compiler to match the optimization level of (the gcc or even intel:) C compiler.D does match the optimization level of gcc: the GDC (gnu D compiler) uses gcc's optimizer and code generator.
Mar 27 2008
Walter Bright wrote:Saaa wrote:Is DMD ever likely to come close? Given the size TODO list for the D front-end, I'm having trouble imagining you'll get much time to work on the optimiser before 64 bit CPUs become completely dominant (forcing you to rewrite most of the code generator). Unless there are significant features of the DMC optimiser which DMD still doesn't make use of?4 is sort of what I was talking about: I would like the D compiler to match the optimization level of (the gcc or even intel:) C compiler.D does match the optimization level of gcc: the GDC (gnu D compiler) uses gcc's optimizer and code generator.
Mar 27 2008
Don Clugston wrote:Walter Bright wrote:I think it already is close. My back end long ago reached the point where it takes a lot more effort to get only tiny gains.D does match the optimization level of gcc: the GDC (gnu D compiler) uses gcc's optimizer and code generator.Is DMD ever likely to come close? Given the size TODO list for the D front-end, I'm having trouble imagining you'll get much time to work on the optimiser before 64 bit CPUs become completely dominant (forcing you to rewrite most of the code generator).Unless there are significant features of the DMC optimiser which DMD still doesn't make use of?No.
Mar 27 2008
"Walter Bright" <newshound1 digitalmars.com> wrote in message news:fsgqvr$vtd$1 digitalmars.com...Don Clugston wrote:Except for floating point optimizations, even if some loss of accuracy is encountered (because temporaries are kept in FP registers for example). I wish the DMD optimizer would do as good a job with FP as it does w/ int, unless I tell it not to for cases where accuracy is paramount. In that case, it would just have to fall back to what it's doing now.Walter Bright wrote:I think it already is close. My back end long ago reached the point where it takes a lot more effort to get only tiny gains.D does match the optimization level of gcc: the GDC (gnu D compiler) uses gcc's optimizer and code generator.Is DMD ever likely to come close? Given the size TODO list for the D front-end, I'm having trouble imagining you'll get much time to work on the optimiser before 64 bit CPUs become completely dominant (forcing you to rewrite most of the code generator).
Mar 27 2008
Saaa wrote:How does DMD's optimization compare? And maybe a silly question but how does optimizing D compare to optimizing C++. Or is this exactly the same as the backend is that similar?4 is sort of what I was talking about: I would like the D compiler to match the optimization level of (the gcc or even intel:) C compiler.D does match the optimization level of gcc: the GDC (gnu D compiler) uses gcc's optimizer and code generator.
Mar 27 2008
Saaa wrote:And maybe a silly question but how does optimizing D compare to optimizing C++. Or is this exactly the same as the backend is that similar?The backend is not only similar, it is exactly the same. Where C++ and D differ is in various opportunities present in the language semantics. For example, D has invariants so it can do some more aggressive optimizations that C++ cannot.
Mar 27 2008
"Walter Bright" <newshound1 digitalmars.com> wrote in message news:fsgrbs$12lf$2 digitalmars.com...Saaa wrote:Any ETA on when that will be worked into the optimizer? Thanks, - DaveAnd maybe a silly question but how does optimizing D compare to optimizing C++. Or is this exactly the same as the backend is that similar?The backend is not only similar, it is exactly the same. Where C++ and D differ is in various opportunities present in the language semantics. For example, D has invariants so it can do some more aggressive optimizations that C++ cannot.
Mar 27 2008
Dave wrote:"Walter Bright" <newshound1 digitalmars.com> wrote in messageToo many other things right now!Where C++ and D differ is in various opportunities present in the language semantics. For example, D has invariants so it can do some more aggressive optimizations that C++ cannot.Any ETA on when that will be worked into the optimizer?
Mar 27 2008
"Saaa" <empty needmail.com> wrote in message news:fsglqt$5n2$1 digitalmars.com...The shootout benchmarks seem to indicate that in the general case DMD is pretty close to GCC except for floating point (again, in general).Saaa wrote:How does DMD's optimization compare? And maybe a silly question but how does optimizing D compare to optimizing C++. Or is this exactly the same as the backend is that similar?4 is sort of what I was talking about: I would like the D compiler to match the optimization level of (the gcc or even intel:) C compiler.D does match the optimization level of gcc: the GDC (gnu D compiler) uses gcc's optimizer and code generator.
Mar 27 2008
Dave Wrote:The shootout benchmarks seem to indicate that in the general case DMD is pretty close to GCC except for floating point (again, in general).Micro-benchmarks have the huge advantage that they allow you to find where a (micro) performance problem may be, the following is a stripped down benchmark from the Shootout: D version: import std.stdio, std.conv; double fibo(double n) { if (n < 2) return 1.0; return fibo(n-2) + fibo(n-1); } void main(string[] args) { int n = args.length == 2 ? toInt(args[1]) : 1; printf("%.1f\n", fibo(n)); } C version: #include <stdio.h> double fibo(double n) { if (n < 2) return 1.0; return fibo(n-2) + fibo(n-1); } int main(int argc, char ** argv) { int n = argc == 2 ? atoi(argv[1]) : 1; printf("%.1f\n", fibo(n)); return 0; } The difference (DMD Vs GCC) is visible. In the case of GCC/DMC you can add the: -fomit-frame compilation flag to increase performance, and you can even replace this line: if (n < 2.0) return 1.0; With: if (__builtin_expect(n < 2.0, 0)) return 1.0; In such situation the runtime timings are (n = 37 on my very old PC) 2.36s Vs 3.11s DMD. (I don't know how much difficult/work is to add an expect to D). Bye, bearophile
Mar 27 2008
Jason House:When performance gaps are an order of magitude,Relax, in many real situations D (compiled with DMD) is much faster. That's a very artificial example. And I'd like to know if my timings are right, because they may be just wrong after all. Trusting me blindly isn't good. Bye, bearophile
Mar 26 2008
bearophile Wrote:The following little program comes from a progressive stripping down of a program I was creating. This C and D code give the approximate count of the words in a file: D version: import std.c.stdio: printf, getchar, EOF; import std.ctype: isspace; void main() { int count, c; //OUTER: while (1) { while (1) { c = getchar(); if (c == EOF) //break OUTER; goto END; if (!isspace(c)) break; } count++; while (1) { c = getchar(); if (c == EOF) //break OUTER; goto END; if (isspace(c)) break; } } END: printf("%d\n", count); } C version: #include <stdio.h> #include <ctype.h> int main() { int count = 0, c; while (1) { while (1) { c = getchar(); if (c == EOF) goto END; if (!isspace(c)) break; } count++; while (1) { c = getchar(); if (c == EOF) goto END; if (isspace(c)) break; } } END: printf("%d\n", count); return 0; } [snip]As an exercise for the viewer, if you use SSE2, prefetch, and non-branching bitmath you can perform this roughly 32 times as fast. Regards, Daniel
Mar 26 2008
As an exercise for the viewer, if you use SSE2, prefetch, and non-branching bitmath you can perform this roughly 32 times as fast. Regards, DanielPlease teach me!
Mar 26 2008
bearophile Wrote: [...] Seems to be a backend problem: dmd -O wcd.d dmc -6 wcc.c gcc -O3 -mno-cygwin -o wccm wcc.d On my 7.5MB iTunes XML file: $ time ./wcd < it.xml 303412 real 0m1.078s user 0m0.015s sys 0m0.015s $ time ./wcc < it.xml 303412 real 0m1.063s user 0m0.031s sys 0m0.000s $ time ./wccm < it.xml 303412 real 0m0.172s user 0m0.015s sys 0m0.015s Even if the usage of opt-switches is argueable (sp??) here, it is rather a backe-end than language thing.
Mar 27 2008
TomD wrote:Seems to be a backend problem: dmd -O wcd.dTo get the fastest code out of dmd: dmd -release -inline -O wcd.d
Mar 27 2008
I did a little checking. D's and DMC's isspace(c) checks to see if the input character c is ascii. gcc's does not. So there's a little slowdown there, buying robustness. What the real difference is, however, is that D's and DMC's getchar() is synchronized for multiple threads. This means that there's a lock/unlock going on for *every* single character read. gcc may not synchronize stdin for multiple threads, or may have found a way to avoid the locking if the start-new-thread function in the library is never called. For the Digital Mars stdio library, the way to do fast I/O from the synchronized streams is to do it in chunks, not character-by-character. Several functions in std.stdio are available to do this - I bet you'll get a dramatic speed boost by using one of them. In other words, from examining the generated asm for dmd and gcc it's pretty clear that the observed speed difference has essentially *nothing* to do with either the language or the back end optimizer, and everything to do with the I/O strategy. This is why it is crucial to figure out why one is getting certain results from a benchmark, instead of just assuming. P.S. In general, it's just a bad idea to do compute performance benchmarks using I/O, because the performance will be dominated by the relatively slow I/O. P.P.S. Back in the 80's, my compiler would regularly beat the tar out of other compilers on magazine I/O speed benchmarks. It's because my library used a large disk buffer, which made a HUGE difference when reading files from a floppy disk. My competitors never did figure that out until about 10 years later <g>.
Mar 27 2008
P.P.S. Back in the 80's, my compiler would regularly beat the tar out of other compilers on magazine I/O speed benchmarks. It's because my library used a large disk buffer, which made a HUGE difference when reading files from a floppy disk. My competitors never did figure that out until about 10 years later <g>.lol
Mar 27 2008
Saaa wrote:The hilarity is that full library source was included with the compiler - what I was doing was in plain sight! Understanding how I/O works is still key to making high performance programs, despite there being very little discussion about it.P.P.S. Back in the 80's, my compiler would regularly beat the tar out of other compilers on magazine I/O speed benchmarks. It's because my library used a large disk buffer, which made a HUGE difference when reading files from a floppy disk. My competitors never did figure that out until about 10 years later <g>.lol
Mar 27 2008
Walter Bright wrote:I did a little checking. D's and DMC's isspace(c) checks to see if the input character c is ascii. gcc's does not. So there's a little slowdown there, buying robustness. What the real difference is, however, is that D's and DMC's getchar() is synchronized for multiple threads. This means that there's a lock/unlock going on for *every* single character read. gcc may not synchronize stdin for multiple threads, or may have found a way to avoid the locking if the start-new-thread function in the library is never called.Hmm. What kind of situations would need multiple threads simultaneously reading from stdin? And if there aren't any, wouldn't locking be like wearing a life vest on land? And isspace, is that really the right place to check for ascii?
Mar 27 2008
Georg Wrede wrote:Hmm. What kind of situations would need multiple threads simultaneously reading from stdin? And if there aren't any, wouldn't locking be like wearing a life vest on land?If it was unsynchronized, if you had two threads reading lines from the input, the contents of the lines would be interleaved. With syncing, each thread will get complete lines.And isspace, is that really the right place to check for ascii?It has to be, because its argument is an int, not a char.
Mar 27 2008
Walter Bright wrote:Georg Wrede wrote:I believe his point was "What kind of program would ever want to do character-level stdin I/O from multiple threads?" My follow-up would be: "And if there *is* some exotic program that needs to do that, why can't you just require it to provide its own damn locking, and leave the rest of us with faster I/O?" (Of course, I use Linux, so AFAIK I'm using glibc anyway)Hmm. What kind of situations would need multiple threads simultaneously reading from stdin? And if there aren't any, wouldn't locking be like wearing a life vest on land?If it was unsynchronized, if you had two threads reading lines from the input, the contents of the lines would be interleaved. With syncing, each thread will get complete lines.
Mar 27 2008
Frits van Bommel wrote:Walter Bright wrote:Heck, I just spent ages formulating the same thing. You said it in a couple of sentences.Georg Wrede wrote:I believe his point was "What kind of program would ever want to do character-level stdin I/O from multiple threads?" My follow-up would be: "And if there *is* some exotic program that needs to do that, why can't you just require it to provide its own damn locking, and leave the rest of us with faster I/O?" (Of course, I use Linux, so AFAIK I'm using glibc anyway)Hmm. What kind of situations would need multiple threads simultaneously reading from stdin? And if there aren't any, wouldn't locking be like wearing a life vest on land?If it was unsynchronized, if you had two threads reading lines from the input, the contents of the lines would be interleaved. With syncing, each thread will get complete lines.
Mar 27 2008
You guys do have a point that perhaps syncing is not needed on stdin. But I'm a little leery of removing it, as it is needed for output, and I'm not comfortable that this is thought through thoroughly. But my other point is that this is an issue with the Digital Mars C runtime library, and has nothing whatsoever to do with D or the optimizer/backend of D.
Mar 28 2008
== Quote from Walter Bright (newshound1 digitalmars.com)'s articleYou guys do have a point that perhaps syncing is not needed on stdin. But I'm a little leery of removing it, as it is needed for output, and I'm not comfortable that this is thought through thoroughly. But my other point is that this is an issue with the Digital Mars C runtime library, and has nothing whatsoever to do with D or the optimizer/backend of D.Another option might be to check the thread count is greater than 1 and only lock if it is. Tango has a routine called thread_needLock for this purpose, though it goes a bit farther and is true once a thread has been created through program termination. This avoids problems with the situation where you have multiple threads running and all but one terminate but memory is not yet synchronized. Sean
Mar 28 2008
Sean Kelly wrote:Another option might be to check the thread count is greater than 1 and only lock if it is. Tango has a routine called thread_needLock for this purpose, though it goes a bit farther and is true once a thread has been created through program termination. This avoids problems with the situation where you have multiple threads running and all but one terminate but memory is not yet synchronized.You have to be very careful with such schemes because they can run afoul of the notoriously difficult to comprehend "double checked locking" bug.
Mar 28 2008
Walter Bright wrote:Sean Kelly wrote:In this case though, if the thread count was 1 when checked it's safe to proceed without locking as long as you don't spawn any new threads. If it wasn't when checking, but became 1 afterwards it doesn't really matter; you may not have needed to lock it but it'll work fine if you do. (Just make sure the code path that locks also (unconditionally) unlocks, and of course that the code path that doesn't lock doesn't try to unlock either[1]) Oh, and the Phobos gc also has a thread_needLock() (in phobos/internal/gc/gcx.d). [1] --- if (!thread_needLock()) { foo(); } else synchronized(X) { foo(); } --- does that nicely, and is what the GC uses. (Both in Phobos and Tango) Hmm, reading the relevant parts of the GC code just now, something occurred to me. Some background first: If allocation triggers an actual collection it checks thread_needLock() again, and locks for the duration of the collection if there's only one thread. The comment explains this is done because finalizers may start new threads. However, the lock is then released *before* using the newly-collected heap to perform the actual allocation. That makes me wonder what happens if the first thing such a new thread does is allocating some memory... In other words: 1) The only thread starts an allocation, determining the lock is not needed. 2) There's no space to allocate, and the GC prepares for a collection. 3) The GC notices the number of threads is 1, and acquires the lock, and starts performing the collection. 4) A finalizer starts a new thread, which attempts to allocate and blocks on the GC lock held by the collector. 5) The original thread finishes the collection and *releases the lock*. 6) It then determines what memory location to return from 'new'. 7) Thread switch, the second thread acquires the GC lock (which is no longer held by the first thread) and starts its own allocation activities. Since the original thread didn't yet mark its chosen memory as allocated, the second thread picks the same memory and marks it as allocated. 8) Thread switch, the first thread finishes its allocation by marking the memory as allocated. (even though it already was marked by the second thread) 9) Both threads start using the same piece of memory as if they have the only reference to it (which should be true from either perspective, since it was just returned from 'new'). 10) *KABOOM* (I may have just proven your point about these things being very hard to get right...)Another option might be to check the thread count is greater than 1 and only lock if it is. Tango has a routine called thread_needLock for this purpose, though it goes a bit farther and is true once a thread has been created through program termination. This avoids problems with the situation where you have multiple threads running and all but one terminate but memory is not yet synchronized.You have to be very careful with such schemes because they can run afoul of the notoriously difficult to comprehend "double checked locking" bug.
Mar 28 2008
Frits van Bommel wrote:10) *KABOOM*Bugzilla'd: http://d.puremagic.com/issues/show_bug.cgi?id=1957
Mar 28 2008
== Quote from Walter Bright (newshound1 digitalmars.com)'s articleSean Kelly wrote:With the way this call is implemented in Tango, it is completely safe to omit a mutex lock on a false return so long as the ostensibly locked code does not call into code that may create a thread. For something like an IO operation, this seems pretty unlikely so it's a reasonable optimization, if perhaps progressively less useful as multithreading becomes more common. SeanAnother option might be to check the thread count is greater than 1 and only lock if it is. Tango has a routine called thread_needLock for this purpose, though it goes a bit farther and is true once a thread has been created through program termination. This avoids problems with the situation where you have multiple threads running and all but one terminate but memory is not yet synchronized.You have to be very careful with such schemes because they can run afoul of the notoriously difficult to comprehend "double checked locking" bug.
Mar 28 2008
Walter Bright wrote:Sean Kelly wrote:I thought the double-checked locking idiom was only buggy in the Java memory model (prior to being fixed in the 1.6 JDK). --benji smithAnother option might be to check the thread count is greater than 1 and only lock if it is. Tango has a routine called thread_needLock for this purpose, though it goes a bit farther and is true once a thread has been created through program termination. This avoids problems with the situation where you have multiple threads running and all but one terminate but memory is not yet synchronized.You have to be very careful with such schemes because they can run afoul of the notoriously difficult to comprehend "double checked locking" bug.
Mar 28 2008
Benji Smith wrote:I thought the double-checked locking idiom was only buggy in the Java memory model (prior to being fixed in the 1.6 JDK).It's a common bug in C/C++ code.
Mar 29 2008
== Quote from Benji Smith (benji benjismith.net)'s articleWalter Bright wrote:It's broken in C++ as well. It actually works in D however, provided one uses 'volatile' in the right places. I posted a demo impl a few years back either in this forum or in D.announce. SeanSean Kelly wrote:I thought the double-checked locking idiom was only buggy in the Java memory model (prior to being fixed in the 1.6 JDK).Another option might be to check the thread count is greater than 1 and only lock if it is. Tango has a routine called thread_needLock for this purpose, though it goes a bit farther and is true once a thread has been created through program termination. This avoids problems with the situation where you have multiple threads running and all but one terminate but memory is not yet synchronized.You have to be very careful with such schemes because they can run afoul of the notoriously difficult to comprehend "double checked locking" bug.
Mar 29 2008
Sean Kelly Wrote:== Quote from Benji Smith (benji benjismith.net)'s articleCouldn't you two mutexes to make this work even in C++ (but I'm using D syntax for simplicity)? The second mutex doesn't actually protect anything, but it should be a portable way to include the memory barriers. Actually "portable" is a weasel word since all C++ locking and multithreadedness is done in platform depended libraries. class Foo { private: Object data; Mutex a, b; public: Object getData() { if (data is null) { a.lock(); if (data is null) { Object o = null; { b.lock(); o = new Object; b.unlock(); } data = o; } a.unlock(); } return data; } }; KevinWalter Bright wrote:It's broken in C++ as well. It actually works in D however, provided one uses 'volatile' in the right places. I posted a demo impl a few years back either in this forum or in D.announce. SeanSean Kelly wrote:I thought the double-checked locking idiom was only buggy in the Java memory model (prior to being fixed in the 1.6 JDK).Another option might be to check the thread count is greater than 1 and only lock if it is. Tango has a routine called thread_needLock for this purpose, though it goes a bit farther and is true once a thread has been created through program termination. This avoids problems with the situation where you have multiple threads running and all but one terminate but memory is not yet synchronized.You have to be very careful with such schemes because they can run afoul of the notoriously difficult to comprehend "double checked locking" bug.
Mar 29 2008
== Quote from Kevin Bealer (kevinbealer gmail.com)'s articleSean Kelly Wrote:second mutex doesn't actually protect anything, but it should be a portable way to include the memory barriers. Actually "portable" is a weasel word since all C++ locking and multithreadedness is done in platform depended libraries.== Quote from Benji Smith (benji benjismith.net)'s articleCouldn't you two mutexes to make this work even in C++ (but I'm using D syntax for simplicity)? TheWalter Bright wrote:It's broken in C++ as well. It actually works in D however, provided one uses 'volatile' in the right places. I posted a demo impl a few years back either in this forum or in D.announce. SeanSean Kelly wrote:I thought the double-checked locking idiom was only buggy in the Java memory model (prior to being fixed in the 1.6 JDK).Another option might be to check the thread count is greater than 1 and only lock if it is. Tango has a routine called thread_needLock for this purpose, though it goes a bit farther and is true once a thread has been created through program termination. This avoids problems with the situation where you have multiple threads running and all but one terminate but memory is not yet synchronized.You have to be very careful with such schemes because they can run afoul of the notoriously difficult to comprehend "double checked locking" bug.class Foo { private: Object data; Mutex a, b; public: Object getData() { if (data is null) { a.lock(); if (data is null) { Object o = null; { b.lock(); o = new Object; b.unlock(); } data = o; } a.unlock(); } return data; } };That seems like it might work. Sean
Mar 29 2008
Walter Bright wrote:Sean Kelly wrote:Perhaps a method of disabling locking would be useful. Either by simply providing a lock free version of the function or by being able to turn locking on/off. Either way the user would be the one who would explicitly disable it. -JoelAnother option might be to check the thread count is greater than 1 and only lock if it is. Tango has a routine called thread_needLock for this purpose, though it goes a bit farther and is true once a thread has been created through program termination. This avoids problems with the situation where you have multiple threads running and all but one terminate but memory is not yet synchronized.You have to be very careful with such schemes because they can run afoul of the notoriously difficult to comprehend "double checked locking" bug.
Mar 29 2008
Walter Bright wrote:Georg Wrede wrote:Yes. Now, my problem was, it should be pretty uncommon to need to have two (or of course more) threads reading the same input within the same program (and hugely more uncommon for stdin, since the _entry_ to stdin is implemented in a _single_ process anyway in the operating system, even if several simultaneous processes were to feed it data). Therefore there is no way a process can aquire data at a rate faster than that of a single processor. So, it would only be relevant to an application that - is disturbed by having the input mistakenly be split into smaller than one-line pieces - is happy with a line at a time (i.e. lines are self-sufficient) vs. needing to split it into logical entities potentially larger than a line (like a parser), any of which may be given to one of the threads. A programmer even dreaming of implementing this, most likely will implement the input itself as a single thread, dispatching the data between multiple threads. The assumptions being - reading stdin is conceptually ill suited to multithreading - consuming the data is slower than reading [and dispatching] it ---- Anyhow, I've never heard even rumors of multiple threads reading _stdin_. No programmer contemplating such would be newbie enough to actually try it. Besides, he'd never _expect_ reading stdio to be thread_safe, therefore he'd not even check the documentation. All in all, I still think it's wearing a life vest on land. And crapping up throughput.Hmm. What kind of situations would need multiple threads simultaneously reading from stdin? And if there aren't any, wouldn't locking be like wearing a life vest on land?If it was unsynchronized, if you had two threads reading lines from the input, the contents of the lines would be interleaved. With syncing, each thread will get complete lines.Ah, ok. (Not to be a perennial PITA,) but the original context of your statement was in a comparison between D and C++. Now, this might have given the casual reader the assumption that C++ doesn't have this int thing here, making C++ faster. You may want to elaborate on that for the regular reader.And isspace, is that really the right place to check for ascii?It has to be, because its argument is an int, not a char.
Mar 27 2008
Walter Bright:gcc may not synchronize stdin for multiple threads, or may have found a way to avoid the locking if the start-new-thread function in the library is never called.Thank you taking a look at this benchmark. I understand that D is safer regarding non-ASCII, but 8 times slower seems a lot. I think GCC has functions like fgets_unlocked, getc_unlocked, etc, (that I think DMD std lib misses) for single thread programs: http://linux.die.net/man/3/fgets_unlocked http://www.opengroup.org/onlinepubs/009695399/functions/getc_unlocked.htmlFor the Digital Mars stdio library, the way to do fast I/O from the synchronized streams is to do it in chunks, not character-by-character. Several functions in std.stdio are available to do this - I bet you'll get a dramatic speed boost by using one of them.<At the moment DMD is two times slower than Clean in number reading benchmark, so there's space for improvements there too: http://shootout.alioth.debian.org/gp4/benchmark.php?test=sumcol&lang=all http://shootout.alioth.debian.org/gp4/benchmark.php?test=sumcol&lang=all Bye, and thank you, bearophile
Mar 27 2008