www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - gdc or ldc for faster programs?

reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
Sorry for being vague and not giving the code here but a program I wrote 
about spelling-out parts of a number (in Turkish) as in "1 milyon 42" 
runs much faster with gdc.

The program integer-divides the number in a loop to find quotients and 
adds the word next to it. One obvious optimization might be to use POSIX 
div() and friends to get the quotient and the remainder at one shot but 
I made myself believe that the compilers already do that. (But still not 
sure. :o) )

I am not experienced with dub but I used --build=release-nobounds and 
verified that -O3 is used for both compilers. (I also tried building 
manually with GNU 'make' with e.g. -O5 and the results were similar.)

For a test run for 2 million numbers:

ldc: ~0.95 seconds
gdc: ~0.79 seconds
dmd: ~1.77 seconds

I am using compilers installed by Manjaro Linux's package system:

ldc: LDC - the LLVM D compiler (1.28.0):
   based on DMD v2.098.0 and LLVM 13.0.0

gdc: dc (GCC) 11.1.0

dmd: DMD64 D Compiler v2.098.1

I've been mainly a dmd person for various reasons and was under the 
impression that ldc was the clear winner among the three. What is your 
experience? Does gdc compile faster programs in general? Would ldc win 
if I took advantage of e.g. link-time optimizations?

Ali
Jan 25 2022
next sibling parent reply Johan <j j.nl> writes:
On Tuesday, 25 January 2022 at 19:52:17 UTC, Ali Çehreli wrote:
 I am not experienced with dub but I used 
 --build=release-nobounds and verified that -O3 is used for both 
 compilers. (I also tried building manually with GNU 'make' with 
 e.g. -O5 and the results were similar.)
`-O5` does not do anything different than `-O3` for LDC.
 For a test run for 2 million numbers:

 ldc: ~0.95 seconds
 gdc: ~0.79 seconds
 dmd: ~1.77 seconds

 I am using compilers installed by Manjaro Linux's package 
 system:

 ldc: LDC - the LLVM D compiler (1.28.0):
   based on DMD v2.098.0 and LLVM 13.0.0

 gdc: dc (GCC) 11.1.0

 dmd: DMD64 D Compiler v2.098.1

 I've been mainly a dmd person for various reasons and was under 
 the impression that ldc was the clear winner among the three. 
 What is your experience? Does gdc compile faster programs in 
 general? Would ldc win if I took advantage of e.g. link-time 
 optimizations?
Tough to say. Of course DMD is not a serious contender, but I believe the difference between GDC and LDC is very small and really in the details, i.e. you'll have to look at assembly to find out the delta. Have you tried `--enable-cross-module-inlining` with LDC? -Johan
Jan 25 2022
next sibling parent =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 1/25/22 12:01, Johan wrote:

 Have you tried `--enable-cross-module-inlining` with LDC?
Tried now. Makes no difference that I can sense, likely because there is only one module anyway. :) (But I guess it works over Phobos modules too.) Ali
Jan 25 2022
prev sibling parent reply forkit <forkit gmail.com> writes:
On Tuesday, 25 January 2022 at 20:01:18 UTC, Johan wrote:
 Tough to say. Of course DMD is not a serious contender, but I 
 believe the difference between GDC and LDC is very small and 
 really in the details, i.e. you'll have to look at assembly to 
 find out the delta.
 Have you tried `--enable-cross-module-inlining` with LDC?

 -Johan
dmd is the best though, in terms of compilation speed without optimisation. As I write/test A LOT of code, that time saved is very much appreciated ;-) I hope it remains that way.
Jan 25 2022
parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Jan 25, 2022 at 11:01:57PM +0000, forkit via Digitalmars-d-learn wrote:
 On Tuesday, 25 January 2022 at 20:01:18 UTC, Johan wrote:
 
 Tough to say. Of course DMD is not a serious contender, but I
 believe the difference between GDC and LDC is very small and really
 in the details, i.e. you'll have to look at assembly to find out the
 delta.  Have you tried `--enable-cross-module-inlining` with LDC?
[...]
 dmd is the best though, in terms of compilation speed without
 optimisation.
 
 As I write/test A LOT of code, that time saved is very much
 appreciated ;-)
[...] My general approach is: use dmd for iterating the code - compile - test cycle, and use LDC for release/production builds. T -- Chance favours the prepared mind. -- Louis Pasteur
Jan 25 2022
prev sibling next sibling parent reply Adam D Ruppe <destructionator gmail.com> writes:
On Tuesday, 25 January 2022 at 19:52:17 UTC, Ali Çehreli wrote:
 ldc: ~0.95 seconds
 gdc: ~0.79 seconds
 dmd: ~1.77 seconds
Not surprising at all: gdc is excellent and underrated in the community.
Jan 25 2022
next sibling parent reply Daniel N <no public.email> writes:
On Tuesday, 25 January 2022 at 20:04:04 UTC, Adam D Ruppe wrote:
 On Tuesday, 25 January 2022 at 19:52:17 UTC, Ali Çehreli wrote:
 ldc: ~0.95 seconds
 gdc: ~0.79 seconds
 dmd: ~1.77 seconds
Maybe you can try --ffast-math on ldc.
Jan 25 2022
parent =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 1/25/22 12:59, Daniel N wrote:

 Maybe you can try --ffast-math on ldc.
Did not make a difference. Ali
Jan 25 2022
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Jan 25, 2022 at 08:04:04PM +0000, Adam D Ruppe via Digitalmars-d-learn
wrote:
 On Tuesday, 25 January 2022 at 19:52:17 UTC, Ali Çehreli wrote:
 ldc: ~0.95 seconds
 gdc: ~0.79 seconds
 dmd: ~1.77 seconds
Not surprising at all: gdc is excellent and underrated in the community.
The GCC optimizer is actually pretty darned good, comparable to LDC's. I only prefer LDC because of easier cross-compilation and more up-to-date language version (due to GDC being tied to GCC's release cycle). But I wouldn't hesitate to use gdc if I didn't need to cross-compile or use features from the latest language version. DMD's optimizer is miles behind LDC/GDC, sad to say. About the only thing that keeps me using dmd is its lightning-fast compilation times, ideal for iterative development. For anything performance related, DMD isn't even on my radar. T -- Doubtless it is a good thing to have an open mind, but a truly open mind should be open at both ends, like the food-pipe, with the capacity for excretion as well as absorption. -- Northrop Frye
Jan 25 2022
prev sibling parent Chris Piker <chris hoopjump.com> writes:
On Tuesday, 25 January 2022 at 20:04:04 UTC, Adam D Ruppe wrote:
 Not surprising at all: gdc is excellent and underrated in the 
 community.
The performance metrics are just a bonus. Gdc is the main reason I can get my worksite to take D seriously since we're a traditional unix shop (solaris -> linux). The gcd crew are doing a *huge* service for the community.
Mar 10 2022
prev sibling next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Jan 25, 2022 at 11:52:17AM -0800, Ali Çehreli via Digitalmars-d-learn
wrote:
 Sorry for being vague and not giving the code here but a program I
 wrote about spelling-out parts of a number (in Turkish) as in "1
 milyon 42" runs much faster with gdc.
 
 The program integer-divides the number in a loop to find quotients and
 adds the word next to it. One obvious optimization might be to use
 POSIX div() and friends to get the quotient and the remainder at one
 shot but I made myself believe that the compilers already do that.
 (But still not sure. :o))
Don't guess at what the compilers are doing; disassemble the binary and see for yourself exactly what the difference is. Use run.dlang.io for a convenient interface that shows you exactly how the compilers translated your code. Or if you're macho, use `objdump -d` and search for _Dmain (or the specific function if you know how it's mangled).
 I am not experienced with dub but I used --build=release-nobounds and
 verified that -O3 is used for both compilers. (I also tried building
 manually with GNU 'make' with e.g. -O5 and the results were similar.)
 
 For a test run for 2 million numbers:
 
 ldc: ~0.95 seconds
 gdc: ~0.79 seconds
 dmd: ~1.77 seconds
For measurements under 1 second, I'm skeptical of the accuracy, because there could be all kinds of background noise, CPU interrupts and stuff that could be skewing the numbers. What about do a best-of-3-runs with 20 million numbers (expected <20 seconds per run) and see how the numbers look? Though having said all that, I can say at least that dmd's relatively poor performance seems in line with my previous observations. :-P The difference between ldc and gdc is harder to pinpoint; they each have different optimizers that could work better or worse than the other depending on the specifics of what the program is doing. [...]
 I've been mainly a dmd person for various reasons and was under the
 impression that ldc was the clear winner among the three. What is your
 experience? Does gdc compile faster programs in general? Would ldc win
 if I took advantage of e.g. link-time optimizations?
[...] I'm not sure LDC is the clear winner. I only prefer LDC because LDC's architecture makes it easier for cross-compilation (with GCC/GDC you need to jump through a lot more hoops to get a working cross compiler). GDC is also tied to the GCC release cycle, and tends to be several language versions behind LDC. But both compilers have excellent optimizers, but they are definitely different so for some things GDC will beat LDC, for other things LDC will beat GDC. It may depend on the specific optimization flags you use as well. But these sorts of statements are just generalizations. The best way to find out for sure is to disassemble the executable and see for yourself what the assembly looks like. :-) T -- Public parking: euphemism for paid parking. -- Flora
Jan 25 2022
parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 1/25/22 12:42, H. S. Teoh wrote:

 For a test run for 2 million numbers:

 ldc: ~0.95 seconds
 gdc: ~0.79 seconds
 dmd: ~1.77 seconds
For measurements under 1 second, I'm skeptical of the accuracy, because there could be all kinds of background noise, CPU interrupts and stuff that could be skewing the numbers. What about do a best-of-3-runs with 20 million numbers (expected <20 seconds per run) and see how the numbers look?
Makes sense. The results are similar to the 2 million run.
 But these sorts of statements are just generalizations. The best way to
 find out for sure is to disassemble the executable and see for yourself
 what the assembly looks like. :-)
I posted the program to have more eyes on the assembly. ;) Ali
Jan 25 2022
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Jan 25, 2022 at 01:30:59PM -0800, Ali Çehreli via Digitalmars-d-learn
wrote:
[...]
 I posted the program to have more eyes on the assembly. ;)
[...] I tested the code locally, and observed, just like Ali did, that the LDC version is unambiguously slower than the gdc version by a small margin. So I decided to compare the disassembly. Due to the large number of templates in the main spellOut/spellOutImpl functions, I didn't have the time to look at all of them; I just arbitrarily picked the !(int) instantiation. And I'm seeing something truly fascinating: - The GDC version has at its core a single idivl instruction for the / and %= operators (I surmise that the optimizer realized that both could share the same instruction because it yields both results). The function is short and compact. - The LDC version, however, seems to go out of its way to avoid the idivl instruction, having instead a whole bunch of shr instructions and imul instructions involving magic constants -- the kind of stuff you see in bit-twiddling hacks when people try to ultra-optimize their code. There also appears to be some loop unrolling, and the function is markedly longer than the GDC version because of this. This is very interesting because idivl is known to be one of the slower instructions, but gdc nevertheless considered it not worthwhile to replace it, whereas ldc seems obsessed about avoid idivl at all costs. I didn't check the other instantiations, but it would appear that in this case the simpler route of just using idivl won over the complexity of trying to replace it with shr+mul. T -- Guns don't kill people. Bullets do.
Jan 25 2022
next sibling parent reply Elronnd <elronnd elronnd.net> writes:
On Tuesday, 25 January 2022 at 22:33:37 UTC, H. S. Teoh wrote:
 interesting because idivl is known to be one of the slower 
 instructions, but gdc nevertheless considered it not worthwhile 
 to replace it, whereas ldc seems obsessed about avoid idivl at 
 all costs.
Interesting indeed. Two remarks: 1. Actual performance cost of div depends a lot on hardware. IIRC on my old intel laptop it's like 40-60 cycles; on my newer amd chip it's more like 20; on my mac it's ~10. GCC may be assuming newer hardware than llvm. Could be worth popping on a -march=native -mtune=native. Also could depend on how many ports can do divs; i.e. how many of them you can have running at a time. 2. LLVM is more aggressive wrt certain optimizations than gcc, by default. Though I don't know how relevant that is at -O3.
Jan 25 2022
next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Jan 25, 2022 at 10:41:35PM +0000, Elronnd via Digitalmars-d-learn wrote:
 On Tuesday, 25 January 2022 at 22:33:37 UTC, H. S. Teoh wrote:
 interesting because idivl is known to be one of the slower
 instructions, but gdc nevertheless considered it not worthwhile to
 replace it, whereas ldc seems obsessed about avoid idivl at all
 costs.
Interesting indeed. Two remarks: 1. Actual performance cost of div depends a lot on hardware. IIRC on my old intel laptop it's like 40-60 cycles; on my newer amd chip it's more like 20; on my mac it's ~10. GCC may be assuming newer hardware than llvm. Could be worth popping on a -march=native -mtune=native. Also could depend on how many ports can do divs; i.e. how many of them you can have running at a time.
I tried `ldc2 -mcpu=native` but that did not significantly change the performance.
 2. LLVM is more aggressive wrt certain optimizations than gcc, by
 default.  Though I don't know how relevant that is at -O3.
Yeah, I've noted in the past that LDC seems to be pretty aggressive with inlining / loop unrolling, whereas GDC has a thing for vectorization and SIMD/XMM usage. The exact outcomes are a toss-up, though. Sometimes LDC wins, sometimes GDC wins. Depends on what exactly the code is doing. T -- "Outlook not so good." That magic 8-ball knows everything! I'll ask about Exchange Server next. -- (Stolen from the net)
Jan 25 2022
prev sibling parent reply Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Tuesday, 25 January 2022 at 22:41:35 UTC, Elronnd wrote:
 On Tuesday, 25 January 2022 at 22:33:37 UTC, H. S. Teoh wrote:
 interesting because idivl is known to be one of the slower 
 instructions, but gdc nevertheless considered it not 
 worthwhile to replace it, whereas ldc seems obsessed about 
 avoid idivl at all costs.
Interesting indeed. Two remarks: 1. Actual performance cost of div depends a lot on hardware. IIRC on my old intel laptop it's like 40-60 cycles; on my newer amd chip it's more like 20; on my mac it's ~10. GCC may be assuming newer hardware than llvm. Could be worth popping on a -march=native -mtune=native. Also could depend on how many ports can do divs; i.e. how many of them you can have running at a time. 2. LLVM is more aggressive wrt certain optimizations than gcc, by default. Though I don't know how relevant that is at -O3.
-O3 often chooses longer code and unrollsmore agressively inducing higher miss rates in the instruction caches. -O2 can beat -O3 in some cases when code size is important.
Jan 31 2022
next sibling parent Elronnd <elronnd elronnd.net> writes:
On Monday, 31 January 2022 at 08:54:16 UTC, Patrick Schluter 
wrote:
 -O3 often chooses longer code and unrollsmore agressively 
 inducing higher miss rates in the instruction caches.
 -O2 can beat -O3 in some cases when code size is important.
That is generally true. My point is that GCC and Clang make different tradeoffs when told '-O2'; Clang is more aggressive than GCC at -O2. I don't know if that still holds at -O3 (I expect probably not).
Jan 31 2022
prev sibling parent reply Siarhei Siamashka <siarhei.siamashka gmail.com> writes:
On Monday, 31 January 2022 at 08:54:16 UTC, Patrick Schluter 
wrote:
 -O3 often chooses longer code and unrollsmore agressively 
 inducing higher miss rates in the instruction caches.
 -O2 can beat -O3 in some cases when code size is important.
One of the historical reasons for favoring -O2 optimization level over -O3 was the necessity for Linux distributions to fit on a CD or DVD. Also if everyone is using -O2 optimizations, then -O3 optimizations get a lot less testing coverage and are more likely to have compiler bugs. This makes -O2 even more attractive for those, who prefer safety and stability... I think that it's a good thing that LDC is breaking out of this -O2 vs. -O3 dilemma by just mapping "-O" option to -O3 ("aggressive optimizations"): Setting the optimization level: -O - Equivalent to -O3 --O0 - No optimizations (default) --O1 - Simple optimizations --O2 - Good optimizations --O3 - Aggressive optimizations --O4 - Equivalent to -O3 --O5 - Equivalent to -O3 --Os - Like -O2 with extra optimizations for size --Oz - Like -Os but reduces code size further I wonder if GDC can do the same?
Jan 31 2022
parent Iain Buclaw <ibuclaw gdcproject.org> writes:
On Monday, 31 January 2022 at 10:33:49 UTC, Siarhei Siamashka 
wrote:
 I wonder if GDC can do the same?
GDC as a front-end doesn't dictate what the optimization passes are doing, nor does it have any real control what each level means. It is only ensured that semantic doesn't break because of an optimization pass.
Mar 09 2022
prev sibling parent =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 1/25/22 14:33, H. S. Teoh wrote:

 This is very interesting
Fascinating code generation and investigation! :) Ali
Jan 25 2022
prev sibling next sibling parent =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 1/25/22 11:52, Ali Çehreli wrote:

 a program I wrote about spelling-out parts of a number
Here is the program as a single module: module spellout.spellout; // This program was written as a code kata to spell out // certain parts of integers as in "1 million 2 thousand // 42". Note that this way of spelling-out numbers is not // grammatically correct in English. // Returns a string that contains the partly spelled-out version // of the parameter. // // You must copy the returned string when needed as this function // uses the same internal buffer for all invocations of the same // template instance. auto spellOut(T)(in T number_) { import std.array : Appender; import std.string : strip; import std.traits : Unqual; import std.meta : AliasSeq; static Appender!(char[]) result; result.clear; // We treat these specially because the algorithm below does // 'number = -number' and calls the same implementation // function. The trouble is, for example, -int.min is still a // negative number. alias problematics = AliasSeq!( byte, "negative 128", short, "negative 32 thousand 768", int, "negative 2 billion 147 million 483 thousand 648", long, "negative 9 quintillion 223 quadrillion 372 trillion" ~ " 36 billion 854 million 775 thousand 808"); static assert((problematics.length % 2) == 0); static foreach (i, P; problematics) { static if (i % 2) { // This is a string; skip } else { // This is a problematic type static if (is (T == P)) { // Our T happens to be this problematic type if (number_ == T.min) { // and we are dealing with a problematic value result ~= problematics[i + 1]; return result.data; } } } } auto number = cast(Unqual!T)number_; // Thanks 'in'! :p if (number == 0) { result ~= "zero"; } else { if (number < 0) { result ~= "negative"; static if (T.sizeof < int.sizeof) { // Being careful with implicit conversions. (See the dmd // command line switch -preview=intpromote) number = cast(T)(-cast(int)number); } else { number = -number; } } spellOutImpl(number, result); } return result.data.strip; } unittest { assert(1_001_500.spellOut == "1 million 1 thousand 500"); assert((-1_001_500).spellOut == "negative 1 million 1 thousand 500"); assert(1_002_500.spellOut == "1 million 2 thousand 500"); } import std.format : format; import std.range : isOutputRange; void spellOutImpl(T, O)(T number, ref O output) if (isOutputRange!(O, char)) in (number > 0, format!"Invalid number: %s"(number)) { import std.range : retro; import std.format : formattedWrite; foreach (divider; dividers!T.retro) { const quotient = number / divider.value; if (quotient) { output.formattedWrite!" %s %s"(quotient, divider.word); } number %= divider.value; } } struct Divider(T) { T value; // 1_000, 1_000_000, etc. string word; // "thousand", etc } // Returns the words related with the provided size of an // integral type. The parameter is number of bytes // e.g. int.sizeof auto words(size_t typeSize) { // This need not be recursive at all but it was fun using // recursion. final switch (typeSize) { case 1: return [ "" ]; case 2: return words(1) ~ [ "thousand" ]; case 4: return words(2) ~ [ "million", "billion" ]; case 8: return words(4) ~ [ "trillion", "quadrillion", "quintillion" ]; } } unittest { // These are relevant words for 'int' and 'uint' values: assert(words(4) == [ "", "thousand", "million", "billion" ]); } // Returns a Divider!T array associated with T auto dividers(T)() { import std.range : array, enumerate; import std.algorithm : map; static const(Divider!T[]) result = words(T.sizeof) .enumerate!T .map!(t => Divider!T(cast(T)(10^^(t.index * 3)), t.value)) .array; return result; } unittest { // Test a few entries assert(dividers!int[1] == Divider!int(1_000, "thousand")); assert(dividers!ulong[3] == Divider!ulong(1_000_000_000, "billion")); } void main() { version (test) { return; } import std.meta : AliasSeq; import std.stdio : writefln; import std.random : Random, uniform; import std.conv : to; static foreach (T; AliasSeq!(byte, ubyte, short, ushort, int, uint, long, ulong)) {{ // A few numbers for each type report(T.min); report((T.max / 4).to!T); // Overcome int promotion for // shorter types because I want // to test with the exact type // e.g. for byte. report(T.max); }} enum count = 2_000_000; writefln!"Testing with %,s random numbers"(spellOut(count)); // Use the same seed to be fair between compilations enum seed = 0; auto rnd = Random(seed); ulong totalLength; foreach (i; 0 .. count) { const number = uniform(int.min, int.max, rnd); const result = spellOut(number); totalLength += result.length; } writefln!("A meaningless number to prevent the compiler from" ~ " removing the entire loop: %,s")(totalLength); } void report(T)(T number) { import std.stdio : writefln; writefln!" %6s % ,s: %s"(T.stringof, number, spellOut(number)); } Ali
Jan 25 2022
prev sibling next sibling parent reply Johan <j j.nl> writes:
On Tuesday, 25 January 2022 at 19:52:17 UTC, Ali Çehreli wrote:
 I am using compilers installed by Manjaro Linux's package 
 system:

 ldc: LDC - the LLVM D compiler (1.28.0):
   based on DMD v2.098.0 and LLVM 13.0.0

 gdc: dc (GCC) 11.1.0

 dmd: DMD64 D Compiler v2.098.1
What phobos version is gdc using? -Johan
Jan 25 2022
parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 1/25/22 16:15, Johan wrote:
 On Tuesday, 25 January 2022 at 19:52:17 UTC, Ali Çehreli wrote:
 I am using compilers installed by Manjaro Linux's package system:

 ldc: LDC - the LLVM D compiler (1.28.0):
   based on DMD v2.098.0 and LLVM 13.0.0

 gdc: dc (GCC) 11.1.0

 dmd: DMD64 D Compiler v2.098.1
What phobos version is gdc using?
Oh! Good question. Unfortunately, I don't think Phobos modules contain that information. The following line outputs 2076L: pragma(msg, __VERSION__); So, I guess I've been comparing apples to oranges but in this case an older gdc is doing pretty well. Ali
Jan 25 2022
parent reply Iain Buclaw <ibuclaw gdcproject.org> writes:
On Wednesday, 26 January 2022 at 04:28:25 UTC, Ali Çehreli wrote:
 On 1/25/22 16:15, Johan wrote:
 On Tuesday, 25 January 2022 at 19:52:17 UTC, Ali Çehreli
wrote:
 I am using compilers installed by Manjaro Linux's package
system:
 ldc: LDC - the LLVM D compiler (1.28.0):
   based on DMD v2.098.0 and LLVM 13.0.0

 gdc: dc (GCC) 11.1.0

 dmd: DMD64 D Compiler v2.098.1
What phobos version is gdc using?
Oh! Good question. Unfortunately, I don't think Phobos modules contain that information. The following line outputs 2076L: pragma(msg, __VERSION__); So, I guess I've been comparing apples to oranges but in this case an older gdc is doing pretty well.
Doubt it. Functions such as to(), map(), etc. have pretty much remained unchanged for the last 6-7 years. Whenever I've watched talks/demos where benchmarks were the central topic, GDC has always blown LDC out the water when it comes to matters of math. Even in more recent examples where I've been pushing for native complex to be replaced with std.complex, LDC was found to be slower with std.complex, but GDC was either equal, or faster than native (and GDC std.complex was faster than LDC).
Jan 26 2022
next sibling parent reply forkit <forkit gmail.com> writes:
On Wednesday, 26 January 2022 at 11:25:47 UTC, Iain Buclaw wrote:
 Whenever I've watched talks/demos where benchmarks were the 
 central topic, GDC has always blown LDC out the water when it 
 comes to matters of math.
 ..
https://dlang.org/blog/2020/05/14/lomutos-comeback/
Jan 26 2022
parent Iain Buclaw <ibuclaw gdcproject.org> writes:
On Wednesday, 26 January 2022 at 11:43:39 UTC, forkit wrote:
 On Wednesday, 26 January 2022 at 11:25:47 UTC, Iain Buclaw 
 wrote:
 Whenever I've watched talks/demos where benchmarks were the 
 central topic, GDC has always blown LDC out the water when it 
 comes to matters of math.
 ..
https://dlang.org/blog/2020/05/14/lomutos-comeback/
Andrei forgot to do a follow up where one weird trick makes the gdc compiled lumutos same speed as C++ (and faster than ldc). https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96429
Jan 26 2022
prev sibling parent reply Johan <j j.nl> writes:
On Wednesday, 26 January 2022 at 11:25:47 UTC, Iain Buclaw wrote:
 On Wednesday, 26 January 2022 at 04:28:25 UTC, Ali Çehreli 
 wrote:
 On 1/25/22 16:15, Johan wrote:
 On Tuesday, 25 January 2022 at 19:52:17 UTC, Ali Çehreli
wrote:
 I am using compilers installed by Manjaro Linux's package
system:
 ldc: LDC - the LLVM D compiler (1.28.0):
   based on DMD v2.098.0 and LLVM 13.0.0

 gdc: dc (GCC) 11.1.0

 dmd: DMD64 D Compiler v2.098.1
What phobos version is gdc using?
Oh! Good question. Unfortunately, I don't think Phobos modules contain that information. The following line outputs 2076L: pragma(msg, __VERSION__); So, I guess I've been comparing apples to oranges but in this case an older gdc is doing pretty well.
Doubt it. Functions such as to(), map(), etc. have pretty much remained unchanged for the last 6-7 years.
The stdlib makes a huge difference in performance. Ali's program uses string manipulation, GC, ... much more than to() and map(). Quick test on my M1 macbook: LDC1.27, arm64 binary (native): ~0.83s LDC1.21, x86_64 binary (rosetta, not native to CPU instruction set): ~0.75s Couldn't test with LDC 1.6 (dlang2.076), because it is too old and not running on M1/Monterey (?). -Johan
Jan 26 2022
next sibling parent =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 1/26/22 04:06, Johan wrote:

 The stdlib makes a huge difference in performance.
 Ali's program uses string manipulation,
Yes, on the surface, I thought my inner loop had just / and % but of course there is that formattedWrite. I will change the code to use sprintf into a static buffer (instead of the current Appender).
 GC
That shouldn't affect it because there are just about 8 allocations to be shared in the Appender.
 , ... much more than to()
Not in the 2 million loop.
 and
 map().
Only in the initialization.
 Quick test on my M1 macbook:
 LDC1.27, arm64 binary (native): ~0.83s
 LDC1.21, x86_64 binary (rosetta, not native to CPU instruction set): 
~0.75s I think std.format gained abilities over the years. I will report back. Ali
Jan 26 2022
prev sibling parent Steven Schveighoffer <schveiguy gmail.com> writes:
On 1/26/22 7:06 AM, Johan wrote:

 Couldn't test with LDC 1.6 (dlang2.076), because it is too old and not 
 running on M1/Monterey (?).
There was a range of macos dmd binaries that did not work after a certain MacOS. I think it had to do with the hack for TLS that apple changed, so it no longer worked. -Steve
Jan 26 2022
prev sibling parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
ldc shines with sprintf. And dmd suprises by being a little bit faster 
than gdc! (?)

ldc (2.098.0): ~6.2 seconds
dmd (2.098.1): ~7.4 seconds
gdc (2.076.?): ~7.5 seconds

Again, here are the versions of the compilers that are readily available 
on my system:

 ldc: LDC - the LLVM D compiler (1.28.0):
    based on DMD v2.098.0 and LLVM 13.0.0

 gdc: dc (GCC) 11.1.0 (Uses dmd 2.076 front end)

 dmd: DMD64 D Compiler v2.098.1
They were compiled with dub run --compiler=<COMPILER> --build=release-nobounds --verbose where <COMPILER> was ldc, dmd, or gdc. I replaced formattedWrite in the code with sprintf. For example, the inner loop became foreach (divider; dividers!T.retro) { const quotient = number / divider.value; if (quotient) { output += sprintf(output, fmt!T.ptr, quotient, divider.word.ptr); } number %= divider.value; } } For completeness (and noise :/) here is the final version of the program: module spellout.spellout; // This program was written as a programming kata to spell out // certain parts of integers as in "1 million 2 thousand // 42". Note that this way of spelling-out numbers is not // grammatically correct in English. // Returns a string that contains the partly spelled-out version // of the parameter. // // You must copy the returned string when needed as this function // uses the same internal buffer for all invocations of the same // template instance. auto spellOut(T)(in T number_) { import std.string : strip; import std.traits : Unqual; import std.meta : AliasSeq; import core.stdc.stdio : sprintf; enum longestString = "negative 9 quintillion 223 quadrillion 372 trillion" ~ " 36 billion 854 million 775 thousand 808"; static char[longestString.length + 1] buffer; auto output = buffer.ptr; // We treat these specially because the algorithm below does // 'number = -number' and calls the same implementation // function. The trouble is, for example, -int.min is still a // negative number. alias problematics = AliasSeq!( byte, "negative 128", short, "negative 32 thousand 768", int, "negative 2 billion 147 million 483 thousand 648", long, longestString); static assert((problematics.length % 2) == 0); static foreach (i, P; problematics) { static if (i % 2) { // This is a string; skip } else { // This is a problematic type static if (is (T == P)) { // Our T happens to be this problematic type if (number_ == T.min) { // and we are dealing with a problematic value output += sprintf(output, problematics[i + 1].ptr); return buffer[0 .. (output - buffer.ptr)]; } } } } auto number = cast(Unqual!T)number_; // Thanks 'in'! :p if (number == 0) { output += sprintf(output, "zero"); } else { if (number < 0) { output += sprintf(output, "negative"); static if (T.sizeof < int.sizeof) { // Being careful with implicit conversions. (See the dmd // command line switch -preview=intpromote) number = cast(T)(-cast(int)number); } else { number = -number; } } spellOutImpl(number, output); } return buffer[0 .. (output - buffer.ptr)].strip; } unittest { assert(1_001_500.spellOut == "1 million 1 thousand 500"); assert((-1_001_500).spellOut == "negative 1 million 1 thousand 500"); assert(1_002_500.spellOut == "1 million 2 thousand 500"); } template fmt(T) { static if (is (T == long)|| is (T == ulong)) { static fmt = " %lld %s"; } else { static fmt = " %u %s"; } } import std.format : format; void spellOutImpl(T)(T number, ref char * output) in (number > 0, format!"Invalid number: %s"(number)) { import std.range : retro; import core.stdc.stdio : sprintf; foreach (divider; dividers!T.retro) { const quotient = number / divider.value; if (quotient) { output += sprintf(output, fmt!T.ptr, quotient, divider.word.ptr); } number %= divider.value; } } struct Divider(T) { T value; // 1_000, 1_000_000, etc. string word; // "thousand", etc } // Returns the words related with the provided size of an // integral type. The parameter is number of bytes // e.g. int.sizeof auto words(size_t typeSize) { // This need not be recursive at all but it was fun using // recursion. final switch (typeSize) { case 1: return [ "" ]; case 2: return words(1) ~ [ "thousand" ]; case 4: return words(2) ~ [ "million", "billion" ]; case 8: return words(4) ~ [ "trillion", "quadrillion", "quintillion" ]; } } unittest { // These are relevant words for 'int' and 'uint' values: assert(words(4) == [ "", "thousand", "million", "billion" ]); } // Returns a Divider!T array associated with T auto dividers(T)() { import std.range : array, enumerate; import std.algorithm : map; static const(Divider!T[]) result = words(T.sizeof) .enumerate!T .map!(t => Divider!T(cast(T)(10^^(t.index * 3)), t.value)) .array; return result; } unittest { // Test a few entries assert(dividers!int[1] == Divider!int(1_000, "thousand")); assert(dividers!ulong[3] == Divider!ulong(1_000_000_000, "billion")); } void main() { version (test) { return; } import std.meta : AliasSeq; import std.stdio : writefln; import std.random : Random, uniform; import std.conv : to; static foreach (T; AliasSeq!(byte, ubyte, short, ushort, int, uint, long, ulong)) {{ // A few numbers for each type report(T.min); report((T.max / 4).to!T); // Overcome int promotion for // shorter types because I want // to test with the exact type // e.g. for byte. report(T.max); }} enum count = 20_000_000; writefln!"Testing with %,s random numbers"(spellOut(count)); // Use the same seed to be fair between compilations enum seed = 0; auto rnd = Random(seed); ulong totalLength; foreach (i; 0 .. count) { const number = uniform(int.min, int.max, rnd); const result = spellOut(number); totalLength += result.length; } writefln!("A meaningless number to prevent the compiler from" ~ " removing the entire loop: %,s")(totalLength); } void report(T)(T number) { import std.stdio : writefln; writefln!" %6s % ,s: %s"(T.stringof, number, spellOut(number)); } Ali
Jan 26 2022
next sibling parent reply Siarhei Siamashka <siarhei.siamashka gmail.com> writes:
On Wednesday, 26 January 2022 at 18:00:41 UTC, Ali Çehreli wrote:
 ldc shines with sprintf. And dmd suprises by being a little bit 
 faster than gdc! (?)

 ldc (2.098.0): ~6.2 seconds
 dmd (2.098.1): ~7.4 seconds
 gdc (2.076.?): ~7.5 seconds

 Again, here are the versions of the compilers that are readily 
 available on my system:

 ldc: LDC - the LLVM D compiler (1.28.0):
    based on DMD v2.098.0 and LLVM 13.0.0

 gdc: dc (GCC) 11.1.0 (Uses dmd 2.076 front end)
It's not DMD doing a good job here, but GDC11 shooting itself in the foot by requiring additional esoteric command line options if you really want to produce optimized binaries. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102765 for more details. You can try to re-run your benchmark after adding '-flto' or '-fno-weak-templates' to GDC command line. I see a ~7% speedup for your code on my computer.
Jan 26 2022
parent reply Iain Buclaw <ibuclaw gdcproject.org> writes:
On Wednesday, 26 January 2022 at 18:39:07 UTC, Siarhei Siamashka 
wrote:
 It's not DMD doing a good job here, but GDC11 shooting itself 
 in the foot by requiring additional  esoteric command line 
 options if you really want to produce optimized binaries.
The D language shot itself in the foot by requiring templates to have weak semantics. If DMD and LDC inline weak functions, that's their bug.
Jan 26 2022
parent reply Siarhei Siamashka <siarhei.siamashka gmail.com> writes:
On Wednesday, 26 January 2022 at 18:41:51 UTC, Iain Buclaw wrote:
 The D language shot itself in the foot by requiring templates 
 to have weak semantics.

 If DMD and LDC inline weak functions, that's their bug.
As I already mentioned in the bugzilla, it would be really useful to see a practical example of DMD and LDC running into troubles because of mishandling weak templates. I was never able to find anything about "requiring templates to have weak semantics" anywhere in the Dlang documentation or on the Internet. Asking for clarification in this forum yielded no results either. Maybe I'm missing something obvious when reading the https://dlang.org/spec/template.html page? I have no doubt that you have your own opinion about how this stuff is supposed to work, but I have no crystal ball and don't know what's happening in your head.
Jan 26 2022
parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 1/26/22 11:07, Siarhei Siamashka wrote:
 On Wednesday, 26 January 2022 at 18:41:51 UTC, Iain Buclaw wrote:
 The D language shot itself in the foot by requiring templates to have
 weak semantics.

 If DMD and LDC inline weak functions, that's their bug.
As I already mentioned in the bugzilla, it would be really useful to see a practical example of DMD and LDC running into troubles because of mishandling weak templates.
I am not experienced enough to answer but the way I understand weak symbols, it is possible to run into trouble but it will probably never happen. When it happens, I suspect people can find workarounds like disabling inlining.
 I was never able to find anything about
 "requiring templates to have weak semantics" anywhere in the Dlang
 documentation or on the Internet.
The truth is some part of D's spec is the implementation. When I compile the following program (with dmd) void foo(T)() {} void main() { foo!int(); } I see that template instantiations are linked through weak symbols: $ nm deneme | grep foo [...] 0000000000021380 W _D6deneme__T3fooTiZQhFNaNbNiNfZv What I know is that weak symbols can be overridden by strong symbols during linking. Which means, if a function body is inlined which also has a weak symbol, some part of the program may be using the inlined definition and some other parts may be using the overridden definition. Thanks to separate compilation, they need not match hence the violation of the one-definition rule (ODR). Ali
Jan 27 2022
next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Thu, Jan 27, 2022 at 08:46:59AM -0800, Ali Çehreli via Digitalmars-d-learn
wrote:
[...]
 I see that template instantiations are linked through weak symbols:
 
 $ nm deneme | grep foo
 [...]
 0000000000021380 W _D6deneme__T3fooTiZQhFNaNbNiNfZv
 
 What I know is that weak symbols can be overridden by strong symbols
 during linking.
[...] Yes, and it also means that only one copy of the symbol will make it into the executable. This is one of the ways we leverage the linker to eliminate (merge) duplicate template instantiations. T -- Claiming that your operating system is the best in the world because more people use it is like saying McDonalds makes the best food in the world. -- Carl B. Constantine
Jan 27 2022
prev sibling parent reply Johan Engelen <j j.nl> writes:
On Thursday, 27 January 2022 at 16:46:59 UTC, Ali Çehreli wrote:
 What I know is that weak symbols can be overridden by strong 
 symbols during linking. Which means, if a function body is 
 inlined which also has a weak symbol, some part of the program 
 may be using the inlined definition and some other parts may be 
 using the overridden definition. Thanks to separate 
 compilation, they need not match hence the violation of the 
 one-definition rule (ODR).
But the language requires ODR, so we can emit templates as weak_odr, telling the optimizer and linker that the symbols should be merged _and_ that ODR can be assumed to hold (i.e. inlining is OK). The onus of honouring ODR is on the user - not the compiler - because we allow the user to do separate compilation. Some more detailed explanation and example: https://stackoverflow.com/questions/44335046/how-does-the-linker-handle-identical-template-instantiations-across-translation/44346057 -Johan
Jan 27 2022
parent reply Siarhei Siamashka <siarhei.siamashka gmail.com> writes:
On Thursday, 27 January 2022 at 18:12:18 UTC, Johan Engelen wrote:
 But the language requires ODR, so we can emit templates as 
 weak_odr, telling the optimizer and linker that the symbols 
 should be merged _and_ that ODR can be assumed to hold (i.e. 
 inlining is OK).
Thanks! This was also my impression. But the problem is that Iain Buclaw seems to disagree with us. He claims that template functions must be overridable by global functions and this is supposed to inhibit template functions inlining. Is there any independent source to back up your or Iain's claim?
 The onus of honouring ODR is on the user - not the compiler - 
 because we allow the user to do separate compilation.
My own limited experiments with various code snippets convinced me that D compilers actually try their best to prevent ODR violation, so it isn't like users can easily hurt themselves: https://forum.dlang.org/thread/cstjhjvmmibonbajwbbl forum.dlang.org Also module names are added as a part of function names mangling. Having an accidental clash of symbol names shouldn't be very likely in a valid D project. Though I'm not absolutely sure whether this provides a sufficient safety net.
Jan 27 2022
parent reply Iain Buclaw <ibuclaw gdcproject.org> writes:
On Thursday, 27 January 2022 at 20:28:40 UTC, Siarhei Siamashka 
wrote:
 On Thursday, 27 January 2022 at 18:12:18 UTC, Johan Engelen 
 wrote:
 But the language requires ODR, so we can emit templates as 
 weak_odr, telling the optimizer and linker that the symbols 
 should be merged _and_ that ODR can be assumed to hold (i.e. 
 inlining is OK).
Thanks! This was also my impression. But the problem is that Iain Buclaw seems to disagree with us. He claims that template functions must be overridable by global functions and this is supposed to inhibit template functions inlining. Is there any independent source to back up your or Iain's claim?
For example, druntime depends on this behaviour. Template: https://github.com/dlang/druntime/blob/a0ad8c42c15942faeeafb016e81a360113ae1b6b/src/rt/config.d#L46-L58 Regular symbol: https://github.com/dlang/druntime/blob/a17bb23b418405e1ce8e4a317651039758013f39/test/config/src/test19433.d#L1 If we can rely on instantiated symbols to not violate ODR, then you would be able to put symbols in the .link-once section. However all duplicates must also be in the .link-once section, else you'll get duplicate definition errors.
Jan 28 2022
parent Siarhei Siamashka <siarhei.siamashka gmail.com> writes:
On Friday, 28 January 2022 at 18:02:27 UTC, Iain Buclaw wrote:
 For example, druntime depends on this behaviour.

 Template: 
 https://github.com/dlang/druntime/blob/a0ad8c42c15942faeeafb016e81a360113ae1b6b/src/rt/config.d#L46-L58
Ouch. From where I stand, this looks like some really ugly hack abusing both the template keyword and mangle pragma. Presumably intended to implement this part of the spec: https://dlang.org/library/rt/config.html Moreover, these are even global variables rather than functions. Wouldn't it make more sense to use a special "weak" attribute for this particular use case? I see that there was a related discussion here: https://forum.dlang.org/post/rgmp5d$198g$1 digitalmars.com
 Regular symbol: 
 https://github.com/dlang/druntime/blob/a17bb23b418405e1ce8e4a317651039758013f39/test/config/src/test19433.d#L1

 If we can rely on instantiated symbols to not violate ODR, then 
 you would be able to put symbols in the .link-once section.  
 However all duplicates must also be in the .link-once section, 
 else you'll get duplicate definition errors.
Duplicate definition errors are surely better than something fishy silently happening under the hood. They can be solved when/if we encounter them. That said, I can confirm that GDC 10 indeed fails with `multiple definition of 'rt_cmdline_enabled'` linker error when trying to compile: ```D extern(C) __gshared bool rt_cmdline_enabled = false; void main() { } ``` But can't GDC just use something like this in `rt/config.d` to solve the problem? ```D version(GNU) { import gcc.attribute; pragma(mangle, "rt_envvars_enabled") attribute("weak") __gshared bool rt_envvars_enabled_ = false; pragma(mangle, "rt_cmdline_enabled") attribute("weak") __gshared bool rt_cmdline_enabled_ = true; pragma(mangle, "rt_options") attribute("weak") __gshared string[] rt_options_ = []; bool rt_envvars_enabled()() { return rt_envvars_enabled_; } bool rt_cmdline_enabled()() { return rt_cmdline_enabled_; } bool rt_options()() { return rt_options_; } } else { // put each variable in its own COMDAT by making them template instances template rt_envvars_enabled() { pragma(mangle, "rt_envvars_enabled") __gshared bool rt_envvars_enabled = false; } template rt_cmdline_enabled() { pragma(mangle, "rt_cmdline_enabled") __gshared bool rt_cmdline_enabled = true; } template rt_options() { pragma(mangle, "rt_options") __gshared string[] rt_options = []; } } ```
Jan 28 2022
prev sibling parent reply Salih Dincer <salihdb hotmail.com> writes:
On Wednesday, 26 January 2022 at 18:00:41 UTC, Ali Çehreli wrote:
 For completeness (and noise :/) here is the final version of 
 the program:
Could you also try the following code with the same configurations? ```d struct LongScale { struct ShortStack { short[] stack; size_t index; property back() { return this.stack[0]; } property push(short data) { this.stack ~= data; this.index++; } property pop() { return this.stack[--this.index]; } } ShortStack stack; this(long i) { long s, t = i; for(long e = 3; e <= 18; e += 3) { s = 10^^e; stack.push = cast(short)((t % s) / (s/1000L)); t -= t % s; } stack.push = cast(short)(t / s); } string toString() { string[] scale = [" zero", "thousand", "million", "billion", "trillion", "quadrillion", "quintillion"]; string r; for(long e = 6; e > 0; e--) { auto t = stack.pop; r ~= t > 1 ? " " ~to!string(t) : t ? " one" : ""; r ~= t ? " " ~scale[e] : ""; } r ~= stack.back ? " " ~to!string(stack.back) : ""; return r.length ? r : scale[0]; } } import std.conv, std.stdio; void main() { long[] inputs = [ 741, 1_500, 2_001, 5_005, 1_250_000, 3_000_042, 10_000_000, 1_000_000, 2_000_000, 100_000, 200_000, 10_000, 20_000, 1_000, 2_000, 74, 7, 0, 1_999_999_999_999]; foreach(long i; inputs) { auto OUT = LongScale(i); auto STR = OUT.toString[1..$]; writefln!"%s"(STR); } } ```
Jan 29 2022
parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 1/29/22 10:04, Salih Dincer wrote:

 Could you also try the following code with the same configurations?
The program you posted with 2 million random values: ldc 1.9 seconds gdc 2.3 seconds dmd 2.8 seconds I understand such short tests are not definitive but to have a rough idea between two programs, the last version of my program that used sprintf with 2 million numbers takes less time: ldc 0.4 seconds gdc 0.5 seconds dmd 0.5 seconds (And now we know gdc can go about 7% faster with additional command line switches.) Ali
Jan 29 2022
next sibling parent max haughton <maxhaton gmail.com> writes:
On Saturday, 29 January 2022 at 18:28:06 UTC, Ali Çehreli wrote:
 On 1/29/22 10:04, Salih Dincer wrote:

 Could you also try the following code with the same
configurations? The program you posted with 2 million random values: ldc 1.9 seconds gdc 2.3 seconds dmd 2.8 seconds I understand such short tests are not definitive but to have a rough idea between two programs, the last version of my program that used sprintf with 2 million numbers takes less time: ldc 0.4 seconds gdc 0.5 seconds dmd 0.5 seconds (And now we know gdc can go about 7% faster with additional command line switches.) Ali
You need to be compiling with PGO to test the compilers optimizer to the maximum. Without PGO they have to assume a fairly conservative flow through the code which means things like inlining and register allocation are effectively flying blind.
Jan 29 2022
prev sibling next sibling parent Siarhei Siamashka <siarhei.siamashka gmail.com> writes:
On Saturday, 29 January 2022 at 18:28:06 UTC, Ali Çehreli wrote:
 (And now we know gdc can go about 7% faster with additional 
 command line switches.)
No, we don't know this yet ;-) That's just what I said and I may be bullshitting. Or the configuration of my computer is significantly different from yours and the exact speedup/slowdown number may be different. So please verify it yourself. You can edit your `dub.json` file to add the following line to it: "dflags-gdc": ["-fno-weak-templates"], Then rebuild your spellout test program with gdc (just like you did before), run benchmarks and report results. The '-fno-weak-templates' option should show up in the gdc invocation command line.
Jan 29 2022
prev sibling parent Salih Dincer <salihdb hotmail.com> writes:
On Saturday, 29 January 2022 at 18:28:06 UTC, Ali Çehreli wrote:
 On 1/29/22 10:04, Salih Dincer wrote:

 Could you also try the following
 code with the same configurations?
The program you posted with 2 million random values: ldc 1.9 seconds gdc 2.3 seconds dmd 2.8 seconds I understand such short tests are not definitive but to have a rough idea between two programs, the last version of my program that used sprintf with 2 million numbers takes less time...
sprintf() might be really fast, but your algorithm is definitely 2.5x faster than mine! (with LDC) I couldn't compile with GDC. Theoretically, I might have lost the challenge :) With love and respect...
Jan 30 2022