digitalmars.D - Speed kills
- ixid (15/15) Feb 15 2016 Every time there is a D thread on reddit it feels like the new
- Guillaume Piolat (4/10) Feb 15 2016 Something that annoyed me a bit is floating-point comparisons,
- Wyatt (5/8) Feb 15 2016 I feel like this point comes up often, and that a lot of people
- Basile B. (18/28) Feb 15 2016 Same for std.math.lround
- Jack Stouffer (3/6) Feb 15 2016 Seems like you know a lot about the subject, and I know you
- Basile B. (12/19) Feb 17 2016 In the meantime:
- Jack Stouffer (6/10) Feb 17 2016 While I don't know about the post you're talking about, I don't
- Basile B. (12/22) Feb 17 2016 That's more subtile than that.
- Basile B. (3/13) Feb 17 2016 Also, forgot to say, but an uniform API is needed to set the
- w0rp (4/4) Feb 21 2016 I think it's important that DMD gets more of the easier
- Marco Leise (7/9) Mar 07 2016 At least GCC has a codegen switch for that. A solution would
- Luc J. Bourhis (4/7) Feb 21 2016 SSE goes back to Pentium III, doesn't it? And the Pentium 4
- Guillaume Piolat (3/19) Feb 15 2016 lround and friends have been a big performance problem at times.
- Basile B. (5/7) Feb 15 2016 I didn't know this trick. It generates almost the same sse
- Guillaume Piolat (3/13) Feb 16 2016 In SSE3 you also get an instruction that does this without
- rsw0x (3/6) Feb 15 2016 if you want better codegen, don't use dmd.
- ixid (6/21) Mar 08 2016 Since I posted this thread I've learned std.algorithm.sum is 4
- jmh530 (3/8) Mar 08 2016 There was a longer discussion here
- Andrei Alexandrescu (2/7) Mar 09 2016 Whoa. What's happening there? Do we have anyone on it? -- Andrei
- Daniel Kozak via Digitalmars-d (3/10) Mar 09 2016 I guess he speaks about this one:
- cym13 (5/16) Mar 09 2016 They just don't do the same thing, sum() uses pairwise summation
- jmh530 (3/6) Mar 09 2016 That third comment about how it's not obvious which algorithm sum
- John Colvin (6/17) Mar 09 2016 Ilya has long term plans for this, but I have a short-term fix
- Andrei Alexandrescu (2/14) Mar 09 2016 Thanks much! -- Andrei
- John Colvin (3/18) Mar 09 2016 https://github.com/D-Programming-Language/phobos/pull/4069
- Jon D (40/45) Mar 09 2016 I seen a few cases while exploring D. Not all fully researched
- H. S. Teoh via Digitalmars-d (44/79) Mar 09 2016 In the case of std.algorithm.sum, the focus is on accuracy rather than
- John Colvin (3/15) Mar 09 2016 I think you'd be good at reviewing this:
- Jon D (10/11) Mar 09 2016 Turns out there are issues filed for each of the performance
Every time there is a D thread on reddit it feels like the new user is expecting mind-blowing speed from D. https://www.reddit.com/r/programming/comments/45v03g/porterstemmerd_an_implementation_of_the_porter/ This is the most recent one where John Colvin provided some pointers to speed it up significantly. Walter has done some good work taking the low-hanging fruit to speed up DMD code and there is a lot of effort going on with reference counting machinery but I wondered if some of the common errors people make that slow down D code can be addressed? Literals used to be a hidden speed bump but I think that was improved, now the append operator is one of the most common culprits, can this not be enhanced behind the scenes to work more like append? Do others notice common pitfalls between the article code and what the D community then suggests where we can bridge the gap so naive users get faster code?
Feb 15 2016
On Monday, 15 February 2016 at 13:51:38 UTC, ixid wrote:This is the most recent one where John Colvin provided some pointers to speed it up significantly. Walter has done some good work taking the low-hanging fruit to speed up DMD code and there is a lot of effort going on with reference counting machinery but I wondered if some of the common errors people make that slow down D code can be addressed?Something that annoyed me a bit is floating-point comparisons, DMD does not seem to be able to handle them from SSE registers, it will convert to FPU and do the comparison there IIRC.
Feb 15 2016
On Monday, 15 February 2016 at 14:16:02 UTC, Guillaume Piolat wrote:Something that annoyed me a bit is floating-point comparisons, DMD does not seem to be able to handle them from SSE registers, it will convert to FPU and do the comparison there IIRC.I feel like this point comes up often, and that a lot of people have argued x87 FP should just not happen anymore. -Wyatt
Feb 15 2016
On Monday, 15 February 2016 at 14:16:02 UTC, Guillaume Piolat wrote:On Monday, 15 February 2016 at 13:51:38 UTC, ixid wrote:Same for std.math.lround they use the FP way while for float and double it's only one sse instruction. Typically with 6 functions similar to this one: int round(float value) { asm { naked; cvtss2si EAX, XMM0; ret; } } we could get ceil/trunc/round/floor, also almost as easily fmod, hypoth. classic but I dont get why thery're not in std.math. Goddamnit, we're in 2016.This is the most recent one where John Colvin provided some pointers to speed it up significantly. Walter has done some good work taking the low-hanging fruit to speed up DMD code and there is a lot of effort going on with reference counting machinery but I wondered if some of the common errors people make that slow down D code can be addressed?Something that annoyed me a bit is floating-point comparisons, DMD does not seem to be able to handle them from SSE registers, it will convert to FPU and do the comparison there IIRC.
Feb 15 2016
On Monday, 15 February 2016 at 22:29:00 UTC, Basile B. wrote:we could get ceil/trunc/round/floor, also almost as easily fmod, hypoth. classic but I dont get why thery're not in std.math.Seems like you know a lot about the subject, and I know you contributed to phobos before, so how about making a PR for this :)
Feb 15 2016
On Monday, 15 February 2016 at 23:13:13 UTC, Jack Stouffer wrote:On Monday, 15 February 2016 at 22:29:00 UTC, Basile B. wrote:In the meantime: https://github.com/BBasile/iz/blob/master/import/iz/math.d Actually when i've participated to this conversation I didn't remember that it was not good on X86. Using SSE rouding is really only good on AMD64, otherwise loading the input parameter "sucks" a lot (even for a 32 bit float since it's not directly in EAX or XMMO). Anyway, not good for phobos, why? When looking for documentation yesterday night I've landed on a post by Walter who explained that the library for a system programming language shouldn't be specific to an architecture.we could get ceil/trunc/round/floor, also almost as easily fmod, hypoth. classic but I dont get why thery're not in std.math.Seems like you know a lot about the subject, and I know you contributed to phobos before, so how about making a PR for this :)
Feb 17 2016
On Wednesday, 17 February 2016 at 18:26:47 UTC, Basile B. wrote:Anyway, not good for phobos, why? When looking for documentation yesterday night I've landed on a post by Walter who explained that the library for a system programming language shouldn't be specific to an architecture.While I don't know about the post you're talking about, I don't think what Walter says applies to internal version blocks in a function. You could make it so on AMD lround and friends are much faster by using those ASM routines. Also, std.math is already chock full of architecture specific code.
Feb 17 2016
On Wednesday, 17 February 2016 at 18:50:45 UTC, Jack Stouffer wrote:On Wednesday, 17 February 2016 at 18:26:47 UTC, Basile B. wrote:That's more subtile than that. The oldest 64 bit processor (AMD64) supports SSE, always. So when we do "cast(int) 0.1;" on X86_64, the backend always generate SSE instructions. The oldest 32 bit processor (X86) doesn't support SSE, maybe MMX (not sure). So when we do "cast(int) 0.1;" on X86, the backend always generate FPU instructions. This is how I understand the post 'I've landed onto'. My current works always use SSE so it's not conform with the "at least available" feature.Anyway, not good for phobos, why? When looking for documentation yesterday night I've landed on a post by Walter who explained that the library for a system programming language shouldn't be specific to an architecture.While I don't know about the post you're talking about, I don't think what Walter says applies to internal version blocks in a function. You could make it so on AMD lround and friends are much faster by using those ASM routines. Also, std.math is already chock full of architecture specific code.
Feb 17 2016
On Wednesday, 17 February 2016 at 19:01:38 UTC, Basile B. wrote:That's more subtile than that. The oldest 64 bit processor (AMD64) supports SSE, always. So when we do "cast(int) 0.1;" on X86_64, the backend always generate SSE instructions. The oldest 32 bit processor (X86) doesn't support SSE, maybe MMX (not sure). So when we do "cast(int) 0.1;" on X86, the backend always generate FPU instructions. This is how I understand the post 'I've landed onto'. My current work always use SSE so it's not conform with the "at least available" feature.Also, forgot to say, but an uniform API is needed to set the rounding mode, whether SSE is used or the FPU...
Feb 17 2016
I think it's important that DMD gets more of the easier optimisations. Most new users won't bother trying GDC or LDC, and if DMD doesn't generate fast enough code, they might leave before they try the compilers with better optimisations.
Feb 21 2016
Am Wed, 17 Feb 2016 19:55:08 +0000 schrieb Basile B. <b2.temp gmx.com>:Also, forgot to say, but an uniform API is needed to set the rounding mode, whether SSE is used or the FPU...At least GCC has a codegen switch for that. A solution would have to either set both rounding modes at once or the compilers would need to expose version MathFPU/MathSSE. -- Marco
Mar 07 2016
On Wednesday, 17 February 2016 at 19:01:38 UTC, Basile B. wrote:The oldest 32 bit processor (X86) doesn't support SSE, maybe MMX (not sure). So when we do "cast(int) 0.1;" on X86, the backend always generate FPU instructions.SSE goes back to Pentium III, doesn't it? And the Pentium 4 supported SSE3, didn't it? Is it an active specification of D to run on Pentium II e.g.?
Feb 21 2016
On Monday, 15 February 2016 at 22:29:00 UTC, Basile B. wrote:Same for std.math.lround they use the FP way while for float and double it's only one sse instruction. Typically with 6 functions similar to this one: int round(float value) { asm { naked; cvtss2si EAX, XMM0; ret; } } we could get ceil/trunc/round/floor, also almost as easily fmod, hypoth. classic but I dont get why thery're not in std.math. Goddamnit, we're in 2016.lround and friends have been a big performance problem at times. Everytime you can use cast(int) instead, it's way faster.
Feb 15 2016
On Monday, 15 February 2016 at 23:19:44 UTC, Guillaume Piolat wrote:lround and friends have been a big performance problem at times. Everytime you can use cast(int) instead, it's way faster.I didn't know this trick. It generates almost the same sse intruction (it truncates) and has the advantage to be inline-able. Is it documented somewhere ? If not it should.
Feb 15 2016
On Monday, 15 February 2016 at 23:35:54 UTC, Basile B. wrote:On Monday, 15 February 2016 at 23:19:44 UTC, Guillaume Piolat wrote:In SSE3 you also get an instruction that does this without messing with the x87 control word: FISTTP.lround and friends have been a big performance problem at times. Everytime you can use cast(int) instead, it's way faster.I didn't know this trick. It generates almost the same sse intruction (it truncates) and has the advantage to be inline-able. Is it documented somewhere ? If not it should.
Feb 16 2016
On Monday, 15 February 2016 at 13:51:38 UTC, ixid wrote:Every time there is a D thread on reddit it feels like the new user is expecting mind-blowing speed from D. [...]if you want better codegen, don't use dmd. use ldc, it's usualy only a version-ish behind dmd.
Feb 15 2016
On Monday, 15 February 2016 at 13:51:38 UTC, ixid wrote:Every time there is a D thread on reddit it feels like the new user is expecting mind-blowing speed from D. https://www.reddit.com/r/programming/comments/45v03g/porterstemmerd_an_implementation_of_the_porter/ This is the most recent one where John Colvin provided some pointers to speed it up significantly. Walter has done some good work taking the low-hanging fruit to speed up DMD code and there is a lot of effort going on with reference counting machinery but I wondered if some of the common errors people make that slow down D code can be addressed? Literals used to be a hidden speed bump but I think that was improved, now the append operator is one of the most common culprits, can this not be enhanced behind the scenes to work more like append? Do others notice common pitfalls between the article code and what the D community then suggests where we can bridge the gap so naive users get faster code?Since I posted this thread I've learned std.algorithm.sum is 4 times slower than a naive loop sum. Even if this is for reasons of accuracy this is exactly what I am talking about- this is a hidden iceberg of terrible performance that will reflect poorly on D. That's so slow the function needs a health warning.
Mar 08 2016
On Tuesday, 8 March 2016 at 14:14:25 UTC, ixid wrote:Since I posted this thread I've learned std.algorithm.sum is 4 times slower than a naive loop sum. Even if this is for reasons of accuracy this is exactly what I am talking about- this is a hidden iceberg of terrible performance that will reflect poorly on D. That's so slow the function needs a health warning.There was a longer discussion here https://forum.dlang.org/post/vkiwojmfjrwhigbkenaa forum.dlang.org
Mar 08 2016
On 03/08/2016 09:14 AM, ixid wrote:Since I posted this thread I've learned std.algorithm.sum is 4 times slower than a naive loop sum. Even if this is for reasons of accuracy this is exactly what I am talking about- this is a hidden iceberg of terrible performance that will reflect poorly on D. That's so slow the function needs a health warning.Whoa. What's happening there? Do we have anyone on it? -- Andrei
Mar 09 2016
Dne 9.3.2016 v 14:26 Andrei Alexandrescu via Digitalmars-d napsal(a):On 03/08/2016 09:14 AM, ixid wrote:I guess he speaks about this one: http://forum.dlang.org/post/mailman.4748.1456070484.22025.digitalmars-d-learn puremagic.comSince I posted this thread I've learned std.algorithm.sum is 4 times slower than a naive loop sum. Even if this is for reasons of accuracy this is exactly what I am talking about- this is a hidden iceberg of terrible performance that will reflect poorly on D. That's so slow the function needs a health warning.Whoa. What's happening there? Do we have anyone on it? -- Andrei
Mar 09 2016
On Wednesday, 9 March 2016 at 13:26:45 UTC, Andrei Alexandrescu wrote:On 03/08/2016 09:14 AM, ixid wrote:They just don't do the same thing, sum() uses pairwise summation which is safer as I understand it. Corresponding issue: https://issues.dlang.org/show_bug.cgi?id=15722Since I posted this thread I've learned std.algorithm.sum is 4 times slower than a naive loop sum. Even if this is for reasons of accuracy this is exactly what I am talking about- this is a hidden iceberg of terrible performance that will reflect poorly on D. That's so slow the function needs a health warning.Whoa. What's happening there? Do we have anyone on it? -- Andrei
Mar 09 2016
On Wednesday, 9 March 2016 at 13:42:40 UTC, cym13 wrote:They just don't do the same thing, sum() uses pairwise summation which is safer as I understand it. Corresponding issue: https://issues.dlang.org/show_bug.cgi?id=15722That third comment about how it's not obvious which algorithm sum is using is a good one.
Mar 09 2016
On Wednesday, 9 March 2016 at 13:26:45 UTC, Andrei Alexandrescu wrote:On 03/08/2016 09:14 AM, ixid wrote:Ilya has long term plans for this, but I have a short-term fix that will buy us a very large performance improvement here (if my old benchmarks were correct). Give me a few mins to prep the pull request :)Since I posted this thread I've learned std.algorithm.sum is 4 times slower than a naive loop sum. Even if this is for reasons of accuracy this is exactly what I am talking about- this is a hidden iceberg of terrible performance that will reflect poorly on D. That's so slow the function needs a health warning.Whoa. What's happening there? Do we have anyone on it? -- Andrei
Mar 09 2016
On 3/9/16 9:03 AM, John Colvin wrote:On Wednesday, 9 March 2016 at 13:26:45 UTC, Andrei Alexandrescu wrote:Thanks much! -- AndreiOn 03/08/2016 09:14 AM, ixid wrote:Ilya has long term plans for this, but I have a short-term fix that will buy us a very large performance improvement here (if my old benchmarks were correct). Give me a few mins to prep the pull request :)Since I posted this thread I've learned std.algorithm.sum is 4 times slower than a naive loop sum. Even if this is for reasons of accuracy this is exactly what I am talking about- this is a hidden iceberg of terrible performance that will reflect poorly on D. That's so slow the function needs a health warning.Whoa. What's happening there? Do we have anyone on it? -- Andrei
Mar 09 2016
On Wednesday, 9 March 2016 at 14:04:40 UTC, Andrei Alexandrescu wrote:On 3/9/16 9:03 AM, John Colvin wrote:https://github.com/D-Programming-Language/phobos/pull/4069On Wednesday, 9 March 2016 at 13:26:45 UTC, Andrei Alexandrescu wrote:Thanks much! -- AndreiOn 03/08/2016 09:14 AM, ixid wrote:Ilya has long term plans for this, but I have a short-term fix that will buy us a very large performance improvement here (if my old benchmarks were correct). Give me a few mins to prep the pull request :)[...]Whoa. What's happening there? Do we have anyone on it? -- Andrei
Mar 09 2016
On Tuesday, 8 March 2016 at 14:14:25 UTC, ixid wrote:Since I posted this thread I've learned std.algorithm.sum is 4 times slower than a naive loop sum. Even if this is for reasons of accuracy this is exactly what I am talking about- this is a hidden iceberg of terrible performance that will reflect poorly on D. That's so slow the function needs a health warning.I seen a few cases while exploring D. Not all fully researched (apologies for that), but since there appears to be interest in identification I'll list them. * Lower-casing strings (likely upper-casing), and some character type checks. Routines like to toLower and asLowerCase call functions that work for all unicode characters. I was able to create much faster versions by checking if the character was ascii, then calling either the ascii version or the more general version. Same is true for a few routines like isNumber. Some have the ascii check optimization built in, but not all. If this optimization is added, it might also be useful to add a few common combinations (or a generic solution, if that's feasible). For example, to check if a character is alpha-numeric, one currently ORs two tests from the standard library, isAlpha and isNumber. Putting in an ascii optimization check requires putting it before doing the OR, rather than inside the tests being ORed. * Large associative arrays When associative arrays get beyond about 10 million entries performance starts to decline. I believe this is due to resizing the arrays. It's worse with strings as keys than integers as keys. Having a way to reserve capacity may help under some circumstances. * Associative arrays - Converting keys to immutable versions for lookup Associative arrays want immutable values as keys. Far as I can tell, immutable values are also required when performing a lookup, even if a new entry won't be stored. A couple apps I've written walk through large lists of text values, naturally available as char[] because they are read from input streams. To test presence in an associative array, it's necessary to copy them to immutable strings first. I haven't fully researched this one, but my experience is that copying from char[] to string becomes a meaningful cost. On the surface, this appears to be an optimization opportunity, to create the immutable strings only when actually storing a new value. --Jon
Mar 09 2016
On Wed, Mar 09, 2016 at 08:30:10PM +0000, Jon D via Digitalmars-d wrote:On Tuesday, 8 March 2016 at 14:14:25 UTC, ixid wrote:In the case of std.algorithm.sum, the focus is on accuracy rather than performance. It does some extra work to ensure maximum accuracy in the result, so it shouldn't be expected to have top performance. Granted, though, the docs could be clearer about this accuracy vs. performance tradeoff. Please file a bug on this (or better yet, submit a PR for it). In any case, 4 times slower sounds a bit excessive... it would be good to investigate why this is happening and fix it.Since I posted this thread I've learned std.algorithm.sum is 4 times slower than a naive loop sum. Even if this is for reasons of accuracy this is exactly what I am talking about- this is a hidden iceberg of terrible performance that will reflect poorly on D. That's so slow the function needs a health warning.I seen a few cases while exploring D. Not all fully researched (apologies for that), but since there appears to be interest in identification I'll list them.One of the big performance killers that people are likely to run into is autodecoding for strings in all range-based algorithms. Walter has repeatedly complained about this, but so far Andrei (and some others) have resisted getting rid of autodecoding. Fortunately, you can (mostly) suppress this by wrapping your strings in std.encoding.codeUnits before operating on them. Nevertheless, it's one of those hidden gotchas that could result in poorer performance than what you could get otherwise. Another big performance killer that I repeatedly run into is the overly-eager GC collection schedule. Some may argue against the use of the GC, period, but I have found that even with the GC, it's possible to get 30-40% speedups just by calling GC.disable() and manually triggering GC.collect() at a lower frequency than the default. This is especially true when your program performs many allocations of small objects, like (small) strings. I have observed this in at least 2-3 different memory-intensive programs of mine, including a recent experiment in writing a faster CSV parser than std.csv. Arguably, we should fix this in druntime itself so that collection cycles are run less often, though I'm not sure what implications this may have on low-memory systems. Sometimes, you don't even need to do this for the entire program; you can get big performance boosts just by wrapping GC.disable() / GC.enable() around strategic sections of allocation-intensive code.* Lower-casing strings (likely upper-casing), and some character type checks. Routines like to toLower and asLowerCase call functions that work for all unicode characters. I was able to create much faster versions by checking if the character was ascii, then calling either the ascii version or the more general version. Same is true for a few routines like isNumber. Some have the ascii check optimization built in, but not all. If this optimization is added, it might also be useful to add a few common combinations (or a generic solution, if that's feasible). For example, to check if a character is alpha-numeric, one currently ORs two tests from the standard library, isAlpha and isNumber. Putting in an ascii optimization check requires putting it before doing the OR, rather than inside the tests being ORed.These sound like valuable additions to Phobos. Submit a PR, please? [...]* Associative arrays - Converting keys to immutable versions for lookup Associative arrays want immutable values as keys. Far as I can tell, immutable values are also required when performing a lookup, even if a new entry won't be stored. A couple apps I've written walk through large lists of text values, naturally available as char[] because they are read from input streams. To test presence in an associative array, it's necessary to copy them to immutable strings first. I haven't fully researched this one, but my experience is that copying from char[] to string becomes a meaningful cost.[...] This is arguably a bug. If you're only looking up a key, you should be able to pass in const(char)[] rather than immutable(char)[] (aka string). The GC performance problem I mentioned above is especially noticeable when there are large numbers of small allocations, as would be the case if you constantly have to call .idup just because AA lookups demand immutable keys. Please file an issue for this, if there isn't one already filed. T -- Windows 95 was a joke, and Windows 98 was the punchline.
Mar 09 2016
On Wednesday, 9 March 2016 at 21:01:13 UTC, H. S. Teoh wrote:On Wed, Mar 09, 2016 at 08:30:10PM +0000, Jon D via Digitalmars-d wrote:I think you'd be good at reviewing this: https://github.com/D-Programming-Language/phobos/pull/4069On Tuesday, 8 March 2016 at 14:14:25 UTC, ixid wrote:In the case of std.algorithm.sum, the focus is on accuracy rather than performance. It does some extra work to ensure maximum accuracy in the result, so it shouldn't be expected to have top performance. Granted, though, the docs could be clearer about this accuracy vs. performance tradeoff. Please file a bug on this (or better yet, submit a PR for it). In any case, 4 times slower sounds a bit excessive... it would be good to investigate why this is happening and fix it.[...]
Mar 09 2016
On Wednesday, 9 March 2016 at 20:30:10 UTC, Jon D wrote:I seen a few cases while exploring D.Turns out there are issues filed for each of the performance issues I mentioned: * Lower casing strings: https://issues.dlang.org/show_bug.cgi?id=11229 * Large associative arrays: https://issues.dlang.org/show_bug.cgi?id=2504 * Associative arrays - Checking membership with mutable values (char arrays) rather strings (immutable): https://issues.dlang.org/show_bug.cgi?id=15038
Mar 09 2016