digitalmars.D - stride in slices
- DigitalDesigns (4/4) Jun 02 2018 Proposal:
- Dmitry Olshansky (2/6) Jun 02 2018 Ranges work for this and don’t need special syntax. Just saying.
- Meta (3/7) Jun 03 2018 This is exactly what std.range.stride does. The syntax [a..b;m]
- Seb (5/14) Jun 03 2018 We even have slide which is a generalization of stride (striding
- DigitalDesigns (14/23) Jun 03 2018 So, can I do
- Paul Backus (4/17) Jun 04 2018 You can use std.algorithm.iteration.each to modify a range
- Steven Schveighoffer (8/37) Jun 04 2018 You are confusing range and array syntax. All of what you want exists,
- Dennis (14/17) Jun 04 2018 I've compared the range versions with a for-loop. For integers
- Steven Schveighoffer (7/25) Jun 04 2018 Interesting!
- Dennis (5/10) Jun 04 2018 I don't know much about this either. Clang has link-time
- Johan Engelen (17/28) Jun 06 2018 Cross-module inlining is never implicitly enabled in LDC. Not
- Ethan (12/13) Jun 04 2018 Just to drive this point home.
- Meta (29/43) Jun 04 2018 Just as an aside:
- David Bennett (9/12) Jun 04 2018 When using `dmd -inline -O -release` with an extra simd benchmark
- David Bennett (7/20) Jun 04 2018 Here's a version that works in ldc:
- Andrei Alexandrescu (6/23) Jun 05 2018 BTW I've had this thought for a long time to implement stride with a
- DigitalDesigns (6/25) Jun 04 2018 This is why I wanted to make sure! I would be using it for a
- Steven Schveighoffer (8/36) Jun 05 2018 See later postings from Ethan and others. It's a matter of optimization
- DigitalDesigns (21/62) Jun 05 2018 It would be nice if testing could be done. Maybe even profiling
- Steven Schveighoffer (14/25) Jun 05 2018 Just to point it out again, Ethan's numbers:
- Timon Gehr (12/14) Jun 05 2018 This is not even close to being true for modern CPUs. There are a lot of...
- DigitalDesigns (25/41) Jun 05 2018 Those optimizations are not part of the instruction set so are
- Adam D. Ruppe (5/6) Jun 05 2018 What are they?
- DigitalDesigns (22/28) Jun 05 2018 It doesn't matter! The issue that I said was not that ranges were
- Ethan (3/8) Jun 05 2018 This is the best part. Ranges *ARE* a language semantic.
- Steven Schveighoffer (5/10) Jun 05 2018 Again, I want to stress, ranges are not "as slow as possible", and it's
- Ethan (36/37) Jun 05 2018 ...
- DigitalDesigns (11/48) Jun 05 2018 Ok asshat! You still don't get it! I didn't say ranges would not
- Steven Schveighoffer (16/25) Jun 05 2018 Nope, he doesn't. Look at what you said:
- DigitalDesigns (34/61) Jun 05 2018 No, you have shown a few fucking cases, why are you guys
- aberba (4/9) Jun 05 2018 Does ranges not evaluate lazily on some cases. So it'll avoid
- Steven Schveighoffer (5/8) Jun 06 2018 It's "accept" not "except". You need to *accept* my world as authority.
- DigitalDesigns (10/19) Jun 07 2018 This is genius!! This made my day! I feel 40 years younger!!! It
- Ethan (3/4) Jun 05 2018 Take it to reddit. Back your arguments up with actual knowledge
- DigitalDesigns (4/8) Jun 05 2018 You are an idiot! You obviously do not understand basic logic.
- Ethan (3/4) Jun 05 2018 Take it to reddit. Back your arguments up with actual knowledge
- Timon Gehr (38/93) Jun 05 2018 I was responding to claims that for loops are basically a wrapper on
On Saturday, 2 June 2018 at 18:49:51 UTC, DigitalDesigns wrote:Proposal: [a..b;m] could be good chars.Ranges work for this and don’t need special syntax. Just saying.
Jun 02 2018
On Saturday, 2 June 2018 at 18:49:51 UTC, DigitalDesigns wrote:Proposal: [a..b;m] could be good chars.This is exactly what std.range.stride does. The syntax [a..b;m] directly translates to [a..b].stride(m).
Jun 03 2018
On Sunday, 3 June 2018 at 07:30:56 UTC, Meta wrote:On Saturday, 2 June 2018 at 18:49:51 UTC, DigitalDesigns wrote:We even have slide which is a generalization of stride (striding is sliding with a window size of 1): [a..b].slide(1, m) https://dlang.org/phobos/std_range.html#slideProposal: [a..b;m] could be good chars.This is exactly what std.range.stride does. The syntax [a..b;m] directly translates to [a..b].stride(m).
Jun 03 2018
On Sunday, 3 June 2018 at 07:30:56 UTC, Meta wrote:On Saturday, 2 June 2018 at 18:49:51 UTC, DigitalDesigns wrote:So, can I do X[a..b].stride(m) = 0; ? Just curious because if it is exactly the same notion then I should be able to do it, right? Of course, I'm sure another hoop could be created to jump through and it will work, will it still be exactly the same though? If there is an efficient and optimal setting so one could get the same effect, then I guess it might be a direct translation. If not then it isn't. What I am looking for is a sort of zeromemory or memset with stride. It should not allocate a new array or be significantly slower. I'd like some proof that they are "equivalent" such as a disassembly or a profiling... just to satisfy my own skepticism.Proposal: [a..b;m] could be good chars.This is exactly what std.range.stride does. The syntax [a..b;m] directly translates to [a..b].stride(m).
Jun 03 2018
On Sunday, 3 June 2018 at 11:13:52 UTC, DigitalDesigns wrote:On Sunday, 3 June 2018 at 07:30:56 UTC, Meta wrote:You can use std.algorithm.iteration.each to modify a range in-place: https://run.dlang.io/is/2jjZHhOn Saturday, 2 June 2018 at 18:49:51 UTC, DigitalDesigns wrote:So, can I do X[a..b].stride(m) = 0;Proposal: [a..b;m] could be good chars.This is exactly what std.range.stride does. The syntax [a..b;m] directly translates to [a..b].stride(m).
Jun 04 2018
On 6/3/18 7:13 AM, DigitalDesigns wrote:On Sunday, 3 June 2018 at 07:30:56 UTC, Meta wrote:X[a..b].stride(m).fill(0);On Saturday, 2 June 2018 at 18:49:51 UTC, DigitalDesigns wrote:So, can I do X[a..b].stride(m) = 0;Proposal: [a..b;m] good chars.This is exactly what std.range.stride does. The syntax [a..b;m] directly translates to [a..b].stride(m).? Just curious because if it is exactly the same notion then I should be able to do it, right?You are confusing range and array syntax. All of what you want exists, but isn't necessarily using the same syntax.Of course, I'm sure another hoop could be created to jump through and it will work, will it still be exactly the same though? If there is an efficient and optimal setting so one could get the same effect, then I guess it might be a direct translation. If not then it isn't. What I am looking for is a sort of zeromemory or memset with stride. It should not allocate a new array or be significantly slower. I'd like some proof that they are "equivalent" such as a disassembly or a profiling... just to satisfy my own skepticism.fill is what you are looking for. Note, it's not going to necessarily be as efficient, but it's likely to be close. -Steve
Jun 04 2018
On Monday, 4 June 2018 at 15:43:20 UTC, Steven Schveighoffer wrote:Note, it's not going to necessarily be as efficient, but it's likely to be close. -SteveI've compared the range versions with a for-loop. For integers and longs or high stride amounts the time is roughly equal, but for bytes with low stride amounts it can be up to twice as slow. https://run.dlang.io/is/BoTflQ 50 Mb array, type = byte, stride = 3, compiler = LDC -O4 -release For-loop 18 ms Fill(0) 33 ms each! 33 ms With stride = 13: For-loop 7.3 ms Fill(0) 7.5 ms each! 7.8 ms
Jun 04 2018
On 6/4/18 1:40 PM, Dennis wrote:On Monday, 4 June 2018 at 15:43:20 UTC, Steven Schveighoffer wrote:Interesting! BTW, do you have cross-module inlining on? I wonder if that makes a difference if you didn't have it on before. (I'm somewhat speaking from ignorance, as I've heard people talk about this limitation, but am not sure exactly when it's enabled) -SteveNote, it's not going to necessarily be as efficient, but it's likely to be close.I've compared the range versions with a for-loop. For integers and longs or high stride amounts the time is roughly equal, but for bytes with low stride amounts it can be up to twice as slow. https://run.dlang.io/is/BoTflQ 50 Mb array, type = byte, stride = 3, compiler = LDC -O4 -release For-loop 18 ms Fill(0) 33 ms each! 33 ms With stride = 13: For-loop 7.3 ms Fill(0) 7.5 ms each! 7.8 ms
Jun 04 2018
On Monday, 4 June 2018 at 18:11:47 UTC, Steven Schveighoffer wrote:BTW, do you have cross-module inlining on? I wonder if that makes a difference if you didn't have it on before. (I'm somewhat speaking from ignorance, as I've heard people talk about this limitation, but am not sure exactly when it's enabled)I don't know much about this either. Clang has link-time optimization with -O4, but looking at the --help of LDC it turns out -O4 is equivalent to -O3 for D. Maybe someone else knows?
Jun 04 2018
On Monday, 4 June 2018 at 18:47:02 UTC, Dennis wrote:On Monday, 4 June 2018 at 18:11:47 UTC, Steven Schveighoffer wrote:Cross-module inlining is never implicitly enabled in LDC. Not having it enabled is definitely something that hurts performance of LDC generated code. Enable it with: `-enable-cross-module-inlining`. (May lead to missing symbols during linking when using templates with __FILE__ arguments.)BTW, do you have cross-module inlining on? I wonder if that makes a difference if you didn't have it on before. (I'm somewhat speaking from ignorance, as I've heard people talk about this limitation, but am not sure exactly when it's enabled)I don't know much about this either. Clang has link-time optimization with -O4, but looking at the --help of LDC it turns out -O4 is equivalent to -O3 for D. Maybe someone else knows?Clang and LDC treat `-O4` as `-O3`. To enable LTO (cross-module inlining and other big perf gains), you have to use `-flto=full`or `-flto=thin`, but it'll only give cross module inlining for modules that have been compiled with it. Notably: default Phobos/druntime is _not_ compiled with LTO enabled. LDC 1.9.0 release packages on Github ship with a second set of LTO Phobos/druntime libs. With LDC 1.9.0, you can do `-flto=<thin|full> -defaultlib=phobos2-ldc-lto,druntime-ldc-lto`, for maximum druntime/Phobos inlining. -Johan
Jun 06 2018
On Monday, 4 June 2018 at 18:11:47 UTC, Steven Schveighoffer wrote:BTW, do you have cross-module inlining on?Just to drive this point home. https://run.dlang.io/is/nrdzb0 Manually implemented stride and fill with everything forced inline. Otherwise, the original code is unchanged. 17 ms, 891 μs, and 6 hnsecs 15 ms, 694 μs, and 1 hnsec 15 ms, 570 μs, and 9 hnsecs My new stride outperformed std.range stride, and the manual for-loop. And, because the third test uses the new stride, it also benefited. But interestingly runs every so slightly faster...
Jun 04 2018
On Monday, 4 June 2018 at 23:08:17 UTC, Ethan wrote:On Monday, 4 June 2018 at 18:11:47 UTC, Steven Schveighoffer wrote:Just as an aside: ... pragma( inline ) property length() const { return range.length / strideCount; } pragma( inline ) property empty() const { return currFront > range.length; } pragma( inline ) property ref Elem front() { return range[ currFront ]; } pragma( inline ) void popFront() { currFront += strideCount; } ... pragma( inline ) auto stride( Range )( Range r, int a ) ... pragma( inline ) auto fill( Range, Value )( Range r, Value v ) ... pragma(inline), without any argument, does not force inlining. It actually does nothing; it just specifies that the "implementation's default behaviour" should be used. You have to annotate with pragma(inline, true) to force inlining (https://dlang.org/spec/pragma.html#inline). When I change all the pragma(inline) to pragma(inline, true), there is a non-trivial speedup: 14 ms, 517 μs, and 9 hnsecs 13 ms, 110 μs, and 1 hnsec 13 ms, 199 μs, and 9 hnsecs There's further reductions using ldc-beta: 14 ms, 520 μs, and 4 hnsecs 13 ms, 87 μs, and 2 hnsecs 12 ms, 938 μs, and 8 hnsecsBTW, do you have cross-module inlining on?Just to drive this point home. https://run.dlang.io/is/nrdzb0 Manually implemented stride and fill with everything forced inline. Otherwise, the original code is unchanged. 17 ms, 891 μs, and 6 hnsecs 15 ms, 694 μs, and 1 hnsec 15 ms, 570 μs, and 9 hnsecs My new stride outperformed std.range stride, and the manual for-loop. And, because the third test uses the new stride, it also benefited. But interestingly runs every so slightly faster...
Jun 04 2018
On Tuesday, 5 June 2018 at 03:13:05 UTC, Meta wrote:14 ms, 520 μs, and 4 hnsecs 13 ms, 87 μs, and 2 hnsecs 12 ms, 938 μs, and 8 hnsecsWhen using `dmd -inline -O -release` with an extra simd benchmark I get: for loop: 21 ms, 291 μs, and 6 hnsecs stride/fill: 64 ms, 927 μs, and 9 hnsecs stride/each: 52 ms, 740 μs, and 8 hnsecs simd &=: 6 ms, 900 μs, and 8 hnsecs https://run.dlang.io/gist/5fe73cbf9943aa57be1101e597bb25e4?args=-inline%20-O%20-release Though the simd version does not work in ldc...
Jun 04 2018
On Tuesday, 5 June 2018 at 05:05:47 UTC, David Bennett wrote:On Tuesday, 5 June 2018 at 03:13:05 UTC, Meta wrote:Here's a version that works in ldc: https://run.dlang.io/gist/1d4bb542427fb82cc455fe9dc30185d7?compiler=ldc&args=-inline%20-O4%20-release for loop: 16 ms, 594 μs, and 1 hnsec stride/fill: 14 ms, 918 μs, and 9 hnsecs stride/each: 14 ms and 813 μs simd &=: 7 ms, 153 μs, and 6 hnsecs14 ms, 520 μs, and 4 hnsecs 13 ms, 87 μs, and 2 hnsecs 12 ms, 938 μs, and 8 hnsecsWhen using `dmd -inline -O -release` with an extra simd benchmark I get: for loop: 21 ms, 291 μs, and 6 hnsecs stride/fill: 64 ms, 927 μs, and 9 hnsecs stride/each: 52 ms, 740 μs, and 8 hnsecs simd &=: 6 ms, 900 μs, and 8 hnsecs https://run.dlang.io/gist/5fe73cbf9943aa57be1101e597bb25e4?args=-inline%20-O%20-release Though the simd version does not work in ldc...
Jun 04 2018
On 06/04/2018 07:08 PM, Ethan wrote:On Monday, 4 June 2018 at 18:11:47 UTC, Steven Schveighoffer wrote:BTW I've had this thought for a long time to implement stride with a compile-time step... never got around to implementing it. It would easily generalize the existing code without too much work. Essentially the step would be a template parameter; if that is 0, then use a run-time stride. Most of the code works unchanged.BTW, do you have cross-module inlining on?Just to drive this point home. https://run.dlang.io/is/nrdzb0 Manually implemented stride and fill with everything forced inline. Otherwise, the original code is unchanged. 17 ms, 891 μs, and 6 hnsecs 15 ms, 694 μs, and 1 hnsec 15 ms, 570 μs, and 9 hnsecs My new stride outperformed std.range stride, and the manual for-loop. And, because the third test uses the new stride, it also benefited. But interestingly runs every so slightly faster...
Jun 05 2018
On Monday, 4 June 2018 at 17:40:57 UTC, Dennis wrote:On Monday, 4 June 2018 at 15:43:20 UTC, Steven Schveighoffer wrote:This is why I wanted to make sure! I would be using it for a stride of 2 and it seems it might have doubled the cost for no other reason than using ranged. Ranges are great but one can't reason about what is happening in then as easy as a direct loop so I wanted to be sure. Thanks for running the test!Note, it's not going to necessarily be as efficient, but it's likely to be close. -SteveI've compared the range versions with a for-loop. For integers and longs or high stride amounts the time is roughly equal, but for bytes with low stride amounts it can be up to twice as slow. https://run.dlang.io/is/BoTflQ 50 Mb array, type = byte, stride = 3, compiler = LDC -O4 -release For-loop 18 ms Fill(0) 33 ms each! 33 ms With stride = 13: For-loop 7.3 ms Fill(0) 7.5 ms each! 7.8 ms
Jun 04 2018
On 6/4/18 5:52 PM, DigitalDesigns wrote:On Monday, 4 June 2018 at 17:40:57 UTC, Dennis wrote:See later postings from Ethan and others. It's a matter of optimization being able to see the "whole thing". This is why for loops are sometimes better. It's not inherent with ranges, but if you use the right optimization flags, it's done as fast as if you hand-wrote it. What I've found with D (and especially LDC) is that when you give the compiler everything to work with, it can do some seemingly magic things. -SteveOn Monday, 4 June 2018 at 15:43:20 UTC, Steven Schveighoffer wrote:This is why I wanted to make sure! I would be using it for a stride of 2 and it seems it might have doubled the cost for no other reason than using ranged. Ranges are great but one can't reason about what is happening in then as easy as a direct loop so I wanted to be sure. Thanks for running the test!Note, it's not going to necessarily be as efficient, but it's likely to be close. -SteveI've compared the range versions with a for-loop. For integers and longs or high stride amounts the time is roughly equal, but for bytes with low stride amounts it can be up to twice as slow. https://run.dlang.io/is/BoTflQ 50 Mb array, type = byte, stride = 3, compiler = LDC -O4 -release For-loop 18 ms Fill(0) 33 ms each! 33 ms With stride = 13: For-loop 7.3 ms Fill(0) 7.5 ms each! 7.8 ms
Jun 05 2018
On Tuesday, 5 June 2018 at 13:05:56 UTC, Steven Schveighoffer wrote:On 6/4/18 5:52 PM, DigitalDesigns wrote:It would be nice if testing could be done. Maybe even profiling in unit tests to make sure ranges are within some margin of error(10%). One of the main reasons I don't use ranges is I simply don't have faith they will be as fast as direct encoding. While they might offer a slightly easier syntax I don't know what is going on under the hood so I can't reason about it(unless I look up the source). With a for loop, it is pretty much a wrapper on internal cpu logic so it will be near as fast as possible. I suppose in the long run ranges do have the potential to out perform since they do abstract but there is no guarantee they will even come close. Having some "proof" that they are working well would ease my mind. As this thread shows, ranges have some major issues. Imagine having some code on your machine that is very performant but on another machine in a slightly different circumstances it runs poorly. Now, say it is the stride issue... One normally would not think of that being an issue so one will look in other areas and could waste times. At least with direct loops you pretty much get what you see. It is very easy for ranges to be slow but more difficult for them to be fast.On Monday, 4 June 2018 at 17:40:57 UTC, Dennis wrote:See later postings from Ethan and others. It's a matter of optimization being able to see the "whole thing". This is why for loops are sometimes better. It's not inherent with ranges, but if you use the right optimization flags, it's done as fast as if you hand-wrote it. What I've found with D (and especially LDC) is that when you give the compiler everything to work with, it can do some seemingly magic things. -SteveOn Monday, 4 June 2018 at 15:43:20 UTC, Steven Schveighoffer wrote:This is why I wanted to make sure! I would be using it for a stride of 2 and it seems it might have doubled the cost for no other reason than using ranged. Ranges are great but one can't reason about what is happening in then as easy as a direct loop so I wanted to be sure. Thanks for running the test!Note, it's not going to necessarily be as efficient, but it's likely to be close. -SteveI've compared the range versions with a for-loop. For integers and longs or high stride amounts the time is roughly equal, but for bytes with low stride amounts it can be up to twice as slow. https://run.dlang.io/is/BoTflQ 50 Mb array, type = byte, stride = 3, compiler = LDC -O4 -release For-loop 18 ms Fill(0) 33 ms each! 33 ms With stride = 13: For-loop 7.3 ms Fill(0) 7.5 ms each! 7.8 ms
Jun 05 2018
On 6/5/18 12:50 PM, DigitalDesigns wrote:I suppose in the long run ranges do have the potential to out perform since they do abstract but there is no guarantee they will even come close. Having some "proof" that they are working well would ease my mind. As this thread shows, ranges have some major issues.Just to point it out again, Ethan's numbers: 17 ms, 891 μs, and 6 hnsecs // for loop 15 ms, 694 μs, and 1 hnsec // fill 15 ms, 570 μs, and 9 hnsecs // each I think ranges are doing just fine. You just need the cross-module inlining turned on.Imagine having some code on your machine that is very performant but on another machine in a slightly different circumstances it runs poorly. Now, say it is the stride issue... One normally would not think of that being an issue so one will look in other areas and could waste times. At least with direct loops you pretty much get what you see. It is very easy for ranges to be slow but more difficult for them to be fast.The same can be said even with direct loops. There are no guarantees of course that one type of programming style is going to outperform another on all platforms and all compilers. My experience is that things that seem like they should be slower can sometimes be faster, and vice versa. It depends on so many factors, and the best thing to do is test and observe in each situation. -Steve
Jun 05 2018
On 05.06.2018 18:50, DigitalDesigns wrote:With a for loop, it is pretty much a wrapper on internal cpu logic so it will be near as fast as possible.This is not even close to being true for modern CPUs. There are a lot of architectural and micro-architectural details that affect performance but are not visible or accessible in your for loop. If you care about performance, you will need to test anyway, as even rather sophisticated models of CPU performance don't get everything right. Also, it is often not necessary to be "as fast as possible". It is usually more helpful to figure out where the bottleneck is for your code and concentrate optimization effort there, which you can do more effectively if you can save time and effort for the remaining parts of your program by writing simple and obviously correct range-based code, which often will be fast as well.
Jun 05 2018
On Tuesday, 5 June 2018 at 18:46:41 UTC, Timon Gehr wrote:On 05.06.2018 18:50, DigitalDesigns wrote:Those optimizations are not part of the instruction set so are irrelevant. They will occur with ranges too. For loops HAVE a direct cpu semantic! Do you doubt this? Cpu's do not have range semantics. Ranges are layers on top of compiler semantics... you act like they are equivalent, they are not! All range semantics must go through the library code then to the compiler then to cpu. For loops of all major systems languages go almost directly to cpu instructions. for(int i = 0; i < N; i++) translates in to either increment and loop or jump instructions. There is absolutely no reason why any decent compiler would not use what the cpu has to offer. For loops are language semantics, Ranges are library semantics. To pretend they are equivalent is wrong and no amount of justifying will make them the same. I actually do not know even any commercial viable cpu exists without loop semantics. I also no of no commercially viable compiler that does not wrap those instructions in a for loop(or while, or whatever) like syntax that almost maps directly to the cpu instructions.With a for loop, it is pretty much a wrapper on internal cpu logic so it will be near as fast as possible.This is not even close to being true for modern CPUs. There are a lot of architectural and micro-architectural details that affect performance but are not visible or accessible in your for loop. If you care about performance, you will need to test anyway, as even rather sophisticated models of CPU performance don't get everything right.Also, it is often not necessary to be "as fast as possible". It is usually more helpful to figure out where the bottleneck is for your code and concentrate optimization effort there, which you can do more effectively if you can save time and effort for the remaining parts of your program by writing simple and obviously correct range-based code, which often will be fast as well.It's also often not necessary to be "as slow as possible". I'm not asking for about generalities but specifics. It's great to make generalizations about how things should be but I would like to know how they are. Maybe in theory ranges could be more optimal than other semantics but theory never equals practice.
Jun 05 2018
On Tuesday, 5 June 2018 at 19:05:27 UTC, DigitalDesigns wrote:For loops HAVE a direct cpu semantic! Do you doubt this?What are they? And for bonus points, are they actually used by compilers? Then the final question: is the generated code any different than inlined ranges?
Jun 05 2018
On Tuesday, 5 June 2018 at 19:18:15 UTC, Adam D. Ruppe wrote:On Tuesday, 5 June 2018 at 19:05:27 UTC, DigitalDesigns wrote:It doesn't matter! The issue that I said was not that ranges were slower but that ranges exist on an abstract on top of language semantics! that means that they can never be faster than the language itself! Anything that a range does can never be faster than doing it by hand. This is not a hard concept to understand. I know everyone wants to defend their precious ranges but what happens is irrelevant in some particular case. Ranges could be 100x faster in some specific case but it doesn't change the fact that they are an abstraction on top of the language, not in the language. I've already pointed out, and made it clear, I will do it one last time: There is no guarantee(doesn't have to be 100% by the rule of god) by the compiler that a ranges performance will even come close to hand written code or that it won't all of a sudden change when someone decides to "optimize" it and actually makes it worse! These problems are far less likely to occur in a well established language semantic. I can't believe people are really defending library solutions as not having these issues, but I guess that is the nature of most humans.For loops HAVE a direct cpu semantic! Do you doubt this?What are they? And for bonus points, are they actually used by compilers? Then the final question: is the generated code any different than inlined ranges?
Jun 05 2018
On Tuesday, 5 June 2018 at 22:20:08 UTC, DigitalDesigns wrote:It doesn't matter! The issue that I said was not that ranges were slower but that ranges exist on an abstract on top of language semantics! that means that they can never be faster than the language itself! Anything that a range does can never be faster than doing it by hand.This is the best part. Ranges *ARE* a language semantic. https://tour.dlang.org/tour/en/basics/ranges
Jun 05 2018
On 6/5/18 3:05 PM, DigitalDesigns wrote:It's also often not necessary to be "as slow as possible". I'm not asking for about generalities but specifics. It's great to make generalizations about how things should be but I would like to know how they are. Maybe in theory ranges could be more optimal than other semantics but theory never equals practice.Again, I want to stress, ranges are not "as slow as possible", and it's clear from the numbers posted here that they are faster than for loops in practice, at least for this example. -Steve
Jun 05 2018
On Tuesday, 5 June 2018 at 19:05:27 UTC, DigitalDesigns wrote:For loops HAVE a direct cpu semantic! Do you doubt this?... Right. If you're gonna keep running your mouth off. How about looking at some disassembly then. for(auto i=0; i<a.length; i+=strideAmount) Using ldc -O4 -release for x86_64 processors, the initialiser translates to: mov byte ptr [rbp + rcx], 0 The comparison translates to: cmp r13, rcx ja .LBB0_2 And the increment and store translates to: mov byte ptr [rbp + rcx], 0 movsxd rcx, eax add eax, 3 So. It uses three of the most basic instructions you can think of: mov, cmp, j<cond>, add. Now, what might you ask are the instructions that a range compiles down to when everything is properly inlined? The initialisation, since it's a function, pulls from the stack. mov rax, qword ptr [rsp + 16] movsxd rcx, dword ptr [rsp + 32] But the comparison looks virtually identical. cmp rax, rcx jb .LBB2_4 But how does it do the add? With some register magic. movsxd rcx, edx lea edx, [rcx + r9] Now, what that looks like it's doing to me is combing the pointer load and index increment in to two those two instructions. One instruction less than the flat for loop. In conclusion. The semantics you talk about are literally some of the most basic instructions in computing; and that escaping the confines of a for loop for a foreach loop can let the compiler generate more efficient code than 50-year-old compsci concepts can.
Jun 05 2018
On Tuesday, 5 June 2018 at 20:07:06 UTC, Ethan wrote:On Tuesday, 5 June 2018 at 19:05:27 UTC, DigitalDesigns wrote:Ok asshat! You still don't get it! I didn't say ranges would not compile down to the same thing! Do you have trouble understanding the English language? You don't seem to get the concept where ranges are library solutions and someone can some along at any time and modify some code and WHAM! They no longer compile down to your efficient precious instructions. That is far more unlikely to occur with language semantics. Why is that so difficult for you to understand you sure do you have an attitude for someone that has difficulty with English.For loops HAVE a direct cpu semantic! Do you doubt this?... Right. If you're gonna keep running your mouth off. How about looking at some disassembly then. for(auto i=0; i<a.length; i+=strideAmount) Using ldc -O4 -release for x86_64 processors, the initialiser translates to: mov byte ptr [rbp + rcx], 0 The comparison translates to: cmp r13, rcx ja .LBB0_2 And the increment and store translates to: mov byte ptr [rbp + rcx], 0 movsxd rcx, eax add eax, 3 So. It uses three of the most basic instructions you can think of: mov, cmp, j<cond>, add. Now, what might you ask are the instructions that a range compiles down to when everything is properly inlined? The initialisation, since it's a function, pulls from the stack. mov rax, qword ptr [rsp + 16] movsxd rcx, dword ptr [rsp + 32] But the comparison looks virtually identical. cmp rax, rcx jb .LBB2_4 But how does it do the add? With some register magic. movsxd rcx, edx lea edx, [rcx + r9] Now, what that looks like it's doing to me is combing the pointer load and index increment in to two those two instructions. One instruction less than the flat for loop. In conclusion. The semantics you talk about are literally some of the most basic instructions in computing; and that escaping the confines of a for loop for a foreach loop can let the compiler generate more efficient code than 50-year-old compsci concepts can.
Jun 05 2018
On 6/5/18 5:22 PM, DigitalDesigns wrote:On Tuesday, 5 June 2018 at 20:07:06 UTC, Ethan wrote:Nope, he doesn't. Look at what you said: "Maybe in theory ranges could be more optimal than other semantics but theory never equals practice. " And now you have been shown (multiple times) that in practice ranges in fact outperform for loops. Including the assembly to prove it (which helps with this comment: "Having some "proof" that they are working well would ease my mind.") So tone down the attitude, you got what you *clearly* asked for but seem reluctant to acknowledge. Ranges are good, for loops are good too, but not as. So maybe you should just use ranges and use the correct optimization flags and call it a day? Or else use for loops and accept that even though they may not run as quickly, they are "safer" to use since some malicious coder could come along and add in sleeps inside the std.algorithm functions. -SteveIn conclusion. The semantics you talk about are literally some of the most basic instructions in computing; and that escaping the confines of a for loop for a foreach loop can let the compiler generate more efficient code than 50-year-old compsci concepts can.Ok asshat! You still don't get it! I didn't say ranges would not compile down to the same thing! Do you have trouble understanding the English language?
Jun 05 2018
On Tuesday, 5 June 2018 at 21:35:03 UTC, Steven Schveighoffer wrote:On 6/5/18 5:22 PM, DigitalDesigns wrote:No, you have shown a few fucking cases, why are you guys attacking me for being dense? You can't prove that ranges are more optimal than direct semantics! Do it! I'd like to see you try!On Tuesday, 5 June 2018 at 20:07:06 UTC, Ethan wrote:Nope, he doesn't. Look at what you said: "Maybe in theory ranges could be more optimal than other semantics but theory never equals practice. " And now you have been shown (multiple times) that in practice ranges in fact outperform for loops. Including the assembly to prove it (which helps with this comment: "Having some "proof" that they are working well would ease my mind.")In conclusion. The semantics you talk about are literally some of the most basic instructions in computing; and that escaping the confines of a for loop for a foreach loop can let the compiler generate more efficient code than 50-year-old compsci concepts can.Ok asshat! You still don't get it! I didn't say ranges would not compile down to the same thing! Do you have trouble understanding the English language?So tone down the attitude, you got what you *clearly* asked for but seem reluctant to acknowledge. Ranges are good, for loops are good too, but not as. So maybe you should just use ranges and use the correct optimization flags and call it a day? Or else use for loops and accept that even though they may not run as quickly, they are "safer" to use since some malicious coder could come along and add in sleeps inside the std.algorithm functions. -SteveWhat it seems is that a few of you are upset because I didn't bow down to your precious range semantics and ask for clarification. At first I was jumped on then someone did some tests and found out that it wasn't so rosy like everyone thought. Of course, the work around is to force optimizations that fix the problem when the problem doesn't exist in for loops. Then you come along and tell me that specific cases prove the general case... that is real dense. You know, it takes two to have an attitude! I asked for information regarding stride. I got the range version, it turned out to be slower in some corner case for some bizarre reason. I was then told it required optimizations(why? That is fishy why the corner cause would be 200% slower for a weird edge case) and then I was told that ranges are always faster(which is what you just said because you act like one example proves everything). Every step of the way I am told "don't worry". You've already stepped in the shit once and you expect me to believe everything you say? Why is it so hard to have a test suite that checks the performance of range constructs instead of just getting me to believe you? Huh? Do you really think I'm suppose to believe every thing any asshat says on the net just because they want me to? Back up your beliefs, that simple. Provide timings for all the range functions in various combinations and give me a worse case scenario compared to their equivalent hand-coded versions. Once you do that then I will be able to make an informed decision rather than doing what you really want, which is except your world as authority regardless of the real truth.
Jun 05 2018
On Tuesday, 5 June 2018 at 22:28:44 UTC, DigitalDesigns wrote:On Tuesday, 5 June 2018 at 21:35:03 UTC, Steven Schveighoffer wrote:[...][...]Does ranges not evaluate lazily on some cases. So it'll avoid unnecessary work...and be much faster and efficient. If I'm correct.[...]
Jun 05 2018
On 6/5/18 6:28 PM, DigitalDesigns wrote:Once you do that then I will be able to make an informed decision rather than doing what you really want, which is except your world as authority regardless of the real truth.It's "accept" not "except". You need to *accept* my world as authority. Here, this may help: https://www.vocabulary.com/articles/chooseyourwords/accept-except/ -Steve
Jun 06 2018
On Wednesday, 6 June 2018 at 14:23:29 UTC, Steven Schveighoffer wrote:On 6/5/18 6:28 PM, DigitalDesigns wrote:This is genius!! This made my day! I feel 40 years younger!!! It took me back to 3rd grade English class. I can remember when little Lissa Lou was sitting next to me on a fine Wednesday morning! She had just used a word in the wrong context and out of the blue the gestapo barged in shoved a grenade down her throat then tossed her out the window! It was a fun day! We all had icecream and danced like the little Nazi's we were! That was the good ol days! Thanks!Once you do that then I will be able to make an informed decision rather than doing what you really want, which is except your world as authority regardless of the real truth.It's "accept" not "except". You need to *accept* my world as authority. Here, this may help: https://www.vocabulary.com/articles/chooseyourwords/accept-except/ -Steve
Jun 07 2018
On Tuesday, 5 June 2018 at 21:22:27 UTC, DigitalDesigns wrote:Ok asshat!Take it to reddit. Back your arguments up with actual knowledge and intelligence, not unfounded agression.
Jun 05 2018
On Tuesday, 5 June 2018 at 21:52:03 UTC, Ethan wrote:On Tuesday, 5 June 2018 at 21:22:27 UTC, DigitalDesigns wrote:You are an idiot! You obviously do not understand basic logic. Unfounded aggression? Yep, way to see how you didn't start it! Must be nice being the bully!Ok asshat!Take it to reddit. Back your arguments up with actual knowledge and intelligence, not unfounded agression.
Jun 05 2018
On Tuesday, 5 June 2018 at 21:54:20 UTC, DigitalDesigns wrote:You are an idiot!Take it to reddit. Back your arguments up with actual knowledge and intelligence, not unfounded agression.
Jun 05 2018
On 05.06.2018 21:05, DigitalDesigns wrote:On Tuesday, 5 June 2018 at 18:46:41 UTC, Timon Gehr wrote:I was responding to claims that for loops are basically a wrapper on internal CPU logic and nearly as fast as possible. Both of those claims were wrong.On 05.06.2018 18:50, DigitalDesigns wrote:Those optimizations are not part of the instruction set so are irrelevant. They will occur with ranges too. ...With a for loop, it is pretty much a wrapper on internal cpu logic so it will be near as fast as possible.This is not even close to being true for modern CPUs. There are a lot of architectural and micro-architectural details that affect performance but are not visible or accessible in your for loop. If you care about performance, you will need to test anyway, as even rather sophisticated models of CPU performance don't get everything right.For loops HAVE a direct cpu semantic! Do you doubt this? ...You'd have to define what that means. (E.g., Google currently shows no hits for "direct CPU semantics".)Cpu's do not have range semantics. Ranges are layers on top of compiler semantics... you act like they are equivalent, they are not!I don't understand why you bring this up nor what you think it means. The compiler takes a program and produces some machine code that has the right behavior. Performance is usually not formally specified. In terms of resulting behavior, code with explicit for loops and range-based code may have identical semantics. Which one executes faster depends on internal details of the compiler and the target architecture, and it may change over time, e.g. between compiler releases.All range semantics must go through the library code then to the compiler then to cpu. For loops of all major systems languages go almost directly to cpu instructions. for(int i = 0; i < N; i++) translates in to either increment and loop or jump instructions. ...Sure, or whatever else the compiler decides to do. It might even be translated into a memcpy call. Even if you want to restrict yourself to use only for loops, my point stands. Write maintainable code by default and let the compiler do what it does. Then optimize further in those cases where the resulting code is actually too slow. Test for performance regressions.There is absolutely no reason why any decent compiler would not use what the cpu has to offer. For loops are language semantics, Ranges are library semantics.Not really. Also, irrelevant.To pretend they are equivalent is wrong and no amount of justifying will make them the same.Again, I don't think this point is part of this discussion.I actually do not know even any commercial viable cpu exists without loop semantics.What does it mean for a CPU to have "loop semantics"? CPUs typically have an instruction pointer register and possibly some built-in instructions to manipulate said instruction pointer. x86 has some built-in loop instructions, but I think they are just there for legacy support and not actually something you want to use in performant code.I also no of no commercially viable compiler that does not wrap those instructions in a for loop(or while, or whatever) like syntax that almost maps directly to the cpu instructions. ...The compiler takes your for loop and generates some machine code. I don't think there is a "commercially viable" compiler that does not sometimes do things that are not direct. And even then, there is no very simple mapping from CPU instructions to observed performance, so the entire point is a bit moot.This seems to be quoting an imaginary person. My point is that to get even faster code, you need to spend effort and often get lower maintainability. This is not always a good trade-off, in particular if the optimization does not improve performance a lot and/or the code in question is not executed very often.Also, it is often not necessary to be "as fast as possible". It is usually more helpful to figure out where the bottleneck is for your code and concentrate optimization effort there, which you can do more effectively if you can save time and effort for the remaining parts of your program by writing simple and obviously correct range-based code, which often will be fast as well.It's also often not necessary to be "as slow as possible".I'm not asking for about generalities but specifics. It's great to make generalizations about how things should be but I would like to know how they are.That's a bit unspecific.Maybe in theory ranges could be more optimal than other semantics but theory never equals practice.I don't know who this is addressed to. My point was entirely practical.
Jun 05 2018