digitalmars.D - Array append performance
- bearophile (51/51) Aug 22 2008 I have performed another small benchmark of the dynamic array append, si...
- Sean Kelly (14/36) Aug 22 2008 Unfortunately, this is expected. On every append the runtime checks to
- dsimcha (18/18) Aug 22 2008 I just ran into this yesterday porting C++ code that used push_back() he...
- bearophile (9/12) Aug 22 2008 In 2D/nD arrays you repeat such size_t many times, so it's not just 4/8 ...
- Robert Jacques (5/8) Aug 22 2008 On a similar note, there's also fast strided arrays. ( see my blog
- Jerry Quinn (5/13) Aug 22 2008 For starters, you need 8 bytes on 64 bit systems. I suspect a more impo...
- bearophile (4/5) Aug 22 2008 Can you explain why?
- superdan (5/20) Aug 22 2008 sorry to always be the one who puts the moose on the table. y'all who wa...
- dsimcha (14/16) Aug 22 2008 and data. it's a view of a portion of memory. the slice don't care where...
- Sean Kelly (5/21) Aug 23 2008 The Tango runtime is a bit different in that it checks to see if the arr...
- Sean Kelly (11/28) Aug 23 2008 Otherwise, it noticably restricts the places I can use built-in arrays.
- Steven Schveighoffer (3/45) Aug 23 2008 Unless the slice is at the beginning of an allocated block...
- Sean Kelly (4/48) Aug 25 2008 True enough :-) I suppose this is one argument in favor of a capacity
- superdan (9/37) Aug 24 2008 then we have a field hanging in there that's only valid sometimes. that ...
- Christopher Wright (8/9) Aug 24 2008 Capacity could be implemented by preallocating extra space at the end of...
- Sean Kelly (9/39) Aug 25 2008 That's the weird bit. The capacity of any copy of a slice or an array
- Denis Koroskin (6/61) Aug 25 2008 Both Java and C# strings aren't supposed to be fast on appending, that's...
- Steven Schveighoffer (11/43) Aug 23 2008 I see nothing wrong with having a library solution to this. D and
- dsimcha (13/23) Aug 23 2008 You do have a point there. Builtin arrays work well for the typical cas...
- Adam D. Ruppe (7/8) Aug 23 2008 I wouldn't rely on it. Those are just interfaces to the platform's nativ...
- Steven Schveighoffer (4/10) Aug 23 2008 Although the spec may not say it, any modern OS has a thread safe malloc...
- bearophile (7/12) Aug 23 2008 Probably a library solution makes you use even more memory (when you wan...
- Bill Baxter (8/16) Aug 23 2008 Personally I'd argue that appending efficiently is important to more
- Steven Schveighoffer (9/28) Aug 23 2008 Making all arrays (even ones that you create temporarily and never appen...
- bearophile (7/15) Aug 23 2008 In the Python world they solve this problem taking a look at real code. ...
- dsimcha (11/17) Aug 23 2008 Definitely a reasonable argument, but just to play devil's advocate, let...
- Robert Jacques (20/26) Aug 24 2008 I've written some micro benchmarks to test out the difference between an...
- bearophile (7/10) Aug 24 2008 You may want to address some of the things superdan has said.
- Robert Jacques (31/38) Aug 24 2008 Yes, stride is a key enabler for integrating 2D/3D/nD matrices with
- Benji Smith (20/38) Aug 25 2008 That's very convincing. Seems like stride ought to be in there.
- superdan (2/35) Aug 25 2008 ok, i'll bite. what exactly are you so desperately asking for?
- Benji Smith (4/7) Aug 25 2008 For strings to be implemented as objects, rather than as arrays of
- superdan (3/12) Aug 25 2008 ok i assume u mean class objects. because current arrays are objects. on...
- Benji Smith (3/16) Aug 25 2008 I'll start a new thread, so as not to hijack this one further.
- Robert Jacques (12/26) Aug 25 2008 This only works if strings are objects as there's an aliasing issue, so ...
- bearophile (4/6) Aug 25 2008 Well, D 2.x strings (and their slices) are immutable, so there is no pro...
- Denis Koroskin (5/17) Aug 25 2008 I don't follow you. Immutability of the slice implies that the whole arr...
- bearophile (12/15) Aug 25 2008 Well, only if immutability is transitive :-)
- bearophile (6/9) Aug 25 2008 Rectangular 2D arrays enjoy a stride as 4th field, while (usually 1D) st...
- Robert Jacques (16/27) Aug 25 2008 I was refering to 1D arrays with strides. For 2D/3D/ND arrays you need a...
- Robert Jacques (7/34) Aug 25 2008 Using these optimizations with const could be done with either a bit-fla...
- JAnderson (5/43) Aug 25 2008 But please don't come up with something like C#'s strings where you have...
- JAnderson (5/24) Aug 23 2008 Another solution would be to cache the capacity on the stack for the
- Lionello Lunesu (17/21) Aug 24 2008 If the others are right, D's probably using std.gc.capacity to constantl...
- Lionello Lunesu (22/22) Aug 24 2008 Nevermind. None of that explained why bearophile's program is so slow. T...
- bearophile (41/42) Aug 25 2008 Yes, there's something that deserves some improvements, probably at vari...
- Robert Jacques (3/9) Aug 25 2008 Such as in this case: http://en.wikipedia.org/wiki/D_language#Example_2
- bearophile (5/6) Aug 25 2008 For the people interested, here are few slides written by the good Hetti...
- JAnderson (3/30) Aug 25 2008 Great stuff!
- Walter Bright (63/63) Aug 26 2008 The bugzilla report on this suggests replacing []=[] with memcpy. But
- Lionello Lunesu (76/76) Aug 27 2008 The problem is that the timing for the small arrays cannot be trusted as...
- Lionello Lunesu (3/4) Aug 27 2008 D'oh! I should have let Don guess it!
- bearophile (151/152) Aug 27 2008 I have modified your code a little:
- Lionello Lunesu (6/6) Aug 27 2008 --snip--
- bearophile (23/27) Aug 27 2008 From more benchmarks I have seen that if you don't use ASM (with prefetc...
- Sean Kelly (17/44) Aug 27 2008 Why not just change that to this instead:
- bearophile (5/20) Aug 27 2008 I have benchmarked many alternative versions, your one too, but mine is ...
- Fawzi Mohamed (2/27) Aug 27 2008
- Fawzi Mohamed (16/46) Aug 27 2008 so
- bearophile (107/110) Aug 27 2008 ...
- Walter Bright (6/10) Aug 28 2008 I suspect that copying the same data over and over again is not
- Walter Bright (3/3) Aug 28 2008 Essentially, I'm having a real hard time believing that for the 4 byte
- Lionello Lunesu (12/15) Aug 28 2008 I agree with you, but I've noticed the byte[] = byte[] emits "rep movsb"...
- Walter Bright (5/21) Aug 28 2008 But for 4 bytes, how much slower could it be than memcpy, which executes...
- Lionello Lunesu (14/16) Aug 29 2008 I don't know.
- Don (16/28) Aug 28 2008 Using cpuid only once doesn't work. The rationale ultimately comes from
- Walter Bright (2/6) Aug 28 2008 Ah, that makes sense now. Thanks!
I have performed another small benchmark of the dynamic array append, similar to one I have performed in the past, a comparison between the C++ push_back of C++ and the ~= of D, both performed after a "reserve". To perform the "reserve" in D I have used std.gc.malloc instead of the simpler: auto a = new int[n]; a.length = 0; to actually speed up the D code a little, avoiding a (fast but useless) memory cleaning. I have used MinGW 4.2.1 and DMD v.1.034. Compiling the C++ with: -O3 -s and the D with: -O -release -inline The following results are with n = 100_000_000 (about 400 MB RAM used): D: 7.94 s C++: 0.60 s (13.2X) Note that if n is 10 or even 100 times smaller the situation doesn't change much: n = 10_000_000: D: 0.83 s C++: 0.09 s Note2: In the D code the final deallocation of the array is fast enough, and its allocation is fast too (you can see this timing just the for loop in the middle of the program). The really slow part seems to be the loop itself. Note3: If a reserve isn't used (that is in both programs the arrays aren't pre-allocated) the difference in performance becomes smaller, about 8.6X with n = 100_000_000). This is the asm generated by g++: http://codepad.org/hXNDH0se Bye, bearophile --------------------------- C++ code: #include "stdio.h" #include <vector> using namespace std; int main(int argc, char **argv) { int n = argc >= 2 ? atoi(argv[1]) : 1000; vector<int> a; a.reserve(n); for (int i = 0; i < n; ++i) a.push_back(i); printf("%d\n", a[n-1]); return 0; } --------------------------- D code: import std.conv: toInt; import std.gc: malloc, hasNoPointers; int main(string[] args) { int n = args.length >= 2 ? toInt(args[1]) : 1000; int* aux_ptr = cast(int*)malloc(n * int.sizeof); hasNoPointers(aux_ptr); int[] a = aux_ptr[0 .. 0]; for (int i; i < n; i++) a ~= i; printf("%d\n", a[n - 1]); return 0; }
Aug 22 2008
bearophile wrote:I have performed another small benchmark of the dynamic array append, similar to one I have performed in the past, a comparison between the C++ push_back of C++ and the ~= of D, both performed after a "reserve". To perform the "reserve" in D I have used std.gc.malloc instead of the simpler: auto a = new int[n]; a.length = 0; to actually speed up the D code a little, avoiding a (fast but useless) memory cleaning. I have used MinGW 4.2.1 and DMD v.1.034. Compiling the C++ with: -O3 -s and the D with: -O -release -inline The following results are with n = 100_000_000 (about 400 MB RAM used): D: 7.94 s C++: 0.60 s (13.2X) Note that if n is 10 or even 100 times smaller the situation doesn't change much: n = 10_000_000: D: 0.83 s C++: 0.09 sUnfortunately, this is expected. On every append the runtime checks to see if there is enough room in the buffer for the new data, which requires a call to gc.capacity(). So best case you're paying for a bunch of math and worst case you have to acquire the GC mutex as well. That said, this isn't really an entirely fair comparison, since you're comparing the performance of a structured, local user-level object vs. a minimal, language-level feature. I know that D arrays are supposed to be roughly comparable to std::vector or std::string, and they are from a semantics perspective, but when the only local data they contain is a pointer and a length there's little that can be done to optimize certain behavior. And you've just happened to pinpoint the greatest performance problem with D arrays :-) Sean
Aug 22 2008
I just ran into this yesterday porting C++ code that used push_back() heavily to D. Ended up rewriting the inner loops to pre-allocate. The D code is now actually faster than the C++ code, but the inner loops are less readable. For a whopping overhead of 4 extra bytes per array (on 32-bit), why not include the capacity field in the array itself? I think the problem here is the need to call gc.capacity(), which can require a lock, as Sean said. If you have the capacity field embedded in the array itself, you never need a lock for ~= unless the array actually needs to be realloced or is appended to by more than one thread. I've written classes before that need fast appending and therefore cache the capacity value every time the array is reallocated, and it makes it a ton faster even in single threaded code. On the other hand, this breaks the ABI for D arrays, but D2 breaks a lot of backwards compatibility anyhow, so maybe this is still a good idea. Does anyone actually write C functions that rely on this ABI and then call them from D? Furthermore, having a capacity field would also allow reserve for builtin arrays. Other than the 4 bytes of extra overhead, does anyone see any downside to including a capacity field? Also, is there some situation I'm not thinking of where the 4 bytes actually matter more than the speed gains in question?
Aug 22 2008
First of all thank you very much Sean Kelly for your clear explanation of the situation. dsimcha:Other than the 4 bytes of extra overhead, does anyone see any downside to including a capacity field? Also, is there some situation I'm not thinking of where the 4 bytes actually matter more than the speed gains in question?In 2D/nD arrays you repeat such size_t many times, so it's not just 4/8 bytes. (This particular problem can be solved with rectangular/equal length dynamic arrays, but it may be too much for a built-in type...). Are dynamic array appends common? I think they may be used often enough to justify such memory increase. I think an experimental approach here may be the best way to answer such question. I mean creating a branch version of DMD where dynamic arrays have a 3-item long defining struct (12/24) and use it to benchmark some real programs or benchmarks, to see how the performance in memory/speed of the compiler changes. How much time/work does it require to create a DMD with such change? Note that a similar branched DMD can be useful to test the performance of another change Walter was talking regarding array structs: changing the len-ptr struct of the arrays to a startptr-stopptr struct. Bye, bearophile
Aug 22 2008
On Fri, 22 Aug 2008 22:23:04 -0400, bearophile <bearophileHUGS lycos.com> wrote:Note that a similar branched DMD can be useful to test the performance of another change Walter was talking regarding array structs: changing the len-ptr struct of the arrays to a startptr-stopptr struct.On a similar note, there's also fast strided arrays. ( see my blog http://dobbscodetalk.com/index.php?option=com_myblog&show=Making-strided-arrays-as-fast-as-native-a rays.html&Itemid=29 )
Aug 22 2008
dsimcha Wrote:I just ran into this yesterday porting C++ code that used push_back() heavily to D. Ended up rewriting the inner loops to pre-allocate. The D code is now actually faster than the C++ code, but the inner loops are less readable. For a whopping overhead of 4 extra bytes per array (on 32-bit), why not include the Other than the 4 bytes of extra overhead, does anyone see any downside to including a capacity field? Also, is there some situation I'm not thinking of where the 4 bytes actually matter more than the speed gains in question?For starters, you need 8 bytes on 64 bit systems. I suspect a more important reason might be making slice operations more difficult to implement correctly. That said, I think I still favor it, since I find I need to do iterative appends to arrays fairly often. Otherwise, it noticably restricts the places I can use built-in arrays. An alternative might be a convenient simple library wrapper that provides these kinds of dynamic operations, while still allowing easy access to the underlying array. I'm thinking prehaps std.array might include an appender object that you create, append to, and then grab the underlying array from, once you're done using it. It's not hard to write, but I'm sure it would be a welcome addition to the library, even if core arrays don't give us this. BTW, std.array seems to be missing documentation.
Aug 22 2008
Jerry Quinn:I suspect a more important reason might be making slice operations more difficult to implement correctly.<Can you explain why? Bye, bearophile
Aug 22 2008
Jerry Quinn Wrote:dsimcha Wrote:sorry to always be the one who puts the moose on the table. y'all who want to make capacity a field in the array are off base. today each slice has length and data. why. because all a slice is *is* length and data. it's a view of a portion of memory. the slice don't care where memory came from. capacity is part of the memory that slices come from. it would be shared by several slices. and all of a sudden you realize capacity can't be a field in the slice. can't be a field just before the actual data in the slice either. for obvious reasons.I just ran into this yesterday porting C++ code that used push_back() heavily to D. Ended up rewriting the inner loops to pre-allocate. The D code is now actually faster than the C++ code, but the inner loops are less readable. For a whopping overhead of 4 extra bytes per array (on 32-bit), why not include the Other than the 4 bytes of extra overhead, does anyone see any downside to including a capacity field? Also, is there some situation I'm not thinking of where the 4 bytes actually matter more than the speed gains in question?For starters, you need 8 bytes on 64 bit systems. I suspect a more important reason might be making slice operations more difficult to implement correctly. That said, I think I still favor it, since I find I need to do iterative appends to arrays fairly often. Otherwise, it noticably restricts the places I can use built-in arrays.An alternative might be a convenient simple library wrapper that provides these kinds of dynamic operations, while still allowing easy access to the underlying array. I'm thinking prehaps std.array might include an appender object that you create, append to, and then grab the underlying array from, once you're done using it. It's not hard to write, but I'm sure it would be a welcome addition to the library, even if core arrays don't give us this.that makes more sense. a sort of in-memory stream thing.
Aug 22 2008
== Quote from superdan (super dan.org)'s articlesorry to always be the one who puts the moose on the table. y'all who want tomake capacity a field in the array are off base.today each slice has length and data. why. because all a slice is *is* lengthand data. it's a view of a portion of memory. the slice don't care where memory came from. capacity is part of the memory that slices come from. it would be shared by several slices. and all of a sudden you realize capacity can't be a field in the slice. A capacity field would work fine with slicing if, whenever you take a slice into an array, the capacity field of the slice is set to zero implicitly, even though the length is longer than zero. This is in effect how it works now. Try calling std.gc.capacity() on a pointer to a slice. It will always return 0 b/c a slice doesn't point to the start of a block. This means that if you try to increase the size of a slice it will always be reallocated. Simply setting the capacity field of slices to 0 when they are created would replicate this in arrays w/ a capacity field.
Aug 22 2008
== Quote from dsimcha (dsimcha yahoo.com)'s article== Quote from superdan (super dan.org)'s articleThe Tango runtime is a bit different in that it checks to see if the array is a slice explicitly--it seems safer than relying on the GC to always return a zero capacity for an interior pointer. Seansorry to always be the one who puts the moose on the table. y'all who want tomake capacity a field in the array are off base.today each slice has length and data. why. because all a slice is *is* lengthand data. it's a view of a portion of memory. the slice don't care where memory came from. capacity is part of the memory that slices come from. it would be shared by several slices. and all of a sudden you realize capacity can't be a field in the slice. A capacity field would work fine with slicing if, whenever you take a slice into an array, the capacity field of the slice is set to zero implicitly, even though the length is longer than zero. This is in effect how it works now. Try calling std.gc.capacity() on a pointer to a slice. It will always return 0 b/c a slice doesn't point to the start of a block. This means that if you try to increase the size of a slice it will always be reallocated. Simply setting the capacity field of slices to 0 when they are created would replicate this in arrays w/ a capacity field.
Aug 23 2008
== Quote from superdan (super dan.org)'s articleJerry Quinn Wrote:making slice operations more difficult to implement correctly.dsimcha Wrote:I just ran into this yesterday porting C++ code that used push_back() heavily to D. Ended up rewriting the inner loops to pre-allocate. The D code is now actually faster than the C++ code, but the inner loops are less readable. For a whopping overhead of 4 extra bytes per array (on 32-bit), why not include the Other than the 4 bytes of extra overhead, does anyone see any downside to including a capacity field? Also, is there some situation I'm not thinking of where the 4 bytes actually matter more than the speed gains in question?For starters, you need 8 bytes on 64 bit systems. I suspect a more important reason might beOtherwise, it noticably restricts the places I can use built-in arrays.That said, I think I still favor it, since I find I need to do iterative appends to arrays fairly often.sorry to always be the one who puts the moose on the table. y'all who want to make capacity a field inthe array are off base.today each slice has length and data. why. because all a slice is *is* length and data. it's a view of aportion of memory. the slice don't care where memory came from. capacity is part of the memory that slices come from. it would be shared by several slices. and all of a sudden you realize capacity can't be a field in the slice. I'm not arguing in favor of a capacity field, but it could be a field in the slice as long as it's the same value as the length. For example, it's legal to append to a slice in D right now, but doing so always reallocates because resizing a slice in place would be bad. Sean
Aug 23 2008
"Sean Kelly" wrote== Quote from superdan (super dan.org)'s articleUnless the slice is at the beginning of an allocated block... -SteveJerry Quinn Wrote:making slice operations more difficult to implement correctly.dsimcha Wrote:I just ran into this yesterday porting C++ code that used push_back() heavily to D. Ended up rewriting the inner loops to pre-allocate. The D code is now actually faster than the C++ code, but the inner loops are less readable. For a whopping overhead of 4 extra bytes per array (on 32-bit), why not include the Other than the 4 bytes of extra overhead, does anyone see any downside to including a capacity field? Also, is there some situation I'm not thinking of where the 4 bytes actually matter more than the speed gains in question?For starters, you need 8 bytes on 64 bit systems. I suspect a more important reason might beOtherwise, it noticably restricts the places I can use built-in arrays.That said, I think I still favor it, since I find I need to do iterative appends to arrays fairly often.sorry to always be the one who puts the moose on the table. y'all who want to make capacity a field inthe array are off base.today each slice has length and data. why. because all a slice is *is* length and data. it's a view of aportion of memory. the slice don't care where memory came from. capacity is part of the memory that slices come from. it would be shared by several slices. and all of a sudden you realize capacity can't be a field in the slice. I'm not arguing in favor of a capacity field, but it could be a field in the slice as long as it's the same value as the length. For example, it's legal to append to a slice in D right now, but doing so always reallocates because resizing a slice in place would be bad.
Aug 23 2008
Steven Schveighoffer wrote:"Sean Kelly" wroteTrue enough :-) I suppose this is one argument in favor of a capacity field. Sean== Quote from superdan (super dan.org)'s articleUnless the slice is at the beginning of an allocated block...Jerry Quinn Wrote:making slice operations more difficult to implement correctly.dsimcha Wrote:I just ran into this yesterday porting C++ code that used push_back() heavily to D. Ended up rewriting the inner loops to pre-allocate. The D code is now actually faster than the C++ code, but the inner loops are less readable. For a whopping overhead of 4 extra bytes per array (on 32-bit), why not include the Other than the 4 bytes of extra overhead, does anyone see any downside to including a capacity field? Also, is there some situation I'm not thinking of where the 4 bytes actually matter more than the speed gains in question?For starters, you need 8 bytes on 64 bit systems. I suspect a more important reason might beOtherwise, it noticably restricts the places I can use built-in arrays.That said, I think I still favor it, since I find I need to do iterative appends to arrays fairly often.sorry to always be the one who puts the moose on the table. y'all who want to make capacity a field inthe array are off base.today each slice has length and data. why. because all a slice is *is* length and data. it's a view of aportion of memory. the slice don't care where memory came from. capacity is part of the memory that slices come from. it would be shared by several slices. and all of a sudden you realize capacity can't be a field in the slice. I'm not arguing in favor of a capacity field, but it could be a field in the slice as long as it's the same value as the length. For example, it's legal to append to a slice in D right now, but doing so always reallocates because resizing a slice in place would be bad.
Aug 25 2008
Sean Kelly Wrote:== Quote from superdan (super dan.org)'s articlethen we have a field hanging in there that's only valid sometimes. that won't quite work. will it be copyable? like when you copy a slice to another. what happens with the capacity field? at any rate. i find it amused that aside from u nobody even missed a beat on the discussion on adding the capacity. this is pulp fiction alright. a: "how about them capacity fields. what if we add them." b: "would be cool." c: "would suck goat balls." d: "capacity don't make sense because this and that and the other." a, b, c: "yeah, but we're still contemplating the ifs." wake up people. there is first the problem capacity can't be implemented. u gotta figure out some approximation first. then discuss the gorram ifs.Jerry Quinn Wrote:making slice operations more difficult to implement correctly.dsimcha Wrote:I just ran into this yesterday porting C++ code that used push_back() heavily to D. Ended up rewriting the inner loops to pre-allocate. The D code is now actually faster than the C++ code, but the inner loops are less readable. For a whopping overhead of 4 extra bytes per array (on 32-bit), why not include the Other than the 4 bytes of extra overhead, does anyone see any downside to including a capacity field? Also, is there some situation I'm not thinking of where the 4 bytes actually matter more than the speed gains in question?For starters, you need 8 bytes on 64 bit systems. I suspect a more important reason might beOtherwise, it noticably restricts the places I can use built-in arrays.That said, I think I still favor it, since I find I need to do iterative appends to arrays fairly often.sorry to always be the one who puts the moose on the table. y'all who want to make capacity a field inthe array are off base.today each slice has length and data. why. because all a slice is *is* length and data. it's a view of aportion of memory. the slice don't care where memory came from. capacity is part of the memory that slices come from. it would be shared by several slices. and all of a sudden you realize capacity can't be a field in the slice. I'm not arguing in favor of a capacity field, but it could be a field in the slice as long as it's the same value as the length. For example, it's legal to append to a slice in D right now, but doing so always reallocates because resizing a slice in place would be bad.
Aug 24 2008
superdan wrote:wake up people. there is first the problem capacity can't be implemented. u gotta figure out some approximation first. then discuss the gorram ifs.Capacity could be implemented by preallocating extra space at the end of an array, but aside from that, you're right; the garbage collector could use memory between the end of the array and the end of its capacity after the array was created, and there's no way you could update all the references to that array and change the stored capacity. As for slices, the capacity of a slice would be equal to its length, so that isn't an issue.
Aug 24 2008
superdan wrote:Sean Kelly Wrote:That's the weird bit. The capacity of any copy of a slice or an array should be the length of the copy, so copying couldn't be done via memcpy on the array ref. More simply, you could probably just default the capacity to zero, but I don't like the idea of violating the logical invariant for array refs just to make copying / initialization easier. And the weirdness of this suggests to me (and to you as well I presume) that perhaps adding a capacity field isn't proper for built-in arrays. Sean== Quote from superdan (super dan.org)'s articlethen we have a field hanging in there that's only valid sometimes. that won't quite work. will it be copyable? like when you copy a slice to another. what happens with the capacity field?Jerry Quinn Wrote:making slice operations more difficult to implement correctly.dsimcha Wrote:I just ran into this yesterday porting C++ code that used push_back() heavily to D. Ended up rewriting the inner loops to pre-allocate. The D code is now actually faster than the C++ code, but the inner loops are less readable. For a whopping overhead of 4 extra bytes per array (on 32-bit), why not include the Other than the 4 bytes of extra overhead, does anyone see any downside to including a capacity field? Also, is there some situation I'm not thinking of where the 4 bytes actually matter more than the speed gains in question?For starters, you need 8 bytes on 64 bit systems. I suspect a more important reason might beOtherwise, it noticably restricts the places I can use built-in arrays.That said, I think I still favor it, since I find I need to do iterative appends to arrays fairly often.sorry to always be the one who puts the moose on the table. y'all who want to make capacity a field inthe array are off base.today each slice has length and data. why. because all a slice is *is* length and data. it's a view of aportion of memory. the slice don't care where memory came from. capacity is part of the memory that slices come from. it would be shared by several slices. and all of a sudden you realize capacity can't be a field in the slice. I'm not arguing in favor of a capacity field, but it could be a field in the slice as long as it's the same value as the length. For example, it's legal to append to a slice in D right now, but doing so always reallocates because resizing a slice in place would be bad.
Aug 25 2008
On Mon, 25 Aug 2008 22:25:42 +0400, Sean Kelly <sean invisibleduck.org> wrote:superdan wrote:why they both have StringBuilders (which has a capacity field). The same can be applied to other array types, a library ArrayBuilder class is a good solution, I think.Sean Kelly Wrote:That's the weird bit. The capacity of any copy of a slice or an array should be the length of the copy, so copying couldn't be done via memcpy on the array ref. More simply, you could probably just default the capacity to zero, but I don't like the idea of violating the logical invariant for array refs just to make copying / initialization easier. And the weirdness of this suggests to me (and to you as well I presume) that perhaps adding a capacity field isn't proper for built-in arrays. Sean== Quote from superdan (super dan.org)'s articlethen we have a field hanging in there that's only valid sometimes. that won't quite work. will it be copyable? like when you copy a slice to another. what happens with the capacity field?Jerry Quinn Wrote:making slice operations more difficult to implement correctly.dsimcha Wrote:I just ran into this yesterday porting C++ code that used push_back() heavily to D. Ended up rewriting the inner loops to pre-allocate. The D code is now actually faster than the C++ code, but the inner loops are less readable. For a whopping overhead of 4 extra bytes per array (on 32-bit), why not include the Other than the 4 bytes of extra overhead, does anyone see any downside to including a capacity field? Also, is there some situation I'm not thinking of where the 4 bytes actually matter more than the speed gains in question?For starters, you need 8 bytes on 64 bit systems. I suspect a more important reason might beOtherwise, it noticably restricts the places I can use built-in arrays.That said, I think I still favor it, since I find I need to do iterative appends to arrays fairly often.sorry to always be the one who puts the moose on the table. y'all who want to make capacity a field inthe array are off base.today each slice has length and data. why. because all a slice is *is* length and data. it's a view of aportion of memory. the slice don't care where memory came from. capacity is part of the memory that slices come from. it would be shared by several slices. and all of a sudden you realize capacity can't be a field in the slice. I'm not arguing in favor of a capacity field, but it could be a field in the slice as long as it's the same value as the length. For example, it's legal to append to a slice in D right now, but doing so always reallocates because resizing a slice in place would be bad.
Aug 25 2008
"dsimcha" wroteI just ran into this yesterday porting C++ code that used push_back() heavily to D. Ended up rewriting the inner loops to pre-allocate. The D code is now actually faster than the C++ code, but the inner loops are less readable. For a whopping overhead of 4 extra bytes per array (on 32-bit), why not include the capacity field in the array itself? I think the problem here is the need to call gc.capacity(), which can require a lock, as Sean said. If you have the capacity field embedded in the array itself, you never need a lock for ~= unless the array actually needs to be realloced or is appended to by more than one thread. I've written classes before that need fast appending and therefore cache the capacity value every time the array is reallocated, and it makes it a ton faster even in single threaded code. On the other hand, this breaks the ABI for D arrays, but D2 breaks a lot of backwards compatibility anyhow, so maybe this is still a good idea. Does anyone actually write C functions that rely on this ABI and then call them from D? Furthermore, having a capacity field would also allow reserve for builtin arrays. Other than the 4 bytes of extra overhead, does anyone see any downside to including a capacity field? Also, is there some situation I'm not thinking of where the 4 bytes actually matter more than the speed gains in question?I see nothing wrong with having a library solution to this. D and especially D2 is to the point where syntax of custom structs is really easy to use. The problem is that many times I don't append to an array or slice, why should I have to accept the cost of an extra 4-8 byte field for every slice and array that isn't going to change size (which I would argue is used more often)? The only thing missing right now is opImplicitCast (to go back and forth from a normal array). -Steve
Aug 23 2008
== Quote from Steven Schveighoffer (schveiguy yahoo.com)'s articleI see nothing wrong with having a library solution to this. D and especially D2 is to the point where syntax of custom structs is really easy to use. The problem is that many times I don't append to an array or slice, why should I have to accept the cost of an extra 4-8 byte field for every slice and array that isn't going to change size (which I would argue is used more often)? The only thing missing right now is opImplicitCast (to go back and forth from a normal array). -SteveYou do have a point there. Builtin arrays work well for the typical cases, but no one array type could work well for every conceivable specialized case, especially if you count the case of needing minimal overhead as one possible specialized case. I've wanted to write my own specialized array classes to deal with these unusual cases for a while, but the lack of opImplicitCast is the only thing that prevents me from doing so, since I would want them to be able to play nice with builtin arrays. A little off topic, but somewhat relevant: I wrote a scope array class for when you need a to allocate large temporary array for something like a mathematical computation or a merge sort, without constantly triggering GC or running into false pointer issues. It bypasses the GC, calls C's malloc, realloc and free, and, since it's a scope class, the memory is freed by the dtor. Does anyone know if DMD's std.c.stdlib.malloc/free/realloc are threadsafe?
Aug 23 2008
On Sat, Aug 23, 2008 at 05:50:08PM +0000, dsimcha wrote:Does anyone know if DMD's std.c.stdlib.malloc/free/realloc are threadsafe?I wouldn't rely on it. Those are just interfaces to the platform's native C library, which may or may not be thread safe - the C spec doesn't say anything about threads at all (if I remember correctly). -- Adam D. Ruppe http://arsdnet.net
Aug 23 2008
"Adam D. Ruppe" wroteOn Sat, Aug 23, 2008 at 05:50:08PM +0000, dsimcha wrote:Although the spec may not say it, any modern OS has a thread safe malloc. You should have no problems. -SteveDoes anyone know if DMD's std.c.stdlib.malloc/free/realloc are threadsafe?I wouldn't rely on it. Those are just interfaces to the platform's native C library, which may or may not be thread safe - the C spec doesn't say anything about threads at all (if I remember correctly).
Aug 23 2008
Steven Schveighoffer:I see nothing wrong with having a library solution to this. [...] The problem is that many times I don't append to an array or slice, why should I have to accept the cost of an extra 4-8 byte field for every slice and array that isn't going to change size (which I would argue is used more often)?Probably a library solution makes you use even more memory (when you want to use it), and makes dynamic arrays less handy to use in D (but it may be better than nothing). Array append is an operation rather common in higher-level programs, for example in Python it's one of the most common operations, while the up-front creation of empty arrays is quite uncommon. I don't know how much common the push_back is in C++ programs. (In successive benchmarks I have seen that array append of integers with Psyco is about as fast as appending integers in D. A difference is that in Python integers are objects, around 20 bytes long). My idea was to do this change in a D branch, and see how programs like your ones behave in the modified version: if your code at runtime has to use 100-200 bytes more than before, then I think this price may be accepted on modern PCs. On the other hand a large number of such size_t fields may be wasted when you manage tons of short strings, or when you manage many smallish nD arrays. That problem with nD arrays can be solved with a library that manages rectangular nD arrays, that avoids wasting of space for the lengths of each row/etc. This means that another solution is to do the opposite thing, instead of adding a library support for efficiently extensible arrays, such support may be of dynamically allocated fixed-size arrays... :-) Bye, bearophile
Aug 23 2008
On Sun, Aug 24, 2008 at 2:16 AM, Steven Schveighoffer <schveiguy yahoo.com> wrote:"dsimcha" wroteI see nothing wrong with having a library solution to this. D and especially D2 is to the point where syntax of custom structs is really easy to use. The problem is that many times I don't append to an array or slice, why should I have to accept the cost of an extra 4-8 byte field for every slice and array that isn't going to change size (which I would argue is used more often)?Personally I'd argue that appending efficiently is important to more programs than than saving 4-8 bytes here and there. So that would suggest that lean-and-mean fixed-sized arrays should be the library solution, and arrays with a capacity field should be the built-in type. --bb
Aug 23 2008
"Bill Baxter" wroteOn Sun, Aug 24, 2008 at 2:16 AM, Steven Schveighoffer <schveiguy yahoo.com> wrote:Making all arrays (even ones that you create temporarily and never append to) have extra data to copy will add up performance-wise. I think there are only a few spots in any of my code that I use the append operator. So 'important' it is, but common, I would argue it is not. Of course, I only present anectodal evidence :) There's also the advantage of backwards compatibility, and not having to change the D spec. -Steve"dsimcha" wroteI see nothing wrong with having a library solution to this. D and especially D2 is to the point where syntax of custom structs is really easy to use. The problem is that many times I don't append to an array or slice, why should I have to accept the cost of an extra 4-8 byte field for every slice and array that isn't going to change size (which I would argue is used more often)?Personally I'd argue that appending efficiently is important to more programs than than saving 4-8 bytes here and there. So that would suggest that lean-and-mean fixed-sized arrays should be the library solution, and arrays with a capacity field should be the built-in type.
Aug 23 2008
Steven Schveighoffer:Making all arrays (even ones that you create temporarily and never append to) have extra data to copy will add up performance-wise.Benchmarks of new features help estimate such performance tradoffs in real programs.I think there are only a few spots in any of my code that I use the append operator. So 'important' it is, but common, I would argue it is not. Of course, I only present anectodal evidence :)In the Python world they solve this problem taking a look at real code. So some projects in the dsource repository can be mined to estimate how much often the appending is used in real programs. Note that a std lib library is meant to be very efficient, because people use it as a source of building blocks to write normal programs, so I presume Tango/Phobos will have less array appends than the average user-level code, so they may be excluded from such database.There's also the advantage of backwards compatibility, and not having to change the D spec.It's probably meant as a change for D 2 only. Bye, bearophile
Aug 23 2008
== Quote from Bill Baxter (wbaxter gmail.com)'s articlePersonally I'd argue that appending efficiently is important to more programs than than saving 4-8 bytes here and there. So that would suggest that lean-and-mean fixed-sized arrays should be the library solution, and arrays with a capacity field should be the built-in type. --bbDefinitely a reasonable argument, but just to play devil's advocate, let's assume that opImplicitCast is coming eventually. Most of the library implementations of arrays will presumably be implicitly castable to builtins, and library functions, especially standard lib functions, will probably use the builtin type as arguments. Implicit casting to builtin, in my view, is essential to using library implementations of arrays, because otherwise you have a mess of a zillion different incompatible array library implementations a la C++. If you have to create a dummy capacity field every time you implicitly cast your array to a builtin anyhow, that might defeat the purpose of a lightweight library implementation at times.
Aug 23 2008
On Sat, 23 Aug 2008 13:16:26 -0400, Steven Schveighoffer <schveiguy yahoo.com> wrote:The problem is that many times I don't append to an array or slice, why should I have to accept the cost of an extra 4-8 byte field for every slice and array that isn't going to change size (which I would argue is used more often)?I've written some micro benchmarks to test out the difference between an 8-byte struct (i.e. the current arrays) and a 16-byte struct (ptr, length, capacity, stride) with respect to function call performance. I wrote 2 benchmarks: 1) an actual function call (which calculated the exponential) and 2) an optimized struct copy (i.e. simulating pass by value). In release -o or debug mode, (1) had performance decreases of -5.1%, 9.8% and 1.8% for passing 1, 2 and 3 structs respectively. When -inline is enabled this changes to 0.5%, 2.5% and 5% respectively. For the (2) benchmark, I looked at using mmx and sse2 instructions to accelerate the struct copy. Using sse2 instructions provided a 15% performance gain using unaligned structs and an 84% gain using aligned structs. 16-byte SSE2 copies beat single 8-byte MMX copies by 9% (both aligned), though dual MMX (i.e. a 16-byte copy) still beat the built-in copy by 61% (again aligned). (All in -release -o -inline, on a Core2 CPU) So the performance cost, even for very small workloads, looks to be pretty minor and with some work performance could actually increase. (I'm assuming the use of sse/mmx/AltiVec for memcopy acceleration would be enabled by a complier flag on 32-bit)
Aug 24 2008
Robert Jacques:I've written some micro benchmarks to test out the difference between an 8-byte struct (i.e. the current arrays) and a 16-byte struct (ptr, length, capacity, stride)You may want to address some of the things superdan has said. And, why the stride? I presume it's useful if you want a 2D matrix. It may be useful, but maybe it's overkill, so you may want to benchmark just (ptr, length, capacity) too. Another useful benchmark is to test a program that manages tons of short strings too, like this one: http://en.wikipedia.org/wiki/D_language#Example_2 Bye, bearophile
Aug 24 2008
On Sun, 24 Aug 2008 22:14:36 -0400, bearophile <bearophileHUGS lycos.com> wrote:And, why the stride? I presume it's useful if you want a 2D matrix.Yes, stride is a key enabler for integrating 2D/3D/nD matrices with built-in arrays. Other pros: - Fast O(1) reversing of an array. Looping through an array in reverse was a common enough operation to create an additional keyword and language feature (foreach_reverse) specifically for it. - Slicing out a struct element. i.e. slice the red channel out of an image or the first name from an Employee array. - Accessing every Nth element. This is a common MATLAB operation and often used in visualization (i.e. a quick and simple resolution adjustment for a line plot). Also, some parallel algorithms benefit from interleaving data. This is a fairly standard multi-threading technique, particularly when processing data[x] may take longer than processing data[y] (though it's not the first thing I'd try). And lastly, SSE works on 16-byte structs, so why not use the extra 4-bytes?It may be useful, but maybe it's overkill, so you may want to benchmark just (ptr, length, capacity) too.For a 12-byte struct, either {ptr, length, capacity} or {ptr, length, stride}, the performance decrease for -release -o -inline was -0.7%, 0.3% and 0.2% for 1, 2 and 3 arrays respectively and -0%, 8% and 0.8% respectively for debug.Another useful benchmark is to test a program that manages tons of short strings too, like this one: http://en.wikipedia.org/wiki/D_language#Example_2Okay, looking just at the main loop (writefln is really slow) the performance cost of the extra 4-bytes for a capacity field is about 1.7% (naive cost, i.e no SSE, etc). I didn't actually implement a capacity field, so this is a worst case.You may want to address some of the things superdan has said.Superdan raised the good point that if the GC is aware of array length at creation, it might compact the capacity to allocate something else. To the best of my knowledge (which isn't saying much) both the Phobos and Tango GC are block base, and so are un-aware of the array length. Also, Sean Kelly wrote the Tango GC and isn't raising any red flags, which is probably a good sign it's a valid optimization.
Aug 24 2008
Robert Jacques wrote:On Sun, 24 Aug 2008 22:14:36 -0400, bearophile <bearophileHUGS lycos.com> wrote:That's very convincing. Seems like stride ought to be in there. But it also got me thinking about what other 4-byte value could be stuffed into that extra space (satisfying the 16-bit alignment requirement), and I can't help but yearn for hashcode memoization. Strings are, arguably, the most common types of arrays in most programs. And strings are, arguably the most common key-type in associative arrays. And every time you perform an associative array lookup, you have to re-compute the hashcode (which is silly for immutable strings, since the hashcode can't ever change). In languages where Strings are objects (please!!! please, Walter!!!) you can do things like memoize the hashcode to avoid redundant future calculations. If Strings are mutable, you can mark a "dirty" bit when the string is changed, lazily recalculating a memoized hashcode based on that dirty bit. But only if you have a field for memoizing the hashcode in the first place. So, while a stride field seems very useful for some very interesting special cases, to me it seems like the more compelling general case is for a memoized hashcode. --benjiAnd, why the stride? I presume it's useful if you want a 2D matrix.Yes, stride is a key enabler for integrating 2D/3D/nD matrices with built-in arrays. Other pros: - Fast O(1) reversing of an array. Looping through an array in reverse was a common enough operation to create an additional keyword and language feature (foreach_reverse) specifically for it. - Slicing out a struct element. i.e. slice the red channel out of an image or the first name from an Employee array. - Accessing every Nth element. This is a common MATLAB operation and often used in visualization (i.e. a quick and simple resolution adjustment for a line plot). Also, some parallel algorithms benefit from interleaving data. This is a fairly standard multi-threading technique, particularly when processing data[x] may take longer than processing data[y] (though it's not the first thing I'd try).
Aug 25 2008
Benji Smith Wrote:Robert Jacques wrote:ok, i'll bite. what exactly are you so desperately asking for?On Sun, 24 Aug 2008 22:14:36 -0400, bearophile <bearophileHUGS lycos.com> wrote:That's very convincing. Seems like stride ought to be in there. But it also got me thinking about what other 4-byte value could be stuffed into that extra space (satisfying the 16-bit alignment requirement), and I can't help but yearn for hashcode memoization. Strings are, arguably, the most common types of arrays in most programs. And strings are, arguably the most common key-type in associative arrays. And every time you perform an associative array lookup, you have to re-compute the hashcode (which is silly for immutable strings, since the hashcode can't ever change). In languages where Strings are objects (please!!! please, Walter!!!)And, why the stride? I presume it's useful if you want a 2D matrix.Yes, stride is a key enabler for integrating 2D/3D/nD matrices with built-in arrays. Other pros: - Fast O(1) reversing of an array. Looping through an array in reverse was a common enough operation to create an additional keyword and language feature (foreach_reverse) specifically for it. - Slicing out a struct element. i.e. slice the red channel out of an image or the first name from an Employee array. - Accessing every Nth element. This is a common MATLAB operation and often used in visualization (i.e. a quick and simple resolution adjustment for a line plot). Also, some parallel algorithms benefit from interleaving data. This is a fairly standard multi-threading technique, particularly when processing data[x] may take longer than processing data[y] (though it's not the first thing I'd try).
Aug 25 2008
superdan wrote:For strings to be implemented as objects, rather than as arrays of characters. --benjiIn languages where Strings are objects (please!!! please, Walter!!!)ok, i'll bite. what exactly are you so desperately asking for?
Aug 25 2008
Benji Smith Wrote:superdan wrote:ok i assume u mean class objects. because current arrays are objects. only not of class type. and what advantages would that have. let me tell u of one disadvantage fer starters. with builtin arrays of characters you can efficiently implement classes. with classes you can't efficiently implement arrays of characters.For strings to be implemented as objects, rather than as arrays of characters. --benjiIn languages where Strings are objects (please!!! please, Walter!!!)ok, i'll bite. what exactly are you so desperately asking for?
Aug 25 2008
superdan wrote:Benji Smith Wrote:I'll start a new thread, so as not to hijack this one further. --benjisuperdan wrote:ok i assume u mean class objects. because current arrays are objects. only not of class type. and what advantages would that have. let me tell u of one disadvantage fer starters. with builtin arrays of characters you can efficiently implement classes. with classes you can't efficiently implement arrays of characters.For strings to be implemented as objects, rather than as arrays of characters. --benjiIn languages where Strings are objects (please!!! please, Walter!!!)ok, i'll bite. what exactly are you so desperately asking for?
Aug 25 2008
On Mon, 25 Aug 2008 11:41:51 -0400, Benji Smith <dlanguage benjismith.net> wrote:I can't help but yearn for hashcode memoization. Strings are, arguably, the most common types of arrays in most programs. And strings are, arguably the most common key-type in associative arrays. And every time you perform an associative array lookup, you have to re-compute the hashcode (which is silly for immutable strings, since the hashcode can't ever change). In languages where Strings are objects (please!!! please, Walter!!!) you can do things like memoize the hashcode to avoid redundant future calculations. If Strings are mutable, you can mark a "dirty" bit when the string is changed, lazily recalculating a memoized hashcode based on that dirty bit. But only if you have a field for memoizing the hashcode in the first place.This only works if strings are objects as there's an aliasing issue, so it won't work for D arrays. Consider: string a = "foo"; string b = a; int[string] counts; counts[b] = 1; a[0] = 'b'; // b's dirty bit isn't set counts[b]++; // calls counts["foo"]++ and not counts["boo"]++So, while a stride field seems very useful for some very interesting special cases,Well, I would argue that games, scientific computing, throughput computing and (lightweight) databases aren't special cases.
Aug 25 2008
Robert Jacques:This only works if strings are objects as there's an aliasing issue, so it won't work for D arrays.Well, D 2.x strings (and their slices) are immutable, so there is no problem. I agree that for mutable data structures (that can be modified from references) storing the hash value is dangerous. A compromise may be to have mutable arrays but immutable slices, that may solve the problem you show. Slices are small, so having them immutable is probably not too much bad. Bye, bearophile
Aug 25 2008
On Mon, 25 Aug 2008 21:51:33 +0400, bearophile <bearophileHUGS lycos.com> wrote:Robert Jacques:I don't follow you. Immutability of the slice implies that the whole array is immutable, too. Besides, how often do you take a hash of an array other than string?This only works if strings are objects as there's an aliasing issue, so it won't work for D arrays.Well, D 2.x strings (and their slices) are immutable, so there is no problem. I agree that for mutable data structures (that can be modified from references) storing the hash value is dangerous. A compromise may be to have mutable arrays but immutable slices, that may solve the problem you show. Slices are small, so having them immutable is probably not too much bad. Bye, bearophile
Aug 25 2008
Denis Koroskin:I don't follow you.After posting my message I have realized that my expressed thoughts are wrong, I am sorry. The topic looks simple, but it is not for me. This newsgroup is useful to learn intellectual humility better.Immutability of the slice implies that the whole array is immutable, too.<Well, only if immutability is transitive :-) Storing the hash value for a mutable data structure that can have references is dangerous, better to avoid doing it. But if a data structure is transitively immutable then storing its hash value may be positive if you use lot of hashing (hash value can be computed lazily only when required, a value of 0 may mean not computed yet. Python uses -1 for such purpose). Finding some compromise between those two extrema seems tricky to me. D 2.0 strings are immutable, so storing their hash value may be positive. The presence of such stored hash value may even be activated or not by a global compilation constant given to the compiler :o)Besides, how often do you take a hash of an array other than string?In D I have never done it, so I presume not often. (In Python I use the hash value of immutable arrays often, because they are more handy than D structs.) Simple related note: capacity of the array must not be used to compute its hash value. Bye, bearophile
Aug 25 2008
Benji Smith:So, while a stride field seems very useful for some very interesting special cases, to me it seems like the more compelling general case is for a memoized hashcode.Rectangular 2D arrays enjoy a stride as 4th field, while (usually 1D) strings enjoy having a hash value there. The two cases are very distinct, with rare mixed cases. So an ugly idea: if the 4th array is 1D the field can be the hash value, if it's 2D it can contain the stride ;-) (But if superdan is right, all this discussion may need to be re-built on different basis). On the other hand, the capacity field isn't much useful for 2D arrays (because you usually append to 1D arrays), so you end with just 3 fields with 2D arrays and 4 fields for 1D arrays :-) So, if you want to keep both structures of len 4 (assuming pointers and size_t having the same size) you can add a second stride, useful for 3D arrays too ;-) Bye, bearophile
Aug 25 2008
On Mon, 25 Aug 2008 12:51:30 -0400, bearophile <bearophileHUGS lycos.com> wrote:Rectangular 2D arrays enjoy a stride as 4th field, while (usually 1D) strings enjoy having a hash value there. The two cases are very distinct, with rare mixed cases. So an ugly idea: if the 4th array is 1D the field can be the hash value, if it's 2D it can contain the stride ;-) (But if superdan is right, all this discussion may need to be re-built on different basis). On the other hand, the capacity field isn't much useful for 2D arrays (because you usually append to 1D arrays), so you end with just 3 fields with 2D arrays and 4 fields for 1D arrays :-) So, if you want to keep both structures of len 4 (assuming pointers and size_t having the same size) you can add a second stride, useful for 3D arrays too ;-)I was refering to 1D arrays with strides. For 2D/3D/ND arrays you need a length and a stride per dimension in order to support array slices, interfacing to both C and/or Fortran code bases, etc. So a ND array looks like { T* ptr; size_t[N] lengths; ptrdiff_t[N] byte_strides; } However, regarding your idea, as caching the has makes sense for invarient arrays and capacity would mostly be used during string bulding, you could overload the field based on type. However, this means neither optimization works for const arrays, which might not be a good thing as it encorages code bloat.
Aug 25 2008
On Mon, 25 Aug 2008 18:45:30 -0400, Robert Jacques <sandford jhu.edu> wrote:On Mon, 25 Aug 2008 12:51:30 -0400, bearophile <bearophileHUGS lycos.com> wrote:Using these optimizations with const could be done with either a bit-flag (in length) or utilizing the block nature of the GC which leads to capacities with (at least 4 I think) zero low order bits. P.S. byte_strides should probably go above lengths as it's a better of memory layout.Rectangular 2D arrays enjoy a stride as 4th field, while (usually 1D) strings enjoy having a hash value there. The two cases are very distinct, with rare mixed cases. So an ugly idea: if the 4th array is 1D the field can be the hash value, if it's 2D it can contain the stride ;-) (But if superdan is right, all this discussion may need to be re-built on different basis). On the other hand, the capacity field isn't much useful for 2D arrays (because you usually append to 1D arrays), so you end with just 3 fields with 2D arrays and 4 fields for 1D arrays :-) So, if you want to keep both structures of len 4 (assuming pointers and size_t having the same size) you can add a second stride, useful for 3D arrays too ;-)I was refering to 1D arrays with strides. For 2D/3D/ND arrays you need a length and a stride per dimension in order to support array slices, interfacing to both C and/or Fortran code bases, etc. So a ND array looks like { T* ptr; size_t[N] lengths; ptrdiff_t[N] byte_strides; } However, regarding your idea, as caching the has makes sense for invarient arrays and capacity would mostly be used during string bulding, you could overload the field based on type. However, this means neither optimization works for const arrays, which might not be a good thing as it encorages code bloat.
Aug 25 2008
Benji Smith wrote:Robert Jacques wrote:to use stringbuilder to get around the very slow string pooling. I think lazily evaluating it would be a good idea. -JoelOn Sun, 24 Aug 2008 22:14:36 -0400, bearophile <bearophileHUGS lycos.com> wrote:That's very convincing. Seems like stride ought to be in there. But it also got me thinking about what other 4-byte value could be stuffed into that extra space (satisfying the 16-bit alignment requirement), and I can't help but yearn for hashcode memoization. Strings are, arguably, the most common types of arrays in most programs. And strings are, arguably the most common key-type in associative arrays. And every time you perform an associative array lookup, you have to re-compute the hashcode (which is silly for immutable strings, since the hashcode can't ever change). In languages where Strings are objects (please!!! please, Walter!!!) you can do things like memoize the hashcode to avoid redundant future calculations. If Strings are mutable, you can mark a "dirty" bit when the string is changed, lazily recalculating a memoized hashcode based on that dirty bit. But only if you have a field for memoizing the hashcode in the first place.And, why the stride? I presume it's useful if you want a 2D matrix.Yes, stride is a key enabler for integrating 2D/3D/nD matrices with built-in arrays. Other pros: - Fast O(1) reversing of an array. Looping through an array in reverse was a common enough operation to create an additional keyword and language feature (foreach_reverse) specifically for it. - Slicing out a struct element. i.e. slice the red channel out of an image or the first name from an Employee array. - Accessing every Nth element. This is a common MATLAB operation and often used in visualization (i.e. a quick and simple resolution adjustment for a line plot). Also, some parallel algorithms benefit from interleaving data. This is a fairly standard multi-threading technique, particularly when processing data[x] may take longer than processing data[y] (though it's not the first thing I'd try).
Aug 25 2008
dsimcha wrote:I just ran into this yesterday porting C++ code that used push_back() heavily to D. Ended up rewriting the inner loops to pre-allocate. The D code is now actually faster than the C++ code, but the inner loops are less readable. For a whopping overhead of 4 extra bytes per array (on 32-bit), why not include the capacity field in the array itself? I think the problem here is the need to call gc.capacity(), which can require a lock, as Sean said. If you have the capacity field embedded in the array itself, you never need a lock for ~= unless the array actually needs to be realloced or is appended to by more than one thread. I've written classes before that need fast appending and therefore cache the capacity value every time the array is reallocated, and it makes it a ton faster even in single threaded code. On the other hand, this breaks the ABI for D arrays, but D2 breaks a lot of backwards compatibility anyhow, so maybe this is still a good idea. Does anyone actually write C functions that rely on this ABI and then call them from D? Furthermore, having a capacity field would also allow reserve for builtin arrays. Other than the 4 bytes of extra overhead, does anyone see any downside to including a capacity field? Also, is there some situation I'm not thinking of where the 4 bytes actually matter more than the speed gains in question?Another solution would be to cache the capacity on the stack for the array so its only called once. That wouldn't be quite as fast but wouldn't break the ABI. -Joel
Aug 23 2008
"bearophile" <bearophileHUGS lycos.com> wrote in message news:g8ni6o$12k2$1 digitalmars.com...Note2: In the D code the final deallocation of the array is fast enough, and its allocation is fast too (you can see this timing just the for loop in the middle of the program). The really slow part seems to be the loop itself.If the others are right, D's probably using std.gc.capacity to constantly check the capacity of the array. Interestingly, the Phobos GC caches the pointer/size in a call to "capacity", so that requesting the capacity of the same pointer immediately returns its capacity without any look-up. See function sizeOfNoSync, in internal/gc/gcx.d. Additionally, if there are no other threads, the GC lock is not taken, so capacity() should be O(1). It would be interesting to check whether the cached pointer/size in the GC is actually being used. If it is not, it would result in a call to findSize() which in turn calls findPool(). findPool() is O(n), n = number of pools. The pooltable is already sorted, so this can be made O(log n). findSize() then does a linear search to check the size of the alloced block, which is O(m), m = size of block. Result: getting the capacity of a pointer is O(m+n). There's a lot of room for improvement right there. L.
Aug 24 2008
Nevermind. None of that explained why bearophile's program is so slow. This does: int[] ~= int is using _d_arrayappendcT (from internal/gc/gc.d), which appends an int by appending its four bytes! Blasphemy! Now we have great super-duper-efficient array operations using SSEx but a simple append of an int to an int[] is using a "rep movs" to append 4 bytes. X-( Also, the append code first checks whether the array should be resized, and if not, jumps to the actual append code. The common case should not be branching: { if (too small) goto resize; append: ... return; resize: ... goto append; } The use of _d_arrayappendcT is hardcoded in the D compiler (e2ir.c) so I don't know how to create different code paths for different primitive (sizes) other than to add many ifs to _d_arrayappendcT. IIRC, TypeInfo contained a pointer to a method for fast array comparisons. Shouldn't a similar thing be added for array appends? BTW, the capacity look-up was indeed using the cached values. Meaning, it's not "capacity" we should be focussing on, but the general append code. However, if you were to append to two different arrays in a similar loop, the performance would be terrible, since each array would overwrite the cache with its own ptr/size, resulting in the slow O(m+n) for each capacity look-up. L.
Aug 24 2008
Lionello Lunesu:However, if you were to append to two different arrays in a similar loop, the performance would be terrible, since each array would overwrite the cache with its own ptr/size, resulting in the slow O(m+n) for each capacity look-up.<Yes, there's something that deserves some improvements, probably at various levels. I have modified the benchmarks in the way you have explained (see below), with n = 4 millions the running time of the C++ version is barely testable (0.04 s), the D version takes 32.2 s (I have had not enough patience to see it finish with n = 10 millions). Bye, bearophile ------------------ C++: #include "stdio.h" #include <vector> using namespace std; int main(int argc, char **argv) { int n = argc >= 2 ? atoi(argv[1]) : 1000; vector<int> a1; a1.reserve(n); vector<int> a2; a2.reserve(n); for (int i = 0; i < n; ++i) { a1.push_back(i); a2.push_back(i); } printf("%d %d\n", a1[n - 1], a2[n - 1]); return 0; } ------------------ C++: import std.conv: toInt; import std.gc: malloc, hasNoPointers; int main(string[] args) { int n = args.length >= 2 ? toInt(args[1]) : 1000; int* aux_ptr1 = cast(int*)malloc(n * int.sizeof); hasNoPointers(aux_ptr1); int[] a1 = aux_ptr1[0 .. 0]; int* aux_ptr2 = cast(int*)malloc(n * int.sizeof); hasNoPointers(aux_ptr2); int[] a2 = aux_ptr2[0 .. 0]; for (int i; i < n; i++) { a1 ~= i; a2 ~= i; } printf("%d %d\n", a1[n - 1], a2[n - 1]); return 0; }
Aug 25 2008
On Mon, 25 Aug 2008 01:41:51 -0400, Lionello Lunesu <lionello lunesu.remove.com> wrote:BTW, the capacity look-up was indeed using the cached values. Meaning, it's not "capacity" we should be focussing on, but the general append code. However, if you were to append to two different arrays in a similar loop, the performance would be terrible, since each array would overwrite the cache with its own ptr/size, resulting in the slow O(m+n) for each capacity look-up.Such as in this case: http://en.wikipedia.org/wiki/D_language#Example_2
Aug 25 2008
Robert Jacques:Such as in this case: http://en.wikipedia.org/wiki/D_language#Example_2For the people interested, here are few slides written by the good Hettinger that show how dict/set/lists work in Python (it shows how the array append too works): http://www.pycon.it/static/pycon2/slides/containers.ppt Bye, bearophile
Aug 25 2008
Lionello Lunesu wrote:Nevermind. None of that explained why bearophile's program is so slow. This does: int[] ~= int is using _d_arrayappendcT (from internal/gc/gc.d), which appends an int by appending its four bytes! Blasphemy! Now we have great super-duper-efficient array operations using SSEx but a simple append of an int to an int[] is using a "rep movs" to append 4 bytes. X-( Also, the append code first checks whether the array should be resized, and if not, jumps to the actual append code. The common case should not be branching: { if (too small) goto resize; append: ... return; resize: ... goto append; } The use of _d_arrayappendcT is hardcoded in the D compiler (e2ir.c) so I don't know how to create different code paths for different primitive (sizes) other than to add many ifs to _d_arrayappendcT. IIRC, TypeInfo contained a pointer to a method for fast array comparisons. Shouldn't a similar thing be added for array appends? BTW, the capacity look-up was indeed using the cached values. Meaning, it's not "capacity" we should be focussing on, but the general append code. However, if you were to append to two different arrays in a similar loop, the performance would be terrible, since each array would overwrite the cache with its own ptr/size, resulting in the slow O(m+n) for each capacity look-up. L.Great stuff! -Joel
Aug 25 2008
The bugzilla report on this suggests replacing []=[] with memcpy. But this benchmark shows this isn't a consistent win: import std.c.string; long timer() { asm { naked ; rdtsc ; ret ; } } void test1(byte[] a, byte[] b) { a[] = b[]; } void test2(byte[] a, byte[] b) { memcpy(a.ptr, b.ptr, a.length); } void main() { for (int i = 4; i < 100_000_000; i *= 2) { auto a = new byte[i]; auto b = new byte[i]; auto start = timer(); test1(a, b); auto end = timer(); auto r1 = end - start; start = timer(); test2(a, b); end = timer(); auto r2 = end - start; printf("i: %8d,\t[]=[]: %8lld,\tmemcpy: %8lld\n", i, r1, r2); } } Running this program produces: i: 4, []=[]: 144, memcpy: 568 i: 8, []=[]: 144, memcpy: 300 i: 16, []=[]: 172, memcpy: 324 i: 32, []=[]: 204, memcpy: 344 i: 64, []=[]: 288, memcpy: 276 i: 128, []=[]: 288, memcpy: 272 i: 256, []=[]: 352, memcpy: 364 i: 512, []=[]: 372, memcpy: 424 i: 1024, []=[]: 552, memcpy: 564 i: 2048, []=[]: 684, memcpy: 1384 i: 4096, []=[]: 1344, memcpy: 1772 i: 8192, []=[]: 2900, memcpy: 3216 i: 16384, []=[]: 5292, memcpy: 5472 i: 32768, []=[]: 11496, memcpy: 10388 i: 65536, []=[]: 29484, memcpy: 27480 i: 131072, []=[]: 110464, memcpy: 67784 i: 262144, []=[]: 655580, memcpy: 562400 i: 524288, []=[]: 1204124, memcpy: 1107256 i: 1048576, []=[]: 2364588, memcpy: 2272552 i: 2097152, []=[]: 4516440, memcpy: 4417764 i: 4194304, []=[]: 8996992, memcpy: 8817176 i: 8388608, []=[]: 20223908, memcpy: 17717748 i: 16777216, []=[]: 35774952, memcpy: 36094652 i: 33554432, []=[]: 71008068, memcpy: 71246896 i: 67108864, []=[]: 142982284, memcpy: 145473300 There's not much of a consistent conclusion to be drawn from that.
Aug 26 2008
The problem is that the timing for the small arrays cannot be trusted as there's more overhead than actual code being tested. I've changed your code by doing each test 100_000_000/i times. I've also added 'cpuid' for synchronization, as Don suggested: import std.c.string; long timer() { asm { naked ; push ECX; push EBX; mov EAX, 0 ; cpuid ; pop EBX; pop ECX; rdtsc ; ret ; } } void test1(byte[] a, byte[] b) { a[] = b[]; } void test2(byte[] a, byte[] b) { memcpy(a.ptr, b.ptr, a.length); } void main() { for (int i = 4; i < 100_000_000; i *= 2) { auto b = new byte[i]; auto a = new byte[i]; test1(a,b);//pre-cache auto count = 100_000_000 / i; auto start = timer(); while (count--) test1(a, b); auto end = timer(); auto r1 = end - start; count = 100_000_000 / i; start = timer(); while (count--) test2(a, b); end = timer(); auto r2 = end - start; printf("i: %8d,\t[]=[]: %8lld,\tmemcpy: %8lld\n", i, r1, r2); } } Results on my Core2 Duo: i: 4, []=[]: 2752539660, memcpy: 429537312 i: 8, []=[]: 1004458383, memcpy: 229984623 i: 16, []=[]: 492054948, memcpy: 128590560 i: 32, []=[]: 309960657, memcpy: 135892809 i: 64, []=[]: 204971832, memcpy: 79507359 i: 128, []=[]: 131925789, memcpy: 52222626 i: 256, []=[]: 73768896, memcpy: 40184559 i: 512, []=[]: 43827948, memcpy: 28040859 i: 1024, []=[]: 29030985, memcpy: 21078432 i: 2048, []=[]: 21645027, memcpy: 18953199 i: 4096, []=[]: 18099945, memcpy: 16022259 i: 8192, []=[]: 15905007, memcpy: 15399036 i: 16384, []=[]: 15567399, memcpy: 15795747 i: 32768, []=[]: 32244507, memcpy: 30654063 i: 65536, []=[]: 30824595, memcpy: 31286475 i: 131072, []=[]: 32248179, memcpy: 30950757 i: 262144, []=[]: 34880868, memcpy: 31166667 i: 524288, []=[]: 31220802, memcpy: 35813277 i: 1048576, []=[]: 42463341, memcpy: 42121386 i: 2097152, []=[]: 122218776, memcpy: 118839150 i: 4194304, []=[]: 111864294, memcpy: 110864988 i: 8388608, []=[]: 110600226, memcpy: 107638083 i: 16777216, []=[]: 97014645, memcpy: 95789574 i: 33554432, []=[]: 79344468, memcpy: 78277374 i: 67108864, []=[]: 77437629, memcpy: 78478767 Conclusion: memcpy wins big time for i <= 8192.
Aug 27 2008
Lionello Lunesu wrote:Results on my Core2 Duo:D'oh! I should have let Don guess it! L.
Aug 27 2008
Lionello Lunesu:Conclusion: memcpy wins big time for i <= 8192.I have modified your code a little: import std.c.string: memcpy; long timer() { asm { naked; push ECX; push EBX; mov EAX, 0; cpuid; pop EBX; pop ECX; rdtsc; ret; } } void main() { const int N = 500_000_000; alias ubyte DataType; for (int i = 4; i < N; i *= 2) { auto b = new DataType[i]; auto a = new DataType[i]; a[] = b[]; // pre-cache auto count = N / i; auto start = timer(); while (count--) a[] = b[]; auto end = timer(); auto t1 = end - start; count = N / i; start = timer(); while (count--) memcpy(a.ptr, b.ptr, a.length * DataType.sizeof); end = timer(); auto t2 = end - start; count = N / i; start = timer(); while (count--) for (int j; j < i; j++) a[j] = b[j]; end = timer(); auto t3 = end - start; count = N / i; start = timer(); while (count--) { auto pa = a.ptr; auto pb = b.ptr; int nloop = i; while (nloop--) *pa++ = *pb++; } end = timer(); auto t4 = end - start; printf("i: %8d,\t[]=[]: %8lld, \tt1/t2: %.2f \tt1/t3: %.2f \tt1/t4: %.2f\n", i, t1, cast(float)t1/t2, cast(float)t1/t3, cast(float)t1/t4); } } The timings: N = 500_000_000, DataType = ubyte: i: 4, []=[]: 9919922030, t1/t2: 4.64 t1/t3: 5.64 t1/t4: 5.27 i: 8, []=[]: 4957640530, t1/t2: 4.63 t1/t3: 3.59 t1/t4: 2.93 i: 16, []=[]: 2478916670, t1/t2: 4.10 t1/t3: 2.08 t1/t4: 1.55 i: 32, []=[]: 1537003970, t1/t2: 2.28 t1/t3: 1.40 t1/t4: 0.99 i: 64, []=[]: 1019418780, t1/t2: 2.59 t1/t3: 0.97 t1/t4: 0.67 i: 128, []=[]: 659281070, t1/t2: 2.54 t1/t3: 0.60 t1/t4: 0.58 i: 256, []=[]: 364781180, t1/t2: 1.90 t1/t3: 0.35 t1/t4: 0.34 i: 512, []=[]: 217207030, t1/t2: 1.56 t1/t3: 0.21 t1/t4: 0.21 i: 1024, []=[]: 143970290, t1/t2: 1.37 t1/t3: 0.14 t1/t4: 0.14 i: 2048, []=[]: 107283300, t1/t2: 1.21 t1/t3: 0.11 t1/t4: 0.11 i: 4096, []=[]: 88801730, t1/t2: 1.11 t1/t3: 0.08 t1/t4: 0.08 i: 8192, []=[]: 79605200, t1/t2: 1.06 t1/t3: 0.07 t1/t4: 0.07 i: 16384, []=[]: 77892400, t1/t2: 0.99 t1/t3: 0.07 t1/t4: 0.07 i: 32768, []=[]: 156926780, t1/t2: 1.03 t1/t3: 0.16 t1/t4: 0.16 i: 65536, []=[]: 154375350, t1/t2: 1.01 t1/t3: 0.15 t1/t4: 0.15 i: 131072, []=[]: 154758850, t1/t2: 1.01 t1/t3: 0.15 t1/t4: 0.15 i: 262144, []=[]: 155175500, t1/t2: 1.01 t1/t3: 0.15 t1/t4: 0.15 i: 524288, []=[]: 159647160, t1/t2: 1.01 t1/t3: 0.16 t1/t4: 0.16 i: 1048576, []=[]: 832590470, t1/t2: 1.00 t1/t3: 0.79 t1/t4: 0.79 i: 2097152, []=[]: 783795470, t1/t2: 1.00 t1/t3: 0.76 t1/t4: 0.75 i: 4194304, []=[]: 828801230, t1/t2: 1.00 t1/t3: 0.79 t1/t4: 0.78 i: 8388608, []=[]: 813813550, t1/t2: 1.00 t1/t3: 0.78 t1/t4: 0.78 i: 16777216, []=[]: 772143790, t1/t2: 1.00 t1/t3: 0.77 t1/t4: 0.76 i: 33554432, []=[]: 772579250, t1/t2: 1.00 t1/t3: 0.78 t1/t4: 0.78 i: 67108864, []=[]: 840918700, t1/t2: 1.00 t1/t3: 0.84 t1/t4: 0.83 i: 134217728, []=[]: 644072850, t1/t2: 1.00 t1/t3: 0.77 t1/t4: 0.77 i: 268435456, []=[]: 446295660, t1/t2: 0.99 t1/t3: 0.79 t1/t4: 0.78 So for DataType = ubyte: i = 4: T3 better i in [8, 32768] T2 better i > 32768: T1 and T2 equally good N = 500_000_000, DataType = uint: i: 4, []=[]: 9789704410, t1/t2: 3.72 t1/t3: 6.00 t1/t4: 4.87 i: 8, []=[]: 6088621920, t1/t2: 2.25 t1/t3: 4.62 t1/t4: 3.46 i: 16, []=[]: 4049677930, t1/t2: 2.58 t1/t3: 3.31 t1/t4: 2.48 i: 32, []=[]: 2620962840, t1/t2: 2.53 t1/t3: 2.31 t1/t4: 1.67 i: 64, []=[]: 1451602120, t1/t2: 1.89 t1/t3: 1.39 t1/t4: 0.94 i: 128, []=[]: 867937320, t1/t2: 1.56 t1/t3: 0.79 t1/t4: 0.74 i: 256, []=[]: 575548140, t1/t2: 1.37 t1/t3: 0.55 t1/t4: 0.53 i: 512, []=[]: 428628560, t1/t2: 1.21 t1/t3: 0.42 t1/t4: 0.41 i: 1024, []=[]: 355175800, t1/t2: 1.11 t1/t3: 0.35 t1/t4: 0.35 i: 2048, []=[]: 318880440, t1/t2: 1.06 t1/t3: 0.32 t1/t4: 0.31 i: 4096, []=[]: 307818740, t1/t2: 1.02 t1/t3: 0.30 t1/t4: 0.30 i: 8192, []=[]: 628989600, t1/t2: 1.03 t1/t3: 0.62 t1/t4: 0.62 i: 16384, []=[]: 618088340, t1/t2: 1.01 t1/t3: 0.61 t1/t4: 0.61 i: 32768, []=[]: 620081160, t1/t2: 1.01 t1/t3: 0.62 t1/t4: 0.61 i: 65536, []=[]: 622162940, t1/t2: 1.01 t1/t3: 0.62 t1/t4: 0.61 i: 131072, []=[]: 633820200, t1/t2: 1.00 t1/t3: 0.61 t1/t4: 0.61 i: 262144, []=[]: 3239626180, t1/t2: 1.00 t1/t3: 1.05 t1/t4: 1.04 i: 524288, []=[]: 3145529870, t1/t2: 1.00 t1/t3: 1.04 t1/t4: 1.04 i: 1048576, []=[]: 3319243820, t1/t2: 1.00 t1/t3: 1.05 t1/t4: 1.05 i: 2097152, []=[]: 3292952240, t1/t2: 1.00 t1/t3: 1.04 t1/t4: 1.03 i: 4194304, []=[]: 3159184090, t1/t2: 1.00 t1/t3: 1.04 t1/t4: 1.04 i: 8388608, []=[]: 3261583900, t1/t2: 1.00 t1/t3: 1.04 t1/t4: 1.04 i: 16777216, []=[]: 3462604700, t1/t2: 1.00 t1/t3: 1.04 t1/t4: 1.03 i: 33554432, []=[]: 2971217250, t1/t2: 0.99 t1/t3: 1.05 t1/t4: 1.05 i: 67108864, []=[]: 3136278270, t1/t2: 0.99 t1/t3: 1.03 t1/t4: 1.03 So for DataType = uint: i in [4, 16]: T3 better i in [32, 8192] T2 better i > 16384: T1 and T2 equally good N = 500_000_000, DataType = ulong: i: 4, []=[]: 12176650300, t1/t2: 2.26 t1/t3: 5.67 t1/t4: 5.10 i: 8, []=[]: 8095535110, t1/t2: 2.58 t1/t3: 4.39 t1/t4: 3.69 i: 16, []=[]: 5241679510, t1/t2: 2.53 t1/t3: 3.15 t1/t4: 2.49 i: 32, []=[]: 2904270580, t1/t2: 1.89 t1/t3: 1.83 t1/t4: 1.41 i: 64, []=[]: 1733200310, t1/t2: 1.56 t1/t3: 1.12 t1/t4: 0.85 i: 128, []=[]: 1149144250, t1/t2: 1.37 t1/t3: 1.02 t1/t4: 0.70 i: 256, []=[]: 856914580, t1/t2: 1.21 t1/t3: 0.80 t1/t4: 0.54 i: 512, []=[]: 711247630, t1/t2: 1.11 t1/t3: 0.69 t1/t4: 0.46 i: 1024, []=[]: 638308500, t1/t2: 1.06 t1/t3: 0.63 t1/t4: 0.42 i: 2048, []=[]: 614686480, t1/t2: 0.99 t1/t3: 0.57 t1/t4: 0.40 i: 4096, []=[]: 1254471740, t1/t2: 1.02 t1/t3: 0.70 t1/t4: 0.69 i: 8192, []=[]: 1238218710, t1/t2: 1.01 t1/t3: 0.69 t1/t4: 0.68 i: 16384, []=[]: 1240384550, t1/t2: 1.01 t1/t3: 0.69 t1/t4: 0.68 i: 32768, []=[]: 1244624520, t1/t2: 1.01 t1/t3: 0.69 t1/t4: 0.69 i: 65536, []=[]: 1272774050, t1/t2: 1.00 t1/t3: 0.69 t1/t4: 0.69 i: 131072, []=[]: 6532720530, t1/t2: 1.00 t1/t3: 1.05 t1/t4: 1.05 i: 262144, []=[]: 6315104980, t1/t2: 1.00 t1/t3: 1.04 t1/t4: 1.04 i: 524288, []=[]: 6663898750, t1/t2: 1.00 t1/t3: 1.05 t1/t4: 1.05 i: 1048576, []=[]: 6562860030, t1/t2: 1.00 t1/t3: 1.04 t1/t4: 1.04 i: 2097152, []=[]: 6322608930, t1/t2: 1.00 t1/t3: 1.04 t1/t4: 1.04 So for DataType = ulong: i in [4, 16]: T3 better i in [32, 4096] T2 better i > 8192: T1 and T2 equally good CPU used: Pentium Dual-Core E2180, with caches 64 KB L1 (32 KB data, 32 KB program) and 1 MB L2. But there are ways to go much faster: http://cdrom.amd.com/devconn/events/gdc_2002_amd.pdf http://files.rsdn.ru/23380/AMD_block_prefetch_paper.pdf Bye, bearophile
Aug 27 2008
--snip-- I think it safe to conclude that memcpy (T2) is the fastest all-round solution. Only the inlined custom code can beat it. What's more, to use memcpy only one line in gc.d needs to be changed. Both T3 and T4 would need a change in the compiler. L.
Aug 27 2008
Lionello Lunesu:I think it safe to conclude that memcpy (T2) is the fastest all-round solution. Only the inlined custom code can beat it. What's more, to use memcpy only one line in gc.d needs to be changed. Both T3 and T4 would need a change in the compiler.From more benchmarks I have seen that if you don't use ASM (with prefetching, etc) and/or more clever code, the following is the faster from DataType from ubyte to ulong: template Range(int stop) { static if (stop <= 0) alias Tuple!() Range; else alias Tuple!(Range!(stop-1), stop-1) Range; } switch (i) { case 0: break; case 1: foreach (j; Range!(1)) a[j] = b[j]; break; case 2: foreach (j; Range!(2)) a[j] = b[j]; break; case 3: foreach (j; Range!(3)) a[j] = b[j]; break; case 4: foreach (j; Range!(4)) a[j] = b[j]; break; case 5: foreach (j; Range!(5)) a[j] = b[j]; break; case 6: foreach (j; Range!(6)) a[j] = b[j]; break; case 7: foreach (j; Range!(7)) a[j] = b[j]; break; case 8: foreach (j; Range!(8)) a[j] = b[j]; break; default: memcpy(a.ptr, b.ptr, a.length * DataType.sizeof); } A possible disadvantage may come from the fact that such code requires several instructions, so it increases the traffic in the L1 code cache. Bye, bearophile
Aug 27 2008
== Quote from bearophile (bearophileHUGS lycos.com)'s articleLionello Lunesu:is the faster from DataType from ubyte to ulong:I think it safe to conclude that memcpy (T2) is the fastest all-round solution. Only the inlined custom code can beat it. What's more, to use memcpy only one line in gc.d needs to be changed. Both T3 and T4 would need a change in the compiler.From more benchmarks I have seen that if you don't use ASM (with prefetching, etc) and/or more clever code, the followingtemplate Range(int stop) { static if (stop <= 0) alias Tuple!() Range; else alias Tuple!(Range!(stop-1), stop-1) Range; } switch (i) { case 0: break; case 1: foreach (j; Range!(1)) a[j] = b[j]; break; case 2: foreach (j; Range!(2)) a[j] = b[j]; break; case 3: foreach (j; Range!(3)) a[j] = b[j]; break; case 4: foreach (j; Range!(4)) a[j] = b[j]; break; case 5: foreach (j; Range!(5)) a[j] = b[j]; break; case 6: foreach (j; Range!(6)) a[j] = b[j]; break; case 7: foreach (j; Range!(7)) a[j] = b[j]; break; case 8: foreach (j; Range!(8)) a[j] = b[j]; break; default: memcpy(a.ptr, b.ptr, a.length * DataType.sizeof); }Why not just change that to this instead: switch(i) { case 8: a[7] = b[7]; case 7: a[6] = b[6]; case 6: a[5] = b[5]; case 5: a[4] = b[4]; case 4: a[3] = b[3]; case 3: a[2] = b[2]; case 1: a[0] = b[0]; case 0: break; default: memcpy( a.ptr, b.ptr, a.length * DataType.sizeof ); } The Range stuff is cool and all, but not really necessary :-) SeanA possible disadvantage may come from the fact that such code requires several instructions, so it increases the traffic in theL1 code cache.Bye, bearophile
Aug 27 2008
Sean Kelly:Why not just change that to this instead: switch(i) { case 8: a[7] = b[7]; case 7: a[6] = b[6]; case 6: a[5] = b[5]; case 5: a[4] = b[4]; case 4: a[3] = b[3]; case 3: a[2] = b[2]; case 1: a[0] = b[0]; case 0: break; default: memcpy( a.ptr, b.ptr, a.length * DataType.sizeof ); } The Range stuff is cool and all, but not really necessary :-)I have benchmarked many alternative versions, your one too, but mine is significantly faster than your one on the DMD I am using. Don't ask me why. You can find the Range!([start,] stop[, step]) in my libs, of course ;-) Bye, bearophile
Aug 27 2008
On 2008-08-27 17:20:55 +0200, bearophile <bearophileHUGS lycos.com> said:Sean Kelly:memory ordering probably, looping from 0 to 7 is faster than 7 to 0...Why not just change that to this instead: switch(i) { case 8: a[7] = b[7]; case 7: a[6] = b[6]; case 6: a[5] = b[5]; case 5: a[4] = b[4]; case 4: a[3] = b[3]; case 3: a[2] = b[2]; case 1: a[0] = b[0]; case 0: break; default: memcpy( a.ptr, b.ptr, a.length * DataType.sizeof ); } The Range stuff is cool and all, but not really necessary :-)I have benchmarked many alternative versions, your one too, but mine is significantly faster than your one on the DMD I am using. Don't ask me why.You can find the Range!([start,] stop[, step]) in my libs, of course ;-) Bye, bearophile
Aug 27 2008
On 2008-08-28 00:22:21 +0200, Fawzi Mohamed <fmohamed mac.com> said:On 2008-08-27 17:20:55 +0200, bearophile <bearophileHUGS lycos.com> said:so int j=j; switch(i) { case 8: a[j] = b[j]; ++j; case 7: a[j] = b[j]; ++j; case 6: a[j] = b[j]; ++j; case 5: a[j] = b[j]; ++j; case 4: a[j] = b[j]; ++j; case 3: a[j] = b[j]; ++j; case 1: a[j] = b[j]; case 0: break; default: memcpy( a.ptr, b.ptr, a.length * DataType.sizeof ); } should be better... FawziSean Kelly:memory ordering probably, looping from 0 to 7 is faster than 7 to 0...Why not just change that to this instead: switch(i) { case 8: a[7] = b[7]; case 7: a[6] = b[6]; case 6: a[5] = b[5]; case 5: a[4] = b[4]; case 4: a[3] = b[3]; case 3: a[2] = b[2]; case 1: a[0] = b[0]; case 0: break; default: memcpy( a.ptr, b.ptr, a.length * DataType.sizeof ); } The Range stuff is cool and all, but not really necessary :-)I have benchmarked many alternative versions, your one too, but mine is significantly faster than your one on the DMD I am using. Don't ask me why.You can find the Range!([start,] stop[, step]) in my libs, of course ;-) Bye, bearophile
Aug 27 2008
Fawzi Mohamed:I don't think that's the problem.memory ordering probably, looping from 0 to 7 is faster than 7 to 0...so...should be better...It is not, it seems. For reference below some code you can run: import std.c.string: memcpy; template Tuple(T...) { alias T Tuple; } long timer() { asm { naked; push ECX; push EBX; mov EAX, 0; cpuid; pop EBX; pop ECX; rdtsc; ret; } } template Range(int stop) { static if (stop <= 0) alias Tuple!() Range; else alias Tuple!(Range!(stop-1), stop-1) Range; } void main() { const int N = 500_000_000; alias uint DataType; for (int i = 4; i < N; i *= 2) { auto b = new DataType[i]; auto a = new DataType[i]; a[] = b[]; // pre-cache auto count = N / i; auto start = timer(); while (count--) a[] = b[]; auto end = timer(); auto t1 = end - start; count = N / i; start = timer(); while (count--) memcpy(a.ptr, b.ptr, a.length * DataType.sizeof); end = timer(); auto t2 = end - start; count = N / i; start = timer(); while (count--) for (int j; j < i; j++) a[j] = b[j]; end = timer(); auto t3 = end - start; count = N / i; start = timer(); while (count--) { auto pa = a.ptr; auto pb = b.ptr; int nloop = i; while (nloop--) *pa++ = *pb++; } end = timer(); auto t4 = end - start; count = N / i; start = timer(); while (count--) { // apparently the faster version switch (i) { case 0: break; case 1: foreach (j; Range!(1)) a[j] = b[j]; break; case 2: foreach (j; Range!(2)) a[j] = b[j]; break; case 3: foreach (j; Range!(3)) a[j] = b[j]; break; case 4: foreach (j; Range!(4)) a[j] = b[j]; break; case 5: foreach (j; Range!(5)) a[j] = b[j]; break; case 6: foreach (j; Range!(6)) a[j] = b[j]; break; case 7: foreach (j; Range!(7)) a[j] = b[j]; break; case 8: foreach (j; Range!(8)) a[j] = b[j]; break; default: memcpy(a.ptr, b.ptr, a.length * DataType.sizeof); } } end = timer(); auto t5 = end - start; count = N / i; start = timer(); while (count--) { int j = i; switch(i) { case 8: a[j] = b[j]; ++j; case 7: a[j] = b[j]; ++j; case 6: a[j] = b[j]; ++j; case 5: a[j] = b[j]; ++j; case 4: a[j] = b[j]; ++j; case 3: a[j] = b[j]; ++j; case 1: a[j] = b[j]; case 0: break; default: memcpy( a.ptr, b.ptr, a.length * DataType.sizeof ); } } end = timer(); auto t6 = end - start; printf("i: %8d,\t[]=[]: %8lld, \tt1/t2: %.2f \tt1/t3: %.2f \tt1/t4: %.2f \tt1/t5: %.2f \tt1/t6: %.2f\n", i, t1, cast(float)t1/t2, cast(float)t1/t3, cast(float)t1/t4, cast(float)t1/t5, cast(float)t1/t6); } } Bye, bearophile
Aug 27 2008
Lionello Lunesu wrote:The problem is that the timing for the small arrays cannot be trusted as there's more overhead than actual code being tested. I've changed your code by doing each test 100_000_000/i times. I've also added 'cpuid' for synchronization, as Don suggested:I suspect that copying the same data over and over again is not representative of actual usage, and so is not giving useful results. I also do not understand the usage of cpuid here. I've used rdtsc for years for profiling purposes, and it has given me results that match the clock-on-the-wall results.
Aug 28 2008
Essentially, I'm having a real hard time believing that for the 4 byte case, memcpy is faster. Take a look at the actual implementation of memcpy, vs the code generated by the compiler.
Aug 28 2008
"Walter Bright" <newshound1 digitalmars.com> wrote in message news:g95mh3$2e5i$2 digitalmars.com...Essentially, I'm having a real hard time believing that for the 4 byte case, memcpy is faster. Take a look at the actual implementation of memcpy, vs the code generated by the compiler.I agree with you, but I've noticed the byte[] = byte[] emits "rep movsb" and that must be the slowest way there is to copy 4 bytes. In fact, from my demo scene experience I've learned to not use those CISC instructions like rep and movs* and loop. mov/dec/jnz always proved faster. However, I do think that it's a good idea to move the []=[] code into the run-time library. That way specialized versions can be written, like the other array operations. (Maybe it's already in the lib, but I haven't found it. I've only seen mention of OPmemcpy in e2ir.c which I assume represents the asm code for a memcpy.) L.
Aug 28 2008
Lionello Lunesu wrote:"Walter Bright" <newshound1 digitalmars.com> wrote in message news:g95mh3$2e5i$2 digitalmars.com...But for 4 bytes, how much slower could it be than memcpy, which executes a lot of setup code?Essentially, I'm having a real hard time believing that for the 4 byte case, memcpy is faster. Take a look at the actual implementation of memcpy, vs the code generated by the compiler.I agree with you, but I've noticed the byte[] = byte[] emits "rep movsb" and that must be the slowest way there is to copy 4 bytes. In fact, from my demo scene experience I've learned to not use those CISC instructions like rep and movs* and loop. mov/dec/jnz always proved faster.However, I do think that it's a good idea to move the []=[] code into the run-time library. That way specialized versions can be written, like the other array operations. (Maybe it's already in the lib, but I haven't found it. I've only seen mention of OPmemcpy in e2ir.c which I assume represents the asm code for a memcpy.)OPmemcpy is an operator supported by the compiler back end, it generates the movsb's.
Aug 28 2008
"Walter Bright" <newshound1 digitalmars.com> wrote in message news:g96sji$2s1a$1 digitalmars.com...But for 4 bytes, how much slower could it be than memcpy, which executes a lot of setup code?I don't know. I do know that I noticed a speed difference and that you asked me personally to do some more timings. And I've provided you with detailed timings, which prove, against popular belief, that memcpy is faster than rep movsb. Now apparently the conclusion is that A) "copying the same data over and over again is not representative of actual usage", and B) "[you're] having a real hard time believing that for the 4 byte case, memcpy is faster." I guess this subject is closed, but somehow I'm left with a somewhat unsatisfied feeling.. L.
Aug 29 2008
Walter Bright wrote:Lionello Lunesu wrote:Using cpuid only once doesn't work. The rationale ultimately comes from here, I think: http://cs.smu.ca/~jamuir/rdtscpm1.pdf But you've got me thinking. So, serialisation only happens with cpuid. But, WHEN is lack of serialisation actually a problem? On my Pentium M, rtdsc takes 13 uops, all using execution port p0. This means that anything on the other ports could execute after it. The only instructions with a latency longer than 13 are div, aam, fdiv, the transcendental floating point instructions, and the bizarro instructions aam, fbld, fbstp, and cpuid. Pretty similar for Core2 and AMD64. So you may be right -- as long as you don't have one of those super-long latency instructions just before your rtdsc call, cpuid is probably doing more harm than good. The conventional wisdom could well be wrong.The problem is that the timing for the small arrays cannot be trusted as there's more overhead than actual code being tested. I've changed your code by doing each test 100_000_000/i times. I've also added 'cpuid' for synchronization, as Don suggested:I suspect that copying the same data over and over again is not representative of actual usage, and so is not giving useful results. I also do not understand the usage of cpuid here. I've used rdtsc for years for profiling purposes, and it has given me results that match the clock-on-the-wall results.
Aug 28 2008
Don wrote:Using cpuid only once doesn't work. The rationale ultimately comes from here, I think: http://cs.smu.ca/~jamuir/rdtscpm1.pdfAh, that makes sense now. Thanks!
Aug 28 2008