digitalmars.D - Holes in structs and opEquals
- bearophile (22/22) Mar 07 2010 This comes from a small thread that is going on in digitalmars.D.learn.
- Fawzi Mohamed (15/54) Mar 07 2010 one could argue that the unsafe operation is memset.
- bearophile (7/9) Mar 07 2010 Note that if you write a opEquals for each struct then you have the same...
- bearophile (10/10) Mar 07 2010 There are other solutions.
- Walter Bright (9/25) Mar 07 2010 Right.
- yigal chripun (8/43) Mar 07 2010 The compiler knows at compile-time what variables are initialized with "...
- Walter Bright (9/18) Mar 08 2010 It can, but I don't agree that it should. For an =void initialization,
- bearophile (6/14) Mar 08 2010 I don't fully like this design strategy, I don't like to separate so muc...
- Don (20/57) Mar 08 2010 The same issue can arise with unions. This one is pretty hard to grep fo...
- Don (3/25) Mar 07 2010 Yes. Most of the problems with struct opEquals were fixed in bug 3433.
- Walter Bright (6/10) Mar 07 2010 The "holes" are defined to be filled with 0, and are when initialized by...
- bearophile (5/8) Mar 07 2010 Thank you for your answer. I didn't read this important detail in the on...
- yigal chripun (4/24) Mar 08 2010 Ok, that sound's reasonable if you want to keep the compiler simple and ...
- Walter Bright (5/14) Mar 08 2010 The non-trivial cases are not detectable (halting problem), so it is a
- bearophile (6/8) Mar 09 2010 On the other hand I have shown few ideas to avoid this problem, that req...
- bearophile (98/98) Mar 09 2010 I guess what's holding back Walter on this improvement is the higher per...
This comes from a small thread that is going on in digitalmars.D.learn. This program asserts: import std.c.string; struct S { // 16 bytes, with a hole ushort s; double d; } void main() { S s1, s2; memset(&s1, ubyte.min, S.sizeof); memset(&s2, ubyte.max, S.sizeof); s1.s = s2.s = 0; s1.d = s2.d = 0; assert(s1 == s2); } But a correctly implemented opEquals (and opCmp) among structs has to ignore the contents of the holes, because they can be filled with anything, for example if the structs where not initialized (with =void). A correct opEquals has to work recursively (so if the struct contains a string, it has to control the string equality too). And a built-in recursive opCmp/opHash for structs is about as useful as having a correct opEquals. A correct opEquals can be a little slower, but correctness come first. In the uncommon situations where I really need max speed and I don't care of correctness I can use a memcmp(&s1, &s2, S.sizeof). (And the compiler is free to use just memcmp when a struct has no holes and doesn't contain references, associative arrays and dynamic arrays). Correctness of basic struct operators is not an optional feature, like properties or even enums. If the opEquals among structs is wrong then it's better to not have it at all. Bye, bearophile
Mar 07 2010
On 2010-03-07 15:09:53 +0100, bearophile <bearophileHUGS lycos.com> said:This comes from a small thread that is going on in digitalmars.D.learn. This program asserts: import std.c.string; struct S { // 16 bytes, with a hole ushort s; double d; } void main() { S s1, s2; memset(&s1, ubyte.min, S.sizeof); memset(&s2, ubyte.max, S.sizeof); s1.s = s2.s = 0; s1.d = s2.d = 0; assert(s1 == s2); } But a correctly implemented opEquals (and opCmp) among structs has to ignore the contents of the holes, because they can be filled with anything, for example if the structs where not initialized (with =void). A correct opEquals has to work recursively (so if the struct contains a string, it has to control the string equality too). And a built-in recursive opCmp/opHash for structs is about as useful as having a correct opEquals. A correct opEquals can be a little slower, but correctness come first. In the uncommon situations where I really need max speed and I don't care of correctness I can use a memcmp(&s1, &s2, S.sizeof). (And the compiler is free to use just memcmp when a struct has no holes and doesn't contain references, associative arrays and dynamic arrays). Correctness of basic struct operators is not an optional feature, like properties or even enums. If the opEquals among structs is wrong then it's better to not have it at all.one could argue that the unsafe operation is memset. The compiler always initializes a struct, so that what you describe should never happen in a safe program. Still as you say the following example that might indeed considered a bug: S s1=void,s2=void; s1.s=0; s1.d=0; s2.s=0; s2.d=0; assert(s1 == s2); here the assert might fail depending on the previous content of the memory. I am not sure of what is the best solution, I am not sure that defining a special comparison operation by default for each struct is the correct solution (it can be quite some bloat), please note that a user defined comparison function will not have these problems. Still I agree that traking down a bug due to this might be very ugly...
Mar 07 2010
Fawzi Mohamed:I am not sure of what is the best solution,<This is true for me too, that's why I have created this thread, to find a way to solve this problem.I am not sure that defining a special comparison operation by default for each struct is the correct solution (it can be quite some bloat), please note that a user defined comparison function will not have these problems.<Note that if you write a opEquals for each struct then you have the same code bloat. The difference is that you have had to write lot of boring code, that can contain bugs, and your program is longer. In a program you don't use == and != (or hashing) on all the structs, so in theory the compiler can add such opEquals only to the structs that are tested for equality in the program. I guess the usual modular compilation requirement of D code asks too all structs to have such opEquals. In this case the Link-Time Optimization of LLVM comes to the rescue and removed unused functions from the whole program/whole compilation block. Another possible solution is to have a single opEquals in the runtime, that works like the array sort, using run time knowledge about the types. But this is of probably slower. Bye, bearophile
Mar 07 2010
There are other solutions. For example remove the current opEquals from structs, so doing == among two structs become a syntax error. And then add a property like equable that when present beside the struct name adds a specific and correct and recursive opEquals to it. equable stuct Foo { int x; } Probably instead of equable it's better to have an attribute that adds opHash, opEquals and opCmp to a struct. This is safe, avoids most code bloat, and keeps code fast. The disadvantage is a that it adds a new attribute to the language. If the user can create new attributes, then it's surely possible to define this attribute with D code and some compile-time reflection, keeping the language simple enough. There is one or more Java libs that use attributes for this specific purpose. Bye, bearophile
Mar 07 2010
Fawzi Mohamed wrote:one could argue that the unsafe operation is memset.That's right.The compiler always initializes a struct, so that what you describe should never happen in a safe program.Right.Still as you say the following example that might indeed considered a bug: S s1=void,s2=void; s1.s=0; s1.d=0; s2.s=0; s2.d=0; assert(s1 == s2); here the assert might fail depending on the previous content of the memory. I am not sure of what is the best solution, I am not sure that defining a special comparison operation by default for each struct is the correct solution (it can be quite some bloat), please note that a user defined comparison function will not have these problems.No, it's not a bug in the language, it's a bug in the user code. Using =void is an advanced feature, and it requires the user to know what he's doing with it. That's why it isn't allowed in safe mode.Still I agree that traking down a bug due to this might be very ugly...The idea with =void initializations is that they are findable using grep, and so can be singled out for special attention when there is a problem.
Mar 07 2010
Walter Bright Wrote:Fawzi Mohamed wrote:The compiler knows at compile-time what variables are initialized with "=void". The compiler than can add a compilation error when such a struct variable is used in an equals expression. this doesn't cover use of malloc() which must be the user's responsebility. e.g. S s1=void,s2=void; s1.s=0; s1.d=0; s2.s=0; s2.d=0; assert(s1 == s2); // <- this line should not compileone could argue that the unsafe operation is memset.That's right.The compiler always initializes a struct, so that what you describe should never happen in a safe program.Right.Still as you say the following example that might indeed considered a bug: S s1=void,s2=void; s1.s=0; s1.d=0; s2.s=0; s2.d=0; assert(s1 == s2); here the assert might fail depending on the previous content of the memory. I am not sure of what is the best solution, I am not sure that defining a special comparison operation by default for each struct is the correct solution (it can be quite some bloat), please note that a user defined comparison function will not have these problems.No, it's not a bug in the language, it's a bug in the user code. Using =void is an advanced feature, and it requires the user to know what he's doing with it. That's why it isn't allowed in safe mode.Still I agree that traking down a bug due to this might be very ugly...The idea with =void initializations is that they are findable using grep, and so can be singled out for special attention when there is a problem.
Mar 07 2010
yigal chripun wrote:The compiler knows at compile-time what variables are initialized with "=void". The compiler than can add a compilation error when such a struct variable is used in an equals expression. this doesn't cover use of malloc() which must be the user's responsebility. e.g. S s1=void,s2=void; s1.s=0; s1.d=0; s2.s=0; s2.d=0; assert(s1 == s2); // <- this line should not compileIt can, but I don't agree that it should. For an =void initialization, it's the user's responsibility to initialize it properly. The use of =void implies the user knows what he's doing with it, and will take care to initialize the 'holes' as necessary. Trying to disable == for such structs is a losing battle, anyway, as the compiler could only detect the most obvious cases. Pass a reference to it to a function, store it in a data structure, etc., and all that goes away.
Mar 08 2010
Walter Bright:It can, but I don't agree that it should. For an =void initialization, it's the user's responsibility to initialize it properly. The use of =void implies the user knows what he's doing with it, and will take care to initialize the 'holes' as necessary.I don't fully like this design strategy, I don't like to separate so much the safe code from the unsafe code where the compiler refuses to give any safety. A good compiler has to help assure safety every time it can. On the other hand I agree with you that your design strategy helps keep the compiler simpler (and sometimes compilation time low), and I agree that there are times where it's hopeless to try to turn something unsafe in something a little safer. It's a waste of time that can be used for something more useful. So in the end I shut up the muzzle :-)Trying to disable == for such structs is a losing battle, anyway, as the compiler could only detect the most obvious cases. Pass a reference to it to a function, store it in a data structure, etc., and all that goes away.In theory a lint for D code can partially trace such movements to spot such troubles (in debug release it can even add to the void structs an extra bool attribute that always gets initialized, to tell apart initialized from uninitialized structs at runtime and spot such bugs, but this starts to sound a little silly). Bye, bearophile
Mar 08 2010
Walter Bright wrote:Fawzi Mohamed wrote:The same issue can arise with unions. This one is pretty hard to grep for: ----------- struct S { // 16 bytes, with a hole ushort s; double d; } union U { char [16] c; S s; } void main() { U u; u.c[]='2'; S a = S(5, 2.0); u.s.s = a.s; u.s.d = a.d; S b = u.s; assert( a == b); }one could argue that the unsafe operation is memset.That's right.The compiler always initializes a struct, so that what you describe should never happen in a safe program.Right.Still as you say the following example that might indeed considered a bug: S s1=void,s2=void; s1.s=0; s1.d=0; s2.s=0; s2.d=0; assert(s1 == s2); here the assert might fail depending on the previous content of the memory. I am not sure of what is the best solution, I am not sure that defining a special comparison operation by default for each struct is the correct solution (it can be quite some bloat), please note that a user defined comparison function will not have these problems.No, it's not a bug in the language, it's a bug in the user code. Using =void is an advanced feature, and it requires the user to know what he's doing with it. That's why it isn't allowed in safe mode.Still I agree that traking down a bug due to this might be very ugly...The idea with =void initializations is that they are findable using grep, and so can be singled out for special attention when there is a problem.
Mar 08 2010
bearophile wrote:This comes from a small thread that is going on in digitalmars.D.learn. This program asserts: import std.c.string; struct S { // 16 bytes, with a hole ushort s; double d; } void main() { S s1, s2; memset(&s1, ubyte.min, S.sizeof); memset(&s2, ubyte.max, S.sizeof); s1.s = s2.s = 0; s1.d = s2.d = 0; assert(s1 == s2); } But a correctly implemented opEquals (and opCmp) among structs has to ignore the contents of the holes, because they can be filled with anything, for example if the structs where not initialized (with =void). Correctness of basic struct operators is not an optional feature, like properties or even enums. If the opEquals among structs is wrong then it's better to not have it at all.Yes. Most of the problems with struct opEquals were fixed in bug 3433. You've found an annoying case which wasn't covered.
Mar 07 2010
bearophile wrote:But a correctly implemented opEquals (and opCmp) among structs has to ignore the contents of the holes, because they can be filled with anything,The "holes" are defined to be filled with 0, and are when initialized by the compiler. This is specifically so that memcmp can be done to compare the struct contents.for example if the structs where not initialized (with =void).If you use =void, or allocate with malloc(), it is your responsibility to deal with the holes.
Mar 07 2010
Walter Bright:The "holes" are defined to be filled with 0, and are when initialized by the compiler. This is specifically so that memcmp can be done to compare the struct contents.Thank you for your answer. I didn't read this important detail in the online documentation, I must have missed it, sorry. It's a detail that D programmers must keep in mind well. A future lint program for D will need to try to warn the programmer that using the default opEquals among uninitialized structs is wrong. Bye, bearophile
Mar 07 2010
Walter Bright Wrote:yigal chripun wrote:Ok, that sound's reasonable if you want to keep the compiler simple and fast. How about same mode though? maybe it's worth to add this check only as part of same mode, where there are less cases anyway since malloc() isn't safe, and there are no pointers. is it allowed to use "=void" in safe mode at all? another question I have: How would a user initialize the holes and doesn't it negate the bebefits of void as optimisation?The compiler knows at compile-time what variables are initialized with "=void". The compiler than can add a compilation error when such a struct variable is used in an equals expression. this doesn't cover use of malloc() which must be the user's responsebility. e.g. S s1=void,s2=void; s1.s=0; s1.d=0; s2.s=0; s2.d=0; assert(s1 == s2); // <- this line should not compileIt can, but I don't agree that it should. For an =void initialization, it's the user's responsibility to initialize it properly. The use of =void implies the user knows what he's doing with it, and will take care to initialize the 'holes' as necessary. Trying to disable == for such structs is a losing battle, anyway, as the compiler could only detect the most obvious cases. Pass a reference to it to a function, store it in a data structure, etc., and all that goes away.
Mar 08 2010
yigal chripun wrote:The non-trivial cases are not detectable (halting problem), so it is a pointless feature to add.Trying to disable == for such structs is a losing battle, anyway, as the compiler could only detect the most obvious cases. Pass a reference to it to a function, store it in a data structure, etc., and all that goes away.Ok, that sound's reasonable if you want to keep the compiler simple and fast.another question I have: How would a user initialize the holes and doesn't it negate the bebefits of void as optimisation?The point of =void is to allow the user to determine those tradeoffs rather than the compiler.
Mar 08 2010
Walter Bright:The non-trivial cases are not detectable (halting problem), so it is a pointless feature to add.On the other hand I have shown few ideas to avoid this problem, that require no magic abilities to a compiler, essentially making the opBinary("==") operator actively ignore the holes. If generic code to ignore such holes is too much slow it can be a little faster creating a specialized version for each struct. If this causes too much code bloat, then the default opBinary("==") can be removed from structs and put only when the programmer asks for it with something like a equatable property. If a programmer wants max speed ignoring the holes, such person can define a opBinary("==") that uses memcmp or something similar. Bye, bearophile
Mar 09 2010
I guess what's holding back Walter on this improvement is the higher performance of the equal done ignoring the holes compared to the more correct one. So I have done a test to see how can large is such performance difference, because with not even a bit of experimental data it's hard to discuss: import std.stdio: writeln; import std.c.string: memcmp, memset; /* Basic uniform random generator: Minimal Standard in Park and Miller (1988): "Random Number Generators: Good Ones Are Hard to Find", Comm. of the ACM, 31, 1192-1201. Parameters: m = 2^31-1, a=48271. Translated to D from Pascal code by Jesper Lund: http://www.gnu-pascal.de/crystal/gpc/en/mail1390.html */ struct Random { enum int m = int.max; enum int a = 48_271; enum int q = m / a; enum int r = m % a; int seed = 42; double nextDouble() { int k = seed / q; seed = a * (seed - k * q) - r * k; if (seed < 1) seed += m; return cast(double)seed / m; } int nextInt(int max) { int n = max + 1; int k = cast(int)(n * this.nextDouble()); return (k == n) ? k - 1 : k; } } struct SA { short s; double d; } struct SB { short s; double d; bool opBinary(string S:"==")(SB other) { return (this.s == other.s) && (this.d == other.d); } } void main() { static assert(SA.sizeof == SB.sizeof); enum Versions { basic, defined, memcmp } enum int N = 1_000; // small, to keep data in CPU cache enum int NLOOPS = 100_000; enum Versions vers = Versions.defined; // basic, defined, memcmp SA[6] sa_items = [SA(20, 5.5), SA(20, 5.5), SA(20, 6.5), SA(20, 6.5), SA(30, 6.5), SA(30, 6.5)]; memset(&(sa_items[1]), 255, SA.sizeof); sa_items[1].s = sa_items[0].s; sa_items[1].d = sa_items[0].d; memset(&(sa_items[3]), 255, SA.sizeof); sa_items[3].s = sa_items[2].s; sa_items[3].d = sa_items[2].d; memset(&(sa_items[5]), 255, SA.sizeof); sa_items[5].s = sa_items[4].s; sa_items[5].d = sa_items[4].d; auto sa1 = new SA[N]; auto sb1 = new SB[sa1.length]; auto sa2 = new SA[sa1.length]; auto sb2 = new SB[sa1.length]; auto rnd = Random(1); foreach (i; 0 .. sa1.length) { int r1 = rnd.nextInt(sa_items.length - 1); sa1[i] = sa_items[r1]; sb1[i] = SB(sa1[i].s, sa1[i].d); int r2 = rnd.nextInt(sa_items.length - 1); sa2[i] = sa_items[r2]; sb2[i] = SB(sa2[i].s, sa2[i].d); } int sum; final switch(vers) { case Versions.basic: for (int i; i < NLOOPS; i++) for (int j; j < N; j++) sum += (sa1[j] == sa2[j]); break; case Versions.defined: for (int i; i < NLOOPS; i++) for (int j; j < N; j++) sum += (sb1[j] == sb2[j]); break; case Versions.memcmp: for (int i; i < NLOOPS; i++) for (int j; j < N; j++) sum += memcmp(&(sa1[j]), &(sa2[j]), SA.sizeof) == 0; break; } writeln(sum); } Timings, N=1_000, NLOOPS=100_000, seconds: basic: 3.76 (result sum = 16400000) defined: 4.48 (result sum = 34100000) memcmp: 4.81 (result sum = 16400000) Bye, bearophile
Mar 09 2010