www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 4475] New: Improving the compiler 'in' associative array can return just a bool

reply d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4475

           Summary: Improving the compiler 'in' associative array can
                    return just a bool
           Product: D
           Version: D2
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: DMD
        AssignedTo: nobody puremagic.com
        ReportedBy: bearophile_hugs eml.cc



This is relative to page 18 and 56 of The D Programming Language.

"foo in associativeArray" returns a pointer. So can this work in SafeD too?
Maybe there are ways to accept this in SafeD too (if the pointer is not used
and just tested if it's null or not), but there is a cleaner alternative
solution.

In normal D code there is no need to write this to find the parity of x:
int parity = x & 1;

The following operation can be used, that is more readable, because some stage
of compiler is able to optimize this to the first expression:
int parity = x % 2;


The "in" for associative arrays returns a pointer for efficiency reasons, to
avoid a double lookup in some situations. But the D1 LDC compiler is now be
able to optimize away two "close" associative array lookups in all situations,
performing just one lookup.

LDC is probably not able to perform this optimization if the pointer is stored
in a variable and used much later, but this is not a common usage pattern, so I
think this can be ignored.

If the compiler is able to perform this optimization, there the "in" can return
a boolean, and it can be used cleanly in SafeD code too.

So in this case consider returning a boolean and improving the compiler
instead. DMD is probably currently (v20.47) not able to perform this
optimization.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jul 16 2010
next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4475




See also bug 4625

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Aug 26 2010
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4475


Stewart Gordon <smjg iname.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |smjg iname.com



From a semantic point of view, in needs to continue to return a pointer in
regular D, or a boolean in SafeD.

But if it's well optimised, then in most use cases the generated code would end
up the same in both cases.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jan 07 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4475





 From a semantic point of view, in needs to continue to return a pointer in
 regular D, or a boolean in SafeD.
 
 But if it's well optimised, then in most use cases the generated code would end
 up the same in both cases.
I think "in" returning a pointer is a case of premature optimization. LDC shows that in most real situations a compiler is able to optimize away two nearby calls to the associative array lookup function into a single call. So I think a better design for "in" is to always return a boolean, both in safe and unsafe D code. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jan 07 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4475


Alex Rønne Petersen <xtzgzorex gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |xtzgzorex gmail.com



06:28:41 PST ---
I would be against making 'in' return bool for AAs. I often do:

if (auto x = foo in someAA)
    // do something with *x

Doing a lookup after checking for foo's presence in someAA is ugly compared to
this.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jan 07 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4475




06:29:23 PST ---
Furthermore, such a change would break way too much code.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jan 07 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4475





 I would be against making 'in' return bool for AAs. I often do:
 
 if (auto x = foo in someAA)
     // do something with *x
 
 Doing a lookup after checking for foo's presence in someAA is ugly compared to
 this.
Ugly is returning a pointer in a language like D where pointers are usually not necessary. What's bad/ugly in code like this? I think it's more readable: if (foo in someAA) { // do something with someAA[foo] -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jan 07 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4475




07:22:48 PST ---
If you need to use x multiple times inside the if statement's true branch, you
end up having to declare a variable, e.g.:

if (foo in someAA)
{
    auto x = someAA[foo];
    someFunction(otherStuff, x, x, moreStuff);
}

As opposed to:

if (auto x = foo in someAA)
    someFunction(otherStuff, *x, *x, moreStuff);

I don't see why pointers are so bad. While, yes, D is a high-level language, it


-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jan 07 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4475






 I don't see why pointers are so bad. While, yes, D is a high-level language, it

Pointers are not evil, but they are usually more bug-prone. An example from simendsjo: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D.learn&article_id=31482
 aa["a"] = new C();
 auto c = "a" in aa;
 aa["b"] = new C();
 // Using c here is undefined as an element was added to aa
This can't happen if "in" returns a bool. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jan 08 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4475


hsteoh quickfur.ath.cx changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hsteoh quickfur.ath.cx




[...]
 aa["a"] = new C();
 auto c = "a" in aa;
 aa["b"] = new C();
 // Using c here is undefined as an element was added to aa
This can't happen if "in" returns a bool.
Actually, that is not undefined. AA's are designed such that inserting new elements does not invalidate pointers to existing elements. In D, because we have a GC, even if you *delete* elements from AA's, pointers returned by 'in' continue to be valid. This holds even in the event of a rehash, because the pointer points to data in a Slot, and add/remove/rehash only shuffle pointers in the Slot, it doesn't move the Slot around in memory. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Aug 15 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4475






 Actually, that is not undefined. AA's are designed such that inserting new
 elements does not invalidate pointers to existing elements.
I didn't know this. Is this stated somewhere in the D specs?
 This holds even in the event of a rehash,
Associative arrays have to grow when you keep adding key-value pairs, I presume this is done allocating a new larger hash (probably 2 or 4 times larger), and copying data in it. In such situation aren't the pointers to the items becoming invalid? Even if the doubling is done with a realloc, it can sometimes not be able to reallocate in place. To test my theory I have written a small test program: void main() { enum size_t N = 1_000_000; bool[immutable uint] aa; auto pointers = new void*[N]; foreach (immutable uint i; 0 .. N) { aa[i] = true; pointers[i] = i in aa; } foreach (immutable uint i; 0 .. N) assert(pointers[i] == (i in aa)); } It gives no errors, so I am not understanding something. But are D specs asserting this program will work in all D implementations? -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Aug 15 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4475





[...]
 Associative arrays have to grow when you keep adding key-value pairs, I presume
 this is done allocating a new larger hash (probably 2 or 4 times larger), and
 copying data in it. In such situation aren't the pointers to the items becoming
 invalid? Even if the doubling is done with a realloc, it can sometimes not be
 able to reallocate in place.
The reason it works, is because the hash table itself doesn't contain the actual key/value pairs; it just contains pointers to linked-lists of these key/value pairs. So the hash table can be modified however you like, but the key/value pairs stays in the same memory address. This would work even if we used something other than linked-lists for the key/value pairs, e.g., trees, because the key/value pairs would just have some pointers to neighbouring nodes, and during AA rehash (or add/delete) all that happens is that some of these pointers get reassigned, but the node itself (containing the key/value pair) remains in the same memory address. This kind of implementation avoids copying/moving of keys and values, so I'd expect any good AA implementation to do something similar. I'm pretty sure that it's generally expected that AA implementations should obey the principle that iterators (i.e. pointers to key/value) are not invalidated by add/delete, otherwise it would greatly reduce the usefulness of AA's. I'm not too sure about this also holding for rehash, but the current AA implementation does indeed preserve references across rehash as well (though it does break iteration order if you trigger a rehash in the middle of iterating over the AA -- but you won't get invalid pointers out of it). -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Aug 15 2013
prev sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=4475






 the hash table itself doesn't contain the
 actual key/value pairs; it just contains pointers to linked-lists of these
 key/value pairs. So the hash table can be modified however you like, but the
 key/value pairs stays in the same memory address.
I see. But that's just an implementation detail (you could design an AA that keeps small keys-value pairs in an array, plus a pointer to a chain for the collisions, this is how I have created associative arrays in C), D specs can't assert that implementation, so D code that relies on that implementation detail goes into the realm of undefined behavour. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Aug 15 2013