www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Memory Corruption with AAs

reply dsimcha <dsimcha yahoo.com> writes:
Has anyone else still been noticing difficult to reproduce memory corruption
issues in the presence of associative arrays with 2.042?  They seem to happen
very infrequently and non-deterministically.  I can only reproduce them in the
context of a large program.  However, they don't occur in 2.040 (the release
before the array stomping patch), and they are clearly a result of memory
corruption, as contents of arrays change from what I expect them to be to
completely random-looking values inside a loop that does a lot of memory
management and uses AAs heavily but doesn't modify the values.
Apr 02 2010
next sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
dsimcha wrote:
 Has anyone else still been noticing difficult to reproduce memory corruption
 issues in the presence of associative arrays with 2.042?  They seem to happen
 very infrequently and non-deterministically.  I can only reproduce them in the
 context of a large program.  However, they don't occur in 2.040 (the release
 before the array stomping patch), and they are clearly a result of memory
 corruption, as contents of arrays change from what I expect them to be to
 completely random-looking values inside a loop that does a lot of memory
 management and uses AAs heavily but doesn't modify the values.
1. is it multithreaded? 2. does your code have any dangling pointers into AAs?
Apr 02 2010
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Walter Bright (newshound1 digitalmars.com)'s article
 dsimcha wrote:
 Has anyone else still been noticing difficult to reproduce memory corruption
 issues in the presence of associative arrays with 2.042?  They seem to happen
 very infrequently and non-deterministically.  I can only reproduce them in the
 context of a large program.  However, they don't occur in 2.040 (the release
 before the array stomping patch), and they are clearly a result of memory
 corruption, as contents of arrays change from what I expect them to be to
 completely random-looking values inside a loop that does a lot of memory
 management and uses AAs heavily but doesn't modify the values.
1. is it multithreaded? 2. does your code have any dangling pointers into AAs?
The program as a whole is multithreaded, but the part where the bug occurs is an initialization routine that is executed before any threads other than the main one are launched. As far as the dangling pointers question, I don't understand how there could be dangling pointers into GC-managed memory, since if there are pointers to it, it won't be freed. (Ignoring dirty tricks that I'm not using in this case.)
Apr 02 2010
parent reply Walter Bright <newshound1 digitalmars.com> writes:
dsimcha wrote:
 == Quote from Walter Bright (newshound1 digitalmars.com)'s article
 dsimcha wrote:
 Has anyone else still been noticing difficult to reproduce memory corruption
 issues in the presence of associative arrays with 2.042?  They seem to happen
 very infrequently and non-deterministically.  I can only reproduce them in the
 context of a large program.  However, they don't occur in 2.040 (the release
 before the array stomping patch), and they are clearly a result of memory
 corruption, as contents of arrays change from what I expect them to be to
 completely random-looking values inside a loop that does a lot of memory
 management and uses AAs heavily but doesn't modify the values.
1. is it multithreaded? 2. does your code have any dangling pointers into AAs?
The program as a whole is multithreaded, but the part where the bug occurs is an initialization routine that is executed before any threads other than the main one are launched.
It should be easier to find then, by removing all the main code and everything it calls.
 
 As far as the dangling pointers question, I don't understand how there could be
 dangling pointers into GC-managed memory, since if there are pointers to it, it
 won't be freed.  (Ignoring dirty tricks that I'm not using in this case.)
What I meant was, do you save any pointers into the AAs, as in: auto p = &aa[key]; ?
Apr 02 2010
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Walter Bright (newshound1 digitalmars.com)'s article
 dsimcha wrote:
 == Quote from Walter Bright (newshound1 digitalmars.com)'s article
 dsimcha wrote:
 Has anyone else still been noticing difficult to reproduce memory corruption
 issues in the presence of associative arrays with 2.042?  They seem to happen
 very infrequently and non-deterministically.  I can only reproduce them in the
 context of a large program.  However, they don't occur in 2.040 (the release
 before the array stomping patch), and they are clearly a result of memory
 corruption, as contents of arrays change from what I expect them to be to
 completely random-looking values inside a loop that does a lot of memory
 management and uses AAs heavily but doesn't modify the values.
1. is it multithreaded? 2. does your code have any dangling pointers into AAs?
The program as a whole is multithreaded, but the part where the bug occurs is an initialization routine that is executed before any threads other than the main one are launched.
It should be easier to find then, by removing all the main code and everything it calls.
The code has so many dependencies (both other code from the same project and libraries) and is such a mess (because it's a research prototype that evolved more than it was designed and also has all kinds of speed hacks) that it would probably be easier to try to reproduce it from scratch. I'll try tonight because I've got a long train ride with nothing else to do anyhow.
 As far as the dangling pointers question, I don't understand how there could be
 dangling pointers into GC-managed memory, since if there are pointers to it, it
 won't be freed.  (Ignoring dirty tricks that I'm not using in this case.)
What I meant was, do you save any pointers into the AAs, as in: auto p = &aa[key]; ?
No, I definitely wasn't. I almost never do this with any data structure other than an array because, even if it works for now, I consider it a horrible violation of encapsulation because you're relying on the details of how the data structure manipulates memory. This is also why, when I designed RandAA I didn't see this as an issue until you pointed it out to me.
Apr 02 2010
next sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
dsimcha wrote:
 I almost never do this with any data structure other
 than an array because, even if it works for now, I consider it a horrible
 violation of encapsulation because you're relying on the details of how the
data
 structure manipulates memory.  This is also why, when I designed RandAA I
didn't
 see this as an issue until you pointed it out to me.
Andrei is working on the design of the D collection class library. After much thought and research, he finally came to the conclusion that a collection class should not allow the address of a member to be taken. I think his reasoning on the issue is pretty sound, and is consistent with your take on it.
Apr 02 2010
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 04/02/2010 03:53 PM, Walter Bright wrote:
 dsimcha wrote:
 I almost never do this with any data structure other
 than an array because, even if it works for now, I consider it a horrible
 violation of encapsulation because you're relying on the details of
 how the data
 structure manipulates memory. This is also why, when I designed RandAA
 I didn't
 see this as an issue until you pointed it out to me.
Andrei is working on the design of the D collection class library. After much thought and research, he finally came to the conclusion that a collection class should not allow the address of a member to be taken. I think his reasoning on the issue is pretty sound, and is consistent with your take on it.
I wouldn't call it research, but I agonized a fair amount over it. I think Phobos containers will all use malloc, realloc, and free for their own storage, while still being safe. Andrei
Apr 03 2010
parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2010-04-03 23:21:48 -0400, Andrei Alexandrescu 
<SeeWebsiteForEmail erdani.org> said:

 On 04/02/2010 03:53 PM, Walter Bright wrote:
 dsimcha wrote:
 I almost never do this with any data structure other
 than an array because, even if it works for now, I consider it a horrible
 violation of encapsulation because you're relying on the details of
 how the data
 structure manipulates memory. This is also why, when I designed RandAA
 I didn't
 see this as an issue until you pointed it out to me.
Andrei is working on the design of the D collection class library. After much thought and research, he finally came to the conclusion that a collection class should not allow the address of a member to be taken. I think his reasoning on the issue is pretty sound, and is consistent with your take on it.
I wouldn't call it research, but I agonized a fair amount over it. I think Phobos containers will all use malloc, realloc, and free for their own storage, while still being safe.
I think this is a sound decision. And I'm not necessarily talking about using malloc, realloc, and free (even though a container capable of using realloc is certainly a plus), but the one about decoupling the container interface from any particular memory management implementation. Question: if the container's memory isn't garbage-collected, how do you implement iterators, eh, ranges so that they are still memory-safe? -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Apr 04 2010
next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Michel Fortin (michel.fortin michelf.com)'s article
 Question: if the container's memory isn't garbage-collected, how do you
 implement iterators, eh, ranges so that they are still memory-safe?
The way I'm picturing this being implemented is that a GC'd class instance exists at the top level, and then the internal implementation-detail storage that the class uses is implemented via malloc and free. This storage would get freed in the class finalizer when the instance is GC'd. In this case all you'd need to do is make the range hold a reference to the class instance so it wouldn't be GC'd.
Apr 04 2010
parent Michel Fortin <michel.fortin michelf.com> writes:
On 2010-04-04 09:45:36 -0400, dsimcha <dsimcha yahoo.com> said:

 == Quote from Michel Fortin (michel.fortin michelf.com)'s article
 Question: if the container's memory isn't garbage-collected, how do you
 implement iterators, eh, ranges so that they are still memory-safe?
The way I'm picturing this being implemented is that a GC'd class instance exists at the top level, and then the internal implementation-detail storage that the class uses is implemented via malloc and free. This storage would get freed in the class finalizer when the instance is GC'd. In this case all you'd need to do is make the range hold a reference to the class instance so it wouldn't be GC'd.
That wouldn't work with realloc: realloc copies to a new location then frees the old memory if it cannot expand in place. You can't keep the old copy allocated. I've been thinking of another method to ensure safety: don't allow expanding a container as long as there are ranges pointing to it. Easily implemented with a reference count. For instance, if you have a vector container, expanding the container would invalidate ranges. Instead of allowing ranges to become invalid and potentially dangerous, just disallow expanding the container. The range would contain a pointer to its upper and lower bound, and a pointer to the container to increment the reference count when it's copied and decrement it when it's destroyed. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Apr 04 2010
prev sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sun, 04 Apr 2010 09:28:44 -0400, Michel Fortin  
<michel.fortin michelf.com> wrote:

 On 2010-04-03 23:21:48 -0400, Andrei Alexandrescu  
 <SeeWebsiteForEmail erdani.org> said:

 On 04/02/2010 03:53 PM, Walter Bright wrote:
 dsimcha wrote:
 I almost never do this with any data structure other
 than an array because, even if it works for now, I consider it a  
 horrible
 violation of encapsulation because you're relying on the details of
 how the data
 structure manipulates memory. This is also why, when I designed RandAA
 I didn't
 see this as an issue until you pointed it out to me.
Andrei is working on the design of the D collection class library. After much thought and research, he finally came to the conclusion that a collection class should not allow the address of a member to be taken. I think his reasoning on the issue is pretty sound, and is consistent with your take on it.
I wouldn't call it research, but I agonized a fair amount over it. I think Phobos containers will all use malloc, realloc, and free for their own storage, while still being safe.
I think this is a sound decision. And I'm not necessarily talking about using malloc, realloc, and free (even though a container capable of using realloc is certainly a plus), but the one about decoupling the container interface from any particular memory management implementation. Question: if the container's memory isn't garbage-collected, how do you implement iterators, eh, ranges so that they are still memory-safe?
Another problem is if the elements of the container have references to GC-managed data. This means you have to addroot any memory you allocate with malloc. Non-reference type elements of course can use C's malloc and free. This is how Tango works. -Steve
Apr 05 2010
prev sibling parent reply Jordi <jordi rovira.cat> writes:
dsimcha wrote:
 == Quote from Walter Bright (newshound1 digitalmars.com)'s article
 dsimcha wrote:
 == Quote from Walter Bright (newshound1 digitalmars.com)'s article
 dsimcha wrote:
 Has anyone else still been noticing difficult to reproduce memory corruption
 issues in the presence of associative arrays with 2.042?  They seem to happen
 very infrequently and non-deterministically.  I can only reproduce them in the
 context of a large program.  However, they don't occur in 2.040 (the release
 before the array stomping patch), and they are clearly a result of memory
 corruption, as contents of arrays change from what I expect them to be to
 completely random-looking values inside a loop that does a lot of memory
 management and uses AAs heavily but doesn't modify the values.
1. is it multithreaded? 2. does your code have any dangling pointers into AAs?
The program as a whole is multithreaded, but the part where the bug occurs is an initialization routine that is executed before any threads other than the main one are launched.
It should be easier to find then, by removing all the main code and everything it calls.
The code has so many dependencies (both other code from the same project and libraries) and is such a mess (because it's a research prototype that evolved more than it was designed and also has all kinds of speed hacks) that it would probably be easier to try to reproduce it from scratch. I'll try tonight because I've got a long train ride with nothing else to do anyhow.
 As far as the dangling pointers question, I don't understand how there could be
 dangling pointers into GC-managed memory, since if there are pointers to it, it
 won't be freed.  (Ignoring dirty tricks that I'm not using in this case.)
What I meant was, do you save any pointers into the AAs, as in: auto p = &aa[key]; ?
No, I definitely wasn't. I almost never do this with any data structure other than an array because, even if it works for now, I consider it a horrible violation of encapsulation because you're relying on the details of how the data structure manipulates memory. This is also why, when I designed RandAA I didn't see this as an issue until you pointed it out to me.
I am having exactly the same situation. My project, which is also quite big, crashes randomly in 2.041 and 2.042 (without patches), but it works fine in 2.040. I didn't bother reporting because i thought i was doing something wrong (being a newbie in D). Mine is single-threaded, no pointers in AAs. j.
Apr 02 2010
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Jordi wrote:
 I am having exactly the same situation.
Any chance you can reduce it to a small test case?
Apr 02 2010
parent reply Jordi <jordi rovira.cat> writes:
Walter Bright wrote:
 Jordi wrote:
 I am having exactly the same situation.
Any chance you can reduce it to a small test case?
Actually i am trying to add an "autotest" mode to my project to be able to test and benchmark different compiler versions, compilation options and even GC implementations in the context of a realistic application (in size, kind of things it does and "average" quality of code). However i don't expect to be able to narrow this down in the short term, due to the relatively big amount of data involved and the randomness of its occurrence. Sometimes it is a crash somewhere gdb cannot provide info on, sometimes i get my own assert in places i am pretty sure that are not possible unless i get some corruption or wrong result from an AA. And finally, i am not entirely sure yet that it is really the compiler's fault... if i manage to have a "send-able" case i will submit it in the bugtracker. j.
Apr 02 2010
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Jordi wrote:
 Walter Bright wrote:
 Jordi wrote:
 I am having exactly the same situation.
Any chance you can reduce it to a small test case?
Actually i am trying to add an "autotest" mode to my project to be able to test and benchmark different compiler versions, compilation options and even GC implementations in the context of a realistic application (in size, kind of things it does and "average" quality of code). However i don't expect to be able to narrow this down in the short term, due to the relatively big amount of data involved and the randomness of its occurrence. Sometimes it is a crash somewhere gdb cannot provide info on, sometimes i get my own assert in places i am pretty sure that are not possible unless i get some corruption or wrong result from an AA. And finally, i am not entirely sure yet that it is really the compiler's fault... if i manage to have a "send-able" case i will submit it in the bugtracker.
Since it is single-threaded, it should crash in the same place in the same way every time. This means you can put an assert on the crashing data (even without gdb it can be found by inserting printf's), and slowly work it backward to where the data goes wrong.
Apr 02 2010
parent reply Steve Teale <steve.teale britseyeview.com> writes:
On Fri, 02 Apr 2010 23:01:48 -0700, Walter Bright wrote:
 
 Since it is single-threaded, it should crash in the same place in the
same way every time. This means you can put an assert on the crashing data (even without gdb it can be found by inserting printf's), and slowly work it backward to where the data goes wrong.
According to an article I was reading that constancy does not apply with current Linux kernels. They deliberately randomize things to make life difficult for hackers. I'll try to find it again - it's about debugging GCC itself. Steve
Apr 02 2010
parent reply Steve Teale <steve.teale britseyeview.com> writes:
On Sat, 03 Apr 2010 06:24:20 +0000, Steve Teale wrote:

 On Fri, 02 Apr 2010 23:01:48 -0700, Walter Bright wrote:
  
 Since it is single-threaded, it should crash in the same place in
 the
same way every time. This means you can put an assert on the crashing data (even without gdb it can be found by inserting printf's), and slowly work it backward to where the data goes wrong.
According to an article I was reading that constancy does not apply with current Linux kernels. They deliberately randomize things to make life difficult for hackers. I'll try to find it again - it's about debugging GCC itself. Steve
Got it - http://gcc.gnu.org/wiki/DebuggingGCC Right at the end.
Apr 02 2010
next sibling parent Jordi <jordi rovira.cat> writes:
Steve Teale wrote:
 On Sat, 03 Apr 2010 06:24:20 +0000, Steve Teale wrote:
 
 On Fri, 02 Apr 2010 23:01:48 -0700, Walter Bright wrote:
  
 Since it is single-threaded, it should crash in the same place in
 the
same way every time. This means you can put an assert on the crashing data (even without gdb it can be found by inserting printf's), and slowly work it backward to where the data goes wrong.
According to an article I was reading that constancy does not apply with current Linux kernels. They deliberately randomize things to make life difficult for hackers. I'll try to find it again - it's about debugging GCC itself. Steve
Got it - http://gcc.gnu.org/wiki/DebuggingGCC Right at the end.
Well, i don't know if this applies to my case, but it is definitely random: - Compile - Run - it crashes after some interaction. - Run the same again - crashes immediately. I am currently trying to run it in windows to see what comes out. Trying to narrow it down makes it disappear in all of my attempts so far. j.
Apr 03 2010
prev sibling parent Walter Bright <newshound1 digitalmars.com> writes:
Steve Teale wrote:
 On Sat, 03 Apr 2010 06:24:20 +0000, Steve Teale wrote:
 
 On Fri, 02 Apr 2010 23:01:48 -0700, Walter Bright wrote:
  
 Since it is single-threaded, it should crash in the same place in
 the
same way every time. This means you can put an assert on the crashing data (even without gdb it can be found by inserting printf's), and slowly work it backward to where the data goes wrong.
According to an article I was reading that constancy does not apply with current Linux kernels. They deliberately randomize things to make life difficult for hackers. I'll try to find it again - it's about debugging GCC itself. Steve
Got it - http://gcc.gnu.org/wiki/DebuggingGCC Right at the end.
It links to this page which shows how to turn it off: http://gcc.gnu.org/wiki/Randomization
Apr 03 2010
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 02 Apr 2010 13:15:36 -0400, dsimcha <dsimcha yahoo.com> wrote:

 Has anyone else still been noticing difficult to reproduce memory  
 corruption
 issues in the presence of associative arrays with 2.042?  They seem to  
 happen
 very infrequently and non-deterministically.  I can only reproduce them  
 in the
 context of a large program.  However, they don't occur in 2.040 (the  
 release
 before the array stomping patch), and they are clearly a result of memory
 corruption, as contents of arrays change from what I expect them to be to
 completely random-looking values inside a loop that does a lot of memory
 management and uses AAs heavily but doesn't modify the values.
Are you using the latest trunk for druntime, or the stock 2.042 version? Because the AAs have been changed significantly since 2.042. Not saying I'm blaming that, but let's make sure we are all on the same page. What is the AA key/value types? Do you use appending at all in your code? The AA code only uses appending in one location, that is when setting the length of its array. Can you narrow down what your code is doing when the corruption occurs? (I realize this might be impossible, but just thought I'd ask) -Steve
Apr 02 2010
prev sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 02 Apr 2010 13:15:36 -0400, dsimcha <dsimcha yahoo.com> wrote:

 Has anyone else still been noticing difficult to reproduce memory  
 corruption
 issues in the presence of associative arrays with 2.042?  They seem to  
 happen
 very infrequently and non-deterministically.  I can only reproduce them  
 in the
 context of a large program.  However, they don't occur in 2.040 (the  
 release
 before the array stomping patch), and they are clearly a result of memory
 corruption, as contents of arrays change from what I expect them to be to
 completely random-looking values inside a loop that does a lot of memory
 management and uses AAs heavily but doesn't modify the values.
I just thought of a way to rule out or not the stomping patch, as long as your code does not depend on preventing array stomping. Change your runtime's version of lifetime.d to the version before the append stomping patch: http://www.dsource.org/projects/druntime/browser/trunk/src/rt/lifetime.d?rev=251 This version should be forwards compatible with the runtime in 2.042, so you should be able to run your code with this runtime. See if the errors still occur. -Steve
Apr 02 2010