www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - three little issues

reply spir <denis.spir gmail.com> writes:
Hello,

Here are three little issues I faced while implemented a lexing toolkit (see 
other post).

1. Regex match

Let us say there are three "natures" or "modes" of lexeme:
* SKIP: not even kept, just matched and dropped (eg optional spacing)
* MARK: kept, but slice is irrelevant data (eg all kinds of punctuation)
* DATA: slice is necessary data (eg constant value or symbol)

For the 2 first cases, I still need to get the size ot the matched slice, to 
advance in source by the corresponding offset. Is there a way to get this 
information without fetching the slice by calling hit()?

Also, I would like to know when Regex.hit() copies or slices.


2. reference escape

This is a little enigma I face somewhere in this module. Say S is a struct:
     ...
     auto s = S(data);
     return &s;
This code is obvioulsy wrong and the compiler gently warns me about that. But 
the variant below is allowed and more, seems towork fine:
     return &(S(data);
For me, both versions are synonym. Thus, why does the compiler accept the 
latter and why does it work? Any later use to the returned struct (recorded in 
an array) should miserably fail with segfault. (*)
Or is it that the compiler recognises the idiom and implicitely allocates the 
struct outside the local stack?
Example:

struct S { int i; }

S* newS (int i) {
     if (i < 0)
         return null;
//  auto s = S(i);
//  return &s;  // Error: escaping reference to local s
     return &(S(i));
}

unittest {
     int[] ints = [2, -2, 1, -1, 0];
     S[] structs;
     foreach (i ; ints) {
         auto p = newS(i);
         if (p) {
             structs ~= *p;      // explicite deref!
         }
     }
     assert ( structs == [S(2), S(1), S(0)] );   // pass!
}

How can this work?


3. implicite deref

But there is even more mysterious for me: if I first access the struct before 
recording it like in:

unittest {
     int[] ints = [2, -2, 1, -1, 0];
     S[] structs;
     foreach (i ; ints) {
         auto p = newS(i);
         if (p) {
             write (p.i,' ');    // implicite deref!
             structs ~= *p;      // explicite deref!
         }
     }
     assert ( structs == [S(2), S(1), S(0)] );   // pass!
}

...then the final assert fails!? But the written i's are correct ("2 1 0").
Worse, if I exchange the two deref lines:

unittest {
     int[] ints = [2, -2, 1, -1, 0];
     S[] structs;
     foreach (i ; ints) {
         auto p = newS(i);
         if (p) {
             structs ~= *p;      // explicite deref!
             write (p.i,' ');    // implicite deref!
         }
     }
     assert ( structs == [S(2), S(1), S(0)] );   // pass!
}

...then the assertion passes, but the written integers are wrong (looks like 
either garbage or an address, repeated 3 times, eg: "134518949 134518949 
134518949"; successive runs constantly produce the same value).


Denis
-- 
_________________
vita es estrany
spir.wikidot.com
Feb 06 2011
parent reply bearophile <bearophileHUGS lycos.com> writes:
spir:

 2. reference escape
 3. implicite deref
The situation is easy to understand once you know how generally what a stack frame is and how C functions are called: http://en.wikipedia.org/wiki/Stack_frame The D call stack is a contiguous-allocated backwards-single-linked list of differently-sized records, each record is a stack frame, and the whole data structure is of course managed as stack :-) When you have similar doubts I also suggest you to take a look at the asm DMD generates. Writing asm requires some work, but reading a bit of asm is something you may learn in few days or even one day. Before a D function starts, a stack frame is created. It will contain your stack-allocated struct instance. When the function ends its stack frame is destroyed virtually by moving a stack pointer, so the struct may be overwritten by other things, like by a call to writeln that creates many stack frames. If the stack frame is not overwritten and you save by *value* the stack contents, you have successfully saved your data in the array of S, but accessing virtually deleted data in the stack is a bad practice to avoid. Bye, bearophile
Feb 06 2011
parent reply spir <denis.spir gmail.com> writes:
On 02/06/2011 02:13 PM, bearophile wrote:
 Before a D function starts, a stack frame is created. It will contain your
stack-allocated struct instance. When the function ends its stack frame is
destroyed virtually by moving a stack pointer, so the struct may be overwritten
by other things, like by a call to writeln that creates many stack frames. If
the stack frame is not overwritten and you save by*value*  the stack contents,
you have successfully saved your data in the array of S, but accessing
virtually deleted data in the stack is a bad practice to avoid.
Right, I may be successful to store by value as you say, before the frame is overwritten, and so-to-say by chance. But this does not explain why the compiler refuses: // 1 auto s = S(data); return &s; and accepts: // 2 return &(S(data)); or does it? What are the supposed differences in semantics or behaviour, if any? For (naive) me, these 2 pieces of code are exactly synonym (and I would be happy with the compiler suppressing s in 1 or instead creating an intermediate var in 2, whatever it judges better). I have a third version, in the case where I need to check something in s before returning its address: // 3 auto p = &(S(data)); if ((*p).check()) return null; return p; (This is just a synopsis). I need to write it that way, else it's refused. (I mean I cannot first have an s var explicitely, check on it directly, then take it's address as return value). Another use case is where S's are in an array, else both the synopsis and the solution are analog to the last code above. Since they are struct values, I use a pointer to avoid a useless local copy. What do you think of this idiom? Is it common? Is it good at all? Real code: /** AST Node constructed from lexeme of type typeCode, if any, at current position in lexeme stream --else null. Node's constructor must expect the lexeme's slice as (only) input. */ Node node (Node) (string typeCode) if (is(Node == class)) { // Avoid useless local copy of lexeme by using pointer // (instead of local struct variable). Lexeme* pointer = &(this.lexemes[this.cursor]); if ((*pointer).typeCode == typeCode) { ++ this.cursor; return new Node((*pointer).slice); } return null; } My "spontaneous" version of this code would indeed be: Node node (Node) (string typeCode) if (is(Node == class)) { Lexeme lexeme = this.lexemes[this.cursor]; if (lexeme.typeCode == typeCode) { ++ this.cursor; return new Node(lexeme.slice); } return null; } The aim is avoiding copying pieces of the (plain text) source when lexing, parsing, constructing the AST. If I'm right in analysing my app as of now, I have, thank to D's "view" slices, exactly 0 copy from source text to AST. Meaning even AST nodes which hold a piece of the source text (strings, symbols, maybe more) actually have a view of the very original source. In any other language (or is it in my pr2vious coding style?), I would have copied at the very minimum once. Even in a dynamic language (which strings are indeed ref'ed), to create the first "slice" (in the sense of substring). Thank you, Walter! Denis -- _________________ vita es estrany spir.wikidot.com
Feb 06 2011
parent bearophile <bearophileHUGS lycos.com> writes:
spir:

 But this does not explain why the compiler refuses:
 	// 1
 	auto s = S(data);
 	return &s;
 and accepts:
 	// 2
 	return &(S(data));
 or does it?
Accepting the second is a bug in the escape analysis done by the front-end, I think. But see also what Walter has invented here: http://en.wikipedia.org/wiki/Return_value_optimization
 What are the supposed differences in semantics or behaviour, if any?
Regarding what the compiler actually does, take a look at the produced asm.
 (This is just a synopsis). I need to write it that way, else it's refused.
Don't return pointers to memory present in to-be-deleted stack frames. Bye, bearophile
Feb 06 2011