www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Associative array issue

reply "Igor Kolesnik" <shadowmaan gmail.com> writes:
Hi;

I'm trying to run an example from the tutorial on 
http://www.informit.com/articles/article.aspx?p=1381876&seqNum=4
Here is the code

import std.stdio, std.string;

void main() {
   uint[string] dic;
   foreach (line; stdin.byLine) {
     string[] words = cast(string[])split(strip(line));
     foreach (word; words) {
       if (word in dic)
	continue;
       uint id = dic.length;
       dic[word] = id;
       writeln(id, '\t', word);
     }
   }
   //foreach (k,v; dic)
   //  writeln(k, '|', v);
}

When run it behaves somehow strange. Here is an example of the 
input/output I get

the type of array
0       the
1       type
2       of
3       array
in d the type of array
4       in
5       d
6       the
7       type
8       of
9       array

It seems like the 'word in dic' doesn't find the item in the 
array.
If I print the contents of 'dic' array on exit, I get the 
following

d|5
in |0
e of |3
in|4
the|6
array|9
  the|1
type|7
ty|2
of|8

Can someone help me understand what is going wrong? Am I missing 
something here?

ps: mdm32 v2.061 on Win7 x64

Sincerely,
Igor
Jan 23 2013
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, Jan 23, 2013 at 09:07:24PM +0100, Igor Kolesnik wrote:
[...]
 import std.stdio, std.string;
 
 void main() {
   uint[string] dic;
   foreach (line; stdin.byLine) {
     string[] words = cast(string[])split(strip(line));
     foreach (word; words) {
       if (word in dic)
 	continue;
       uint id = dic.length;
       dic[word] = id;
       writeln(id, '\t', word);
     }
   }
   //foreach (k,v; dic)
   //  writeln(k, '|', v);
 }
 
 When run it behaves somehow strange. Here is an example of the
 input/output I get
[...] This is a known issue with stdin.byLine: it is a transient range (that means it reuses the same buffer for each line read from the input). The problem with this is that split returns slices of the line, that ultimately refer back to the data in the buffer. But by the time byLine is called again, that data has been overwritten. That's why the associative array is messed up. There's a slight hint of this problem in your code that starts with "string[] words = cast(string[])..." -- in normal D code, you should not need to perform this kind of casting. In this case, this is an unsafe operation, because string is immutable(char)[], but the reused buffer returned by byLine is *not* immutable, so by casting away immutable, you've inadvertently introduced yourself to the buffer reuse issue in byLine. :) The correct way to write that line is: string[] words = split(strip(line.idup)); which will copy the buffer, thereby ensuring it's safe to keep slices of it in your associative array, and also return the correct type so that no cast is necessary. T -- Notwithstanding the eloquent discontent that you have just respectfully expressed at length against my verbal capabilities, I am afraid that I must unfortunately bring it to your attention that I am, in fact, NOT verbose.
Jan 23 2013
parent "Igor Kolesnik" <shadowmaan gmail.com> writes:
 The correct way to write that line is:

 	string[] words = split(strip(line.idup));

 which will copy the buffer, thereby ensuring it's safe to keep 
 slices of
 it in your associative array, and also return the correct type 
 so that
 no cast is necessary.


 T
This makes sense. Thanks a lot! Igor
Jan 23 2013