www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - How to use sets in D?

reply mark <mark qtrac.eu> writes:
I am porting code from other languages to D as part of learning 
D, and I find I've used sets quite a lot. AFAIK D doesn't have a 
built-in set type or one in the std. lib.

However, I've been perfectly successfully using int[E] where E is 
my ElementType, and adding with set[element] = 0. I mostly only 
need add, remove, iteration, and in, with uniqueness what I care 
most about.

I know I could use bool[E] and set[element] = false, or I suppose 
container.rbtree. Would either of these--or something else 
built-in or in the std. lib.--be better?
Feb 07 2020
next sibling parent Jan =?UTF-8?B?SMO2bmln?= <hrominium gmail.com> writes:
On Friday, 7 February 2020 at 19:37:08 UTC, mark wrote:
 I am porting code from other languages to D as part of learning 
 D, and I find I've used sets quite a lot. AFAIK D doesn't have 
 a built-in set type or one in the std. lib.

 However, I've been perfectly successfully using int[E] where E 
 is my ElementType, and adding with set[element] = 0. I mostly 
 only need add, remove, iteration, and in, with uniqueness what 
 I care most about.

 I know I could use bool[E] and set[element] = false, or I 
 suppose container.rbtree. Would either of these--or something 
 else built-in or in the std. lib.--be better?
I have not that much experience, but i have asked the exact same thing in the IRC. rbtree was recommended. However, if it is not performance critical (in terms of speed or size), your solution works fine as well.
Feb 07 2020
prev sibling next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Feb 07, 2020 at 07:37:08PM +0000, mark via Digitalmars-d-learn wrote:
 I am porting code from other languages to D as part of learning D, and
 I find I've used sets quite a lot. AFAIK D doesn't have a built-in set
 type or one in the std. lib.
 
 However, I've been perfectly successfully using int[E] where E is my
 ElementType, and adding with set[element] = 0. I mostly only need add,
 remove, iteration, and in, with uniqueness what I care most about.
 
 I know I could use bool[E] and set[element] = false, or I suppose
 container.rbtree. Would either of these--or something else built-in or
 in the std. lib.--be better?
bool[E] works just fine. The bool does take up 1 byte, though, so if you're *really* want to optimize that away, you could do this: alias Unit = void[0]; enum unit = Unit.init; // Look, ma! A bona fide set! Unit[E] mySet; mySet[...] = unit; mySet.remove(...); ... // etc. Or you can wrap void[0][E] in a nice user-defined type that gives nice set-like syntax. But IMO, this is all overkill, and adds needless complexity. Just use bool[E] or std.container.rbtree. :-D T -- If you look at a thing nine hundred and ninety-nine times, you are perfectly safe; if you look at it the thousandth time, you are in frightful danger of seeing it for the first time. -- G. K. Chesterton
Feb 07 2020
parent reply mark <mark qtrac.eu> writes:
On Friday, 7 February 2020 at 22:03:00 UTC, H. S. Teoh wrote:
 On Fri, Feb 07, 2020 at 07:37:08PM +0000, mark via 
 Digitalmars-d-learn wrote:
[snip]
 bool[E] works just fine.
[snip]
 Or you can wrap void[0][E] in a nice user-defined type that 
 gives nice set-like syntax.  But IMO, this is all overkill, and 
 adds needless complexity. Just use bool[E] or 
 std.container.rbtree. :-D
I didn't think bool[E] would be a win because although it is only one byte per item, it won't align so wouldn't it end up taking 4 bytes of space anyway. The void[0][E] you showed is good, but, I'll try using the rbtree since I'd rather use an out-of-the-box collection. Thanks for the replies.
Feb 07 2020
parent reply mark <mark qtrac.eu> writes:
I use sets a lot and am now working on a program that will need 
to hold sets of 65,000+ items, so I thought I do some timings for 
the different approaches.

Here are some timings (uset uses the AA Unit approach, tset uses 
an rbtree, and aset uses an AA with bool values):

$ ./sets.d
size 50,000
uset 1 ms, 340 μs, and 8 hnsecs
tset 4 ms, 637 μs, and 1 hnsec
aset 1 ms, 402 μs, and 6 hnsecs
$ ./sets.d
size 100,000
uset 2 ms, 338 μs, and 4 hnsecs
tset 12 ms, 262 μs, and 6 hnsecs
aset 2 ms and 991 μs
$ ./sets.d
size 200,000
uset 5 ms, 971 μs, and 5 hnsecs
tset 30 ms, 675 μs, and 5 hnsecs
aset 6 ms, 74 μs, and 6 hnsecs
$ ./sets.d
size 400,000
uset 11 ms, 823 μs, and 4 hnsecs
tset 74 ms, 146 μs, and 2 hnsecs
aset 12 ms, 560 μs, and 5 hnsecs

What seems pretty clear is that for my purposes (checking 
presence or absence of membership in a set), AAs are much faster 
than rbtrees. (This is to be expected since if an AA uses a hash 
is should be O(1) vs O(lg n) for an rbtree).

Here is the test code I used:



enum SIZE = 400_000;
enum SUB_SIZE = SIZE / 10;
enum MIN_LEN = 10;
enum MAX_LEN = 50;
enum AZ = "abcdefghijklmnopqrstuvwxyz";

void main() {
     import std.stdio: writefln;

     auto data = Data.populate();
     auto sets = Sets(data.words);

     writefln("size %,d", SIZE);
     check(sets.uset, data.present, data.absent, "uset");
     check(sets.tset, data.present, data.absent, "tset");
     check(sets.aset, data.present, data.absent, "aset");
}

struct Sets {
     import std.container.rbtree: RedBlackTree;

     alias Unit = void[0];
     enum unit = Unit.init;

     Unit[string] uset;
     RedBlackTree!string tset;
     bool[string] aset;

     this(string[] words) {
         tset = new RedBlackTree!string;
         foreach (word; words) {
             uset[word] = unit;
             tset.insert(word);
             aset[word] = false;
         }
     }
}

struct Data {
     string[] words;
     string[] present;
     string[] absent;

     static Data populate() {
         Data data;
         for (int i = 0; i < SIZE; i++) {
             auto word = makeWord;
             data.words ~= word;
             if (data.present.length < SUB_SIZE)
                 data.present ~= word;
             if (data.absent.length < SUB_SIZE)
                 data.absent ~= word ~ "9";
         }
         return data;
     }
}

string makeWord() {
     import std.random: randomCover, uniform;
     import std.range: take;
     import std.string: join, split;

     enum AZS = (AZ ~ AZ ~ AZ ~ AZ ~ AZ ~ AZ).split("");
     return randomCover(AZS).take(uniform(MIN_LEN, 
MAX_LEN)).join("");
}

void check(T)(T set, string[] present, string[] absent, string 
name) {
     import std.datetime.stopwatch: AutoStart, StopWatch;
     import std.stdio: writeln;

     auto timer = StopWatch(AutoStart.yes);
     foreach (string p; present)
         assert(p in set);
     foreach (string a; absent)
         assert(a !in set);
     writeln(name, " ", timer.peek);
}
Mar 08 2020
parent Johan <j j.nl> writes:
On Sunday, 8 March 2020 at 08:43:10 UTC, mark wrote:
 Here are some timings ...
[...]

Please remember that performance testing is not trivial.
At the very least, you should be testing optimized code (-O) and 
preferably with LDC or GDC because they have a much stronger 
optimizer than DMD.
Also, `assert` is not a good way to check something in 
performance testing code. See 
https://forum.dlang.org/thread/bpbyhmrsfzirfqggnlyw forum.dlang.org

-Johan
Mar 08 2020
prev sibling parent reply Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:
On Friday, 7 February 2020 at 19:37:08 UTC, mark wrote:
 I am porting code from other languages to D as part of learning 
 D, and I find I've used sets quite a lot. AFAIK D doesn't have 
 a built-in set type or one in the std. lib.

 However, I've been perfectly successfully using int[E] where E 
 is my ElementType, and adding with set[element] = 0. I mostly 
 only need add, remove, iteration, and in, with uniqueness what 
 I care most about.

 I know I could use bool[E] and set[element] = false, or I 
 suppose container.rbtree. Would either of these--or something 
 else built-in or in the std. lib.--be better?
I think I've usually used the associative arrays, but I also think I tend to avoid using this approach but couldn't quite remember what I do instead. I believe I have started just using an array. arr ~= addMyData; arr.sort.uniq Then I make use of the algorithms here. https://dlang.org/phobos/std_algorithm_setops.html
Mar 08 2020
parent mark <mark qtrac.eu> writes:
On Sunday, 8 March 2020 at 17:58:16 UTC, Jesse Phillips wrote:
 On Friday, 7 February 2020 at 19:37:08 UTC, mark wrote:
[snip]
 I think I've usually used the associative arrays, but I also 
 think I tend to avoid using this approach but couldn't quite 
 remember what I do instead.

 I believe I have started just using an array.

 arr ~= addMyData;
 arr.sort.uniq

 Then I make use of the algorithms here.

 https://dlang.org/phobos/std_algorithm_setops.html
I can see the sense in that for use with union & intersection. However, AA's (assuming they are hashes under the hood) give O(1) for `in` whereas those algorithms like the rbtree are O(lg n) for `in`, and most of what I'm doing at the moment is `in`. I'll keep the idea in mind though, because I have use cases for intersection.
Mar 09 2020