digitalmars.D.learn - How to use sets in D?
- mark (10/10) Feb 07 2020 I am porting code from other languages to D as part of learning
- Jan =?UTF-8?B?SMO2bmln?= (6/16) Feb 07 2020 I have not that much experience, but i have asked the exact same
- H. S. Teoh (17/28) Feb 07 2020 bool[E] works just fine.
- mark (9/16) Feb 07 2020 [snip]
- Jesse Phillips (9/19) Mar 08 2020 I think I've usually used the associative arrays, but I also
- mark (8/17) Mar 09 2020 I can see the sense in that for use with union & intersection.
I am porting code from other languages to D as part of learning D, and I find I've used sets quite a lot. AFAIK D doesn't have a built-in set type or one in the std. lib. However, I've been perfectly successfully using int[E] where E is my ElementType, and adding with set[element] = 0. I mostly only need add, remove, iteration, and in, with uniqueness what I care most about. I know I could use bool[E] and set[element] = false, or I suppose container.rbtree. Would either of these--or something else built-in or in the std. lib.--be better?
Feb 07 2020
On Friday, 7 February 2020 at 19:37:08 UTC, mark wrote:I am porting code from other languages to D as part of learning D, and I find I've used sets quite a lot. AFAIK D doesn't have a built-in set type or one in the std. lib. However, I've been perfectly successfully using int[E] where E is my ElementType, and adding with set[element] = 0. I mostly only need add, remove, iteration, and in, with uniqueness what I care most about. I know I could use bool[E] and set[element] = false, or I suppose container.rbtree. Would either of these--or something else built-in or in the std. lib.--be better?I have not that much experience, but i have asked the exact same thing in the IRC. rbtree was recommended. However, if it is not performance critical (in terms of speed or size), your solution works fine as well.
Feb 07 2020
On Fri, Feb 07, 2020 at 07:37:08PM +0000, mark via Digitalmars-d-learn wrote:I am porting code from other languages to D as part of learning D, and I find I've used sets quite a lot. AFAIK D doesn't have a built-in set type or one in the std. lib. However, I've been perfectly successfully using int[E] where E is my ElementType, and adding with set[element] = 0. I mostly only need add, remove, iteration, and in, with uniqueness what I care most about. I know I could use bool[E] and set[element] = false, or I suppose container.rbtree. Would either of these--or something else built-in or in the std. lib.--be better?bool[E] works just fine. The bool does take up 1 byte, though, so if you're *really* want to optimize that away, you could do this: alias Unit = void[0]; enum unit = Unit.init; // Look, ma! A bona fide set! Unit[E] mySet; mySet[...] = unit; mySet.remove(...); ... // etc. Or you can wrap void[0][E] in a nice user-defined type that gives nice set-like syntax. But IMO, this is all overkill, and adds needless complexity. Just use bool[E] or std.container.rbtree. :-D T -- If you look at a thing nine hundred and ninety-nine times, you are perfectly safe; if you look at it the thousandth time, you are in frightful danger of seeing it for the first time. -- G. K. Chesterton
Feb 07 2020
On Friday, 7 February 2020 at 22:03:00 UTC, H. S. Teoh wrote:On Fri, Feb 07, 2020 at 07:37:08PM +0000, mark via Digitalmars-d-learn wrote:[snip]bool[E] works just fine.[snip]Or you can wrap void[0][E] in a nice user-defined type that gives nice set-like syntax. But IMO, this is all overkill, and adds needless complexity. Just use bool[E] or std.container.rbtree. :-DI didn't think bool[E] would be a win because although it is only one byte per item, it won't align so wouldn't it end up taking 4 bytes of space anyway. The void[0][E] you showed is good, but, I'll try using the rbtree since I'd rather use an out-of-the-box collection. Thanks for the replies.
Feb 07 2020
I use sets a lot and am now working on a program that will need to hold sets of 65,000+ items, so I thought I do some timings for the different approaches. Here are some timings (uset uses the AA Unit approach, tset uses an rbtree, and aset uses an AA with bool values): $ ./sets.d size 50,000 uset 1 ms, 340 μs, and 8 hnsecs tset 4 ms, 637 μs, and 1 hnsec aset 1 ms, 402 μs, and 6 hnsecs $ ./sets.d size 100,000 uset 2 ms, 338 μs, and 4 hnsecs tset 12 ms, 262 μs, and 6 hnsecs aset 2 ms and 991 μs $ ./sets.d size 200,000 uset 5 ms, 971 μs, and 5 hnsecs tset 30 ms, 675 μs, and 5 hnsecs aset 6 ms, 74 μs, and 6 hnsecs $ ./sets.d size 400,000 uset 11 ms, 823 μs, and 4 hnsecs tset 74 ms, 146 μs, and 2 hnsecs aset 12 ms, 560 μs, and 5 hnsecs What seems pretty clear is that for my purposes (checking presence or absence of membership in a set), AAs are much faster than rbtrees. (This is to be expected since if an AA uses a hash is should be O(1) vs O(lg n) for an rbtree). Here is the test code I used: enum SIZE = 400_000; enum SUB_SIZE = SIZE / 10; enum MIN_LEN = 10; enum MAX_LEN = 50; enum AZ = "abcdefghijklmnopqrstuvwxyz"; void main() { import std.stdio: writefln; auto data = Data.populate(); auto sets = Sets(data.words); writefln("size %,d", SIZE); check(sets.uset, data.present, data.absent, "uset"); check(sets.tset, data.present, data.absent, "tset"); check(sets.aset, data.present, data.absent, "aset"); } struct Sets { import std.container.rbtree: RedBlackTree; alias Unit = void[0]; enum unit = Unit.init; Unit[string] uset; RedBlackTree!string tset; bool[string] aset; this(string[] words) { tset = new RedBlackTree!string; foreach (word; words) { uset[word] = unit; tset.insert(word); aset[word] = false; } } } struct Data { string[] words; string[] present; string[] absent; static Data populate() { Data data; for (int i = 0; i < SIZE; i++) { auto word = makeWord; data.words ~= word; if (data.present.length < SUB_SIZE) data.present ~= word; if (data.absent.length < SUB_SIZE) data.absent ~= word ~ "9"; } return data; } } string makeWord() { import std.random: randomCover, uniform; import std.range: take; import std.string: join, split; enum AZS = (AZ ~ AZ ~ AZ ~ AZ ~ AZ ~ AZ).split(""); return randomCover(AZS).take(uniform(MIN_LEN, MAX_LEN)).join(""); } void check(T)(T set, string[] present, string[] absent, string name) { import std.datetime.stopwatch: AutoStart, StopWatch; import std.stdio: writeln; auto timer = StopWatch(AutoStart.yes); foreach (string p; present) assert(p in set); foreach (string a; absent) assert(a !in set); writeln(name, " ", timer.peek); }
Mar 08 2020
On Sunday, 8 March 2020 at 08:43:10 UTC, mark wrote:Here are some timings ...[...]Please remember that performance testing is not trivial. At the very least, you should be testing optimized code (-O) and preferably with LDC or GDC because they have a much stronger optimizer than DMD. Also, `assert` is not a good way to check something in performance testing code. See https://forum.dlang.org/thread/bpbyhmrsfzirfqggnlyw forum.dlang.org -Johan
Mar 08 2020
On Friday, 7 February 2020 at 19:37:08 UTC, mark wrote:I am porting code from other languages to D as part of learning D, and I find I've used sets quite a lot. AFAIK D doesn't have a built-in set type or one in the std. lib. However, I've been perfectly successfully using int[E] where E is my ElementType, and adding with set[element] = 0. I mostly only need add, remove, iteration, and in, with uniqueness what I care most about. I know I could use bool[E] and set[element] = false, or I suppose container.rbtree. Would either of these--or something else built-in or in the std. lib.--be better?I think I've usually used the associative arrays, but I also think I tend to avoid using this approach but couldn't quite remember what I do instead. I believe I have started just using an array. arr ~= addMyData; arr.sort.uniq Then I make use of the algorithms here. https://dlang.org/phobos/std_algorithm_setops.html
Mar 08 2020
On Sunday, 8 March 2020 at 17:58:16 UTC, Jesse Phillips wrote:On Friday, 7 February 2020 at 19:37:08 UTC, mark wrote:[snip]I think I've usually used the associative arrays, but I also think I tend to avoid using this approach but couldn't quite remember what I do instead. I believe I have started just using an array. arr ~= addMyData; arr.sort.uniq Then I make use of the algorithms here. https://dlang.org/phobos/std_algorithm_setops.htmlI can see the sense in that for use with union & intersection. However, AA's (assuming they are hashes under the hood) give O(1) for `in` whereas those algorithms like the rbtree are O(lg n) for `in`, and most of what I'm doing at the moment is `in`. I'll keep the idea in mind though, because I have use cases for intersection.
Mar 09 2020