digitalmars.D.learn - Help with an algorithm!

MGW (5/5) Jun 14 2017 There are two arrays of string [] mas1, mas2; Size of each about

Ivan Kazmenko (31/37) Jun 15 2017 The approaches which come to mind are:

CRAIG DILLABAUGH (7/17) Jun 15 2017 As a follow up to this, if your alphabet is reasonably small

MGW (17/17) Jun 15 2017 On Thursday, 15 June 2017 at 13:16:24 UTC, CRAIG DILLABAUGH wrote:

CRAIG DILLABAUGH (22/40) Jun 15 2017 radix sort is O(N) time, which is as fast as you can hope. But
Ivan Kazmenko (14/28) Jun 15 2017 Ugh. This can work as slow as length-of-m1 *multiplied* by

MGW <mgw yandex.ru> writes:

There are two arrays of string [] mas1, mas2; Size of each about 
5M lines. By the size they different, but lines in both match for 
95%. It is necessary to find all lines in an array of mas2 which 
differ from mas1. The principal criterion - speed. There are the 
8th core processor and it is good to involve a multithreading.

Jun 14 2017

Ivan Kazmenko <gassa mail.ru> writes:

On Thursday, 15 June 2017 at 06:06:01 UTC, MGW wrote:
 There are two arrays of string [] mas1, mas2; Size of each 
 about 5M lines. By the size they different, but lines in both 
 match for 95%. It is necessary to find all lines in an array of 
 mas2 which differ from mas1. The principal criterion - speed. 
 There are the 8th core processor and it is good to involve a 
 multithreading.

The approaches which come to mind are:

1. Sort both arrays, then traverse them in sorted order, like in 
merge step of merge sort:

     sort (mas1);
     sort (mas2);

     size_t index1 = 0;
     foreach (str2; mas2)
     {
         while (index1 < mas1.length && mas1[index1] < str2)
             index1 += 1;
         if (mas1[index1] != str2)
             writeln (str2);
     }

Sorting takes O (n log n * c) time, where n is the size of the 
arrays, and c is the expected time of two strings comparison when 
sorting.  The subsequent step is O (n * c) which is faster than 
sorting.

2. Hashing.  Just put the contents of the first array into a bool 
[string], and then, for each string from the second array, check 
whether it is contained in the associative array.  The time will 
be O (total length of all strings) multiplied by a moderate 
constant, unless the strings are designed specifically to 
generate hash collisions, in which case it will be slower.

3. Trie.  Similar to hashing, but the constant multipliers will 
be much higher unless the strings have large common prefixes.

Whether we can do faster depends on context.  For example, if the 
strings tend to all have long common prefixes, any string 
comparison will be slow, but otherwise it can be thought of as 
taking constant time.

Ivan Kazmenko.

Jun 15 2017

CRAIG DILLABAUGH <craig.dillabaugh gmail.com> writes:

On Thursday, 15 June 2017 at 11:48:54 UTC, Ivan Kazmenko wrote:
 On Thursday, 15 June 2017 at 06:06:01 UTC, MGW wrote:
 There are two arrays of string [] mas1, mas2; Size of each 
 about 5M lines. By the size they different, but lines in both 
 match for 95%. It is necessary to find all lines in an array 
 of mas2 which differ from mas1. The principal criterion - 
 speed. There are the 8th core processor and it is good to 
 involve a multithreading.

 The approaches which come to mind are:

clip
 taking constant time.

 Ivan Kazmenko.

As a follow up to this, if your alphabet is reasonably small 
perhaps could run radix sort based on the first few characters to 
split your arrays up into smaller subsets, and then use one of 
Ivan's suggestions within each subset.  Subset searches could be 
easily run in parallel.

Jun 15 2017

MGW <mgw yandex.ru> writes:

On Thursday, 15 June 2017 at 13:16:24 UTC, CRAIG DILLABAUGH wrote:

The purpose - search of changes in file system.
Sorting is a slow operation as well as hashing. Creation of a 
tree, is equally in sorting.
So far the best result:

string[] rez;

foreach(str; m2) {
	bool fFind;	int j;
	foreach(int i, s; m1) {
		if(str == s) { fFind = true; j = i; break; }
	}
	if(!fFind) { rez ~= str; }
	else       { 	m1[j] = m1[$-1]; m1.length = m1.length - 1; 	}
}
		
//  rez => rezult

How to parallel on thred?

Jun 15 2017

CRAIG DILLABAUGH <craig.dillabaugh gmail.com> writes:

On Thursday, 15 June 2017 at 13:41:07 UTC, MGW wrote:
 On Thursday, 15 June 2017 at 13:16:24 UTC, CRAIG DILLABAUGH 
 wrote:

 The purpose - search of changes in file system.
 Sorting is a slow operation as well as hashing. Creation of a 
 tree, is equally in sorting.
 So far the best result:

 string[] rez;

 foreach(str; m2) {
 	bool fFind;	int j;
 	foreach(int i, s; m1) {
 		if(str == s) { fFind = true; j = i; break; }
 	}
 	if(!fFind) { rez ~= str; }
 	else       { 	m1[j] = m1[$-1]; m1.length = m1.length - 1; 	}
 }
 		
 //  rez => rezult

 How to parallel on thred?

radix sort is O(N) time, which is as fast as you can hope. But 
given your specific problem domain (the strings are paths) an 
initial radix sort step likely won't gain you much, as everything 
is going to be sorted into a small subset of the buckets. So I 
guess you can scrap that idea.

Knowing that your strings are actually file paths I think 
building some sort of tree structure over M1 wouldn't be 
unreasonable. You say go two or three levels deep on your 
directory structure (ie nodes are labelled with directory name) 
and use that to split M1 into buckets. If some bucket has too 
many entries you could apply this recursively. Since you are only 
building a constant number of levels and the number of nodes is 
not likely to be too large you should do much better than N log N 
* c time for this step.

Then you search with the elements of M2. You should be able to do 
this in a multi-threaded way since once built, your data 
structure on M1 is read-only you could just split M2 over X 
threads and search. I am not an expert in this regard though, so 
perhaps someone better informed than I can chime in.

Since strings will tend to have long common prefix's Ivan's Trie 
idea would also work well.

Jun 15 2017

Ivan Kazmenko <gassa mail.ru> writes:

On Thursday, 15 June 2017 at 13:41:07 UTC, MGW wrote:
 On Thursday, 15 June 2017 at 13:16:24 UTC, CRAIG DILLABAUGH 
 wrote:

 The purpose - search of changes in file system.
 Sorting is a slow operation as well as hashing. Creation of a 
 tree, is equally in sorting.
 So far the best result:

 foreach(str; m2) {
 	bool fFind;	int j;
 	foreach(int i, s; m1) {
 		if(str == s) { fFind = true; j = i; break; }
 	}
 	if(!fFind) { rez ~= str; }
 	else       { 	m1[j] = m1[$-1]; m1.length = m1.length - 1; 	}
 }

Ugh.  This can work as slow as length-of-m1 *multiplied* by 
length-of-m2.  For 5,000,000 strings, it is 5,000,000 * 5,000,000 
= 25,000,000,000,000.  Granted, if you run it very often, the 
arrays are almost equal, and it's closer to linear.  But once 
there are substantial changes between two consecutive runs, this 
approach is seriously screwed.

Sorting would work in length-of-array * log(length-of-array).  
For 5,000,000 strings, it is 5,000,000 * 23 = 115,000,000.  This 
is ~217,391 times better than your method above.  May be a bit 
slower because of long common prefixes.  Anyway, a couple of 
seconds at most.  How fast you need it to be?  Did you actually 
try it?

Ivan Kazmenko.

Jun 15 2017

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Help with an algorithm!