digitalmars.D.learn - D'ish similar

digitalmars.D.learn - D'ish similar_text?

Dgame (54/54) May 10 2018 I'm in need for some sort of string similarity comparision. I've

Vladimir Panteleev (4/5) May 10 2018 That looks like an implementation of Levenshtein distance. We

Dgame (14/19) May 10 2018 Oh, that could to work, thank you. I've just tested it quickly,

Vladimir Panteleev (4/6) May 10 2018 Note that this formula will give you only 50% similarity for

Dgame (6/12) May 10 2018 Hm, that does not work either. ABC and AZB have a different

Dgame <r.schuett.1987 gmail.com> writes:

I'm in need for some sort of string similarity comparision. I've 
found soundex but that didn't solved my needs. After some search 
I found a C implementation of similar_text, but that is quite 
ugly... I was able to let it work in D but it's still somewhat 
messy. Is there any D implementation of similar_text? Here's my 
current code which makes heavy use of pointer-arithmetic which 
makes it hard to understand.

----
void similar_text_similar_str(char* txt1, size_t len1, char* 
txt2, size_t len2, size_t* pos1, size_t* pos2, size_t* max) {
     char* p, q;
     char* end1 = txt1 + len1;
     char* end2 = txt2 + len2;
     size_t l;

     *max = 0;
     for (p = txt1; p < end1; p++) {
         for (q = txt2; q < end2; q++) {
             for (l = 0; (p + l < end1) && (q + l < end2) && (p[l] 
== q[l]); l++) { }
             if (l > *max) {
                 *max = l;
                 *pos1 = p - txt1;
                 *pos2 = q - txt2;
             }
         }
     }
}

size_t similar_text_similar_char(char* txt1, size_t len1, char* 
txt2, size_t len2) {
     size_t sum;
     size_t pos1, pos2, max;

     similar_text_similar_str(txt1, len1, txt2, len2, &pos1, 
&pos2, &max);
     if ((sum = max) != 0) {
         if (pos1 && pos2)  {
             sum += similar_text_similar_char(txt1, pos1, txt2, 
pos2);
         }
         if ((pos1 + max < len1) && (pos2 + max < len2)) {
             sum += similar_text_similar_char(txt1 + pos1 + max, 
len1 - pos1 - max,
                     txt2 + pos2 + max, len2 - pos2 - max);
         }
     }

     return sum;
}

size_t similar_text(in string s1, in string s2, out double 
percent) {
     immutable size_t sim = similar_text_similar_char(s1.dup.ptr, 
s1.length, s2.dup.ptr, s2.length);
     percent = sim * 200.0 / (s1.length + s2.length);

     return sim;
}
----

May 10 2018

Vladimir Panteleev <thecybershadow.lists gmail.com> writes:

On Thursday, 10 May 2018 at 20:08:04 UTC, Dgame wrote:
 void similar_text_similar_str(char* txt1, size_t len1, char*

That looks like an implementation of Levenshtein distance. We 
have one in Phobos:

https://dlang.org/library/std/algorithm/comparison/levenshtein_distance.html

May 10 2018

Dgame <r.schuett.1987 gmail.com> writes:

On Thursday, 10 May 2018 at 20:13:49 UTC, Vladimir Panteleev 
wrote:
 On Thursday, 10 May 2018 at 20:08:04 UTC, Dgame wrote:
 void similar_text_similar_str(char* txt1, size_t len1, char*

 That looks like an implementation of Levenshtein distance. We 
 have one in Phobos:

 https://dlang.org/library/std/algorithm/comparison/levenshtein_distance.html

Oh, that could to work, thank you. I've just tested it quickly, 
but that code seems to produce the same results:

----
size_t similar_text2(in string s1, in string s2, out double 
percent) {
     import std.algorithm: levenshteinDistance;

     immutable size_t distance = s1.levenshteinDistance(s2);
     immutable size_t len = s1.length + s2.length;
     percent = (len - distance) * 100.0 / len;

     return distance;
}
----

May 10 2018

Vladimir Panteleev <thecybershadow.lists gmail.com> writes:

On Thursday, 10 May 2018 at 20:32:11 UTC, Dgame wrote:
     immutable size_t len = s1.length + s2.length;
     percent = (len - distance) * 100.0 / len;

Note that this formula will give you only 50% similarity for 
"abc" and "def", i.e. two completely different strings. I suggest 
to divide by max(s1.length, s2.length) instead.

May 10 2018

Dgame <r.schuett.1987 gmail.com> writes:

On Thursday, 10 May 2018 at 20:38:12 UTC, Vladimir Panteleev 
wrote:
 On Thursday, 10 May 2018 at 20:32:11 UTC, Dgame wrote:
     immutable size_t len = s1.length + s2.length;
     percent = (len - distance) * 100.0 / len;

 Note that this formula will give you only 50% similarity for 
 "abc" and "def", i.e. two completely different strings. I 
 suggest to divide by max(s1.length, s2.length) instead.

Hm, that does not work either. ABC and AZB have a different 
outcome with both. How can I calculate the percentage with 
levenshtein? It's rather simple with similar_text since it 
returns the amount of similar chars.

May 10 2018

D Programming

C/C++ Programming

Other

digitalmars.D.learn - D'ish similar_text?