digitalmars.D.learn - Find Semantically Correct Word Splits in UTF-8 Strings

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= (25/25) Oct 01 2014 I'm looking for a way to make my algorithm

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= (21/22) Oct 01 2014 Update:

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= (5/6) Oct 01 2014 If so, how about naming it SlidingSplitter or perhaps
Kagamin (18/18) Oct 01 2014 S[] findMeaningfulWordSplit(S)(S word,
monarch_dodra (11/34) Oct 01 2014 This seems like a text-book case for a trie algorithm:
monarch_dodra (12/23) Oct 01 2014 Does that even work? takeExactly would pop up to N *codepoints*,

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= (3/5) Oct 01 2014 Your're right again :)

monarch_dodra (2/8) Oct 02 2014 Technically, it only pops. It's front/popFront that auto-decode.

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= (5/6) Oct 03 2014 Thanks again.

monarch_dodra (4/29) Oct 01 2014 Out of curiosity, why exactly isn't it working in your "native

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= (2/4) Oct 01 2014 You're right. I'll try that.

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:

I'm looking for a way to make my algorithm

     S[] findWordSplit(S)(S word,
                          HLang[] langs = [])
     {
         for (size_t i = 1; i + 1 < word.length; i++)
         {
             const first = word[0..i];
             const second = word[i..$];
             if (this.canMeanSomething(first, langs) &&
                 this.canMeanSomething(second, langs))
             {
                 return [first,
                         second];
             }
         }
         return typeof(return).init;
     }

correctly work if S is a (UTF-8) string without first, in lazy 
manner, encode word to a dstring.

Currently this algorithm works as

"carwash" => ["car", "wash"]

and I would like it to work correctly and efficient in my native 
language aswell as

"biltvätt" => ["bil", "tvätt"]

:)

Oct 01 2014

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:

On Wednesday, 1 October 2014 at 11:06:24 UTC, Nordlöw wrote:
 I'm looking for a way to make my algorithm

Update:

     S[] findMeaningfulWordSplit(S)(S word,
                                    HLang[] langs = []) if 
(isSomeString!S)
     {
         for (size_t i = 1; i + 1 < word.length; i++)
         {
             const first = word.takeExactly(i).to!string;
             const second = word.dropExactly(i).to!string;
             if (this.canMeanSomething(first, langs) &&
                 this.canMeanSomething(second, langs))
             {
                 return [first,
                         second];
             }
         }
         return typeof(return).init;
     }

works but has unfortunately O(word.length^^2) time-complexity.

Do we need a new InputRange algorithm for this?

Oct 01 2014

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:

On Wednesday, 1 October 2014 at 11:47:41 UTC, Nordlöw wrote:
 Do we need a new InputRange algorithm for this?

If so, how about naming it SlidingSplitter or perhaps 
SlidingHalver in the binary case?

I haven't thought about making this variadic on the number of 
splits (>= 1) but that could be useful in my application aswell.

Oct 01 2014

"Kagamin" <spam here.lot> writes:

     S[] findMeaningfulWordSplit(S)(S word,
                                    HLang[] langs = []) if
(isSomeString!S)
     {
         S second = word;
         for (size_t i = 1; i + 1 < word.length; i++)
         {
             second = second.dropExactly(i).to!string;
             const first = word[0..$-second.length];
             if (this.canMeanSomething(first, langs) &&
                 this.canMeanSomething(second, langs))
             {
                 return [first,
                         second];
             }
         }
         return typeof(return).init;
     }

Oct 01 2014

"monarch_dodra" <monarchdodra gmail.com> writes:

On Wednesday, 1 October 2014 at 11:47:41 UTC, Nordlöw wrote:
 On Wednesday, 1 October 2014 at 11:06:24 UTC, Nordlöw wrote:
 I'm looking for a way to make my algorithm

 Update:

     S[] findMeaningfulWordSplit(S)(S word,
                                    HLang[] langs = []) if 
 (isSomeString!S)
     {
         for (size_t i = 1; i + 1 < word.length; i++)
         {
             const first = word.takeExactly(i).to!string;
             const second = word.dropExactly(i).to!string;
             if (this.canMeanSomething(first, langs) &&
                 this.canMeanSomething(second, langs))
             {
                 return [first,
                         second];
             }
         }
         return typeof(return).init;
     }

 works but has unfortunately O(word.length^^2) time-complexity.

 Do we need a new InputRange algorithm for this?

This seems like a text-book case for a trie algorithm:
http://en.wikipedia.org/wiki/Trie

Once your "dictionary" is built, it can find *all* the splits in 
O(word.length).

...but I don't know how well it adapts to the range interface. 
Should still work though.

Also, Aho–Corasick might be relevant depending on what you want 
to do, for example, find all substrings that happen to be a word, 
regardless of what comes before or after:
http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm

Oct 01 2014

"monarch_dodra" <monarchdodra gmail.com> writes:

On Wednesday, 1 October 2014 at 11:47:41 UTC, Nordlöw wrote:
 On Wednesday, 1 October 2014 at 11:06:24 UTC, Nordlöw wrote:
 I'm looking for a way to make my algorithm

 Update:

     S[] findMeaningfulWordSplit(S)(S word,
                                    HLang[] langs = []) if 
 (isSomeString!S)
     {
         for (size_t i = 1; i + 1 < word.length; i++)
         {
             const first = word.takeExactly(i).to!string;

Does that even work? takeExactly would pop up to N *codepoints*, 
whereas your string only has N *codeunits*.

Something like:

for (auto second = str ; !second.empty ; second.popFront() )
{
     auto first = str[0 .. $ - second.length];
     ...
}
//special case str + str[$ .. $] here. (or adapt your loop)

Would also be unicode correct, without increasing the original 
complexity.

Oct 01 2014

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:

On Wednesday, 1 October 2014 at 17:09:57 UTC, monarch_dodra wrote:
 Does that even work? takeExactly would pop up to N 
 *codepoints*, whereas your string only has N *codeunits*.

Your're right again :)

If forgot that takeExactly auto-decodes.

Oct 01 2014

"monarch_dodra" <monarchdodra gmail.com> writes:

On Wednesday, 1 October 2014 at 21:34:40 UTC, Nordlöw wrote:
 On Wednesday, 1 October 2014 at 17:09:57 UTC, monarch_dodra 
 wrote:
 Does that even work? takeExactly would pop up to N 
 *codepoints*, whereas your string only has N *codeunits*.

 Your're right again :)

 If forgot that takeExactly auto-decodes.

Technically, it only pops. It's front/popFront that auto-decode.

Oct 02 2014

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:

On Thursday, 2 October 2014 at 13:21:24 UTC, monarch_dodra wrote:
 Technically, it only pops. It's front/popFront that auto-decode.

Thanks again.

I decided to try to expand D-universe by expressing this through 
a new range. For details see:

http://forum.dlang.org/thread/uzrbmjonrkixojzflbig forum.dlang.org#post-uzrbmjonrkixojzflbig:40forum.dlang.org

Oct 03 2014

"monarch_dodra" <monarchdodra gmail.com> writes:

On Wednesday, 1 October 2014 at 11:06:24 UTC, Nordlöw wrote:
 I'm looking for a way to make my algorithm

     S[] findWordSplit(S)(S word,
                          HLang[] langs = [])
     {
         for (size_t i = 1; i + 1 < word.length; i++)
         {
             const first = word[0..i];
             const second = word[i..$];
             if (this.canMeanSomething(first, langs) &&
                 this.canMeanSomething(second, langs))
             {
                 return [first,
                         second];
             }
         }
         return typeof(return).init;
     }

 correctly work if S is a (UTF-8) string without first, in lazy 
 manner, encode word to a dstring.

 Currently this algorithm works as

 "carwash" => ["car", "wash"]

 and I would like it to work correctly and efficient in my 
 native language aswell as

 "biltvätt" => ["bil", "tvätt"]

 :)

Out of curiosity, why exactly isn't it working in your "native 
language"? If you avoid decoding in your "canMeanSomething", you 
should encounter no problems.

Oct 01 2014

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:

On Wednesday, 1 October 2014 at 16:44:24 UTC, monarch_dodra wrote:
 language"? If you avoid decoding in your "canMeanSomething", 
 you should encounter no problems.

You're right. I'll try that.

Oct 01 2014

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Find Semantically Correct Word Splits in UTF-8 Strings