www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Unnecessary Large Memory Usage of Levenshtein Distance in Phobos

reply =?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:
I justd discovered that the std.algorithm implementation of 
Levenshtein Distance requires O(m*n) memory usage.

This is not neccessary. I have a C++-implementation of 
Damerau-Levenshtein that requires only O(3*min(m,n)). Is anybody 
interested in discussion modifying std.algorithm to make use of 
this?

Here's C++ implementation:


template<class T> inline void perm3_231(T& a, T& b, T& c) {
     T _t=a; a=b; b=c; c=_t;
}
template<class T> inline pure bool is_min(const T& a) {
     return a == pnw::minof<T>();
}
template<class T> inline pure bool is_max(const T& a) {
     return a == pnw::maxof<T>();
}

template<class T, class D = size_t>
inline pure
D damerau_levenshtein(const T& s_, const T& t_,
                       D max_distance = 
std::numeric_limits<D>::max(),
                       D insert_weight = static_cast<D>(10),
                       D delete_weight = static_cast<D>(7),
                       D replace_weight = static_cast<D>(5),
                       D transposition_weight = static_cast<D>(3))
{
     // reorder s and t to minimize memory usage
     bool ook = s_.size() >= t_.size(); // argument ordering ok 
flag
     const T& s = ook ? s_ : t_; // assure \c s becomes the \em 
longest
     const T& t = ook ? t_ : s_; // assure \c t becomes the \em 
shortest

     const D m = s.size();
     const D n = t.size();

     if (m == 0) { return n; }
     if (n == 0) { return m; }

     // Adapt the algorithm to use less space, O(3*min(n,m)) 
instead of O(mn),
     // since it only requires that the previous row/column and 
current row/column be stored at
     // any one time.
#ifdef HAVE_C99_VARIABLE_LENGTH_ARRAYS
     D cc_[n+1], pc_[n+1], sc_[n+1]; // current, previous and 
second-previous column on stack
#elif HAVE_CXX11_UNIQUE_PTR
     std::unique_ptr<D[]> cc_(new D[n+1]);      // current column
     std::unique_ptr<D[]> pc_(new D[n+1]);      // previous column
     std::unique_ptr<D[]> sc_(new D[n+1]);      // second previous 
column
#else
     auto cc_ = new D[n+1];      // current column
     auto pc_ = new D[n+1];      // previous column
     auto sc_ = new D[n+1];      // second previous column
     //std::vector<D> cc_(n+1), pc_(n+1), sc_(n+1); // current, 
previous and second previous column
#endif
     D * cc = &cc_[0], * pc = &pc_[0], * sc = &sc_[0]; // pointers 
for efficient swapping

     // initialize previous column
     for (D i = 0; i < n+1; ++i) { pc[i] = i * insert_weight; }

     // second previous column \c sc will be defined in second \c 
i iteration in outer-loop

     const auto D_max = std::numeric_limits<D>::max();

     // Computing the Levenshtein distance is based on the 
observation that if we
     // reserve a matrix to hold the Levenshtein distances between 
all prefixes
     // of the first string and all prefixes of the second, then 
we can compute
     // the values in the matrix by flood filling the matrix, and 
thus find the
     // distance between the two full strings as the last value 
computed.
     // This algorithm, an example of bottom-up dynamic 
programming, is
     // discussed, with variants, in the 1974 article The 
String-to-string
     // correction problem by Robert A. Wagner and Michael 
J.Fischer.
     for (D i = 0; i < m; ++i) {
         cc[0] = i+insert_weight;
         auto tmin = D_max; // row/column total min
         for (D j = 0; j < n; ++j) {
             // TODO Use sub_dist
             //auto sub_dist = damerau_levenshtein(s[i], t[j]); // 
recurse if for example T is an std::vector<std::string>
             cc[j+1] = pnw::min(pc[j+1] + insert_weight,         
// insertion
                                cc[j] + delete_weight,           
// deletion
                                pc[j] + (s[i] == t[j] ? 0 : 
replace_weight)); // substitution

             // transposition
             if (not is_max(transposition_weight)) { // if 
transposition should be allowed
                 if (i > 0 and j > 0 and // we need at least two 
characters
                     s[i-1] == t[j] and  // and first must be 
equal second
                     s[i]   == t[j-1]    // and vice versa
                     ) {
                     cc[j+1] = std::min(cc[j+1],
                                        sc[j-1] + 
transposition_weight);
                 }
             }

             if (not is_max(max_distance)) {
                 tmin = std::min(tmin, cc[j+1]);
             }
         }

         if ((not is_max(max_distance)) and
             tmin >= max_distance) {
             // if no element is smaller than \p max_distance
             return max_distance;
         }

         if (transposition_weight) {
             perm3_231(pc, cc, sc); // rotate pointers
         } else {
             std::swap(cc, pc);
         }
     }

#if !(defined(HAVE_C99_VARIABLE_LENGTH_ARRAYS) || 
defined(HAVE_CXX11_UNIQUE_PTR))
     delete [] cc_;
     delete [] pc_;
     delete [] sc_;
#endif
     return pc[n];
}


/*! Get \em Levenshtein (Edit) Distance (LD) metric between the 
\em sequences \p s and \p t.
  * Computing LD is also called Optimal String Alignment (OSA).
  */
template<class T, class D = size_t>
inline pure
D levenshtein(const T& s, const T& t,
               D max_distance = std::numeric_limits<D>::max(),
               D insert_weight = static_cast<D>(10),
               D delete_weight = static_cast<D>(7),
               D replace_weight = static_cast<D>(5))
{
     return damerau_levenshtein(s, t, max_distance, insert_weight, 
delete_weight, replace_weight,
                                std::numeric_limits<D>::max());
}

/*! Get \em Levenshtein (Edit) Distance (LD) metric between the 
\em arrays \p s and \p t.
  * Computing LD is also called Optimal String Alignment (OSA).
  */
template<class D = size_t>
inline pure
D levenshtein(const char * s, const char * t,
               D max_distance = std::numeric_limits<D>::max(),
               D insert_weight = static_cast<D>(10),
               D delete_weight = static_cast<D>(7),
               D replace_weight = static_cast<D>(5))
{
     return levenshtein(csc(s),
                        csc(t),
                        max_distance, insert_weight, 
delete_weight, replace_weight);
}

/* ---------------------------- Group Separator 
---------------------------- */

template<class T, class D = size_t>
inline pure
D test_levenshtein_symmetry(const T& s, const T& t,
                             D max_distance = 
std::numeric_limits<D>::max())
{
     D st = levenshtein(s, t, max_distance, 
static_cast<D>(1),static_cast<D>(1),static_cast<D>(1));
     D ts = levenshtein(t, s, max_distance, 
static_cast<D>(1),static_cast<D>(1),static_cast<D>(1));
     bool sym = (st == ts); // symmetry
     return sym ? st : std::numeric_limits<D>::max();
}
May 22 2014
parent reply Max Klyga <max.klyga gmail.com> writes:
You should make a pull request with this implementation adapted to 
std.algorithm interface

On 2014-05-22 09:49:09 +0000, Nordlöw said:

 I justd discovered that the std.algorithm implementation of Levenshtein 
 Distance requires O(m*n) memory usage.
 
 This is not neccessary. I have a C++-implementation of 
 Damerau-Levenshtein that requires only O(3*min(m,n)). Is anybody 
 interested in discussion modifying std.algorithm to make use of this?
 
 Here's C++ implementation:
 
 
 template<class T> inline void perm3_231(T& a, T& b, T& c) {
      T _t=a; a=b; b=c; c=_t;
 }
 template<class T> inline pure bool is_min(const T& a) {
      return a == pnw::minof<T>();
 }
 template<class T> inline pure bool is_max(const T& a) {
      return a == pnw::maxof<T>();
 }
 
 template<class T, class D = size_t>
 inline pure
 D damerau_levenshtein(const T& s_, const T& t_,
                        D max_distance = std::numeric_limits<D>::max(),
                        D insert_weight = static_cast<D>(10),
                        D delete_weight = static_cast<D>(7),
                        D replace_weight = static_cast<D>(5),
                        D transposition_weight = static_cast<D>(3))
 {
      // reorder s and t to minimize memory usage
      bool ook = s_.size() >= t_.size(); // argument ordering ok flag
      const T& s = ook ? s_ : t_; // assure \c s becomes the \em longest
      const T& t = ook ? t_ : s_; // assure \c t becomes the \em shortest
 
      const D m = s.size();
      const D n = t.size();
 
      if (m == 0) { return n; }
      if (n == 0) { return m; }
 
      // Adapt the algorithm to use less space, O(3*min(n,m)) instead of O(mn),
      // since it only requires that the previous row/column and current 
 row/column be stored at
      // any one time.
 #ifdef HAVE_C99_VARIABLE_LENGTH_ARRAYS
      D cc_[n+1], pc_[n+1], sc_[n+1]; // current, previous and 
 second-previous column on stack
 #elif HAVE_CXX11_UNIQUE_PTR
      std::unique_ptr<D[]> cc_(new D[n+1]);      // current column
      std::unique_ptr<D[]> pc_(new D[n+1]);      // previous column
      std::unique_ptr<D[]> sc_(new D[n+1]);      // second previous column
 #else
      auto cc_ = new D[n+1];      // current column
      auto pc_ = new D[n+1];      // previous column
      auto sc_ = new D[n+1];      // second previous column
      //std::vector<D> cc_(n+1), pc_(n+1), sc_(n+1); // current, 
 previous and second previous column
 #endif
      D * cc = &cc_[0], * pc = &pc_[0], * sc = &sc_[0]; // pointers for 
 efficient swapping
 
      // initialize previous column
      for (D i = 0; i < n+1; ++i) { pc[i] = i * insert_weight; }
 
      // second previous column \c sc will be defined in second \c i 
 iteration in outer-loop
 
      const auto D_max = std::numeric_limits<D>::max();
 
      // Computing the Levenshtein distance is based on the observation 
 that if we
      // reserve a matrix to hold the Levenshtein distances between all prefixes
      // of the first string and all prefixes of the second, then we can compute
      // the values in the matrix by flood filling the matrix, and thus find the
      // distance between the two full strings as the last value computed.
      // This algorithm, an example of bottom-up dynamic programming, is
      // discussed, with variants, in the 1974 article The String-to-string
      // correction problem by Robert A. Wagner and Michael J.Fischer.
      for (D i = 0; i < m; ++i) {
          cc[0] = i+insert_weight;
          auto tmin = D_max; // row/column total min
          for (D j = 0; j < n; ++j) {
              // TODO Use sub_dist
              //auto sub_dist = damerau_levenshtein(s[i], t[j]); // 
 recurse if for example T is an std::vector<std::string>
              cc[j+1] = pnw::min(pc[j+1] + insert_weight,         // insertion
                                 cc[j] + delete_weight,           // deletion
                                 pc[j] + (s[i] == t[j] ? 0 : 
 replace_weight)); // substitution
 
              // transposition
              if (not is_max(transposition_weight)) { // if 
 transposition should be allowed
                  if (i > 0 and j > 0 and // we need at least two characters
                      s[i-1] == t[j] and  // and first must be equal second
                      s[i]   == t[j-1]    // and vice versa
                      ) {
                      cc[j+1] = std::min(cc[j+1],
                                         sc[j-1] + transposition_weight);
                  }
              }
 
              if (not is_max(max_distance)) {
                  tmin = std::min(tmin, cc[j+1]);
              }
          }
 
          if ((not is_max(max_distance)) and
              tmin >= max_distance) {
              // if no element is smaller than \p max_distance
              return max_distance;
          }
 
          if (transposition_weight) {
              perm3_231(pc, cc, sc); // rotate pointers
          } else {
              std::swap(cc, pc);
          }
      }
 
 #if !(defined(HAVE_C99_VARIABLE_LENGTH_ARRAYS) || 
 defined(HAVE_CXX11_UNIQUE_PTR))
      delete [] cc_;
      delete [] pc_;
      delete [] sc_;
 #endif
      return pc[n];
 }
 
 
 /*! Get \em Levenshtein (Edit) Distance (LD) metric between the \em 
 sequences \p s and \p t.
   * Computing LD is also called Optimal String Alignment (OSA).
   */
 template<class T, class D = size_t>
 inline pure
 D levenshtein(const T& s, const T& t,
                D max_distance = std::numeric_limits<D>::max(),
                D insert_weight = static_cast<D>(10),
                D delete_weight = static_cast<D>(7),
                D replace_weight = static_cast<D>(5))
 {
      return damerau_levenshtein(s, t, max_distance, insert_weight, 
 delete_weight, replace_weight,
                                 std::numeric_limits<D>::max());
 }
 
 /*! Get \em Levenshtein (Edit) Distance (LD) metric between the \em 
 arrays \p s and \p t.
   * Computing LD is also called Optimal String Alignment (OSA).
   */
 template<class D = size_t>
 inline pure
 D levenshtein(const char * s, const char * t,
                D max_distance = std::numeric_limits<D>::max(),
                D insert_weight = static_cast<D>(10),
                D delete_weight = static_cast<D>(7),
                D replace_weight = static_cast<D>(5))
 {
      return levenshtein(csc(s),
                         csc(t),
                         max_distance, insert_weight, delete_weight, 
 replace_weight);
 }
 
 /* ---------------------------- Group Separator ---------------------------- */
 
 template<class T, class D = size_t>
 inline pure
 D test_levenshtein_symmetry(const T& s, const T& t,
                              D max_distance = std::numeric_limits<D>::max())
 {
      D st = levenshtein(s, t, max_distance, 
 static_cast<D>(1),static_cast<D>(1),static_cast<D>(1));
      D ts = levenshtein(t, s, max_distance, 
 static_cast<D>(1),static_cast<D>(1),static_cast<D>(1));
      bool sym = (st == ts); // symmetry
      return sym ? st : std::numeric_limits<D>::max();
 }
May 22 2014
next sibling parent reply James Carr via Digitalmars-d <digitalmars-d puremagic.com> writes:
I've begun work on this and my implementation in D passes all the
std.algorithm unit tests, but because it now uses a single array instead of
a matrix, path() no longer provides the correct answer.

I'm working on trying to amend it so that there is consistency.


On Thu, May 22, 2014 at 12:26 PM, Max Klyga via Digitalmars-d <
digitalmars-d puremagic.com> wrote:

 You should make a pull request with this implementation adapted to
 std.algorithm interface


 On 2014-05-22 09:49:09 +0000, Nordl=C3=B6w said:

  I justd discovered that the std.algorithm implementation of Levenshtein
 Distance requires O(m*n) memory usage.

 This is not neccessary. I have a C++-implementation of
 Damerau-Levenshtein that requires only O(3*min(m,n)). Is anybody interes=
ted
 in discussion modifying std.algorithm to make use of this?

 Here's C++ implementation:


 template<class T> inline void perm3_231(T& a, T& b, T& c) {
      T _t=3Da; a=3Db; b=3Dc; c=3D_t;
 }
 template<class T> inline pure bool is_min(const T& a) {
      return a =3D=3D pnw::minof<T>();
 }
 template<class T> inline pure bool is_max(const T& a) {
      return a =3D=3D pnw::maxof<T>();
 }

 template<class T, class D =3D size_t>
 inline pure
 D damerau_levenshtein(const T& s_, const T& t_,
                        D max_distance =3D std::numeric_limits<D>::max(),
                        D insert_weight =3D static_cast<D>(10),
                        D delete_weight =3D static_cast<D>(7),
                        D replace_weight =3D static_cast<D>(5),
                        D transposition_weight =3D static_cast<D>(3))
 {
      // reorder s and t to minimize memory usage
      bool ook =3D s_.size() >=3D t_.size(); // argument ordering ok flag
      const T& s =3D ook ? s_ : t_; // assure \c s becomes the \em longes=
t
      const T& t =3D ook ? t_ : s_; // assure \c t becomes the \em shorte=
st
      const D m =3D s.size();
      const D n =3D t.size();

      if (m =3D=3D 0) { return n; }
      if (n =3D=3D 0) { return m; }

      // Adapt the algorithm to use less space, O(3*min(n,m)) instead of
 O(mn),
      // since it only requires that the previous row/column and current
 row/column be stored at
      // any one time.
 #ifdef HAVE_C99_VARIABLE_LENGTH_ARRAYS
      D cc_[n+1], pc_[n+1], sc_[n+1]; // current, previous and
 second-previous column on stack
 #elif HAVE_CXX11_UNIQUE_PTR
      std::unique_ptr<D[]> cc_(new D[n+1]);      // current column
      std::unique_ptr<D[]> pc_(new D[n+1]);      // previous column
      std::unique_ptr<D[]> sc_(new D[n+1]);      // second previous colum=
n
 #else
      auto cc_ =3D new D[n+1];      // current column
      auto pc_ =3D new D[n+1];      // previous column
      auto sc_ =3D new D[n+1];      // second previous column
      //std::vector<D> cc_(n+1), pc_(n+1), sc_(n+1); // current, previous
 and second previous column
 #endif
      D * cc =3D &cc_[0], * pc =3D &pc_[0], * sc =3D &sc_[0]; // pointers=
for
 efficient swapping

      // initialize previous column
      for (D i =3D 0; i < n+1; ++i) { pc[i] =3D i * insert_weight; }

      // second previous column \c sc will be defined in second \c i
 iteration in outer-loop

      const auto D_max =3D std::numeric_limits<D>::max();

      // Computing the Levenshtein distance is based on the observation
 that if we
      // reserve a matrix to hold the Levenshtein distances between all
 prefixes
      // of the first string and all prefixes of the second, then we can
 compute
      // the values in the matrix by flood filling the matrix, and thus
 find the
      // distance between the two full strings as the last value computed=
.
      // This algorithm, an example of bottom-up dynamic programming, is
      // discussed, with variants, in the 1974 article The String-to-stri=
ng
      // correction problem by Robert A. Wagner and Michael J.Fischer.
      for (D i =3D 0; i < m; ++i) {
          cc[0] =3D i+insert_weight;
          auto tmin =3D D_max; // row/column total min
          for (D j =3D 0; j < n; ++j) {
              // TODO Use sub_dist
              //auto sub_dist =3D damerau_levenshtein(s[i], t[j]); //
 recurse if for example T is an std::vector<std::string>
              cc[j+1] =3D pnw::min(pc[j+1] + insert_weight,         //
 insertion
                                 cc[j] + delete_weight,           //
 deletion
                                 pc[j] + (s[i] =3D=3D t[j] ? 0 :
 replace_weight)); // substitution

              // transposition
              if (not is_max(transposition_weight)) { // if transposition
 should be allowed
                  if (i > 0 and j > 0 and // we need at least two
 characters
                      s[i-1] =3D=3D t[j] and  // and first must be equal =
second
                      s[i]   =3D=3D t[j-1]    // and vice versa
                      ) {
                      cc[j+1] =3D std::min(cc[j+1],
                                         sc[j-1] + transposition_weight);
                  }
              }

              if (not is_max(max_distance)) {
                  tmin =3D std::min(tmin, cc[j+1]);
              }
          }

          if ((not is_max(max_distance)) and
              tmin >=3D max_distance) {
              // if no element is smaller than \p max_distance
              return max_distance;
          }

          if (transposition_weight) {
              perm3_231(pc, cc, sc); // rotate pointers
          } else {
              std::swap(cc, pc);
          }
      }

 #if !(defined(HAVE_C99_VARIABLE_LENGTH_ARRAYS) ||
 defined(HAVE_CXX11_UNIQUE_PTR))
      delete [] cc_;
      delete [] pc_;
      delete [] sc_;
 #endif
      return pc[n];
 }


 /*! Get \em Levenshtein (Edit) Distance (LD) metric between the \em
 sequences \p s and \p t.
   * Computing LD is also called Optimal String Alignment (OSA).
   */
 template<class T, class D =3D size_t>
 inline pure
 D levenshtein(const T& s, const T& t,
                D max_distance =3D std::numeric_limits<D>::max(),
                D insert_weight =3D static_cast<D>(10),
                D delete_weight =3D static_cast<D>(7),
                D replace_weight =3D static_cast<D>(5))
 {
      return damerau_levenshtein(s, t, max_distance, insert_weight,
 delete_weight, replace_weight,
                                 std::numeric_limits<D>::max());
 }

 /*! Get \em Levenshtein (Edit) Distance (LD) metric between the \em
 arrays \p s and \p t.
   * Computing LD is also called Optimal String Alignment (OSA).
   */
 template<class D =3D size_t>
 inline pure
 D levenshtein(const char * s, const char * t,
                D max_distance =3D std::numeric_limits<D>::max(),
                D insert_weight =3D static_cast<D>(10),
                D delete_weight =3D static_cast<D>(7),
                D replace_weight =3D static_cast<D>(5))
 {
      return levenshtein(csc(s),
                         csc(t),
                         max_distance, insert_weight, delete_weight,
 replace_weight);
 }

 /* ---------------------------- Group Separator
 ---------------------------- */

 template<class T, class D =3D size_t>
 inline pure
 D test_levenshtein_symmetry(const T& s, const T& t,
                              D max_distance =3D
 std::numeric_limits<D>::max())
 {
      D st =3D levenshtein(s, t, max_distance, static_cast<D>(1),static_c=
ast<
 D>(1),static_cast<D>(1));
      D ts =3D levenshtein(t, s, max_distance, static_cast<D>(1),static_c=
ast<
 D>(1),static_cast<D>(1));
      bool sym =3D (st =3D=3D ts); // symmetry
      return sym ? st : std::numeric_limits<D>::max();
 }
May 22 2014
parent reply =?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:
 I've begun work on this and my implementation in D passes all
I uploaded it here https://github.com/nordlow/justcxx/blob/master/levenshtein_distance.hpp I don't have time right now to look into adapting it to Phobos. But I hope it can give some clues for your work ;) /Per
May 22 2014
next sibling parent James Carr via Digitalmars-d <digitalmars-d puremagic.com> writes:
I've submitted a pull request.

https://github.com/D-Programming-Language/phobos/pull/2195


On Thu, May 22, 2014 at 8:05 PM, "Nordl=C3=B6w" <digitalmars-d puremagic.co=
m>wrote:

 I've begun work on this and my implementation in D passes all

 I uploaded it here

 https://github.com/nordlow/justcxx/blob/master/levenshtein_distance.hpp

 I don't have time right now to look into adapting it to Phobos.
 But I hope it can give some clues for your work ;)

 /Per
May 22 2014
prev sibling parent reply James Carr via Digitalmars-d <digitalmars-d puremagic.com> writes:
Hi again,

I've implemented the memory efficient version
https://github.com/D-Programming-Language/phobos/pull/2195

I was wondering if I could get some feedback please.

Thanks.


On Thu, May 22, 2014 at 8:42 PM, James Carr <jdcarrman gmail.com> wrote:

 I've submitted a pull request.

 https://github.com/D-Programming-Language/phobos/pull/2195


 On Thu, May 22, 2014 at 8:05 PM, "Nordl=C3=B6w" <digitalmars-d puremagic.=
com>wrote:
  I've begun work on this and my implementation in D passes all

 I uploaded it here

 https://github.com/nordlow/justcxx/blob/master/levenshtein_distance.hpp

 I don't have time right now to look into adapting it to Phobos.
 But I hope it can give some clues for your work ;)

 /Per
May 25 2014
parent "Brian Schott" <briancschott gmail.com> writes:
On Sunday, 25 May 2014 at 22:22:28 UTC, James Carr via 
Digitalmars-d wrote:
 Hi again,

 I've implemented the memory efficient version
 https://github.com/D-Programming-Language/phobos/pull/2195

 I was wondering if I could get some feedback please.

 Thanks.
This PR has been sitting around for over a month without anybody doing anything about it. Can we get a Phobos dev to take a look?
Jul 11 2014
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/22/14, 4:26 AM, Max Klyga wrote:
 You should make a pull request with this implementation adapted to
 std.algorithm interface
Yes please. I recall I wanted to add that optimization later but forgot about it. Andrei
May 22 2014