digitalmars.D - Unnecessary Large Memory Usage of Levenshtein Distance in Phobos
- =?UTF-8?B?Ik5vcmRsw7Z3Ig==?= (178/178) May 22 2014 I justd discovered that the std.algorithm implementation of
- Max Klyga (3/182) May 22 2014 You should make a pull request with this implementation adapted to
- James Carr via Digitalmars-d (16/208) May 22 2014 I've begun work on this and my implementation in D passes all the
- =?UTF-8?B?Ik5vcmRsw7Z3Ig==?= (5/6) May 22 2014 I uploaded it here
- James Carr via Digitalmars-d (4/11) May 22 2014 I've submitted a pull request.
- James Carr via Digitalmars-d (7/22) May 25 2014 Hi again,
- Brian Schott (4/9) Jul 11 2014 This PR has been sitting around for over a month without anybody
- Andrei Alexandrescu (4/6) May 22 2014 Yes please. I recall I wanted to add that optimization later but forgot
I justd discovered that the std.algorithm implementation of Levenshtein Distance requires O(m*n) memory usage. This is not neccessary. I have a C++-implementation of Damerau-Levenshtein that requires only O(3*min(m,n)). Is anybody interested in discussion modifying std.algorithm to make use of this? Here's C++ implementation: template<class T> inline void perm3_231(T& a, T& b, T& c) { T _t=a; a=b; b=c; c=_t; } template<class T> inline pure bool is_min(const T& a) { return a == pnw::minof<T>(); } template<class T> inline pure bool is_max(const T& a) { return a == pnw::maxof<T>(); } template<class T, class D = size_t> inline pure D damerau_levenshtein(const T& s_, const T& t_, D max_distance = std::numeric_limits<D>::max(), D insert_weight = static_cast<D>(10), D delete_weight = static_cast<D>(7), D replace_weight = static_cast<D>(5), D transposition_weight = static_cast<D>(3)) { // reorder s and t to minimize memory usage bool ook = s_.size() >= t_.size(); // argument ordering ok flag const T& s = ook ? s_ : t_; // assure \c s becomes the \em longest const T& t = ook ? t_ : s_; // assure \c t becomes the \em shortest const D m = s.size(); const D n = t.size(); if (m == 0) { return n; } if (n == 0) { return m; } // Adapt the algorithm to use less space, O(3*min(n,m)) instead of O(mn), // since it only requires that the previous row/column and current row/column be stored at // any one time. #ifdef HAVE_C99_VARIABLE_LENGTH_ARRAYS D cc_[n+1], pc_[n+1], sc_[n+1]; // current, previous and second-previous column on stack #elif HAVE_CXX11_UNIQUE_PTR std::unique_ptr<D[]> cc_(new D[n+1]); // current column std::unique_ptr<D[]> pc_(new D[n+1]); // previous column std::unique_ptr<D[]> sc_(new D[n+1]); // second previous column #else auto cc_ = new D[n+1]; // current column auto pc_ = new D[n+1]; // previous column auto sc_ = new D[n+1]; // second previous column //std::vector<D> cc_(n+1), pc_(n+1), sc_(n+1); // current, previous and second previous column #endif D * cc = &cc_[0], * pc = &pc_[0], * sc = &sc_[0]; // pointers for efficient swapping // initialize previous column for (D i = 0; i < n+1; ++i) { pc[i] = i * insert_weight; } // second previous column \c sc will be defined in second \c i iteration in outer-loop const auto D_max = std::numeric_limits<D>::max(); // Computing the Levenshtein distance is based on the observation that if we // reserve a matrix to hold the Levenshtein distances between all prefixes // of the first string and all prefixes of the second, then we can compute // the values in the matrix by flood filling the matrix, and thus find the // distance between the two full strings as the last value computed. // This algorithm, an example of bottom-up dynamic programming, is // discussed, with variants, in the 1974 article The String-to-string // correction problem by Robert A. Wagner and Michael J.Fischer. for (D i = 0; i < m; ++i) { cc[0] = i+insert_weight; auto tmin = D_max; // row/column total min for (D j = 0; j < n; ++j) { // TODO Use sub_dist //auto sub_dist = damerau_levenshtein(s[i], t[j]); // recurse if for example T is an std::vector<std::string> cc[j+1] = pnw::min(pc[j+1] + insert_weight, // insertion cc[j] + delete_weight, // deletion pc[j] + (s[i] == t[j] ? 0 : replace_weight)); // substitution // transposition if (not is_max(transposition_weight)) { // if transposition should be allowed if (i > 0 and j > 0 and // we need at least two characters s[i-1] == t[j] and // and first must be equal second s[i] == t[j-1] // and vice versa ) { cc[j+1] = std::min(cc[j+1], sc[j-1] + transposition_weight); } } if (not is_max(max_distance)) { tmin = std::min(tmin, cc[j+1]); } } if ((not is_max(max_distance)) and tmin >= max_distance) { // if no element is smaller than \p max_distance return max_distance; } if (transposition_weight) { perm3_231(pc, cc, sc); // rotate pointers } else { std::swap(cc, pc); } } #if !(defined(HAVE_C99_VARIABLE_LENGTH_ARRAYS) || defined(HAVE_CXX11_UNIQUE_PTR)) delete [] cc_; delete [] pc_; delete [] sc_; #endif return pc[n]; } /*! Get \em Levenshtein (Edit) Distance (LD) metric between the \em sequences \p s and \p t. * Computing LD is also called Optimal String Alignment (OSA). */ template<class T, class D = size_t> inline pure D levenshtein(const T& s, const T& t, D max_distance = std::numeric_limits<D>::max(), D insert_weight = static_cast<D>(10), D delete_weight = static_cast<D>(7), D replace_weight = static_cast<D>(5)) { return damerau_levenshtein(s, t, max_distance, insert_weight, delete_weight, replace_weight, std::numeric_limits<D>::max()); } /*! Get \em Levenshtein (Edit) Distance (LD) metric between the \em arrays \p s and \p t. * Computing LD is also called Optimal String Alignment (OSA). */ template<class D = size_t> inline pure D levenshtein(const char * s, const char * t, D max_distance = std::numeric_limits<D>::max(), D insert_weight = static_cast<D>(10), D delete_weight = static_cast<D>(7), D replace_weight = static_cast<D>(5)) { return levenshtein(csc(s), csc(t), max_distance, insert_weight, delete_weight, replace_weight); } /* ---------------------------- Group Separator ---------------------------- */ template<class T, class D = size_t> inline pure D test_levenshtein_symmetry(const T& s, const T& t, D max_distance = std::numeric_limits<D>::max()) { D st = levenshtein(s, t, max_distance, static_cast<D>(1),static_cast<D>(1),static_cast<D>(1)); D ts = levenshtein(t, s, max_distance, static_cast<D>(1),static_cast<D>(1),static_cast<D>(1)); bool sym = (st == ts); // symmetry return sym ? st : std::numeric_limits<D>::max(); }
May 22 2014
You should make a pull request with this implementation adapted to std.algorithm interface On 2014-05-22 09:49:09 +0000, Nordlöw said:I justd discovered that the std.algorithm implementation of Levenshtein Distance requires O(m*n) memory usage. This is not neccessary. I have a C++-implementation of Damerau-Levenshtein that requires only O(3*min(m,n)). Is anybody interested in discussion modifying std.algorithm to make use of this? Here's C++ implementation: template<class T> inline void perm3_231(T& a, T& b, T& c) { T _t=a; a=b; b=c; c=_t; } template<class T> inline pure bool is_min(const T& a) { return a == pnw::minof<T>(); } template<class T> inline pure bool is_max(const T& a) { return a == pnw::maxof<T>(); } template<class T, class D = size_t> inline pure D damerau_levenshtein(const T& s_, const T& t_, D max_distance = std::numeric_limits<D>::max(), D insert_weight = static_cast<D>(10), D delete_weight = static_cast<D>(7), D replace_weight = static_cast<D>(5), D transposition_weight = static_cast<D>(3)) { // reorder s and t to minimize memory usage bool ook = s_.size() >= t_.size(); // argument ordering ok flag const T& s = ook ? s_ : t_; // assure \c s becomes the \em longest const T& t = ook ? t_ : s_; // assure \c t becomes the \em shortest const D m = s.size(); const D n = t.size(); if (m == 0) { return n; } if (n == 0) { return m; } // Adapt the algorithm to use less space, O(3*min(n,m)) instead of O(mn), // since it only requires that the previous row/column and current row/column be stored at // any one time. #ifdef HAVE_C99_VARIABLE_LENGTH_ARRAYS D cc_[n+1], pc_[n+1], sc_[n+1]; // current, previous and second-previous column on stack #elif HAVE_CXX11_UNIQUE_PTR std::unique_ptr<D[]> cc_(new D[n+1]); // current column std::unique_ptr<D[]> pc_(new D[n+1]); // previous column std::unique_ptr<D[]> sc_(new D[n+1]); // second previous column #else auto cc_ = new D[n+1]; // current column auto pc_ = new D[n+1]; // previous column auto sc_ = new D[n+1]; // second previous column //std::vector<D> cc_(n+1), pc_(n+1), sc_(n+1); // current, previous and second previous column #endif D * cc = &cc_[0], * pc = &pc_[0], * sc = &sc_[0]; // pointers for efficient swapping // initialize previous column for (D i = 0; i < n+1; ++i) { pc[i] = i * insert_weight; } // second previous column \c sc will be defined in second \c i iteration in outer-loop const auto D_max = std::numeric_limits<D>::max(); // Computing the Levenshtein distance is based on the observation that if we // reserve a matrix to hold the Levenshtein distances between all prefixes // of the first string and all prefixes of the second, then we can compute // the values in the matrix by flood filling the matrix, and thus find the // distance between the two full strings as the last value computed. // This algorithm, an example of bottom-up dynamic programming, is // discussed, with variants, in the 1974 article The String-to-string // correction problem by Robert A. Wagner and Michael J.Fischer. for (D i = 0; i < m; ++i) { cc[0] = i+insert_weight; auto tmin = D_max; // row/column total min for (D j = 0; j < n; ++j) { // TODO Use sub_dist //auto sub_dist = damerau_levenshtein(s[i], t[j]); // recurse if for example T is an std::vector<std::string> cc[j+1] = pnw::min(pc[j+1] + insert_weight, // insertion cc[j] + delete_weight, // deletion pc[j] + (s[i] == t[j] ? 0 : replace_weight)); // substitution // transposition if (not is_max(transposition_weight)) { // if transposition should be allowed if (i > 0 and j > 0 and // we need at least two characters s[i-1] == t[j] and // and first must be equal second s[i] == t[j-1] // and vice versa ) { cc[j+1] = std::min(cc[j+1], sc[j-1] + transposition_weight); } } if (not is_max(max_distance)) { tmin = std::min(tmin, cc[j+1]); } } if ((not is_max(max_distance)) and tmin >= max_distance) { // if no element is smaller than \p max_distance return max_distance; } if (transposition_weight) { perm3_231(pc, cc, sc); // rotate pointers } else { std::swap(cc, pc); } } #if !(defined(HAVE_C99_VARIABLE_LENGTH_ARRAYS) || defined(HAVE_CXX11_UNIQUE_PTR)) delete [] cc_; delete [] pc_; delete [] sc_; #endif return pc[n]; } /*! Get \em Levenshtein (Edit) Distance (LD) metric between the \em sequences \p s and \p t. * Computing LD is also called Optimal String Alignment (OSA). */ template<class T, class D = size_t> inline pure D levenshtein(const T& s, const T& t, D max_distance = std::numeric_limits<D>::max(), D insert_weight = static_cast<D>(10), D delete_weight = static_cast<D>(7), D replace_weight = static_cast<D>(5)) { return damerau_levenshtein(s, t, max_distance, insert_weight, delete_weight, replace_weight, std::numeric_limits<D>::max()); } /*! Get \em Levenshtein (Edit) Distance (LD) metric between the \em arrays \p s and \p t. * Computing LD is also called Optimal String Alignment (OSA). */ template<class D = size_t> inline pure D levenshtein(const char * s, const char * t, D max_distance = std::numeric_limits<D>::max(), D insert_weight = static_cast<D>(10), D delete_weight = static_cast<D>(7), D replace_weight = static_cast<D>(5)) { return levenshtein(csc(s), csc(t), max_distance, insert_weight, delete_weight, replace_weight); } /* ---------------------------- Group Separator ---------------------------- */ template<class T, class D = size_t> inline pure D test_levenshtein_symmetry(const T& s, const T& t, D max_distance = std::numeric_limits<D>::max()) { D st = levenshtein(s, t, max_distance, static_cast<D>(1),static_cast<D>(1),static_cast<D>(1)); D ts = levenshtein(t, s, max_distance, static_cast<D>(1),static_cast<D>(1),static_cast<D>(1)); bool sym = (st == ts); // symmetry return sym ? st : std::numeric_limits<D>::max(); }
May 22 2014
I've begun work on this and my implementation in D passes all the std.algorithm unit tests, but because it now uses a single array instead of a matrix, path() no longer provides the correct answer. I'm working on trying to amend it so that there is consistency. On Thu, May 22, 2014 at 12:26 PM, Max Klyga via Digitalmars-d < digitalmars-d puremagic.com> wrote:You should make a pull request with this implementation adapted to std.algorithm interface On 2014-05-22 09:49:09 +0000, Nordl=C3=B6w said: I justd discovered that the std.algorithm implementation of LevenshteintedDistance requires O(m*n) memory usage. This is not neccessary. I have a C++-implementation of Damerau-Levenshtein that requires only O(3*min(m,n)). Is anybody interes=tin discussion modifying std.algorithm to make use of this? Here's C++ implementation: template<class T> inline void perm3_231(T& a, T& b, T& c) { T _t=3Da; a=3Db; b=3Dc; c=3D_t; } template<class T> inline pure bool is_min(const T& a) { return a =3D=3D pnw::minof<T>(); } template<class T> inline pure bool is_max(const T& a) { return a =3D=3D pnw::maxof<T>(); } template<class T, class D =3D size_t> inline pure D damerau_levenshtein(const T& s_, const T& t_, D max_distance =3D std::numeric_limits<D>::max(), D insert_weight =3D static_cast<D>(10), D delete_weight =3D static_cast<D>(7), D replace_weight =3D static_cast<D>(5), D transposition_weight =3D static_cast<D>(3)) { // reorder s and t to minimize memory usage bool ook =3D s_.size() >=3D t_.size(); // argument ordering ok flag const T& s =3D ook ? s_ : t_; // assure \c s becomes the \em longes=stconst T& t =3D ook ? t_ : s_; // assure \c t becomes the \em shorte=nconst D m =3D s.size(); const D n =3D t.size(); if (m =3D=3D 0) { return n; } if (n =3D=3D 0) { return m; } // Adapt the algorithm to use less space, O(3*min(n,m)) instead of O(mn), // since it only requires that the previous row/column and current row/column be stored at // any one time. #ifdef HAVE_C99_VARIABLE_LENGTH_ARRAYS D cc_[n+1], pc_[n+1], sc_[n+1]; // current, previous and second-previous column on stack #elif HAVE_CXX11_UNIQUE_PTR std::unique_ptr<D[]> cc_(new D[n+1]); // current column std::unique_ptr<D[]> pc_(new D[n+1]); // previous column std::unique_ptr<D[]> sc_(new D[n+1]); // second previous colum=for#else auto cc_ =3D new D[n+1]; // current column auto pc_ =3D new D[n+1]; // previous column auto sc_ =3D new D[n+1]; // second previous column //std::vector<D> cc_(n+1), pc_(n+1), sc_(n+1); // current, previous and second previous column #endif D * cc =3D &cc_[0], * pc =3D &pc_[0], * sc =3D &sc_[0]; // pointers=.efficient swapping // initialize previous column for (D i =3D 0; i < n+1; ++i) { pc[i] =3D i * insert_weight; } // second previous column \c sc will be defined in second \c i iteration in outer-loop const auto D_max =3D std::numeric_limits<D>::max(); // Computing the Levenshtein distance is based on the observation that if we // reserve a matrix to hold the Levenshtein distances between all prefixes // of the first string and all prefixes of the second, then we can compute // the values in the matrix by flood filling the matrix, and thus find the // distance between the two full strings as the last value computed=ng// This algorithm, an example of bottom-up dynamic programming, is // discussed, with variants, in the 1974 article The String-to-stri=second// correction problem by Robert A. Wagner and Michael J.Fischer. for (D i =3D 0; i < m; ++i) { cc[0] =3D i+insert_weight; auto tmin =3D D_max; // row/column total min for (D j =3D 0; j < n; ++j) { // TODO Use sub_dist //auto sub_dist =3D damerau_levenshtein(s[i], t[j]); // recurse if for example T is an std::vector<std::string> cc[j+1] =3D pnw::min(pc[j+1] + insert_weight, // insertion cc[j] + delete_weight, // deletion pc[j] + (s[i] =3D=3D t[j] ? 0 : replace_weight)); // substitution // transposition if (not is_max(transposition_weight)) { // if transposition should be allowed if (i > 0 and j > 0 and // we need at least two characters s[i-1] =3D=3D t[j] and // and first must be equal =ast<s[i] =3D=3D t[j-1] // and vice versa ) { cc[j+1] =3D std::min(cc[j+1], sc[j-1] + transposition_weight); } } if (not is_max(max_distance)) { tmin =3D std::min(tmin, cc[j+1]); } } if ((not is_max(max_distance)) and tmin >=3D max_distance) { // if no element is smaller than \p max_distance return max_distance; } if (transposition_weight) { perm3_231(pc, cc, sc); // rotate pointers } else { std::swap(cc, pc); } } #if !(defined(HAVE_C99_VARIABLE_LENGTH_ARRAYS) || defined(HAVE_CXX11_UNIQUE_PTR)) delete [] cc_; delete [] pc_; delete [] sc_; #endif return pc[n]; } /*! Get \em Levenshtein (Edit) Distance (LD) metric between the \em sequences \p s and \p t. * Computing LD is also called Optimal String Alignment (OSA). */ template<class T, class D =3D size_t> inline pure D levenshtein(const T& s, const T& t, D max_distance =3D std::numeric_limits<D>::max(), D insert_weight =3D static_cast<D>(10), D delete_weight =3D static_cast<D>(7), D replace_weight =3D static_cast<D>(5)) { return damerau_levenshtein(s, t, max_distance, insert_weight, delete_weight, replace_weight, std::numeric_limits<D>::max()); } /*! Get \em Levenshtein (Edit) Distance (LD) metric between the \em arrays \p s and \p t. * Computing LD is also called Optimal String Alignment (OSA). */ template<class D =3D size_t> inline pure D levenshtein(const char * s, const char * t, D max_distance =3D std::numeric_limits<D>::max(), D insert_weight =3D static_cast<D>(10), D delete_weight =3D static_cast<D>(7), D replace_weight =3D static_cast<D>(5)) { return levenshtein(csc(s), csc(t), max_distance, insert_weight, delete_weight, replace_weight); } /* ---------------------------- Group Separator ---------------------------- */ template<class T, class D =3D size_t> inline pure D test_levenshtein_symmetry(const T& s, const T& t, D max_distance =3D std::numeric_limits<D>::max()) { D st =3D levenshtein(s, t, max_distance, static_cast<D>(1),static_c=ast<D>(1),static_cast<D>(1)); D ts =3D levenshtein(t, s, max_distance, static_cast<D>(1),static_c=D>(1),static_cast<D>(1)); bool sym =3D (st =3D=3D ts); // symmetry return sym ? st : std::numeric_limits<D>::max(); }
May 22 2014
I've begun work on this and my implementation in D passes allI uploaded it here https://github.com/nordlow/justcxx/blob/master/levenshtein_distance.hpp I don't have time right now to look into adapting it to Phobos. But I hope it can give some clues for your work ;) /Per
May 22 2014
I've submitted a pull request. https://github.com/D-Programming-Language/phobos/pull/2195 On Thu, May 22, 2014 at 8:05 PM, "Nordl=C3=B6w" <digitalmars-d puremagic.co= m>wrote:I've begun work on this and my implementation in D passes allI uploaded it here https://github.com/nordlow/justcxx/blob/master/levenshtein_distance.hpp I don't have time right now to look into adapting it to Phobos. But I hope it can give some clues for your work ;) /Per
May 22 2014
Hi again, I've implemented the memory efficient version https://github.com/D-Programming-Language/phobos/pull/2195 I was wondering if I could get some feedback please. Thanks. On Thu, May 22, 2014 at 8:42 PM, James Carr <jdcarrman gmail.com> wrote:I've submitted a pull request. https://github.com/D-Programming-Language/phobos/pull/2195 On Thu, May 22, 2014 at 8:05 PM, "Nordl=C3=B6w" <digitalmars-d puremagic.=com>wrote:I've begun work on this and my implementation in D passes allI uploaded it here https://github.com/nordlow/justcxx/blob/master/levenshtein_distance.hpp I don't have time right now to look into adapting it to Phobos. But I hope it can give some clues for your work ;) /Per
May 25 2014
On Sunday, 25 May 2014 at 22:22:28 UTC, James Carr via Digitalmars-d wrote:Hi again, I've implemented the memory efficient version https://github.com/D-Programming-Language/phobos/pull/2195 I was wondering if I could get some feedback please. Thanks.This PR has been sitting around for over a month without anybody doing anything about it. Can we get a Phobos dev to take a look?
Jul 11 2014
On 5/22/14, 4:26 AM, Max Klyga wrote:You should make a pull request with this implementation adapted to std.algorithm interfaceYes please. I recall I wanted to add that optimization later but forgot about it. Andrei
May 22 2014