digitalmars.D - Fixing dub search
- aberba (15/15) Dec 28 2020 Current dub registry search is inaccurate because it uses the
- Imperatorn (2/6) Dec 28 2020 https://github.com/dlang/dub-registry/pull/481
- aberba (2/9) Dec 28 2020 I've sent him an email about using MeiliSearch instead of a hack
- Imperatorn (2/12) Dec 29 2020 đź‘Ť
- sarn (13/18) Dec 28 2020 ElasticSearch also has a simple REST API and would do this job on
- aberba (15/35) Dec 29 2020 If you've looked at the very discussion you referenced, you'd
- sarn (9/17) Dec 29 2020 I read it. It's the bug report for this issue, and it's the
- Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (4/8) Dec 29 2020 The easiest option is to see if an indexing service (like the
- James Blachly (2/6) Dec 30 2020 This is an excellent suggestion!
- sarn (13/19) Jan 01 2021 Here's one problem: new contributors won't have access to the
- Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (7/10) Jan 01 2021 Two standard solutions:
- bachmeier (7/14) Dec 29 2020 sqlite was the first thing I thought about when I saw this
- Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (8/13) Dec 29 2020 I'm not being dismissive (I also don't use Dub), but in general
- bachmeier (8/21) Dec 29 2020 Well, except that sqlite works now and has been extensively
- Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (12/20) Dec 29 2020 I actually implemented Damerau–Levenshtein in Python the other
- Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (6/19) Dec 29 2020 Also, keep in mind that the fuzzy search does not have to be
- Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (7/11) Dec 29 2020 Her is the SLA for Algolia, they offer 99.99% and 99.999% which
- Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (7/10) Dec 29 2020 It is written in Rust...
- Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (9/11) Dec 29 2020 You could just use a trie for the tokens and implement
- aberba (8/18) Dec 29 2020 Read the previous GitHub discussion. They've gone through that
- aberba (4/5) Dec 29 2020 If anyone has one written in D too, we can use that as well. I
- Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (11/16) Dec 29 2020 Alright, if someone wants to start on it, then I'm willing to
Current dub registry search is inaccurate because it uses the built-in MongoDB search which isn't designed for accurate search. To fix this, a real search engine is needed. ElasticSearch is an overkill for what we need... basic accurate string search. Solution: use MeiliSearch. It's very lightweight and fast (1GB vps is more than enough). Very easy to use... just a REST API call. I already have a package for meilisearch on dub. What's needed: a hosted running instance of MeiliSearch for use in dub search. Since only the search functionality needs to be fixed, MeiliSearch will index a copy off all packages and re-index when they chang. The MeiliSearch index will handle search queries whilst MongoDB continues to handle everything else. I can make a PR for the MeiliSearch integration but I need to know foundation is willing to host a MeiliSearch instance for that.
Dec 28 2020
On Monday, 28 December 2020 at 10:49:58 UTC, aberba wrote:Current dub registry search is inaccurate because it uses the built-in MongoDB search which isn't designed for accurate search. [...]https://github.com/dlang/dub-registry/pull/481
Dec 28 2020
On Monday, 28 December 2020 at 18:55:09 UTC, Imperatorn wrote:On Monday, 28 December 2020 at 10:49:58 UTC, aberba wrote:I've sent him an email about using MeiliSearch instead of a hackCurrent dub registry search is inaccurate because it uses the built-in MongoDB search which isn't designed for accurate search. [...]https://github.com/dlang/dub-registry/pull/481
Dec 28 2020
On Monday, 28 December 2020 at 19:15:33 UTC, aberba wrote:On Monday, 28 December 2020 at 18:55:09 UTC, Imperatorn wrote:đź‘ŤOn Monday, 28 December 2020 at 10:49:58 UTC, aberba wrote:I've sent him an email about using MeiliSearch instead of a hackCurrent dub registry search is inaccurate because it uses the built-in MongoDB search which isn't designed for accurate search. [...]https://github.com/dlang/dub-registry/pull/481
Dec 29 2020
On Monday, 28 December 2020 at 10:49:58 UTC, aberba wrote:To fix this, a real search engine is needed. ElasticSearch is an overkill for what we need... basic accurate string search. Solution: use MeiliSearch. It's very lightweight and fast (1GB vps is more than enough). Very easy to use... just a REST API call. I already have a package for meilisearch on dub.ElasticSearch also has a simple REST API and would do this job on whatever hardware we'd realistically use. I'm not a huge ES fan, personally, but do you have more reasons to dismiss it as overkill and recommend MeiliSearch instead? The best place for discussion is here, though: https://github.com/dlang/dub-registry/issues/93 But I have to say something again: please, please, please, I beg, consider using an embedded search tool before adding an external server (or, worse, an external SaaS) as a runtime dependency to the dub registry. There are only a few thousand packages, and they don't update much. Even grepping the whole dataset every request would be fast enough (just not featureful enough).
Dec 28 2020
On Monday, 28 December 2020 at 20:49:12 UTC, sarn wrote:On Monday, 28 December 2020 at 10:49:58 UTC, aberba wrote:If you've looked at the very discussion you referenced, you'd realize they went around and still came back to using mongodb for search. Not only is elasticsearch built in Java, hence bloatware, it's also designed to do more than just search... hence more bloatware and overkill for just basic search. You may compare the size of elastic with meilisearch which is just a small binary. MeiliSearch can run on very little ram... The accuracy you'd get from a search engine just isn't possible with brute-force and hacks. Search is a complex problem involving stemming, plurals, synonyms, step words, ranking, etc. You'd want to use a real search engine. And between elasticsearch and MeiliSearch, MeiliSearch is simpler, lightweight and easy to use.To fix this, a real search engine is needed. ElasticSearch is an overkill for what we need... basic accurate string search. Solution: use MeiliSearch. It's very lightweight and fast (1GB vps is more than enough). Very easy to use... just a REST API call. I already have a package for meilisearch on dub.ElasticSearch also has a simple REST API and would do this job on whatever hardware we'd realistically use. I'm not a huge ES fan, personally, but do you have more reasons to dismiss it as overkill and recommend MeiliSearch instead? The best place for discussion is here, though: https://github.com/dlang/dub-registry/issues/93 But I have to say something again: please, please, please, I beg, consider using an embedded search tool before adding an external server (or, worse, an external SaaS) as a runtime dependency to the dub registry. There are only a few thousand packages, and they don't update much. Even grepping the whole dataset every request would be fast enough (just not featureful enough).
Dec 29 2020
On Tuesday, 29 December 2020 at 08:45:02 UTC, aberba wrote:On Monday, 28 December 2020 at 20:49:12 UTC, sarn wrote:I read it. It's the bug report for this issue, and it's the discussion thread for this community project. There's no them vs you. There's isn't even a final decision in that thread. MongoDB is being used now because that's what's implemented.The best place for discussion is here, though: https://github.com/dlang/dub-registry/issues/93If you've looked at the very discussion you referenced, you'd realize they went around and still came back to using mongodb for search.And between elasticsearch and MeiliSearch, MeiliSearch is simpler, lightweight and easy to use.Have you considered anything other than ElasticSearch, MeiliSearch and custom hacks? There are at least four other options mentioned in the thread I linked to. Maybe add MeiliSearch and your reasons for using it.
Dec 29 2020
On Tuesday, 29 December 2020 at 22:27:17 UTC, sarn wrote:Have you considered anything other than ElasticSearch, MeiliSearch and custom hacks? There are at least four other options mentioned in the thread I linked to. Maybe add MeiliSearch and your reasons for using it.The easiest option is to see if an indexing service (like the mentioned Algolia) is willing to sponsor Dub as an open source project, then they get some free advertising in return.
Dec 29 2020
On 12/29/20 5:34 PM, Ola Fosheim Grøstad wrote:The easiest option is to see if an indexing service (like the mentioned Algolia) is willing to sponsor Dub as an open source project, then they get some free advertising in return.This is an excellent suggestion!
Dec 30 2020
On Thursday, 31 December 2020 at 02:39:44 UTC, James Blachly wrote:On 12/29/20 5:34 PM, Ola Fosheim Grøstad wrote:Here's one problem: new contributors won't have access to the third-party service and will need to set up their own indexes from scratch, but the scripts or whatever to do that probably won't be maintained because existing contributors won't need them. It's healthier for the project if anyone can just download the codebase and hack on it without signing up for some third party service. By the way, I think there's a good project in here and I'm willing to contribute my 2c and maybe more, but I know I'll lose any discussion in this thread. I'm following this one: https://github.com/dlang/dub-registry/issues/93The easiest option is to see if an indexing service (like the mentioned Algolia) is willing to sponsor Dub as an open source project, then they get some free advertising in return.This is an excellent suggestion!
Jan 01 2021
On Friday, 1 January 2021 at 11:02:49 UTC, sarn wrote:Here's one problem: new contributors won't have access to the third-party service and will need to set up their own indexes from scratch, but the scripts or whatever to do that probablyTwo standard solutions: 1. make a tiny webservice for it have one set up for production and another one for testing. 2. make a tiny local in-memory emulator for it (no need for advanced matching or ranking) (I didn't quite get the argument about "losing")
Jan 01 2021
On Monday, 28 December 2020 at 20:49:12 UTC, sarn wrote:On Monday, 28 December 2020 at 10:49:58 UTC, aberba wrote:The best place for discussion is here, though: https://github.com/dlang/dub-registry/issues/93In that thread you wroteThe FTS features of DBs like Sqlite and Postgres are really nice if you're already using those DBs (otherwise other tools are more powerful). Moving all data to Sqlite or PG is obviously a whole bigger decision.sqlite was the first thing I thought about when I saw this thread. How much data would have to be copied into an sqlite database for searching of packages? That has the advantage of more or less no dependencies, dead simple to add, claimed good results...
Dec 29 2020
On Tuesday, 29 December 2020 at 22:53:55 UTC, bachmeier wrote:sqlite was the first thing I thought about when I saw this thread. How much data would have to be copied into an sqlite database for searching of packages? That has the advantage of more or less no dependencies, dead simple to add, claimed good results...I'm not being dismissive (I also don't use Dub), but in general this would not scale very well. Unless you want to do all searches locally. Also, a high quality search engine requires custom ranking, so not really sure if it is overall less work than rolling your own if you want high quality search results. The text corpus is tiny, so there is really no point in using a generic on-disk solution?
Dec 29 2020
On Tuesday, 29 December 2020 at 23:17:33 UTC, Ola Fosheim Grøstad wrote:On Tuesday, 29 December 2020 at 22:53:55 UTC, bachmeier wrote:Well, except that sqlite works now and has been extensively tested. I don't want to discourage anyone from rolling their own, but knowing how long things take around here, and using actuarial tables to compute my life expectancy, it's not obvious that it would impact me. That's also why adding another dependency concerns me.sqlite was the first thing I thought about when I saw this thread. How much data would have to be copied into an sqlite database for searching of packages? That has the advantage of more or less no dependencies, dead simple to add, claimed good results...I'm not being dismissive (I also don't use Dub), but in general this would not scale very well. Unless you want to do all searches locally. Also, a high quality search engine requires custom ranking, so not really sure if it is overall less work than rolling your own if you want high quality search results. The text corpus is tiny, so there is really no point in using a generic on-disk solution?
Dec 29 2020
On Tuesday, 29 December 2020 at 23:32:28 UTC, bachmeier wrote:On Tuesday, 29 December 2020 at 23:17:33 UTC, Ola Fosheim Grøstad wrote: Well, except that sqlite works now and has been extensively tested. I don't want to discourage anyone from rolling their own, but knowing how long things take around here, and using actuarial tables to compute my life expectancy, it's not obvious that it would impact me. That's also why adding another dependency concerns me.I actually implemented Damerau–Levenshtein in Python the other day in order to validate an exam question... It takes <15 minutes from scratch. A faster version on a trie can be done in an evening, debugged and tested. A full system in a weekend. But, the advantage in using an existing online service is that you get automatic scaling and better uptime: write once, run forever... I think such a service should be grateful if Dub provided them with: 1. cheap advertising 2. a maintained API to their service Maybe they even will pay for the work, who knows?
Dec 29 2020
On Tuesday, 29 December 2020 at 23:40:44 UTC, Ola Fosheim Grøstad wrote:On Tuesday, 29 December 2020 at 23:32:28 UTC, bachmeier wrote:Also, keep in mind that the fuzzy search does not have to be crazy fast when people often search for the same stuff. Just log all search phrases and preload the caches with the most common ones. With some luck maybe 90% of all searches hit caches?On Tuesday, 29 December 2020 at 23:17:33 UTC, Ola Fosheim Grøstad wrote: Well, except that sqlite works now and has been extensively tested. I don't want to discourage anyone from rolling their own, but knowing how long things take around here, and using actuarial tables to compute my life expectancy, it's not obvious that it would impact me. That's also why adding another dependency concerns me.I actually implemented Damerau–Levenshtein in Python the other day in order to validate an exam question... It takes <15 minutes from scratch. A faster version on a trie can be done in an evening, debugged and tested. A full system in a weekend.
Dec 29 2020
On Tuesday, 29 December 2020 at 23:40:44 UTC, Ola Fosheim Grøstad wrote:But, the advantage in using an existing online service is that you get automatic scaling and better uptime: write once, run forever... I think such a service should be grateful if Dub provided them with:Her is the SLA for Algolia, they offer 99.99% and 99.999% which translates to 53 minutes and 5 minutes of downtime per year. It would be difficult (highly improbable) to compete with that for a self hosted solution. https://www.algolia.com/blog/for-slas-theres-no-such-thing-as-100-uptime-only-100-transparency/
Dec 29 2020
On Monday, 28 December 2020 at 10:49:58 UTC, aberba wrote:I can make a PR for the MeiliSearch integration but I need to know foundation is willing to host a MeiliSearch instance for that.It is written in Rust... But seriously, in-memory-search is easy to implement, so it would look better if it is done in D. An alternative is to use an existing online indexing service, probably cheaper and more scalable than setting up a dedicated service yourself.
Dec 29 2020
On Tuesday, 29 December 2020 at 11:47:11 UTC, Ola Fosheim Grøstad wrote:But seriously, in-memory-search is easy to implement, so it would look better if it is done in D.You could just use a trie for the tokens and implement Levenshtein-Damerau fuzzy matching on that. That is a fun exercise to do. The next fun exercise is to abstract it in a way that fits into Phobos! (Fun fact: I've just read a bunch of suggestions for how to do this as I am spending my holiday grading exams in text search... :-P Ok, not so fun...)
Dec 29 2020
On Tuesday, 29 December 2020 at 11:47:11 UTC, Ola Fosheim Grøstad wrote:On Monday, 28 December 2020 at 10:49:58 UTC, aberba wrote:Read the previous GitHub discussion. They've gone through that route. Any PaaS cost more than IaaS. If cost isn't an issue then we can go with that too. But since the registry is hosted, it's quite straightforward to do ./meilisearch --master-key PRIVATE_KEY and be done with.I can make a PR for the MeiliSearch integration but I need to know foundation is willing to host a MeiliSearch instance for that.It is written in Rust... But seriously, in-memory-search is easy to implement, so it would look better if it is done in D. An alternative is to use an existing online indexing service, probably cheaper and more scalable than setting up a dedicated service yourself.
Dec 29 2020
On Tuesday, 29 December 2020 at 11:47:11 UTC, Ola Fosheim Grøstad wrote:It is written in Rust...If anyone has one written in D too, we can use that as well. I just want to have the embarrassing search fixed.
Dec 29 2020
On Tuesday, 29 December 2020 at 12:04:28 UTC, aberba wrote:On Tuesday, 29 December 2020 at 11:47:11 UTC, Ola Fosheim Grøstad wrote:Alright, if someone wants to start on it, then I'm willing to help out with suggestions and code reviews. This is a decent starting point: https://nlp.stanford.edu/IR-book/information-retrieval-book.html And also Wikipedia https://en.wikipedia.org/wiki/Inverted_index https://en.wikipedia.org/wiki/Approximate_string_matching https://en.wikipedia.org/wiki/Suffix_array https://en.wikipedia.org/wiki/Trie https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithmIt is written in Rust...If anyone has one written in D too, we can use that as well. I just want to have the embarrassing search fixed.
Dec 29 2020