www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Fixing dub search

reply aberba <karabutaworld gmail.com> writes:
Current dub registry search is inaccurate because it uses the 
built-in MongoDB search which isn't designed for accurate search.

To fix this, a real search engine is needed. ElasticSearch is an 
overkill for what we need... basic accurate string search.

Solution: use MeiliSearch. It's very lightweight and fast (1GB 
vps is more than enough). Very easy to use... just a REST API 
call. I already have a package for meilisearch on dub.


What's needed: a hosted running instance of MeiliSearch for use 
in dub search. Since only the search functionality needs to be 
fixed, MeiliSearch will index a copy off all packages and 
re-index when they chang. The MeiliSearch index will handle 
search queries whilst MongoDB continues to handle everything else.


I can make a PR for the MeiliSearch integration but I need to 
know foundation is willing to host a MeiliSearch instance for 
that.
Dec 28 2020
next sibling parent reply Imperatorn <johan_forsberg_86 hotmail.com> writes:
On Monday, 28 December 2020 at 10:49:58 UTC, aberba wrote:
 Current dub registry search is inaccurate because it uses the 
 built-in MongoDB search which isn't designed for accurate 
 search.

 [...]
https://github.com/dlang/dub-registry/pull/481
Dec 28 2020
parent reply aberba <karabutaworld gmail.com> writes:
On Monday, 28 December 2020 at 18:55:09 UTC, Imperatorn wrote:
 On Monday, 28 December 2020 at 10:49:58 UTC, aberba wrote:
 Current dub registry search is inaccurate because it uses the 
 built-in MongoDB search which isn't designed for accurate 
 search.

 [...]
https://github.com/dlang/dub-registry/pull/481
I've sent him an email about using MeiliSearch instead of a hack
Dec 28 2020
parent Imperatorn <johan_forsberg_86 hotmail.com> writes:
On Monday, 28 December 2020 at 19:15:33 UTC, aberba wrote:
 On Monday, 28 December 2020 at 18:55:09 UTC, Imperatorn wrote:
 On Monday, 28 December 2020 at 10:49:58 UTC, aberba wrote:
 Current dub registry search is inaccurate because it uses the 
 built-in MongoDB search which isn't designed for accurate 
 search.

 [...]
https://github.com/dlang/dub-registry/pull/481
I've sent him an email about using MeiliSearch instead of a hack
đź‘Ť
Dec 29 2020
prev sibling next sibling parent reply sarn <sarn theartofmachinery.com> writes:
On Monday, 28 December 2020 at 10:49:58 UTC, aberba wrote:
 To fix this, a real search engine is needed. ElasticSearch is 
 an overkill for what we need... basic accurate string search.

 Solution: use MeiliSearch. It's very lightweight and fast (1GB 
 vps is more than enough). Very easy to use... just a REST API 
 call. I already have a package for meilisearch on dub.
ElasticSearch also has a simple REST API and would do this job on whatever hardware we'd realistically use. I'm not a huge ES fan, personally, but do you have more reasons to dismiss it as overkill and recommend MeiliSearch instead? The best place for discussion is here, though: https://github.com/dlang/dub-registry/issues/93 But I have to say something again: please, please, please, I beg, consider using an embedded search tool before adding an external server (or, worse, an external SaaS) as a runtime dependency to the dub registry. There are only a few thousand packages, and they don't update much. Even grepping the whole dataset every request would be fast enough (just not featureful enough).
Dec 28 2020
next sibling parent reply aberba <karabutaworld gmail.com> writes:
On Monday, 28 December 2020 at 20:49:12 UTC, sarn wrote:
 On Monday, 28 December 2020 at 10:49:58 UTC, aberba wrote:
 To fix this, a real search engine is needed. ElasticSearch is 
 an overkill for what we need... basic accurate string search.

 Solution: use MeiliSearch. It's very lightweight and fast (1GB 
 vps is more than enough). Very easy to use... just a REST API 
 call. I already have a package for meilisearch on dub.
ElasticSearch also has a simple REST API and would do this job on whatever hardware we'd realistically use. I'm not a huge ES fan, personally, but do you have more reasons to dismiss it as overkill and recommend MeiliSearch instead? The best place for discussion is here, though: https://github.com/dlang/dub-registry/issues/93 But I have to say something again: please, please, please, I beg, consider using an embedded search tool before adding an external server (or, worse, an external SaaS) as a runtime dependency to the dub registry. There are only a few thousand packages, and they don't update much. Even grepping the whole dataset every request would be fast enough (just not featureful enough).
If you've looked at the very discussion you referenced, you'd realize they went around and still came back to using mongodb for search. Not only is elasticsearch built in Java, hence bloatware, it's also designed to do more than just search... hence more bloatware and overkill for just basic search. You may compare the size of elastic with meilisearch which is just a small binary. MeiliSearch can run on very little ram... The accuracy you'd get from a search engine just isn't possible with brute-force and hacks. Search is a complex problem involving stemming, plurals, synonyms, step words, ranking, etc. You'd want to use a real search engine. And between elasticsearch and MeiliSearch, MeiliSearch is simpler, lightweight and easy to use.
Dec 29 2020
parent reply sarn <sarn theartofmachinery.com> writes:
On Tuesday, 29 December 2020 at 08:45:02 UTC, aberba wrote:
 On Monday, 28 December 2020 at 20:49:12 UTC, sarn wrote:
 The best place for discussion is here, though:
 https://github.com/dlang/dub-registry/issues/93
If you've looked at the very discussion you referenced, you'd realize they went around and still came back to using mongodb for search.
I read it. It's the bug report for this issue, and it's the discussion thread for this community project. There's no them vs you. There's isn't even a final decision in that thread. MongoDB is being used now because that's what's implemented.
 And between elasticsearch and MeiliSearch, MeiliSearch is 
 simpler, lightweight and easy to use.
Have you considered anything other than ElasticSearch, MeiliSearch and custom hacks? There are at least four other options mentioned in the thread I linked to. Maybe add MeiliSearch and your reasons for using it.
Dec 29 2020
parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Tuesday, 29 December 2020 at 22:27:17 UTC, sarn wrote:
 Have you considered anything other than ElasticSearch, 
 MeiliSearch and custom hacks?  There are at least four other 
 options mentioned in the thread I linked to.  Maybe add 
 MeiliSearch and your reasons for using it.
The easiest option is to see if an indexing service (like the mentioned Algolia) is willing to sponsor Dub as an open source project, then they get some free advertising in return.
Dec 29 2020
parent reply James Blachly <james.blachly gmail.com> writes:
On 12/29/20 5:34 PM, Ola Fosheim Grøstad wrote:
 The easiest option is to see if an indexing service (like the mentioned 
 Algolia) is willing to sponsor Dub as an open source project, then they 
 get some free advertising in return.
 
This is an excellent suggestion!
Dec 30 2020
parent reply sarn <sarn theartofmachinery.com> writes:
On Thursday, 31 December 2020 at 02:39:44 UTC, James Blachly 
wrote:
 On 12/29/20 5:34 PM, Ola Fosheim Grøstad wrote:
 The easiest option is to see if an indexing service (like the 
 mentioned Algolia) is willing to sponsor Dub as an open source 
 project, then they get some free advertising in return.
 
This is an excellent suggestion!
Here's one problem: new contributors won't have access to the third-party service and will need to set up their own indexes from scratch, but the scripts or whatever to do that probably won't be maintained because existing contributors won't need them. It's healthier for the project if anyone can just download the codebase and hack on it without signing up for some third party service. By the way, I think there's a good project in here and I'm willing to contribute my 2c and maybe more, but I know I'll lose any discussion in this thread. I'm following this one: https://github.com/dlang/dub-registry/issues/93
Jan 01 2021
parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Friday, 1 January 2021 at 11:02:49 UTC, sarn wrote:
 Here's one problem: new contributors won't have access to the 
 third-party service and will need to set up their own indexes 
 from scratch, but the scripts or whatever to do that probably
Two standard solutions: 1. make a tiny webservice for it have one set up for production and another one for testing. 2. make a tiny local in-memory emulator for it (no need for advanced matching or ranking) (I didn't quite get the argument about "losing")
Jan 01 2021
prev sibling parent reply bachmeier <no spam.net> writes:
On Monday, 28 December 2020 at 20:49:12 UTC, sarn wrote:
 On Monday, 28 December 2020 at 10:49:58 UTC, aberba wrote:
 The best place for discussion is here, though:
 https://github.com/dlang/dub-registry/issues/93
In that thread you wrote
 The FTS features of DBs like Sqlite and Postgres are really 
 nice if you're already using those DBs (otherwise other tools 
 are more powerful). Moving all data to Sqlite or PG is 
 obviously a whole bigger decision.
sqlite was the first thing I thought about when I saw this thread. How much data would have to be copied into an sqlite database for searching of packages? That has the advantage of more or less no dependencies, dead simple to add, claimed good results...
Dec 29 2020
parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Tuesday, 29 December 2020 at 22:53:55 UTC, bachmeier wrote:
 sqlite was the first thing I thought about when I saw this 
 thread. How much data would have to be copied into an sqlite 
 database for searching of packages? That has the advantage of 
 more or less no dependencies, dead simple to add, claimed good 
 results...
I'm not being dismissive (I also don't use Dub), but in general this would not scale very well. Unless you want to do all searches locally. Also, a high quality search engine requires custom ranking, so not really sure if it is overall less work than rolling your own if you want high quality search results. The text corpus is tiny, so there is really no point in using a generic on-disk solution?
Dec 29 2020
parent reply bachmeier <no spam.net> writes:
On Tuesday, 29 December 2020 at 23:17:33 UTC, Ola Fosheim Grøstad 
wrote:
 On Tuesday, 29 December 2020 at 22:53:55 UTC, bachmeier wrote:
 sqlite was the first thing I thought about when I saw this 
 thread. How much data would have to be copied into an sqlite 
 database for searching of packages? That has the advantage of 
 more or less no dependencies, dead simple to add, claimed good 
 results...
I'm not being dismissive (I also don't use Dub), but in general this would not scale very well. Unless you want to do all searches locally. Also, a high quality search engine requires custom ranking, so not really sure if it is overall less work than rolling your own if you want high quality search results. The text corpus is tiny, so there is really no point in using a generic on-disk solution?
Well, except that sqlite works now and has been extensively tested. I don't want to discourage anyone from rolling their own, but knowing how long things take around here, and using actuarial tables to compute my life expectancy, it's not obvious that it would impact me. That's also why adding another dependency concerns me.
Dec 29 2020
parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Tuesday, 29 December 2020 at 23:32:28 UTC, bachmeier wrote:
 On Tuesday, 29 December 2020 at 23:17:33 UTC, Ola Fosheim 
 Grøstad wrote:
 Well, except that sqlite works now and has been extensively 
 tested. I don't want to discourage anyone from rolling their 
 own, but knowing how long things take around here, and using 
 actuarial tables to compute my life expectancy, it's not 
 obvious that it would impact me. That's also why adding another 
 dependency concerns me.
I actually implemented Damerau–Levenshtein in Python the other day in order to validate an exam question... It takes <15 minutes from scratch. A faster version on a trie can be done in an evening, debugged and tested. A full system in a weekend. But, the advantage in using an existing online service is that you get automatic scaling and better uptime: write once, run forever... I think such a service should be grateful if Dub provided them with: 1. cheap advertising 2. a maintained API to their service Maybe they even will pay for the work, who knows?
Dec 29 2020
next sibling parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Tuesday, 29 December 2020 at 23:40:44 UTC, Ola Fosheim Grøstad 
wrote:
 On Tuesday, 29 December 2020 at 23:32:28 UTC, bachmeier wrote:
 On Tuesday, 29 December 2020 at 23:17:33 UTC, Ola Fosheim 
 Grøstad wrote:
 Well, except that sqlite works now and has been extensively 
 tested. I don't want to discourage anyone from rolling their 
 own, but knowing how long things take around here, and using 
 actuarial tables to compute my life expectancy, it's not 
 obvious that it would impact me. That's also why adding 
 another dependency concerns me.
I actually implemented Damerau–Levenshtein in Python the other day in order to validate an exam question... It takes <15 minutes from scratch. A faster version on a trie can be done in an evening, debugged and tested. A full system in a weekend.
Also, keep in mind that the fuzzy search does not have to be crazy fast when people often search for the same stuff. Just log all search phrases and preload the caches with the most common ones. With some luck maybe 90% of all searches hit caches?
Dec 29 2020
prev sibling parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Tuesday, 29 December 2020 at 23:40:44 UTC, Ola Fosheim Grøstad 
wrote:
 But, the advantage in using an existing online service is that 
 you get automatic scaling and better uptime: write once, run 
 forever... I think such a service should be grateful if Dub 
 provided them with:
Her is the SLA for Algolia, they offer 99.99% and 99.999% which translates to 53 minutes and 5 minutes of downtime per year. It would be difficult (highly improbable) to compete with that for a self hosted solution. https://www.algolia.com/blog/for-slas-theres-no-such-thing-as-100-uptime-only-100-transparency/
Dec 29 2020
prev sibling parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Monday, 28 December 2020 at 10:49:58 UTC, aberba wrote:
 I can make a PR for the MeiliSearch integration but I need to 
 know foundation is willing to host a MeiliSearch instance for 
 that.
It is written in Rust... But seriously, in-memory-search is easy to implement, so it would look better if it is done in D. An alternative is to use an existing online indexing service, probably cheaper and more scalable than setting up a dedicated service yourself.
Dec 29 2020
next sibling parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Tuesday, 29 December 2020 at 11:47:11 UTC, Ola Fosheim Grøstad 
wrote:
 But seriously, in-memory-search is easy to implement, so it 
 would look better if it is done in D.
You could just use a trie for the tokens and implement Levenshtein-Damerau fuzzy matching on that. That is a fun exercise to do. The next fun exercise is to abstract it in a way that fits into Phobos! (Fun fact: I've just read a bunch of suggestions for how to do this as I am spending my holiday grading exams in text search... :-P Ok, not so fun...)
Dec 29 2020
prev sibling next sibling parent aberba <karabutaworld gmail.com> writes:
On Tuesday, 29 December 2020 at 11:47:11 UTC, Ola Fosheim Grøstad 
wrote:
 On Monday, 28 December 2020 at 10:49:58 UTC, aberba wrote:
 I can make a PR for the MeiliSearch integration but I need to 
 know foundation is willing to host a MeiliSearch instance for 
 that.
It is written in Rust... But seriously, in-memory-search is easy to implement, so it would look better if it is done in D. An alternative is to use an existing online indexing service, probably cheaper and more scalable than setting up a dedicated service yourself.
Read the previous GitHub discussion. They've gone through that route. Any PaaS cost more than IaaS. If cost isn't an issue then we can go with that too. But since the registry is hosted, it's quite straightforward to do ./meilisearch --master-key PRIVATE_KEY and be done with.
Dec 29 2020
prev sibling parent reply aberba <karabutaworld gmail.com> writes:
On Tuesday, 29 December 2020 at 11:47:11 UTC, Ola Fosheim Grøstad 
wrote:

 It is written in Rust...
If anyone has one written in D too, we can use that as well. I just want to have the embarrassing search fixed.
Dec 29 2020
parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Tuesday, 29 December 2020 at 12:04:28 UTC, aberba wrote:
 On Tuesday, 29 December 2020 at 11:47:11 UTC, Ola Fosheim 
 Grøstad wrote:

 It is written in Rust...
If anyone has one written in D too, we can use that as well. I just want to have the embarrassing search fixed.
Alright, if someone wants to start on it, then I'm willing to help out with suggestions and code reviews. This is a decent starting point: https://nlp.stanford.edu/IR-book/information-retrieval-book.html And also Wikipedia https://en.wikipedia.org/wiki/Inverted_index https://en.wikipedia.org/wiki/Approximate_string_matching https://en.wikipedia.org/wiki/Suffix_array https://en.wikipedia.org/wiki/Trie https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm
Dec 29 2020