The math and algorithms behind a search library

JamesMcMinn · on Nov 11, 2017

This is a an interesting way of doing text search over a collection of documents, that I suppose makes sense if you only have a small number of documents to search.

Generally, an inverted index [1] is used rather than a trie, and collection-based term weights are calculated using something like IDF [2]. You could then use something like cosine to compute length normalised rankings for documents.

There are of course many different approaches to text search, but this is the first time I've personally come across this one.

If anyone is interested in an in depth look at how text search works, an Introduction to Information Retrieval [3] is available for free and is a fantastic book.

[1] https://en.wikipedia.org/wiki/Inverted_index

[2] https://en.wikipedia.org/wiki/Tf%E2%80%93idf

[3] https://nlp.stanford.edu/IR-book/

kbr · on Nov 11, 2017

This is how Wade was originally going to be implemented. I actually had a local version of it using an inverted index and a deterministic acyclic finite state automaton [1]. It was using TF-IDF as well for the rankings.

I began to experiment with a couple of my own ideas and used a trie rather than a DAFSA because it could be stored and loaded rather than generated at runtime. I also created and used the various linear methods described in the post to rank the documents.

[1] https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_s...

mooneater · on Nov 12, 2017

Consider BM25 as a newer substitute for tf-idf https://en.wikipedia.org/wiki/Okapi_BM25

amelius · on Nov 11, 2017

Finally a post on search algorithms on HN again!

Pardon my surprise, but I think it's kind of strange that we see so little on HN of this topic, which is one of the pillars of Computer Science, and also at the core of one of the most successful IT companies.

leetcrew · on Nov 11, 2017

I think we see surprisingly little of anything search related. In terms of full-scale solutions, there is basically just Elasticsearch and solr. When you look at lower level libraries available, there is very little outside the Lucene world.

fabrice_d · on Nov 12, 2017

You may be interested by Xapian (https://xapian.org/) which is a very solid alternative to Lucene.

aditirex · on Nov 13, 2017

There's also Sphinx search (written in C++ and not based on Lucene) and the newly Manticore search fork. (disclaimer: I was/am involved in both projects).

sorokod · on Nov 11, 2017

I believe that the search part of both ES and Solr is Apache Lucene

amelius · on Nov 11, 2017

Yes, it's good software, but pretty much a monoculture.

erpellan · on Nov 11, 2017

Yes, and?

amelius · on Nov 11, 2017

Yes, and I'd also like to see more posts/projects on the combination of search and NLP (natural language processing).

nerdponx · on Nov 13, 2017

Doesn't SQLite of all things include an (optional) native full-text-search engine?

jazoom · on Nov 11, 2017

And now Vespa.

wyc · on Nov 12, 2017

If you're a working programmer looking for a bit more depth on text searching, check out Taming Text by Ingersoll. It was written by some real pros, and it's very practical. It goes over everything from the first naive approaches to what drives many engines like Lucene today.

https://www.amazon.com/Taming-Text-Find-Organize-Manipulate/...

stochastic_monk · on Nov 11, 2017

I'm a little surprised to not be seeing some classical search algorithms.

For online pattern searching, I'd want Boyer-Moore with the the Galil rule for guaranteed linear search. [Due to an unfortunate accident in C++ history, Boost and C++17 skip the Galil rule, leading to O(n+m) runtime vs. O(nm).]

For offline, Burrows-Wheeler has performance linear with query string and independent of the length of the searched document.

kbr · on Nov 11, 2017

This isn't an exact pattern searching library. Wade originally used Boyer-Moore, but has since switched to a trie [1] structure to search for _terms_ rather than the whole query.

This allows for result documents that can contain some of the terms, all of the terms, or a prefix of a term. It's also pretty fast as it doesn't need to run a linear search through the documents. I'm not sure what you mean by "classical" search algorithms, but a trie based search is a pretty standard way of doing it.

[1] https://en.wikipedia.org/wiki/Trie

stochastic_monk · on Nov 12, 2017

I understand, thank you. (Though a lot of inexact pattern matching is done with using exact pattern matching of substrings with the pidgeonhole principle.)

Tries led to suffix trees led to suffix arrays led to Burrows-Wheeler transform led to FM index, which is probably the best way to do an offline search.

If you haven't looked at BWT indexing, it's worth at least checking out. It preserves the advantages you speak of with linear memory cost with the search space. It's a lot more hairy to implement, unfortunately.

stochastic_monk · on Nov 12, 2017

A nice summary can be found here:

http://www.cs.jhu.edu/~langmea/resources/lecture_notes/bwt_a....

bootcat · on Nov 11, 2017

Certainly its worth taking a look at another javascript library which works for fuzzy search called, fuse, http://fusejs.io/.

jihadjihad · on Nov 12, 2017

This is neat, and reminds me a bit of the tricks used by RediSearch. I came across it on HN recently and have been geeking out on it since! http://redisearch.io

johnhenry · on Nov 11, 2017

Lunr, named to correspond to Apache Solr, is also worth a look. https://lunrjs.com/

hnrc · on Nov 12, 2017

https://news.ycombinator.com/item?id=14805395

kbr · on Nov 12, 2017

During the time of that post, this article was about a simpler example and only explained the trie structure. Wade didn't have the ranking system either. It has since been rewritten to include the methods used to rank the relevance of a term to the document, and to see how they are applied in a more complex example.

sus_007 · on Nov 12, 2017

There must've been some updates, otherwise