Mesopixel

Late May
Needle in a haystack

I've always wanted to play around with full-text search using Lucene or Sphinx, but never really had any reason to do so. In addition, most of those packages require having a persistent search server process to index and query from, which is extra overhead on the lean servers that I use. But since I had a bit of time these past couple days, I've been playing with around with a completely file-based full-text searching Python library called Whoosh, with my blog posts as the search corpus.

Out of the box, Whoosh has a sane setup and supports complex schemas with a variety of indexable, stored, and weighted fields. Adding documents to the search index is fast, and querying is straightforward across multiple fields. The library also has a few interesting plugins to do things like stemming (to allow searching variants of a base word), and query correction (to propose correct spellings from either a dictionary or the content in a particular field). Whoosh is one of those Python modules that Just Works™, and it is awesome.

To see it in action, try searching using the input box below. It should show snippets from each article that matches the terms, and also suggest corrected queries if there are no results found (ie. try doing a search for a typo of California).

Under the hood, Whoosh supports a variety of scoring functions including BM25F and base TF-IDF. TF-IDF is actually a pretty simple and intuitive function, one part representing the number of documents that a search query term appears in (document frequency), and the other being the frequency of the term in each document (term frequency). The more unique a term is across documents, the more likely its documents should score higher (hence, the inverse document frequency), and likewise, the more times the term is referenced in a document, the higher the score.

BM25 builds off TD-IDF, but the term frequency is effectively weighted less for high frequencies (it approaches a limit faster), and the inverse document frequency also takes the length of each document into account relative to the average document length (as longer documents will generally have more terms). As a result, the same number of terms appearing in a shorter document will score higher than in a long document. BM25F is an improvement over BM25 and supports scoring of terms across multiple weighted fields (ie. title, body, etc). Despite being from the 80's, BM25 seems to perform quite well!

One funny issue that I ran into during testing was that I could search for every month by name except for the month of "May". Stumped, I thought there was something wrong with Whoosh, until I read the docs a little closer. When using the StemmingAnalyzer, the default set of stop words (common words filtered out due to their prevalence) included "may" in the list. Removing it from that list did the trick, and all the months were fair game again. :)

meso·pixel

Late May
Needle in a haystack

Waterfall

meso·pixel

Late May Needle in a haystack

Waterfall

Late May
Needle in a haystack