Different ways to make auto suggestions with Solr
Nowadays almost every website has a full text search box as well as the auto suggestion feature in order to help users to find what they are looking for, by typing the least possible number of characters possible. The example below shows what this feature looks like in Google. It progressively suggests how to complete the current word and/or phrase, and corrects typo errors. That’s a meaningful example which contains multi-term suggestions depending on the most popular queries, combined with spelling correction.
Figure 1: The way Google makes auto complete suggestions: multi-term query suggestions and spelling correction
There are different ways to make auto complete suggestions with Solr. You can find many articles and examples on the internet, but making the right choice is not always easy. The goal of this post is compare the available options in order to identify the best solution tailored to your needs, rather than describe any one specific approach in depth.
It’s common practice to make auto-suggestions based on the indexed data. In fact a user is usually looking for something that can be found within the index, that’s why we’d like to show the words that are similar to the current query and at the same time relevant within the index. On the other hand, it is recommended to provide query suggestions; we can for example capture and index on a specific solr core all the user queries which return more than zero results, so we can use those information to make auto-suggestions as well. What actually matters is that we are going to make suggestions based on what’s inside the index; for this purpose it’s not relevant if the index contains user queries or “normal data”, the solutions we are going to consider can be applied in both cases.
Some questions before starting
In order to make the right choice you should first of all ask yourself some questions:
- Which Solr version are you working with? If we’re working with an old version (1.x for example) it is worth an upgrade. If you can’t upgrade you’ll probably have less options to choose from, unless you’re willing to manually apply some patches.
- Do you want to make single term or multiple term suggestions? You should basically decide if you want to suggest single words which can complete the word the user has partially written, or even complete sentences.
- Do you want to filter the suggestions based on the actual search? The user could have previously selected a facet entry, filtering his results to a specific subset. Every search should match with that specific context, so it is common practice to have the auto-suggestions reflect the user filters. Unfortunately some of the solutions we have available don’t support any filter.
- How do you want to sort the auto-suggestions? It’s important to show on top the best suggestion, and each solution you are going to explore has a different sorting option.
- Do you want to make auto-suggestions based on multivalued fields? Multivalued fields are for example commonly used for tags, since every document can have more than one tag and do you want to suggest a tag while the user is typing it.
- Do you want to make auto-suggestions based on prefix queries or even infix queries? While it’s always possible to suggest words starting with a prefix, not all the solutions are able to suggest words that contain the actual query.
- What’s the impact of each solution in terms of performance and index size? The answer depends on the index you’re working with and needs to take into account that some solutions can increase the index size, while all of them will affect performance.
Faceting using the prefix parameter
The first option we have is available in Solr 1.2 and based on a special facet that includes only the results starting with a prefix, which the user has partially typed, making use of the facet.prefix parameter. This solution works only for single term suggestions starting with a particular prefix (not infix) and you can sort results only alphabetically or by count. It works even with multi valued fields, and is possible to apply any filter queries to have the suggestions reflecting the current context of the search.
Use of NGrams as part of the analysis chain
The second solution is available from Solr 1.3 and relies on the use of NGramFilterFactory or EdgeNGramFilterFactory as part of the analysis chain. It means you’ll have a specific field which makes possible to search on it through wildcard queries, typing word fragments. Every word in the index will be split into several NGrams; you can reduce the number of NGrams (and the size of the index) by increasing the minGramSize parameter or switching to the EdgeNGramFilterFactory which works in only one direction, by default from the beginning edge of an input token. With NGramFilterFactory you can use infix and prefix queries, while with EdgeNGramFilterFactory only prefix queries. This looks like a really flexible way to make auto-suggestions since it relies on a specific field with its configurable processors chain. You can easily filter your results and have them sorted based on relevance, also using boosting and the eDisMax query parser. Furthermore, this solution is faster than the previous one. On the other hand, if we want to make auto-suggestions based on a field which contains many terms, we should consider that the index size will considerably increase since we are indexing for each term a number of terms equals to term length – minGramSize (using EdgeNGrams). This option would work even with multi valued fields, but the index size would obviously increase even more.
Use of the TermsComponent
One more solution, available from Solr 1.4, is based on the use of the TermsComponent, which provides access to the indexed terms in a field and the number of documents that match each term. This option is even faster than the previous one, you can make prefix queries using the terms.prefix parameter or infix queries using the terms.regex parameter available starting from Solr 3.1. Only single term suggestions are possible, and unfortunately you can’t apply any filter. Furthermore, user queries will not be analyzed in any way; you’ll have access to raw indexed data, which means you could have problems with whitespaces or case-sensitive queries, since you’ll be searching directly through the indexed terms.
Use of the Suggester
Due to the limitations of the above solutions, Solr developers have worked on a new component created exactly for this task. This option is the most recent and recommended one, available since Solr 3.1 and based on the SpellCheckComponent, the same you can use to make spelling correction. What’s new is the SolrSpellChecker implementation to make suggestions, called Suggester, which actually makes use of the lucene suggest module. All has started with the SOLR-1316 issue, based on which the Suggester was created. Then the collate functionality has been improved with the SOLR-2010 issue. After that, the task has been finalized with LUCENE-3135 by backporting to the 3.x branch the lucene suggest module, which is actually used from the Solr Suggester class. This solution has its own separate index which you can automatically build on every commit. Using collation you can have multi-term suggestions. Furthermore, it is possible to use a custom dictionary instead of the index content, which makes the current solution even more flexible.
The following table contains pros and cons for each solution I mentioned above, from the slowest to the fastest one. Even if the last option is the most flexible, it requires more tuning. Of course more power means also more responsibility, so if your requirements are just single term suggestions with filtering and you don’t have particular performance problems, the facet old fashioned way works perfectly out of the box.
Figure 2: Comparison table between the mentioned ways to make auto complete suggestions with Solr
This blog entry has hopefully shown you some ways in which you can use auto-suggestions with Solr and the related pros and cons. I hope this will help you in making the right choices from the beginning tailored to your requirements. Please do share any additional considerations I may not have covered and your experiences. Also, we’re intrigued to hear how you deal with the same problems in your search applications. Leave a comment or ask a question if you have any doubt too!