GrabDuck

A better search for Drupal

:

Drupal comes with a built in search module that provides some pretty basic search options. Like everything else we do, we ask ourselves, "how can this be better?" One of our recent projects has given us the opportunity to evaluate some new options for search. 

Note: This evaluation has a specific case of needs/wants and my notes on them are below. 

Wish list, in order of importance

  1. Facets
  2. Grouping of results by content type
  3. Autocomplete or recommendations
  4. Spellcheck 
  5. Biasing or the ability to order the results outside the natural result set
  6. Search analytics
  7. How easily can I wrap up all the configuration into an installable module or profile?

What is out there? What's Hot?

The first thing to decide on is what engine is going to power the new search. Turns out there are a few options. Through my research I have found out that there are some big differences and specific use cases for each of the following available search options.  I found this resource which does a good job at comparing and illustrating the search options available for Drupal. It is important as a developer to look at the industry to see what is happening in the space you are looking at. Looking beyond the download counter on the module page, and/or seeing which one has active commits is important. If my selection were based purely on those two credentials, Search API DB and Solr would be clear winners. These two, although good options for some, might not hit the needs of your project. What I found was that one particular engine was stirring up some waves. 

Contrib search options for Drupal:

  • Acquia Solr
  • Other Solr
  • Search API with search db 
  • Sphinx
  • Elasticsearch
  • Fuzzy Search 
  • Xapian
  • Sarnia
  • Google Custom Search
  • Custom written extensions of Drupal core search
  • Fake it with views and exposed filters

What are these things?

Search API is a collection of modules that allow site builders to build out advanced full text search solutions for Drupal. Right from the project page itself:

"This module provides a framework for easily creating searches on any entity known to Drupal, using any kind of search engine. For site administrators, it is a great alternative to other search solutions, since it already incorporates faceting support and the ability to use the Views module for displaying search results, filters, etc. Also, with the Apache Solr integration, a high-performance search engine is available for this module.

Developers, on the other hand, will be impressed by the large flexibility and numerous ways of extension the module provides. Hence, the growing number of additional contrib modules, providing additional functionality or helping users customize some aspects of the search process." (Drupal.org project page)

Solr is a super fast, open source, standalone search engine built off of Lucene with Java. Solr needs to be installed and run as its own entity. For a full list of features and information check out Solr's features page. It is one of the most used and developed on search engines in the Drupal space. Drupal.org uses it to power its search as well as the project browsing pages. The Drupal community has been very active around this search solution and has provided several implementation methods as well as a number of contributed modules to make it better. Acquia has provided support and development for a number of modules that use Solr as well as offers a hosted Solr solution. There are a number of hosted solutions out there if you cannot install, or do not want to support, your own Solr instance. For example, if you are hosted on a shared hosting platform you will probably want to go with a hosted solution. 

Search API Database Search is a PHP/Database solution. "It is therefore a cheap and simple alternative to backends like Solr, but can also be a great option for larger sites if you know what you're doing." (Drupal.org project page) This module is built for the Search API module search solution. It provides a much stronger search than the out of the box Drupal core search offerings and can be used on any Drupal website and hosting environment. Because of its underlying technologies this module can perform poorly on highly trafficked and large websites. For large scale websites with lots of content this option could potentially eat up your servers resources and slow down the website. 

Sphinx is for enterprise or massive scale websites.  This search engine powers Craigslist, which claims to have over 300+ million search queries/day.

"Sphinx is an open source full text search server, designed from the ground up with performance, relevance (aka search quality), and integration simplicity in mind. It's written in C++ and works on Linux (RedHat, Ubuntu, etc), Windows, MacOS, Solaris, FreeBSD, and a few other systems." (sphinxsearch.com)

Elasticsearch looks to be the new hotness. This is the product that is making waves in the search community. So big are the waves a Solr hosting service has taken the time to address it. It is fast, feature rich, has a sexy UI, and comes with a number of extra tools the provide valuable information and functionality. Like Solr, Elasticsearch is a standalone search that needs to be installed on its own and have Drupal's modules use it. There are also a number of hosted solutions available. This search engine was built from the ground up with the cloud in mind. 

"Elasticsearch is a flexible and powerful open source, distributed, real-time search and analytics engine. Architected from the ground up for use in distributed environments where reliability and scalability are must haves. Elasticsearch gives you the ability to move easily beyond simple full-text search." (elasticsearch.org)

The extra tools that come with Elasticsearch are:

  1. Logstash, a time based event data logger.
  2. Kibana, a data visualization tool that gives you dashboards of information in real time about the data being indexed.
  3. Marvel, a deployment and cluster management tool that provides historical and real time information about your Elasticsearch servers.

Fuzzy Search is similar to Search API Database Search in that it is also a PHP/Database solution. It can be installed on to any Drupal website and integrates with Search API. 

Fuzzy matching is implemented by using ngrams. Each word in a node is split into 3 (default) letter lengths, so 'apple' gets indexed with 3 smaller strings 'app', 'ppl', 'ple'. The effect of this is that as long as your search matches X percentage (administerable in the admin settings) of the word the node will be pulled up in the results.

Although, a likely candidate to rival the Search API Database Search this project looks to be stale. There is no stable release for Drupal 7, the last commit was in 2013, and the module status is 'seeking new maintainer'.  It does have a decent install base with over 2100 websites, but the lack of development is discouraging.

Xapian "is an Open Source Search Engine Library, released under the GPL. It's written in C++, with bindings to allow use from Perl, Python, PHP, Java, Tcl, C#, Ruby, Lua, Erlang and Node.js."  (http://xapian.org/) Xapian's strength looks to be in document indexing; Specifically large file size documents. **Full disclosure** I did not get around to testing but here is a link to a video where Simon Lindsay talks about the project. You can see his part at the 11:50 minute mark.

Xapian is a highly adaptable toolkit which allows developers to easily add advanced indexing and search facilities to their own applications. It supports the Probabilistic Information Retrieval model and also supports a rich set of boolean query operators.

 

Sarnia " allows a Drupal site to interact with and display data from Solr cores with arbitrary schemas, mainly by building views. This is useful for Solr cores that index large, external (ie, non-Drupal) datasets that either aren't practical to store in Drupal or that are already indexed in Solr." (Drupal.org project page)

 

This looks really cool but was outside the scope for my testing. I would love to hear more about it.

 

Google Custom Search is different from all of the above search options as it is an embedded search engine that you get from Google. It uses a crawler and sitemap.xml data to crawl through your website to provide Google like searching of your website. The downside to this option is that it does not provide the type of configuration of what to index and how to display results that I want. It is a great option for a quick and easy search solution.

Custom coding is always an option if you have the expertise and time available. However, Drupal is open source software with many viable search options. It would be silly to not pick a project that has already been started to use or build upon.

Faking it with views exposed filters is a fast, cheap, and "not a real search but sometimes good enough" (Shea McKinney) solution. If you are looking for exact keyword matching or simple filtering this may be a less resource intensive option. Views exposed filters should not be viewed as a complete search option. 

Quick elimination

Now that I know what the playing field is it is time to make the first round of cuts. Here are some quick notes on the decisions why I chose to remove a few options for the list.

  • There are many contrib modules that provide the added functionality we want and it would be far more effort to write our own. Going fully custom won't be needed here.
  • Views with exposed filters would allow a lot of control over display of results but are field based and cause problems quickly when there are multiple content types in play.
  • Google custom search or other 3rd party crawlers do not provide enough control over display, don't support facets, and can only index publicly available content.
  • Xapian looks promising but is not as feature rich as other options and requires PHP libraries to be installed on the server.
  • Sarnia looks interesting and is built on Search API and Solr but is best used for large amounts of external Solr data and is probably more than we need.
  • Sphinx is very fast because it uses real-time indexes. Its best use case would be for a site has hundreds to thousands of new entities created an hour and needs to have content instantly searchable. Our typical use case would not have this volume of new content on a regular basis. 
  • Local Solr setup is not an option as we do not have the resources to set up and maintain a Solr search server in our environment.
  • The Fuzzy search project is looking for a new maintainer and although this project could be a good opportunity to pick up and help a contrib module there are other more interesting projects out there.

A closer look

After reviewing all of the options above it looks like there are a few really good choices. It was time to put them through their paces. To test, I decided to install our base distribution which comes with a number of contributed modules and a few content types, and configure from scratch 1 index for all content types, inclusion of 6 different field types, author and taxonomy relationships, search field biasing, facets, and autocomplete. From there I generated roughly 200 taxonomy terms and attached them to roughly 7000 nodes of varying content types. I ran indexing immediately and selected a few nodes for my target searches. I searched for those nodes on multiple fields using multiple keywords and compared the results to my arbitrary values of relavancy. 

Below is a feature breakdown table.

Name Cost Features Cons
Search API FREE, as in beer.
  • Autocomplete (contrib)
  • Search live results (contrib)
  • Saved searches (contrib)
  • Range searches (contrib)
  • Search sorting (contrib)
  • Location searching (contrib)
  • Search Pages or Search Views (contrib)
  • Search statistics (contrib)
  • Multiple search indexes
  • Integrates with views
  • Index entities immediately or on cron
  • Multiple index searching
  • Add on modules to the Search API module have the feel of being buggy. This is a shame since the Search API module itself is exceptionally well maintained.
  • Spellcheck project is 4 years old with no commits.
Search API + Search Database FREE, as in beer.
  • All of the Search API features and...
  • Install anywhere you have Drupal installed
  • Result biasing
  • Search facets (contrib)
  • Portable / Migratable
  • Less accurate and powerful than Solr or Elasticsearch
  • Needs Apache Tika installed to index files
Search API + Externally hosted Solr

Cheapest: $10.00/Month

Most expensive: Thousands of dollars a month

  • All of the Search API features and...
  • Fast searching
  • Result biasing
  • Search facets (contrib)
  • Portable / Migrateable
  • Can index files
  • Can take some time to index many and/or large documents
Search API + Externall hosted Elasticsearch

Anywhere from $37.00/month 

to thousands of dollars a month

  • All of the Search API features and...
  • Fast searching and indexing
  • Search facets (contrib)
  • Multiple transports (Curl, Guzzle, Thrift, Memcached)
  • Bonus software and monitoring tools
  • Can index files
  • During testing I periodically dropped, created, and copied settings around to various different test urls and environments. I had some issues with the database index machine names.

Search API autocomplete vs Search API live results

Search API autocomplete provides an autocomplete search field that displays the keyword or matching keywords the user is typing plus the number of results that keyword would return in a drop down box off of the search field. The live results module does roughly the same except that it displays search results and not keywords in the drop down box. Clicking or selecting a search result from the drop down takes the user directly to that search result page. 

Winner for use on a search page: Search API Autocomplete

Decision

The status of search in Drupal is good. There are a number of powerful and easy to implement options out there. Search API is leading the way. It empowers a site to move past Drupal core search and utilize a full text search option. For us, and our client's needs, we will be looking to build out a graduated option. For the most common use case a search implementation with Search API + Search API Search Database will be sufficient. For those sites that need something more robust a migration path to Search API + Solr will be used. 

Why did we choose Solr over Elasticsearch? 

It was the slimmest of margins that allowed Solr to top Elasticsearch. Looking specifically at our needs, this breakdown will discuss the points we valued as important.

Functionality

Our needs are simple. We want our search engine to provide accurate results, facets, excerpt snippets, and possibly the ability to index raw files. Both Solr and Elasticsearch performed these operations very well.

Performance

Testing on several remote services and standing up local instances of each search appliance, at our scale, with one index, the performance of each was very similar. Elasticsearch out-performed Solr in this area due to indexing waits on the Solr hosts where Elasticsearch was instantaneous. 

Ease of setup and use

As we decided to go with 3rd party hosting options the setup and connection for each option was very similar. Both options have easy to follow configuration options in their respected module. 

3rd party options

There are several options for using 3rd party Solr hosts including Acquia. Generally, Solr hosting has the cheaper options but both scale to the thousands of dollars a month range.

Project activity

Strictly looking at the momentum in the Drupal community it looks like Elasticsearch has come a long way recently. Solr stands out as the most developed for and comes with a number of contributed modules and features. Solr looks to be the more mature project but Elasticsearch is looking to be making great headway. With some great features being developed, keep your eyes on Elasticsearch and it's progress.

Support

With a number of groups on campus already using Solr as their search engine of choice it makes sense to also use Solr. Not only will we be inline with the rest of the campus, but we will also have resources available should we need some extra support.

Resources: