Elasticsearch 101 - codecentric AG Blog

:

Introduction

Elasticsearch is a highly scalable search engine that stores data in a structure optimized for language based searches and it is a whole lot of fun to work with.  In this 101 I’ll will give you a hands-on introduction to Elasticsearch and give you a glimpse at some of the key concepts.

  • open-source
  • distributed: clustering, replication, fail over, and master election out of the box
  • schema less: document based, automated type mapping, JSON
  • RESTful API
  • highly configurable
  • sane defaults
  • data exploration
  • based on Lucene
  • runs on the JVM
  • fast

 

Running Elasticsearch

Elasticsearch is a standalone Java application, so getting up and running is a piece of cake.  Make sure you have Java ≥ 1.6.0 and that no one else is running Elasticsearch in your network.  You can either download a packaged distribution from elasticsearch.org and unpack it or get the latest source from github.

Start a node in foreground with

 

$ bin/elasticsearch -f

 

You should see something like this

startup

 

You can see in the output that Elasticsearch started, that it assigned the node a random name, and that the node started a cluster and elected itself as master.  The node is publishing to HTTP port 9200 (default).  We use this port to interact with the cluster.  Now you can

 

$ curl localhost:9200

 

and you get some data about your node

curl localhost

 

Elasticsearch has a JSON based REST API.  Administrative operations, indexing and searching, everything is done with HTTP and JSON.  I use cURL for more concise examples.  It may be more convenient for you to use any graphical HTTP client.  There is also a number of browser extensions and plugins.  If you use Google Chrome then I recommend Sense plugin for Chrome, a JSON aware developer console to Elasticsearch.

 

Lets start up another node and give it a name NODE_2 as parameter

 

$ bin/elasticsearch f -Des.node.name=NODE_2

 

startup node 2

 

You can see that NODE_2 started.  The node detected the other node as master and joined the cluster.  The new node publishes to HTTP port 9201.  You can talk to 9201 as well, as each node behaves the same.  Any node starting up will join this cluster if, and only if it shares the cluster name.  In our case as we haven’t defined a cluster name a default setting is in place.
You now have a Elasticsearch cluster running with two nodes.

 

Install Head plugin

This is an optional step.  The Head plugin or elasticsearch-head is a web front end for browsing and interacting with an Elasticsearch cluster.  For more details on available plugins refer to the plugin guide.

To install it run

 

$ bin/plugin -install mobz/elasticsearch-head

 

Then open http://localhost:9200/_plugin/head/ in your browser

 

Indexing

A Search engine is something like a database with a difference in how data is stored.  The structure is loosely similar to that of a conventional relational database like MySQL or Postgres for example.

  • Elasticsearch    –    Database
  • Index    –    Database
  • Type    –    Table
  • Document    –    Row
  • Field    –    Column

Elasticsearch creates an inverted index.     http://en.wikipedia.org/wiki/Inverted_index

Everything is indexed in this data structure.  It allows to quickly find all the documents that contain a particular word.  Much in the way of an index at the back of a book.

Let’s index something.  We send off a HTTP PUT with the URL made up of the index name, type name and ID and in the HTTP payload we supply a JSON document with the fields and values.  Notice the field author has another JSON object nested.

first data

 

Indexing in Elasticsearch corresponds to create and update in CRUD.  If we try to index a document with an ID that already exists it is overwritten.  Index and type are required while the ID is optional.  If we do not specify an ID, then Elasticsearch will generate one.

 

And we get a response that verifies that the operation was successful

 

first data ack

 

We get the name of the index, a type, and the ID.  We also get a version, which is not a historical version.  The versioning is used for optimistic concurrency control and is always incremented with any changes.  The data we have supplied was all Elasticsearch needed.  Elasticsearch automatically created the index for us.

 

Mapping

Elasticsearch is schema less.  Elasticsearch is using mappings, which is basically a schema, but makes working with it much easier.

To see the mapping of our indexed blog

$ curl localhost:9200/documents/_mapping

 

mapping document

 

Here you get the mapping for the index documents.  There is the type blog and a list of all properties from our blog.  Elasticsearch automatically predicts the data types.  If Elasticsearch doesn’t predict the right type for your field, then you can supply a mapping when you create an index.

 

Getting data

To get a set of data, we set off a HTTP GET with the index name, type and the ID.

$ curl -XGET "http://localhost:9200/documents/blog/one"

The response looks like this

 

GET id response

 

We get  metadata information and the source field containing the JSON that we have indexed.

 

Searching

A document needs to be indexed before you can search for it. Elasticsearch refreshes  every  second by default.

Let’s search for our data set

$ curl -XGET "http://localhost:9200/documents/blog/_search?q=_id:one"

 

result search id

 

This is a query  on the ID one.  It is using the search facilities, but as we are looking for the ID, this can only result into one or zero documents.

 

Lucene under the hood

Elasticsearch is built on top of Lucene, a very old Java library that is proven, tested and best of its kind in open source search software.  Everything related to indexing and searching text is implemented in Lucene.  Elasticsearch builds an infrastructure around Lucene.  While Lucene is a great tool, it can be cumbersome to use it directly and Lucene doesn’t provide any facilities to scale past a single node.  Elasticsearch provides an easier more intuitive API and the infrastructure and operational tools for simple scalability across multiple nodes.  The REST API also allows interoperability with non-Java languages.

A shard in Elasticsearch refers to a Lucene index.  Elasticsearch by default uses five shards for each Elasticsearch index.  A document is stored in one shard.  Elasticsearch supports replica shards.  One replica is configured by default.

Let’s put in another document

 

second data

 

Now lets search for the term english

 

search result english

 

we get two hits

 

result search english

 

We can also search on a specific field.  Nested fields are addressed with a point separator like in the next example.

 

 

query nested

 

We get a result that matches the author name Pip the Troll

 

result search pip

 

In the last example we used a prefix query.  For more types of queries refer to the elasticsearch.org guide.  We are not providing the full value of the field.  This is the gist of a search engine.  We don’t need to know exactly what we are searching for.  We can provide what we know and get results on what might be true or not in contrary to what must be true.  We can find word stems,  synonyms, misspellings, and we can even provide autocompletion.

 

Clustering

You can get some information about your cluster with

$ curl -XGET "http://localhost:9200/_cluster/health"
cluster health response

 

Alternatively, if you have installed the head plugin you can open http://localhost:9201/_plugin/head/.  We have a status, which is green.  We have 5 active primary shards and 10 active shards, because we’re running with two nodes.  Our indexed documents are available on each node.

Lets shut down our master.  Go to your console of your master node and press CTRL-C then look  at the console output of NODE_2.

 

master gone

 

You can see that NODE_2 noticed that the master has left the cluster and elected itself as new master.  Check the status again

$ curl -XGET "http://localhost:9201/_cluster/health"

Make sure to use port 9201 of NODE_2 not 9200.

 

cluster health yellow

 

You can see that the cluster status is now yellow, because one of our two nodes is unassigned.  The search functionality of the cluster is still working, though.  If you set off our search requests from earlier again, but this time against port 9201 from NODE_2, you still get the search results, because everything we have indexed is available on every node.

If you start our previous master back up and check on the cluster status it’ll be back in status green.

 

Facets

At some point you will be interested in information about the data you have indexed.  In our case we might be interested in what is the average number of words over all indexed blog articles.  Elasticsearch has a feature called facets that provides aggregated statistics about a query.  This is a core part of Elasticsearch and is part of the search API.  Facets are always bound to a query and provide aggregate statistics alongside the query results. Facets are highly configurable and can return complex groupings of nested filters, spans of amounts or spans of time, even full Elasticsearch queries can be nested inside a Facet.  In Elasticsearch 1.0 this feature will be called Aggregations and is supposed to have more features and be more composable.

 

facet query

 

Here we define a facet together with a match all query.  The facet is a predefined statistical facet for number fields in this case our word_count field.

 

facets response

 

In the response we receive the query results first on top then the facet response with statistical numbers on our word_count field. There are further predefined facets you can choose from in order to get statistical information about your data and there is also Kibana that is a graphical front-end data analysis tool specifically tailored for Elasticsearch that became very popular.

 

That’s it!

No of course that’s not it. It’s just all I wanted to show you in this 101. I’ve introduced you to quiet a few topics but we barely scratched the surface of what there is to discover about Elasticsearch.

The query API has much more to offer than we have covered for instance. There are many interesting types of queries and filters that can be used.  To get the most out of natural language based searches and other complex types of data, you’ll get in touch with analyzers. Analyzers are the tools to slice and dice words into stems to create an efficient search space for natural languages. The word stemming allows Elasticsearch to find linguistically similar words. Percolators are another very interesting topic. Percolators allow one to index queries and then send docs to Elasticsearch and find out which queries match the doc. So the entire operation turned around going in reverse direction. And there is even more.

I hope you found this post interesting and useful on your quest to discover this awesome peace of technology. Thank you for reading and stay in the loop for more posts to come on Elasticsearch. In the meanwhile I’ve put a list of links together, you can find them below, there is also a great interview with Github about Elasticsearch at scale.

Where to go from here

To learn more about Lucene go to the Lucene documentation or visit the Lucene wiki.

Inverview with Github about Elasticsearch at scale

http://exploringelasticsearch.com/book/elasticsearch-at-scale-interviews/interview-with-the-github-elasticsearch-team.html