Elasticsearch tips: inserting vs. updating your index - codecentric AG Blog

:

Transforming an update-heavy Elasticsearch use case into an insert-heavy one.

Just recently i’ve had the opportunity to set up an Elasticsearch installation at a customer that had a rather unique use case, and i’d like to share my approach of that with you. This post will show why an update heavy use of Elasticsearch is a bad idea and how you could transform it into an insert heavy one, which is way faster.

Prerequisites

The requirements involved tracking the lifecycle of a document that entered the company via various input channels, and is processed by a number of automated systems. Sometimes it happens that one of these documents gets lost between steps or is misanalyzed and therefore gets lost in the system. If someone happens to inquire the status of such a lost document noone could really give a good answer on that, or attempt to fix it. That’s not a desirable state.

Fortunately the “metadata” of such a document does contain the OCR fulltext, so any kind of “storage engine” with fulltext search capabilities is needed, and that really sounds like a job for elasticsearch! It’s especially easy because we were able to hook custom code into each of these processing steps. Another great coincidence is that we can print a barcode on each document, so every process step can be truly independent of the others. This will influence my conclusion later on.

As for the general usage of this system, I would expect to have a lot of writing operations (lots of documents processed, most of them without errors) and only few reading operations (you only check when something went wrong, if at all). This will bring us to some conclusions you would not expect in a more traditional use case.

The ‘naive’ NoSQL approach

As with every Elasticsearch project I’m involved in I like to step back first and give the data model a good thought. Sure, Elasticsearch is schemaless, but that does not mean you can skip thinking about your data at all, especially not if you want acceptable performance later on. Naturally I was inclined to think of a document as a flat structure, that contains it’s various events and the respective results and timestamps. They could be thought of relations, sure, but since they are naturally tightly bound to each other (in classical terms a 1-1 Relationship, if you will) it saves you the awkwardness of joining things together.

Implemented that would mean that the first operation on an document would create (or upsert it) and each following step would update the document accordingly. I’m not quite happy with this approach since updating lots of documents all the time has the following drawbacks:

Downsides to frequent updates

  • Cost for get_then_reindex
  • Any updates you would do during the lifecycle would mostly be “partial updates”, where you only send the things that have changed to the Elasticsearch cluster. In fact, the independent software systems should really be unaware of the state updates the other systems did to avoid coupling of these systems. Elasticsearch allows us to do partial updates, but internally these are “get_then_update” operations, where the whole document is fetched, the changes are applied and then the document is indexed again. Even without disk hits one can imagine the potential performance implications if this is your main use case.

  • potential version conflicts
  • The “get_then_update” operations are not atomic, and Elasticsearch uses implicit versioning of it’s documents, so version conflicts are to be expected. They are automatically handled (by last-write-wins) and do not need to be handled by your software, but it’s another performance impact you have to be aware of.

  • need to store _source
  • Another uniqueness about the “get_then_update” update is that Elasticsearch can not use the indexed document itself but needs the original instead. This forces you to keep the _source field activated. In my case that was not an issue but it’s something to be aware of.

  • Lucene “soft deletes” and merging cost
  • On the Lucene layer, an update is actually not an update but an (atomic) “insert and delete” operation. But alas, this is still not the full truth: Deletes are soft, that means they are marked with a tombstone flag and reside in the segment. Only a merging operation will clean them up eventually. Like garbage collection for instantiated and dereferenced objects, this can lead to additional pressure on your system.

In conclusion, update operations can be considered rather expensive. Now that our application will ultimately consist of (almost) nothing but update operations, this seems like a bad idea. Let’s try changing that.

Index instead of Update

To achieve a different operation we need to split the document into its event parts – so we have got a relation going. To keep things a little denormalized we can reduce the event into a single type that contains all the possible data – and only fill the fields relevant to that:

es_update_insert

Relations in Elasticsearch

To handle relations, Elasticsearch provides us with two different mechanisms that both have their individual pros and cons: nested documents and parent-child relations. For an in depth introduction to both concepts, i’d recommend reading the Elasticsearch Guide’s chapter on modeling your data.

Without poorly replicating the description, in a nutshell, the nested documents live inside the original document type and the parent-child documents live separately in their own type, and are joined at query time. You need to be aware that parents and their children necessarily have to live on the same shard, and the parent-child ID map is held in memory.

For our specific use case  (where there are plenty of updates which are our performance concern, and search performance is actually negligible), we chose a parent-child relation as the better fit: we can truly insert a new event, without touching the original document or any of the other events. This is possible in this case because every step in the process chain does already know about the ID of the document without touching Elasticsearch. It’s a printed ID on the document, that we can reuse as an ID for our Dokument type.

In the end, the performance numbers on the hardware we had to our disposal prove us right: We are able to process a day’s worth of data in about 2 seconds!

Conclusion

While this was a relatively rare use case that you probably won’t encounter in the wild, it contains an interesting essence: Sometimes the “natural” or obvious data model goes against the inner workings of Elasticsearch, and it’s useful to remodel your data to better fit your system. Afterwards (and by that I mean the rest of the week, talking about how insanely fast one can accomplish results with elasticsearch) we were able to develop a small webapp where users can search the generated data – and were pleasantly surprised that search operations are still way faster than we anticipated!