Scaling an Elasticsearch Index - Introduction - codecentric AG Blog
A well-known design decision of Elasticsearch is that a fixed number of shards has to be specified when creating an index. It is not possible to start out with just one or only a few shards and add more shards later as the data increases.
Now what to do if we find ourselves in a situation where the capacity of an index is exhausted and we really need to extend that index? We will try to find the answer in this blog series.
First of all, let’s try to understand why Elasticsearch doesn’t provide anything like dynamic shard splitting. Quite a few other database and search engine products offer shard splitting so that extending a database or an index can be done conveniently via a single API call.
The most obvious advantage of the approach taken by Elasticsearch is that a particular shard key will always point to the same shard. There is no need to implement a complicated shard splitting algorithm, with heavyweight behind-the-scenes data migrations, high disk space requirements, and all kinds of unexpected error situations. Further reasoning, shedding some light on why shard splitting is considered a bad idea, is given here.
The recommendation for Elasticsearch users goes like this: First, estimate the capacity of a single shard by performing measurements with realistic amounts of data. Next, based on the results of those measurements, calculate the total number of shards needed to hold the expected data. Finally, overallocate a little to have some headroom in case something unexpected happens.
Probably, just following that recommendation will satisfactorily handle many scenarios arising in practice. Also, when designing a system, we should spend some time thinking about the expected amounts of data anyway, even without any sharding involved. On the other hand, there is no guarantee that our expectations or measurements are correct – unexpected success of a startup is a typical example. We have to accept that even with careful planning an Elasticsearch index may reach its capacity. And if it happens, it is guaranteed to happen at the worst possible point in time, so we should better be prepared. The following two approaches are available:
- Extend the index by a second index and have queries cover both of them.
- Replace the index by a new, bigger index. Once all data are migrated, use the new index only.
Note that, regarding search performance, both approaches are equivalent because it doesn’t make any difference if we query a single index with, say, 50 shards or 50 indexes with one shard each. In both cases, 50 Lucene indexes are searched.
In this article, we will concentrate on the first approach, extending an index. Parts 2 and 3 of this series will cover the second approach, migrating to a new index.
Extending an index
Suppose we have an „index1“ that we want to extend by a second index „index2“. This can be achieved with the following steps:
- Create a second index „index2“ with the same mapping as „index1“.
- Direct all new documents to „index2“. If the application already uses an alias for indexing, that alias only needs to be switched from „index1“ to „index2“. If there is no such alias yet, it can be created in a preparatory step (which might involve a new deployment or restart of the application). Of course, we can do without an alias just by hardcoding “index2” into the application, but the use of an alias is recommended for increased flexibility.
- Direct search queries to both indexes. Ideally, the application already uses a separate alias for queries and we only need to update that alias definition to point to both „index1“ and „index2“. Once again, this can also be done without an alias by specifying both indexes in search requests, but the use of an alias is recommended.
Readers familiar with the ELK stack may notice that this is exactly how Elasticsearch indexes are usually managed for scenarios involving time-based data, e.g., log data. With ELK, the steps outlined above are applied regularly, i.e., once per day or week. Extending an index in this way is a perfect fit for use cases where documents are indexed once and never changed later, where basically all you do is index and search.
Unfortunately, things are not so simple in the general case where the index needs to support requests by document ID other than indexing, e.g., retrieving, updating or deleting a document. With multiple indexes, for each such operation we need to know which index to address for that particular document. This in turn means that we have to implement an additional sharding-like mechanism in the application. Furthermore, the mechanism has to be designed such that it identifies all already existing documents as stored in „index1″. How can we achieve that? Let’s take a look at two possible options.
Option 1: Use information already available in documents
While it’s a good shard key for fresh indexes, the document ID doesn’t qualify as a shard key in our scenario at hand: Documents with all kinds of IDs have already been stored in „index1“, so we cannot find a clear rule to distinguish them from documents directed to „index2“. Instead, a good candidate for our shard key is something like the creation date of a document. Assuming that all documents to be indexed carry a creation date field with them, we could modify our application to send all documents with a creation date later than some point in time to „index2“. All documents with an earlier creation date will be sent to “index1”. When accessing a document by ID, its creation date can be used to identify the index to use. Let’s discuss this idea.
- Already existing information is reused, so there is no need to modify any other part of the application beyond the search engine clients.
- The search engine clients of the application need to be modified to perform their own sharding via the document creation date. Unfortunately, it’s not possible to use an alias for that purpose. Thus, in contrast to ELK-type scenarios, clients cannot just index into a single „current“ index (alias) but have to be aware of different indexes.
- When reading, updating or deleting a document, the creation date of the document needs to be available. Most likely this will involve reading it from a primary database (by document ID) first, which results in an additional database call. Note that without some primary database the outlined approach is not possible at all because then there is no place to fetch the document creation date from.
Option 2: Store additional indexing information in the primary database
Given that option 1 already requires the presence of a primary database, we can also consider making slightly bigger changes to the application in order to achieve a simpler result. How about just storing the respective index name directly with documents in the primary DB? Whenever needed, it can be read from there and provided to the search engine client. For entries that don’t have an index name stored in the primary database, we can just assume that the original index, „index1“, is the correct index. Let’s discuss this idea:
- Search clients can work with index names directly. They don’t have to know about artificial rules involving, e.g., creation dates.
- The index name for each document has to be stored in the primary database. This requires changing all parts of the application that store new documents to the primary database. Also, we might even have to update the database schema.
- The primary database will be directly coupled to the search index. This coupling might create additional complexity for any future changes.
When there is no need to retrieve, update or delete single documents by ID, extending an index by another one one is a fairly simple and promising approach. However, if additional operations by document ID are required, things are more difficult. We have discussed two possible options for extending an index, but neither is particularly attractive or straightforward to implement. One more option we didn’t discuss is to simply duplicate all requests by document ID and send them to both indexes, as only one of them will be able to sensibly work with them. And while it’s indeed a valid approach, let’s not consider it a serious attempt at a solution. Unfortunately, there is not much more that comes to (my) mind.
But no need to despair! In the next parts of this series we will delve deep into the details of the second approach, migrating to a new index.