Or “Why can’t you answer a simple question?”

Client after client and user after user (on the user’s list) ask the perfectly reasonable question: “Given documents of size X, what kind of hardware do we need to run Solr?”. I shudder whenever that question is asked because our answer is inevitably “It Depends ™”. This is like asking “how big a machine will I need to run a C program?”. You have to know what the program is trying to do as well as how much data there is. The number of documents you can put on a single instance of Solr is most often limited by Java’s heap. To give you an idea how wide the range is, we at Lucidworks have seen:

  • 10M docs require 64G of Java heap. Zing was used to keep GC under control
  • 300M docs fit in 12G of Java heap.

We usually reply “We can’t say in the abstract, you have to prototype”. This isn’t laziness on our part, we reply this way because there are a number of factors that go into answering this question, and clients rarely know ahead of time what the actual characteristics of the data and search patterns will be. Answering that question usually involves prototyping anyway. I’ve personally tried at least three times to create a matrix to help answer this question and given up after a while because the answers even for the same set of documents vary so widely!

This last is somewhat counter-intuitive. To use a simplified example; say I have a corpus of 11M documents indexed. I have two fields, both string fields “type” and “id”. Type has 11 unique values, and id has 11M unique values (it’s a <uniqueKey>). For simplicity’s sake, each unique value in these fields is exactly 32 characters long. There is a certain amount of overhead for storing a string in Java, plus there is some extra information kept by Solr to, say, sort. The total memory needed by Solr to store a value for sorting is 56 bytes + (2 * characters_in_string) as I remember. So, in the 3.x code line, the RAM needed to sort by:

  • The “type” field is trivial (11 * (56 + 64) = 1,320) bytes.
  • The “id” field is, well, 1 million times that (11,000,000 * (56 + 64) = 1,320,000,000) bytes.

Now you see why there’s no simple answer. The size of the source document is almost totally irrelevant to the memory requirements in the example above, each document could be very short, 32 bytes for the “id” and 32 bytes for “type”. Of course you also have resources required for the inverted index, faceting, document caching, filter query caching, etc, etc, etc….. Each and every one of these features may require widely varying resources depending on how they’re used. “It Depends ™”.

But what about an “average” document?

Which leads to the question “OK, you can’t say specifically. But don’t you at least have some averages? Given that our documents average ### bytes, can’t you make an estimate?” Unfortunately, there’s no such thing as an “average” document. Here are a couple of examples:

  • MS Word documents. The directory listing says our Word documents average 16K. Can’t you use that number? Unfortunately not unless you can tell us how much text is in each. That 16K may be the world’s simplest Word document with 1K of formatting information, the rest text. It may be a Word document with 1K of text and the rest graphics, and Solr doesn’t index non-textual data.
  • Videos. This is even worse. We practically guarantee that 99.9% of a video will be thrown away, it’s non-textual data that isn’t indexed in Solr. And notice the subtlety here. Even though we’ll throw away almost all of the bytes, the sorting could still be as above and take a surprising amount of memory!
  • RDBMS. This is my favorite. It usually goes along the lines of “We have 7 tables, table 1 has 25M rows, 3 numeric columns and 3 text columns averaging 120 bytes, table 2 has 1M rows, 4 columns, 2 numeric and 2 textual columns averaging 64 bytes each. Table 3 has…”. Well, often in Solr, you have to denormalize the data for optimal user experience. Depending upon the database structure, each row in table 1 could be joined to 100 rows of table 2 and denormalizing the data could require that you have (25,000,000 * 100 * 64 * 2) bytes of raw data for just the two tables! Then again, table 1 and table 2 could have no join relationship at all. So trying to predict what that means for Solr in the abstract is just a good way to go mad.
    • And if you want to truly go mad, consider what it means if it turns out that some of the rows in table 1 have BLOBs that are associated text.
  • I once worked at a place where, honest, one of the “documents” was (I wouldn’t lie to you), a 23 volume specialized encyclopedia.

And this doesn’t even address how the data in these documents are used. As the example above illustrated, how the data is used is as important as its raw size.

Other factors in sizing a machine

The first example above only deals with sorting on a couple of fields. Here is a list of questions that help at least determine whether the hardware needs to be commodity PCs or supercomputers:

  • For each field you will use for sorting
    • How many unique values will there be?
    • What kind of values for each? (String? Date? Numeric?)
  • For each field you will use for faceting
    • How many unique values will there be?
    • What kinds of values?
  • How many filter queries need to be kept in the cache?
  • How many documents will you have in your corpus?
  • How big are your returned pages (i.e. how many results do you want to display at once)?
  • Will the search results be created entirely from Solr content or are you going to fetch part of the information from some other source (e.g. RDBMS, filesystem)?
  • How many documents, on average, do you expect to be deleted in each segment? (*)
  • Do you intend to have any custom caches? (*)
  • How many fields do you expect to store term vectors for and how many terms in each? (**)
  • How many fields do you expect to store norms for and how many terms in each? (**)
  • How will your users structure their queries? (***)
  • What is the query load you need to support?
  • What are acceptable response times (max/median/99th percentile)?

I’ve thrown a curve ball here with the entries marked (*) and (**). Asking a client to answer these questions, assuming they’re not already Solr/search experts, is just cruel. They’re gibberish unless and until you understand the end product (and Solr) thoroughly. Yet they’ll affect your hardware (mostly memory) requirements!

The entries marked (**) can actually be answered early in the process if and only if a client can answer questions like “Do you require phrase queries to be supported?” and “Do you require length normalization to be taken into account?”. This last is also gibberish unless you understand how Solr/Lucene scoring works.

And the entry marked (***) is just impossible to answer unless you’re either strictly forming the queries programmatically or have an existing application you can mine for queries. And even if you do have queries from an existing application, when users get on a new system the usage patterns very often change.

Another problem is that answers to these questions often aren’t forthcoming until the product managers see the effects of, say, omitting norms. Which they can’t see until a prototype is available. So one can make the “best guess” as to the answers, create a prototype and measure.

Take pity on the operations people

Somewhere in your organization is a group responsible for ordering hardware and keeping it running smoothly. I have complete sympathy when the person responsible for coordinating with this group doesn’t like the answer “We can’t tell you what hardware you need until after we prototype”. They have to buy the hardware, provision it and wake up in the middle of the night if the system slows to a crawl and try to get it back running before all the traffic hits in the morning. Asking the operations people to wait before ordering their hardware until you have a prototype running and can measure justifiably causes them to break out in hives. The (valid) fear is that they won’t get the information they need to do their job until a week before go-live. Be nice to your ops people and get the prototype going first thing.

Take pity on the project sponsors

The executives who are responsible for a LucidWorks or Solr project also break out in hives when they’re told “We won’t know what kind of machine we will need for a month or two”, and justifiably so. They have to go ask for money to pay your salary and buy hardware after all. And you’re telling them “We don’t know how much hardware we’ll need, but get the budget approved anyway”.

The best advice I can give is to offer to create a prototype as below. Fortunately, you can use Velocity Response Writer or the Lucidworks UI to see what the search results look like to get a very good idea of the kinds of searches you’ll want to support very quickly. It won’t be the UI for your product, but it will let you see what search results look like. And you can often use some piece of hardware you have lying around (or rent a Cloud machine) to run some stress tests on. Offer your sponsor a defined go/no-go prototyping project; at least the risk is known.

And the work won’t be wasted if you continue the project. The stress-test harness will be required in my opinion before go-live. The UI prototyping will be required before you have a decent user experience.

The other thing to offer your sponsor is that “Solr rocks”. We can tell you that we have clients indexing and searching billions of documents, and getting sub-second response times. To be sure, they have something other than commodity PCs running their apps, and they’ve had to shard….

Prototyping: how to get a handle on this problem

Of course it’s unacceptable to say “just put it all together and go live, you’ll figure it out then”. Fortunately, one can make reliable estimates, but this involves prototyping. Here’s what we recommend.

Take a machine you think is close to what you want to use for production and make your best guess as to how it will be used. Making a “best guess” may involve:

  • Mining any current applications for usage patterns
  • Working with your product managers to create realistic scenarios
  • Getting data, either synthesizing them or using your real documents
  • Getting queries, either synthesizing them or mining existing applications

Once this has been done, you need two numbers for your target hardware: how many queries per second you can run and how many documents you can put on that machine.

To get the first number, pick some “reasonable” number of docs. Personally I choose something on the order of 10M. Now use one of the load-testing tools (jMeter, SolrMeter) to fire off enough queries (you have to have generated sample queries!) to saturate that machine. Solr usually shows a flattening QPS rate. By that I mean you’ll hit, say, 10 (or 50 or 100) QPS and stay there. Firing off more queries will change the average response time, but the QPS rate will stay relatively constant.

Now, take say 80% of the QPS rate above and start adding documents to the Solr instance in increments of, say, 1M until the machine falls over. This can be less graceful than saturating the machine with queries, you can reach a tipping point where the response rises dramatically.

Now you have two numbers, the maximum QPS rate you can expect and the number of documents your target hardware can handle. Various monitoring tools can be used to alert you when you start getting close to either number so you can take some kind of preventative action.

Do note that there are a significant number of tuning parameters that can influence these numbers, and the exercise of understanding these early in the process will be invaluable for ongoing maintenance. And having a test harness for testing out changes you want to make for release N + 1 will be more valuable yet. Not to mention the interesting tricks that can be played with multi-core machines.

Scaling

OK, you have these magic numbers of query rates and number of documents per machine. What happens when you approach these? Then you will implement the standard Solr scaling process.

As long as the entire index fits on a single machine with reasonable performance, you can scale as necessary by simply adding more slave machines to achieve whatever QPS rate you need.

When you get near the number of documents you can host on a single machine, you need to either move to a bigger machine or shard. If you anticipate growth that will require sharding, you can start out with multiple shards (perhaps hosted on a single machine) with the intent of distributing these out to separate hardware as necessary later.

These topics are covered elsewhere, so I’ll not repeat them in any more detail here, but these are standard Solr use-cases, see:

  • http://wiki.apache.org/solr/CollectionDistribution
  • http://wiki.apache.org/solr/DistributedSearch/

And it gets worse

Say you have created a model for your usage patterns and managed to fit it into a nice spreadsheet. Now you want to take advantage of some of the nifty features in Solr 4.x. Your model is now almost, but not quite totally, useless.

There have been some remarkable improvements in Solr/Lucene memory usage with the FST-based structures in the 4.x code line (now in alpha). Here are some useful blogs:

  • For more background, see Mike McCandless’ blog posts http://blog.mikemccandless.com/2010/12/using-finite-state-transducers-in.html and http://blog.mikemccandless.com/2011/01/finite-state-transducers-part-2.html.
  • If you want to see what kind of work goes into something like this, http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html.
  • In case you can’t tell, I’m a fan of Mike’s blogs….
  • I blogged about this here: http://www.lucidimagination.com/blog/2012/04/06/memory-comparisons-between-solr-3x-and-trunk/.

Why do I digress with this? Because, on some tests I ran the memory requirements for the same operations shrank by 2/3 between 3.x and trunk/4.0 (4.0-ALPHA is out). So even if you have the right formula for calculating all this in the abstract for 3.x Solr, that information may now be misleading. Admittedly, I used a worst-case test, but this is another illustration why having a prototype-and-stress-test setup will serve you well.

And SolrCloud (Solr running with ZooKeeper for automatic distributed indexing, searching, and near-real-time) also changes the game considerably. Sometimes I just wish the developers would hold still for a while and let me catch up!

Personal rant

I spent more years than I want to admit to programming. It took me a long time to stop telling people what they wanted to hear, even if that answer was nonsense! The problem with trying to answer the sizing question is exactly the problem developers face when being asked “How long will you take to write this program?”; anything you say will be so inaccurate that it’s worse than no answer at all. The whole Agile Programming methodology explicitly rejects being able to give accurate, far-off estimates, and I’m a convert.

I think it’s far kinder to tell people up-front “I can’t give you a number, here’s the process we’ll have to go through to find a reasonable estimate” and force the work to happen than give a glib, inaccurate response that results in one or more of the following guaranteed (almost).

  • Cost inaccuracies because the hardware specified was too large or too small
  • “Go live” that fails the first day because of inadequate hardware
  • “Go live” that’s delayed because the stress testing the week before the target date showed that lots of new hardware would be required
  • Delay and expense because the requirements change as soon as the Product Manager sees what the developers have implemented (I didn’t think it would look like that)… and the PM demands changes that require more hardware
  • Ulcers in the Operations department
  • Ulcers for the project sponsor

I’ll bend a bit if you’re following a strict waterfall model. But if you’re following a strict waterfall model, you can also answer all the questions outlined above and you can get a reasonable estimate up-front.

Other resources

Grant Ingersoll has also blogged on this topic, see: http://www.lucidimagination.com/blog/2011/09/14/estimating-memory-and-storage-for-lucenesolr/

Solr Wiki “Powered By”, see: http://wiki.apache.org/solr/PublicServers. Some of the people who added information to this page were kind enough to include hardware and throughput figures.