Indexing Strategies for Multilingual Search with Solr and Rosette - Rosette
As a solutions engineer at Basis Technology, I often discuss the integration of Rosette and Apache Solr with our existing and potential clients, who look to Rosette to improve Solr search in many languages (including English). Generally this integration is a fairly simple process that involves little, if any, programming. There are a number of nuances to it however, and we usually cannot go into as much detail as I wish during our calls because there are so many other things we need to cover. This post is meant to present a more complete picture of the various ways to integrate linguistic analysis into Solr.
Rosette’s Connector to Solr
The Rosette linguistics platform provides text analysis capabilities that can greatly improve the user experience, such as language identification, linguistic analysis (tokenization, lemmatization, decompounding), and entity extraction. The platform comes with a connector to Solr—as well as to Apache Lucene and dtSearch, but I’ll discuss these in another post. With the connector, all it takes to integrate Rosette into your Solr installation is copying a few jar files and changing the configuration (via schema.xml and solrconfig.xml). The connector is provided with source, so you can easily fine tune it if need be. For example, if you want to use another project built around Lucene and Solr, such as Elasticsearch, all it takes is a few tweaks to our connector and voilà.
Getting Started: Questions To Ask Yourself
There are a few questions you have to ask yourself as you begin planning your Rosette and Solr integration. In addition to those listed below, if you have an existing search system in production, it is also helpful to look at the most common queries, the queries that bring back no results, and other search patterns particular to your users.
- Is the document collection you plan to index
- monolingual, or
- multilingual, but every document is monolingual, or
- multilingual both at the document and collection level?
- How high are your query demands? How many queries per second do you wish to support?
- How many languages do you want to handle? A setup for English + Chinese will be different from one for 40 different languages.
- How many text fields do you plan to index? What is the number you get from multiplying the number of text fields by the number of languages you want to process?
- Do you want your search engine to allow filtering or faceting based on the language(s) of a document?
- Will you know the language of the query string?
- What kinds of data will you be indexing? Is it log files, news articles, technical documentation, specialized documents (e.g. patents), …?
- What kinds of users and queries will you support? Advanced users with elaborate queries, exact matching, wildcard search, …?
- How do you plan to handle multilingual documents?
- Multilingual Documents
How you handle multilingual documents depends in part on your data and your application. What proportion of your data is (or is expected to be) multilingual documents? Is it sufficient to take the dominant language and mis-process the parts of the document that are not in that language?
If you want to handle each language region according to its language, you need to use Rosette Language Boundary Locator to split the document into language regions. Suppose you have to index a message containing both English and Chinese text. The boundary locator will allow you to separate the two regions and perform English linguistic processing for one, and Chinese for the other.
If the document contains multiple regions of each language—for example some English, some Chinese, followed by some more English—you may combine all the language-specific regions into a single field value, or create separate values for every region.
These decisions depend on your data and your users’ expectations.
Based on your answers to the questions above, you can select one of the following multilingual text indexing approaches. Note that each approach has pros and cons, and the right approach depends on your specific application and scenario. Regardless of the approach you take, by its nature, indexing multilingual document sets is more complex than indexing single language documents. But if you do the work, you will have a truly multilingual search application to look forward to at the end of this project.
I. Multi-Field Indexing
This approach is the simplest, most widely used, and the one we recommend in the vast majority of cases. The only reason not to use this approach is if you plan to support many languages or many fields, as this approach is the most costly of the three at query time in high-throughput scenarios. Depending on the existing complexity of your queries, this approach may also be prohibitive in terms of further complicating them.
In this approach, you create one Solr field per “text field, language” pair. If you are starting with Solr fields “subject” and “body,” and dealing with English and Chinese, you would expand your field set (in schema.xml) to be:
As documents are being indexed, the language identifier will determine the language of the text and move the contents into the appropriate fields. For example, a subject in Chinese will be copied into subject_zhs. Because our connector is set up to detect the language before field-level indexing (the language identifier is configured as an UpdateRequestProcessor in solrconfig.xml), you can also store the detected language value for filtering or faceting in the interface.
Another option here is to call the language identifier before sending the data to Solr, and add a language attribute to all documents in your own code. You may choose to do this if you have additional considerations for the language selection, and/or other uses or the language field.
Given a query, if you know its language (or a small enough subset of possibilities), you can limit the search to only the fields in the corresponding language(s). However, as you might suspect, the more languages and fields you search, the slower your search performance will become. In our experience, this is not a significant issue, but it is something you will want to keep an eye on and benchmark.
This approach as described thus far would not involve any code writing, only configuration changes in solrconfig.xml and schema.xml. The case in which you have to write a small customization is if your collection contains multilingual documents and you want to use Rosette Language Boundary Locator (a feature withinRosette Language Identifier) to split them into language-specific regions so that you can process each one appropriately to its language. This customization involves writing an update request processor class and referencing it in solrconfig.xml. Although this code is not included in the connector, we are happy to share some draft code and ideas to help you along.
II. Multi-Core Indexing
The second approach is using one Solr core per language. In this case, each core will be responsible for documents in the language assigned to it. This option requires some programming and a more complex configuration. Because we are using different Solr cores, the language identifier must be run before sending the documents into Solr, because the language information will be used to determine which shard, or core, a document should be directed to.
One advantage of this approach is that the field names are the same in all cores, which simplifies query-time processing because you no longer need to expand queries to search the multiple <field_name>_<language_code> fields. The field names in schema.xml remain the same in each core, and the analyzers/tokenizers/token-filters are specific to the language you are indexing.
Another benefit of this method is that your queries will be parallelized across the different language cores. Multi-field indexing described above, on the other hand, where querying body_eng and body_zhs happens in a single core, is essentially sequential.
Of course there are drawbacks (in addition to managing multiple cores). First of all, it’s very challenging to process multilingual documents with this approach. Whether you split a multilingual document across multiple cores or make copies of it on every core whose language matches one of the languages of the document, you have the possibility that a query will bring back the same document from multiple cores. This makes merging and ordering the results very complex.
Second, you have to deal with the difficulty of handling document relevancy metrics. Because term frequency counts are per core, if search results are being pulled from multiple queries (say, the user searched for “Paris” and you are bringing results from the English core and the French core), you have to decide how to merge the relevancy scores presented by each core. Solr will do a basic merge for you, but it may not be what you want.
In general, we do not recommend this approach, but it’s worth considering, especially if you are already familiar and comfortable with sharding.
III. Single-Field Multilingual Indexing
The third approach is to use a single Solr core and a single Solr field. In our example, you would remain with fields “subject” and “body.” This method is recommended when you require very high query throughput, have complex queries, and are comfortable with the complexity and quantity of programming that is necessary. This method requires quite a bit of programming. We learned this approach from one of our customers.
First, you need to customize our language identification UpdateRequestProcessor to prepend language information at the beginning of your content fields (“subject” and “body” in our example), based on the detected language of the document. For example, if “Subject” contained the sentence “How do you plug Rosette into Solr?” the string that goes into the “Subject” field analyzer would be “ENG|How do you plug Rosette into Solr?”. In addition, you may want to also store the language information in a separate field, for faceting/filtering purposes.
Second, you need to write a custom tokenizer that looks at the prepended language code, creates an appropriate language-specific RLPTokenizer (our Solr Tokenizer subclass), and returns the appropriate tokens to Solr for indexing.
The most complex piece still remains: it comes from the fact that at query time, you probably want to process the query in more than one language—if you don’t know the specific language of the query, that is. In this case, your queries must have one or more prepended language codes, and your custom tokenizer must get more complex. It needs to look at one or more prepended language codes, create one language-specific RLPTokenizer per language, and return the token streams from each tokenizer with the correct offsets to Solr for indexing/query.
In addition to the programming described above, there are two more challenges you will need to face. One is that because all the languages are mixed together in a single field, TF/IDF could be mixed up by (a) words shared across languages, and (b) imbalanced language representation. That is, words belonging to an underrepresented language in your corpus will count as “rare” just because the language is rare.
The other challenge is that querying a single field in multiple languages is quite complex and can be problematic. For instance, if a lemma (dictionary form of a word) exists in multiple languages (e.g. “Kind” means “child” in German and has several meaning in English), even if you know the language intended by the user, you cannot restrict the search to documents of that language without adding a filter query.
On the whole, this is a great solution if your query demands are very high, and are willing to deal with the complexity.
I hope this overview of the considerations and options for integrating Rosette with Solr is helpful. Now that you have the background, you may find it interesting and useful to look at this set of slides on this topic, created and presented by our Director of Product Management, Steve Kearns, at Apache Eurocon 2011.