MongoDB Text Search Tutorial - codecentric AG Blog

:

In my introduction to text search in MongoDB, we had a look at the basic features. Today we’ll have a closer look at the details.

API

You may have noticed that a text search is not executed with a find() command. Instead you call

db.foo.runCommand( "text", {search: "bar"} )

db.foo.runCommand( "text", {search: "bar"} )

Remember it’s an experimental feature still. Adding it to the implementation of the find() command would have mixed critical production code with the new text search feature. When executed via a runCommand() call, text search can be run and tested in isolation.

I expect to see a new query operator like $text or $textsearch as soon as text search is integrated with the standard find() command.

Text Query Syntax

In the previous examples we just searched for a single word. We can do more than that. Let’s have a look at the following example:

db.foo.drop()
db.foo.ensureIndex( {txt: "text"} )
db.foo.insert( {txt: "Robots are superior to humans"} )
db.foo.insert( {txt: "Humans are weak"} )
db.foo.insert( {txt: "I, Robot - by Isaac Asimov"} )

db.foo.drop() db.foo.ensureIndex( {txt: "text"} ) db.foo.insert( {txt: "Robots are superior to humans"} ) db.foo.insert( {txt: "Humans are weak"} ) db.foo.insert( {txt: "I, Robot - by Isaac Asimov"} )


A search for “robot” will find two documents, the same it true for “human”:

> db.foo.runCommand("text", {search: "robot"}).results.length
2
> db.foo.runCommand("text", {search: "human"}).results.length
2

> db.foo.runCommand("text", {search: "robot"}).results.length 2 > db.foo.runCommand("text", {search: "human"}).results.length 2

When searching for multiple terms, an OR search is performed, yielding three documents in our example:

> db.foo.runCommand("text", {search: "human robot"}).results.length
3

> db.foo.runCommand("text", {search: "human robot"}).results.length 3

I would have expected that the given search words are AND-ed not OR-ed.

Negation

By adding a heading minus sign to a search word, you can exclude documents containing that word. Let’s say, we want all documents on “robot” but no “humans”.

> db.foo.runCommand("text", {search: "robot -humans"})
{
        "queryDebugString" : "robot||human||||",
        "language" : "english",
        "results" : [
                {
                        "score" : 0.6666666666666666,
                        "obj" : {
                                "_id" : ObjectId("50ebc484214a1e88aaa4ada0"),
                                "txt" : "I, Robot - by Isaac Asimov"
                        }
                }
        ],
        "stats" : {
                "nscanned" : 2,
                "nscannedObjects" : 0,
                "n" : 1,
                "timeMicros" : 212
        },
        "ok" : 1
}

> db.foo.runCommand("text", {search: "robot -humans"}) { "queryDebugString" : "robot||human||||", "language" : "english", "results" : [ { "score" : 0.6666666666666666, "obj" : { "_id" : ObjectId("50ebc484214a1e88aaa4ada0"), "txt" : "I, Robot - by Isaac Asimov" } } ], "stats" : { "nscanned" : 2, "nscannedObjects" : 0, "n" : 1, "timeMicros" : 212 }, "ok" : 1 }

Phrase Search

By enclosing multiple words inside quotes (“foo bar”) you perform a phrase search. Inside a phrase, order is important and stop words are also taken into account:

> db.foo.runCommand("text", {search: '"robots are"'})
{
        "queryDebugString" : "robot||||robots are||",
        "language" : "english",
        "results" : [
                {
                        "score" : 0.6666666666666666,
                        "obj" : {
                                "_id" : ObjectId("50ebc482214a1e88aaa4ad9e"),
                                "txt" : "Robots are superior to humans"
                        }
                }
        ],
        "stats" : {
                "nscanned" : 2,
                "nscannedObjects" : 0,
                "n" : 1,
                "timeMicros" : 185
        },
        "ok" : 1
}

> db.foo.runCommand("text", {search: '"robots are"'}) { "queryDebugString" : "robot||||robots are||", "language" : "english", "results" : [ { "score" : 0.6666666666666666, "obj" : { "_id" : ObjectId("50ebc482214a1e88aaa4ad9e"), "txt" : "Robots are superior to humans" } } ], "stats" : { "nscanned" : 2, "nscannedObjects" : 0, "n" : 1, "timeMicros" : 185 }, "ok" : 1 }

Please have a look at the “queryDebugField”:

 "queryDebugString" : "robot||||robots are||"

"queryDebugString" : "robot||||robots are||"

It tells us that our search string contains one stem “robot” but also the phrase “robots are”. That’s the reason we have only one hit. Compare that to these searches:

 
> // order matters inside phrase
> db.foo.runCommand("text", {search: '"are robots"'}).results.length
0
> // no phrase search --> OR query
> db.foo.runCommand("text", {search: 'are robots'}).results.length
2

> // order matters inside phrase > db.foo.runCommand("text", {search: '"are robots"'}).results.length 0 > // no phrase search --> OR query > db.foo.runCommand("text", {search: 'are robots'}).results.length 2

Multi Language Support

Stemming and stop word filtering are both language dependent. So we have to tell MongoDB what language to use for indexing and searching if you want to use other languages than the default which is English. MongoDB uses the open source Snowball stemmer that supports these languages.

In order to use another language for indexing and searching, you do this when creating the index:

db.de.ensureIndex( {txt: "text"}, {default_language: "german"} )

db.de.ensureIndex( {txt: "text"}, {default_language: "german"} )

With this setting, MongoDB assumes that all text in the field “txt” and all text searches on that collection are in German. Let’s see if it works:

> db.de.insert( {txt: "Ich bin Dein Vater, Luke." } )
> db.de.validate().keysPerIndex["text.de.$txt_text"]
2

> db.de.insert( {txt: "Ich bin Dein Vater, Luke." } ) > db.de.validate().keysPerIndex["text.de.$txt_text"] 2

As you can see, there are only two index keys, so stop word filtering did occur (this time with a German stop word list. Vater is the German word for father, not some typo with Vader) Let’s try some searches:

> db.de.runCommand("text", {search: "ich"}).results.length
0
> db.de.runCommand("text", {search: "Vater"}).results.length
1
> db.de.runCommand("text", {search: "Luke"}).results.length
1

> db.de.runCommand("text", {search: "ich"}).results.length 0 > db.de.runCommand("text", {search: "Vater"}).results.length 1 > db.de.runCommand("text", {search: "Luke"}).results.length 1

Please note that we don’t have to give the language we are searching for because it is derived from the index. We have hits for the meaningful words “Vater” and “Luke”, but not for the stop word “ich” (which means “I”).

It it also possible to mix multiple languages in the same index. Each single document can have its own language:

db.de.insert( {language:"english", txt: "Ich bin ein Berliner" } )

db.de.insert( {language:"english", txt: "Ich bin ein Berliner" } )

If a field “language” is present, its content defines the language for stemming and stop word filtering for the indexed field(s) of that document. The word “ich” is not a stop word in English, so it is indexed now.

// default language: german -> no hits
> db.de.runCommand("text", {search: "ich"})
{
        "queryDebugString" : "||||||",
        "language" : "german",
        "results" : [ ],
        "stats" : {
                "nscanned" : 0,
                "nscannedObjects" : 0,
                "n" : 0,
                "timeMicros" : 96
        },
        "ok" : 1
}
 
// search for English -> one hit
> db.de.runCommand("text", {search: "ich", language: "english"})
{
        "queryDebugString" : "ich||||||",
        "language" : "english",
        "results" : [
                {
                        "score" : 0.625,
                        "obj" : {
                                "_id" : ObjectId("50ed163b1e27d5e73741fafb"),
                                "language" : "english",
                                "txt" : "Ich bin ein Berliner"
                        }
                }
        ],
        "stats" : {
                "nscanned" : 1,
                "nscannedObjects" : 0,
                "n" : 1,
                "timeMicros" : 161
        },
        "ok" : 1
}

// default language: german -> no hits > db.de.runCommand("text", {search: "ich"}) { "queryDebugString" : "||||||", "language" : "german", "results" : [ ], "stats" : { "nscanned" : 0, "nscannedObjects" : 0, "n" : 0, "timeMicros" : 96 }, "ok" : 1 }// search for English -> one hit > db.de.runCommand("text", {search: "ich", language: "english"}) { "queryDebugString" : "ich||||||", "language" : "english", "results" : [ { "score" : 0.625, "obj" : { "_id" : ObjectId("50ed163b1e27d5e73741fafb"), "language" : "english", "txt" : "Ich bin ein Berliner" } } ], "stats" : { "nscanned" : 1, "nscannedObjects" : 0, "n" : 1, "timeMicros" : 161 }, "ok" : 1 }

What happened here? The default language for searching is German. So the first search has no result (as before). In the second search we say to search for English text (to be more precise: for index keys that were generated with an English stemmer and stop words). That’s why we find the famous sentence from JFK.

What does that mean? Well, you have are real multi language text search at hand. You can store text messages from around the world in one collection and still search them dependent on the language.

Multiple Fields

A text index can span more that one field. If you are using more than one field, each field can have its one weight. That enables you to have indexed text parts of your document with different meanings.

> db.mail.ensureIndex( {subject: "text", body: "text"}, {weights: {subject: 10} } )
> db.mail.getIndices()
[
        ...
        {
                "v" : 0,
                "key" : {
                        "_fts" : "text",
                        "_ftsx" : 1
                },
                "ns" : "de.mail",
                "name" : "subject_text_body_text",
                "weights" : {
                        "body" : 1,
                        "subject" : 10
                },
                "default_language" : "english",
                "language_override" : "language"
        }
]

> db.mail.ensureIndex( {subject: "text", body: "text"}, {weights: {subject: 10} } ) > db.mail.getIndices() [ ... { "v" : 0, "key" : { "_fts" : "text", "_ftsx" : 1 }, "ns" : "de.mail", "name" : "subject_text_body_text", "weights" : { "body" : 1, "subject" : 10 }, "default_language" : "english", "language_override" : "language" } ]

We created a text index spanning the fields “subject” and “body”, where the first got a weight of 10 and the latter the standard weight 1. Let’s see what impact these weights have:

> db.mail.insert( {subject: "Robot leader to minions", body: "Humans suck", prio: 0 } )
> db.mail.insert( {subject: "Human leader to minions", body: "Robots suck", prio: 1 } )
> db.mail.runCommand("text", {search: "robot"})
{
        "queryDebugString" : "robot||||||",
        "language" : "english",
        "results" : [
                {
                        "score" : 6.666666666666666,
                        "obj" : {
                                "_id" : ObjectId("50ed1be71e27d5e73741fafe"),
                                "subject" : "Robot leader to minions",
                                "body" : "Humans suck"
                                "prio" : 0 
                        }
                },
                {
                        "score" : 0.75,
                        "obj" : {
                                "_id" : ObjectId("50ed1bfd1e27d5e73741faff"),
                                "subject" : "Human leader to minions",
                                "body" : "Robots suck"
                                "prio" : 1
                        }
                }
        ],
        "stats" : {
                "nscanned" : 2,
                "nscannedObjects" : 0,
                "n" : 2,
                "timeMicros" : 148
        },
        "ok" : 1
}

> db.mail.insert( {subject: "Robot leader to minions", body: "Humans suck", prio: 0 } ) > db.mail.insert( {subject: "Human leader to minions", body: "Robots suck", prio: 1 } ) > db.mail.runCommand("text", {search: "robot"}) { "queryDebugString" : "robot||||||", "language" : "english", "results" : [ { "score" : 6.666666666666666, "obj" : { "_id" : ObjectId("50ed1be71e27d5e73741fafe"), "subject" : "Robot leader to minions", "body" : "Humans suck" "prio" : 0 } }, { "score" : 0.75, "obj" : { "_id" : ObjectId("50ed1bfd1e27d5e73741faff"), "subject" : "Human leader to minions", "body" : "Robots suck" "prio" : 1 } } ], "stats" : { "nscanned" : 2, "nscannedObjects" : 0, "n" : 2, "timeMicros" : 148 }, "ok" : 1 }

The document with “robot” in the “subject” field has much higher score because the weight of 10 is a taken as a multiplier.

Filtering and Projection

You can apply additional search criteria via filtering:

> db.mail.runCommand("text", {search: "robot", filter: {prio:0} } )
{
        "queryDebugString" : "robot||||||",
        "language" : "english",
        "results" : [
                {
                        "score" : 6.666666666666666,
                        "obj" : {
                                "_id" : ObjectId("50ed22621e27d5e73741fb04"),
                                "subject" : "Robot leader to minions",
                                "body" : "Humans suck",
                                "prio" : 0
                        }
                }
        ],
        "stats" : {
                "nscanned" : 2,
                "nscannedObjects" : 2,
                "n" : 1,
                "timeMicros" : 185
        },
        "ok" : 1
}

> db.mail.runCommand("text", {search: "robot", filter: {prio:0} } ) { "queryDebugString" : "robot||||||", "language" : "english", "results" : [ { "score" : 6.666666666666666, "obj" : { "_id" : ObjectId("50ed22621e27d5e73741fb04"), "subject" : "Robot leader to minions", "body" : "Humans suck", "prio" : 0 } } ], "stats" : { "nscanned" : 2, "nscannedObjects" : 2, "n" : 1, "timeMicros" : 185 }, "ok" : 1 }

Please note that filtering does not use an index.

If you are interested only in a subset of fields, you can use projection (similar to the aggreation framework):

> db.mail.runCommand("text", {search: "robot", project: {_id:0, prio:0} } )
{
        "queryDebugString" : "robot||||||",
        "language" : "english",
        "results" : [
                {
                        "score" : 6.666666666666666,
                        "obj" : {
                                "subject" : "Robot leader to minions",
                                "body" : "Humans suck"
                        }
                },
                {
                        "score" : 0.75,
                        "obj" : {
                                "subject" : "Human leader to minions",
                                "body" : "Robots suck"
                        }
                }
        ],
        "stats" : {
                "nscanned" : 2,
                "nscannedObjects" : 0,
                "n" : 2,
                "timeMicros" : 127
        },
        "ok" : 1
}

> db.mail.runCommand("text", {search: "robot", project: {_id:0, prio:0} } ) { "queryDebugString" : "robot||||||", "language" : "english", "results" : [ { "score" : 6.666666666666666, "obj" : { "subject" : "Robot leader to minions", "body" : "Humans suck" } }, { "score" : 0.75, "obj" : { "subject" : "Human leader to minions", "body" : "Robots suck" } } ], "stats" : { "nscanned" : 2, "nscannedObjects" : 0, "n" : 2, "timeMicros" : 127 }, "ok" : 1 }

Filtering and projection can be combined, of course.

Examples

All examples can be found on github. Try them yourself.

Summary

With this second part on MongoDB text search we had a look at the more intereting features of the text search capability. For a start that’s quite a good toolbox to implement your own search engines. I’m looking forward your feedback.