Proper indexing strategy is important in any database and MongoDB is no different. However tracking the existence of certain indexes from environment to environment can be challenging. Automatically Building Indexes in MongoDB from you C# code can ensure that indexes tested in QA get built in production. This can help lessen the burden on your devops team when moving through different environments. There are some potential risks to automatically building indexes that I’ll talk about later.
A Brief Introduction To Indexes
Indexes are special data structures that improve the speed of database read operations. Proper indexing can mean the difference between fast, stable, applications or complete outages. They work mainly on the premise that data is read considerably more often than it is written. This is because each write to the database may require the insertion or update of a key in an index (or many several). This small cost to update the index is paid on writes so that reads can utilize this specialized data structure to find data much faster.
Without an index, each time you wanted to find any document where
foo: "bar", the database engine would have to do a considerable amount of work to find the results that match your query. This process will take subtly different forms in each type of database, but the process is essentially
- Read document off disk
- Compare document with query predicate2
- If the document matches the predicate, add it to the result set
- If the result set has reached the batch size, return it and wait for the client.
- If there are more documents in the collection, go to step 1.
- Return the result set
While this is an extremely simplified view of the process, this is, at a high level, how queries get executed. This process will “walk” the entire collection (or table) from “beginning” to “end”. Sometimes it will break in the middle if the result set hits a size limit to return this to the client, waiting for the client to request more. This can have some very negative effects, such as evicting the “working set”1 from RAM, consuming large amounts of CPU, large amounts of disk activity, essentially starving other operations for resources. This causes contention within the database, potentially causing it to crash or simply stop responding to requests.
So we know indexes are good, but why don’t we just index everything if they’re so good? That comes down to basically 3 reasons:
- Indexes must be updated on each write, this takes time and resources
- Indexes take up space on disk, they must be persisted between restarts of the server, just like the data
- Indexes take up space in memory. Memory, just like storage, is finite, even more so than disk space.
So we don’t want to just “index everything”, we can end up dragging our database down with excessive amounts of work for each write, beyond just persisting the document.
Planning Indexes In MongoDB
Indexing in MongoDB is very similar to indexing in any other database, however there are some differences. For users coming from SQL, they may be used to being able to leverage multiple indexes in the same query. MongoDB calls this is called Index Intersection. In MongoDB, multiple indexes are generally only leveraged for predicates where there is an
$or allowing the query engine to run 2 query operations in parallel. This usually means a compound index across multiple fields is generally more efficient, especially if the result set is to be sorted.
In general, it is always good practice to take some time to plan out your indexes before creating them. If you following something like the Repository Pattern3, you should have fairly tight control over the queries and updates being run against the collections. Keep in mind, even Updates in MongoDB need to use indexes to be efficient. Since we don’t want to create any more indexes than are needed, a balance needs to be struck between the “cost” of indexing and the “cost” of a read with only partial index coverage.
Lets look at an example for a moment. Imagine we are creating a tool to analyze sporting events and athletes. We need several types of objects to represent our data.
In this situation, we have several complex relationships:
- Players can “belong” to teams and games.
- Games “belong” to weeks and leagues.
- Teams belong to leagues and games.
Based on this, we may want to query games in several ways
- By specific Id (
- By time range
- By teams
- By week
- By league
- By teams and time range
- By teams and week
With this in mind, if we created an index for each type, we could end up with 6 indexes (the
_id index is a given in MongoDB). But in reality, we may not need to create that many as we can carefully choose our indexed fields and ordering in compound indexes. For example, lets look at the following 2 index forms
These can cover the following query forms:
From this, we’ve already reduced the number of indexes by 2, from 6 to 4. Depending on the amount of writing done to this collection, the savings can be significant.
It is worth noting that field order does matter in queries. They should match the ordering in the index to ensure the correct index is chosen.
In my upcoming posts, I’m going to talk more about types of indexes, creating indexes in MongoDB, how to create them from code, and when you should or should not create them in code.
As always, I hope this helps and happy programming.