Handling Schema Changes in C# and MongoDB
MongoDB is well known for not having “schema” but this is a bit of a misnomer. The more appropriate way to describe it would be, MongoDB does not enforce a schema. If you take the literal definition of schema, you get:
a representation of a plan or theory in the form of an outline or model.
MongoDB does not require you to know what the data will look like before inserting it into the database. So in this sense, MongoDB does not, in fact, have schema. However, our programs that interact with MongoDB do have a schema. We insert data into the database in a particular way, and we expect it back in that way. Perhaps our schema is extremely flexible and we have many different forms of documents, but at the end of the day, there is a schema as defined by the software we write.
One of the biggest mistakes somebody can make when dealing with a database that doesn’t enforce schema, is to put unrelated items in a collection. While the database allows this, it can make managing the data quite hard. This is a very good case of “just because you can, doesn’t mean you should.” Lets imagine for a minute that we are implementing a blog on top of MongoDB (This is an example used in the MongoDB Docs). There is nothing stopping us from putting the blog posts, the comments, and the users all in one monolithic collection. It just makes our code very ugly. We need a way to delineate posts from comments from users, which might merit the addition of a type
field, or perhaps we filter client side based on the existence of some fields.
If we were to model this in typescript
, it might look something like this:
|
|
Some common lookups would look something like
|
|
As you can see, there is no advantage to doing this over
|
|
The latter is considerably more self explanatory than the former. This makes the code considerably more readable. It also makes it much easier to do in C#. If we translate the above 2 examples to C#, we can see why quite easily.
|
|
Now to retrieve those documents
|
|
As you can see, there is no advantage to doing this over
|
|
The big danger with doing a single collection in C# with MongoDB is if you accidentally do the following:
|
|
Post
objects back, but we queried for 'type': "comment"
. This will cause serialization exceptions.
So now that we all agree, while MongoDB does not enforce schema, there is a schema that must be followed for the sanity of all those who interact with the database. With this in mind, lets look at some ways of modifying that schema without causing downtime.
Performing Migrations
While I am going to talk a lot about performing migrations in C#, the methodology I discuss is applicable to most, if not all, languages that interface with MongoDB.
The C# driver team has written some functionality into the C# driver for handling these situations (docs. I don’t generally use this method, mainly because I do not like the idea of documents sitting in an “inconsistent” state. I prefer the method of updating all the documents as quickly as possible. Now there are a few types of schema changes we need to address (as laid out in the driver docs):
- A new member is added
- A member is removed
- A member is renamed
- The type of a member is changed
- The representation of a member has changed
A New Member is Added
When a new member is added, there are a few options to work with
- Use a default value in the POCO (probably best for value types)
- Handle null (or default) values in code (probably best for reference types)
- Update all documents in the database with the appropriate value
Default Value on the POCO
|
|
Now the default for IsFooBar
is not default(bool)
and we can happily continue on our way. This works great for anything that is easily defaulted, like value types.
Handle Null Values in Code
|
|
The default for FooBar
is going to be null
on deserialization. When we access this member, we must do a null check, which, depending on our code, might be a valid state for FooBar
. While it is possible to use the first example in this case as well, generally classes are harder to construct in a default way as they are generally driven by some other piece of data not already existing in the current document.
Update all Existing Documents with the Appropriate Value
In this example, I am simplifying the value of FooBar
to be something simple, however I expect in the real world, the process of constructing FooBar
will be more complex. Our update process looks something like:
- Update the POCO to contain the field, but don’t yet use it in code
- Update the database to add the new default value
- Update the code to use the new field in the POCO
So re-using the classes from handling null values in code, our update would look something like:
|
|
This isn’t the most efficient way to do the updates (using the bulk API would be faster), but it gets the job done. There is a small problem with this issue. We’ve added a new field to ExistingDocumentWithNewReferenceField
, but this is live code, so unless we’ve decorated the ExistingDocumentWithNewReferenceField
with [BsonIgnoreExtraElements]
, we’ll run into errors in our running code that don’t yet know about the new FooBar
field. There are 2 ways around this
- The previously mentioned
[BsonIgnoreExtraElements]
- Deploy a new version of our app that knows about
FooBar
but doesn’t yet do anything with it
I prefer the second one for no particular reason, but the first is just as valid. Each requires forethought before the migration, so the choice is up to you.
A Member is Removed
This case also requires a bit of forethought and a decision to:
- Leave the data in place but ignore it: (See docs: Supporting Extra Elements, Ignoring Extra Elements)
- Remove the data from all documents
I prefer to remove the field from the database instead of just ignoring it as it can cause some confusion when looking at the data manually. To remove the data in an online fashion:
- Remove all references to the field in code, but leave the field in the POCO
- Update the database to remove the field from all documents
- Update the POCO to remove the field
The reason I opt for this method is that I can still use the strong typed driver interfaces to remove the field(s) and don’t need to worry about fat fingering the update(s). This method make use of the $unset
operator in MongoDB to remove the field, not just set it to null
. So a migration to remove our field would look like
|
|
Once this update completes, we can update our code to remove the FooBar
field from the POCO.
A Member is Renamed
Think of this as a combination of adding and removing a member.
- Update the POCO to have the new member name
- Update code references to prefer the new member name but default to the old member name
- Update the database, moving the data from the old name to the new name
- Update the POCO to remove the old member name
This update is can be very easy thanks to the $rename
operator in MongoDB. Instead of having to query each document and updating them one by one, we can issue a single update statement.
First, lets look at our schema:
|
|
We are going to rename our FooBar
field to BazBob
, better describing the content.
|
|
Note that the Update.Rename
method doesn’t take 2 field selectors, it takes a field selector and a string
of the new name. To ensure we don’t fat finger this one either, we can make use of the nameof
operator in C#. This update will apply to all documents in the collection, although it is not an atomic operation. The update on each document is atmoic, but the whole update process is not. This is why our code must be tolerant of either field existing.
The Type of a Member is Changed
This process is probably the most complex of all, depending on what the current and desired types are. If the types are implicitly convertible in the .NET Runtime, you can most likely just leave the data in the database alone. If the types are not convertible, this gets a bit complex. I think the safest and easiest way to handle this is to make use of both the Rename steps and the New Member is Added steps.
- Update the POCO to have a new, temporary, field with the old type
- Update the the application code to handle reading from either the old or new type (perhaps with an in-memory conversion of old -> new type)
- Update the database to
$rename
the old field to the temporary field name - Update the POCO to change to the new type on the existing field
- Update the database to convert from the old type (now in the temporary field) to the new type in the original field by reading the documents, converting the types in memory, and writing the converted value
- Update the database to
$unset
the temporary field - Update the POCO to remove the temporary field
I have not personally had to perform this type of schema migration, so there may be craftier, more efficient, ways to perform this, but I am unaware.
The Representation of a Member is Changed
The MongoDB C# driver chooses the serializer to use for reading BSON based on the BSON type, not the member type in the POCO. That conversion happens later. This means you can typically alter the representation of a type without altering the POCO so long as the .NET Runtime has a way to perform a conversion. If the runtime lacks this conversion, you will need to follow the procedure for when the type of a member has changed.
How to Implement These Conversions in Your Application
So now that we understand the different methods needed to actually alter our schema and underlying stored data. I have chosen to use a schema version document and migration scripts at startup. This means deploying new migrations looks something like:
- Deploy a new app version to support any temporary or new fields
- Deploy a version with the migrations
- Deploy a new app version that removes any temporary or unneeded fields
Depending on the migration, steps 1 and 2 may be combined. For tracking the schema versions, I have a document in my database with the current schema version:
|
|
My web application is generally what needs to be tolerant of migration changes mid-migration. My background workers block attempting to acquire a migration lock on startup. My migration runner follows the steps below
- Find the migration document, if it doesn’t exist, create it using an
_id
of the environment name (MongoDB won’t let multiple documents have the same_id
). - Attempt to acquire a lock (using SharpLock, see Distributed Locks in C#)
- Find all migrations in the codebase by the interface
IMigration
- Sort the migrations found by their version (lowest to highest)
- Skip any migrations with a version less than the version in the migration document
- Run the migrations in order
- Release the lock
This means the first worker process to acquire the lock will run all the migrations serially. Once it is done, it will release the lock, any subsequent workers that acquire the lock will filter out all migrations that have run already and continue their startup.
The IMigration
interface is quite simple:
|
|
Conclusion
Migrations in MongoDB using the C# driver don’t have to require you to drop to the command line. They can be done with type safety and with no downtime if implemented correctly.
Happy coding!