Lessons I Learned From Managing Hundreds of Millions of Data in MongoDB

5 Jun 2024

In this post, I will share real experience that I gained while working with hundreds of millions of pieces of data in MongoDB.

Don't Store All Data in a Single Mongo Collection

This was the worst mistake that we made, which became the root cause of all issues. We used to store 500 million complex structured data in a single collection, which caused the following side effects:

Indexes creation takes a very long time.
If collection is dropped accidentally, all data is gone.
Querying data takes longer than usual, even though with indexes.
Counting with filters is timing out because of large document scanning.

Probably there will be even more issues as the data increases on a daily basis.

Bad Document Structure

Invest some time to define the document structure precisely. This is a critical part that never gets attention when starting the project. Our mistake was excluding the field entirely from some of the documents instead of assigning a default value for the field.

That ended up with slow query scanning because of existence checking. In MongoDB, it's possible to give partial indexes, but it's really not working that well where field existence is checked.

The example below shows null value assigned for phone instead of excluding it from the document completely.

{
  "_id": "507f191e810c19729de860ea",
  "name": "John Doe",
  "age": 30,
  "email": "john.doe@example.com",
  "phone": null, //assign default value instead of removing
  "isActive": true,
  "tags": ["user", "active", "new"]
  "registeredOn": "2024-04-19T15:00:00Z"
}

{$exists: true} filter is slow so consider adding a default value and indexing the fields for fast query results.

Use Bulk Operations

Bulk operations are used to execute multiple write operations (inserts, updates, deletes) efficiently in a single command. This method can significantly reduce the number of round trips between your application and the MongoDB server, leading to performance improvements, especially when dealing with large volumes of data.

Since we were processing millions of data from RabbitMQ queues, it required handling messages in batches and making bulk DB operations for each batch.

Here's the documentation for pymongo on how to use bulk operations for better write performance.

Use Indexes

Indexing plays a crucial part in the DB performance. Create required indexes in the collection based on the filters you are using in the codebase. That also refers to the part about creating a valid document structure. Instead of removing the field completely, assign a default value to benefit from indexing power.

Instead of Count, Use Aggregation Pipeline

If you need to get count of data based on a specific filter, I wouldn't recommend using count() , instead, use a combination of $match and $count aggregation for fast performance.

$match filters the documents first, which means only the relevant documents that meet the specified conditions are passed down to the $count stage. This is more efficient than counting all documents and then filtering them, as count() might initially do without any conditions.

Wrap Up

That's all that I have experienced so far with MongoDB. I will update the article once I find more hints and best practices to help optimize the performance.