The bug in the Mongoose package created serious problems in production. Lessons learned.

Alexander Ravikovich
7 min readApr 26, 2023

In this article, we’ll explore a true story of a bug in a Mongoose package that almost brought down production. We’ll discuss how the bug was discovered and fixed. Let’s get started!

It was the end of a typical workday when we began receiving alerts about system performance degradation. There was a significant issue that impacted numerous clients.

Some alerts were about increased response time on the APIs. Some of them were from the MongoDB Atlas related to Max Process CPU/Disk IOPS.

The moment we understood something is happening, but didn’t know what exactly yet.

Investigation

High IOPS gave us some direction. We started to look for queries which potentially could lead to a disk scan. We knew that indexes can be involved…

The first thing we did was to look at real-time and profiler Atlas tabs.

We did see a lot of COUNT queries which were running for a very long time. It was strange behaviour because we have a default 30s timeout set up on all our services.

Fire-fighting

The first thing we did was to kill all long-running queries. As soon as we did it, we saw immediately database behaviour was back to normal.

At the same time, we downloaded slow query log files and did find problematic queries. We identified who did create these queries and asked him to stop doing it until we fix the issue.

MongoDB is back to normal

Understanding the issue

By looking at the logs, we immediately understood that we had a few issues with the indexes. We did find the problematic endpoint with code like that:

const filter = {....};
const count = model.count(filter);
const find = model.find(filter).sort({...}).limit(...);

const [total, results] = await Promise.all([find, count]);

Profiling information in the log file showed these two queries used different indexes. Sometimes, MongoDB used different indexes even for the same query.

Our theories were:

  1. We have too many indexes in the collection, so the database sometimes fails to find the right one.
  2. The fact we didn’t have the right index led to the really bad index being chosen. As a result, the COUNT query led to the whole collection scan.
  3. The FIND query worked slowly and most of the time returned a timeout error.
  4. We have really huge collection, so scanning it uses a lot of resources

The fix

We did profile the queries and added the necessary indexes. It helped. Now both queries had good indexes and MongoDB knows to choose them. But it was not enough for us.

We decided that in this case, we are smarter than MongoDB, and forced to use a specific index by using .hint(“index_name”) programmatically in the code.

We didn’t finish yet

We fixed a very specific case, but it could happen again. Our API is very flexible, and we don’t have indexes for all the combinations of filtering and sorting.

To protect us from situations like that one we have default timeouts. Each query to MongoDB should not take more than 30 seconds. Otherwise, the database will return an error to the client.

At least, we were sure this is what will happen. But we were mistaken…

Reproducing the issue

To understand what happened in production I decided to try to reproduce the issue locally. The main point was to check why timeouts didn’t work as expected. We shouldn’t see queries which are running more than 30 seconds in production.

I did create a simple express.js app, with the same Mongoose version we use in production — v5.6.7 and the same NodeJS version.

To emulate long-running queries I used the sleep function. No need to fill a database with a huge amount of data… The sleep function will run for each document, so I inserted 10 dummy documents.

const mongoose = require('mongoose');
const Promise = require('bluebird');
mongoose.Promise = Promise;

mongoose.set('maxTimeMS', 5000);
mongoose.set('useNewUrlParser', true);

const connectionString = 'mongodb://localhost:27017/mydb';

mongoose.connect(connectionString, {
poolSize: 2,
});

const Schema = mongoose.Schema;

const MovieSchema = new Schema({
_id: {
type: Schema.Types.ObjectId,
},
}, {collection: 'movies'});

const movieModel = mongoose.model('Movies', MovieSchema);

const express = require('express');
const app = express();
const port = 3003;

/**
* Doing a single find query to the database.
*/
app.get('/find', async (req, res, cb) => {
try {
const findQuery = {
$where: function () {
return sleep(1000) || true
}
};

const movies = await movieModel.find(findQuery)
.sort({_id: 1,}).limit(10);

res.json({movies,});
} catch (error) {
console.log(error.stack);
res.json({
error,
});
}
});

/**
* Doing a single find query to the database.
*/
app.get('/count', async (req, res, cb) => {
try {
const countQuery = {
$where: function () {
return sleep(50000) || true
}
};

const total = await movieModel.count(countQuery);
res.json({total,});
} catch (error) {
console.log(error.stack);
res.json({
error,
});
}
});

/**
* Emulates production behaviour. We have two queries, which should run concurrently,
* so we do it using Promise.all()
* To emulate the long-running queries, we use sleep(), instead of filling database with
* huge amount of data and try to emulate slow queries.
*
* Sleep function will run for each document in the database.
*/
app.get('/', async (req, res, cb) => {
try {
const countQuery = {
$where: function () {
return sleep(50000) || true
}
};
const findQuery = {
$where: function () {
return sleep(1000) || true
}
};

const [total, movies] = await Promise.all([
movieModel.find(findQuery),
movieModel.count(countQuery),
]);

const response = {
total,
movies,
};

res.json(response);
} catch (error) {
console.log(error.stack);
res.json({
error,
});
}
});

app.listen(port, () => {
console.log(`Server listening on port ${port}`);
});

What is happening here?

  • Connection to the MongoDB created
  • Maximum 2 connections in the pool allowed
  • By using maxTimeMS parameter we tell the MongoDB to stop query execution time after 5 seconds and return error
  • We use Promise.all to run two queries concurrently
  • The count query will run for 50 seconds per document
  • The find query will run for 1 second per document

When I did run it the first 2 times, I did receive the timeout error. Expected behaviour.

But when I did send the request 3rd time, nothing good happened. I didn’t get any results from the server. Why?

Now it became more interesting. What will happen if I try only the count() request after the server restart? Again, no response from the server.

What will happen if I try only the find() query after the server restart? Again, worked as expected and I did receive timeouts.

What? Looks like timeout doesn’t work on count() queries.

The last check before giving the statement. Restarted the server, made multiple count() requests, and looked at the database.

const milliseconds_running = 2000;

const query = [
{"$currentOp": {"allUsers": true, "idleSessions": true}},
{
"$match": {
"active": true,
"microsecs_running": {
"$gte": milliseconds_running,
},
"ns": {"$in": ["mydb.movies"]},
}
},
]

db.aggregate(query);

There were exactly two running queries in MongoDB. They would run until the NodeJS process is killed and databases would detect dead connections. Otherwise, the query would run until it finishes.

The reason I have 2 running count() queries is that I have poolSize of 2 connections. For each query, the MongoDB driver takes a connection from the pool, sends data to the MongoDB and waits for the response.

As long as the MongoDB doesn’t respond, this connection cannot be reused. In my example, any HTTP requests will be “blocked” until some connection will be available in the connections pool (at the point where the code will try to query the database).

Now we fully understand what happened in the production

  • HTTP requests were done against our NodeJS servers
  • We respond with timeout because it works for the find query, but the count query continued to run in the database
  • The client received timeout and did retry again and again
  • Each waiting count query in NodeJS keeps the connection from the pool
  • Requests distributed over all over NodeJS instances
  • All (or almost all) connections from the pool were in took by count queries
  • The overall throughput of the APIs dropped
  • A wrong index was used by MongoDB for the count query, which led to the full scan of a huge collection
  • Auto scale-up started to bring new NodeJS instances
  • More and more count queries were sent to the database
  • The database started to scan the collection in parallel
  • The database started to respond slower and slower because all the resources were spent on scanning a huge collection

Solution

Quick Google, and we did find out that there was a bug in Mongoose, which was fixed a long time ago. The bugfix was released within Mongoose v5.10.15, and we had v5.6.7

After upgrading the Mongoose package, timeout started work on the count query. Now the challenge is to upgrade the package in production.

Finally, timeouts work on count query

Lessons were learned

  • Blindly assuming some package is working as expected can hurt sometimes.
  • Having up-to-date NodeJS packages is important. Even more, than you can imagine.

Remember. There is no such thing as “unlimited” resources. There should always be some limit on everything you are doing against the database. A big limit is still a limit.

--

--