Last Updated: February 25, 2016

Crazy experiment: batching MongoDB reads together

Note: this isn't something you should use in production. It's just an experiment at this stage, regardless of test results. Do not use it.

The problem

For whatever reason, your application is sending out a lot of individual requests to your Mongo database. A lot of them are going to the same collection. Because of the number of requests, they are getting in each other's way. And they all incur costs associated with encoding, decoding, network latencies (yay TCP).

The (idea for a) solution

Lets say we intercept all the cursor creations, and then we merge the requests on the same collection together, and we send just a single query to Mongo then split up the results once we've received them. So instead of two distinct queries:

{_id: 1}, {_id: 2}

We just send one:

{_id: {$in: [1,2]}}

Does it work?

Yes! We can observe the time between the first query and the last is reduced by upto 70%. Or in real terms, 12ms vs 35ms.

Of course, that number is a bit of a fudge: On larger test (8500 records, improvement from 840 to 450ms of 47%), the average latencies are increased by 221% (75ms vs 166ms). (That is, the time between making the cursor and toArray getting anything out of the other end.)

Does it help mongo?

Depending on the query, yes absolutely! In some tests it caused mongo load to be 1ms vs 35ms over parallel requests. On others, it had no effect. I've not yet seen it increase mongo load significantly, but I'm not discounting the possibility of it doing so in some instances.

Is this a panacea?

Of course not! Some queries just don't lend themselves to this sort of thing. Take, for example, the following pair:

{a: 1, b: 3}, {a: 2, b: 5}

This would become when combined:

{a: {$in: [1, 2]}, b: {$in: [3, 5]}}

Which would match documents such as:

{a: 1, b: 5} or {a: 2, b: 3}

These get filtered out by the receiver, but they still get transferred over the wire. If the combination of fields in your query is highly selective but the individual fields are not, then it's probably not going to help (and might even hurt in those cases).

What other problems are there?

Currently, the implementation is very stupid about how it merges and checks queries. For example, any use of query operators ($lt, $in, $regex) will completely break it. There is support for checking the query objects prior to batching them but that's not done yet.

The returned cursors do not (and as far I can tell, cannot) support limit, skip, count, etc. Whilst it would be easy enough to implement these such that the cursors unbatch themselves, again this is not done yet.

Can I try it out?

Yes of course! It's here: https://github.com/richthegeek/mongroup

Can I help out?

Well, you can certainly make issues or fork it or send me PRs or whatever. If these performance gains actually transpire then I'd love for this to be stable enough to use in production.