MongoDB Query and Updating Corner Cases - the $type Query Operator and Arrays
<em>This post was originally published on <a href="http://blog.3t.io/post/86298733013/mongodb-query-and-updating-corner-cases">my company's blog</a></em>
First coming as a developer from a traditional RDBMS to a NoSQL database like MongoDB can be a truly liberating experience. For example, not being forced into a fixed schema and being able to dynamically add fields along the way as and where you need them is just fantastic. For me personally, another great aspect has always been the very expressive and flexible JSON-based query language in MongoDB. Compare that to the rigidness of the SQL syntax!
That said, while working on our <a href="http://3t.io/products/data-manager/">Data Manager GUI tool for MongoDB</a>, I have come across a few very tricky query and update corner cases that stem from the peculiar way arrays are handled by the MongoDB search engine. In this post, I will talk about the repercussions that these seemingly isolated peculiarities can create for certain common update commands.
The $type Query Operator and Arrays
Let’s begin by revisiting arrays and queries. Consider the following collection “address” that lists persons and their (multiple) addresses:
{
"_id" : 1,
"first_name" : "Peter",
"address" : "100 Main St, Boston, MA"
},
{
"_id" : 2,
"first_name" : "Paula",
"address" : {
"street" : "1234 Broad St",
"city" : "New York, NY"
}
},
{
"_id" : 3,
"first_name" : "Natalie",
"address" : {
"street" : "200 High St",
"city" : "Miami, FL"
}
},
{
"_id" : 4,
"first_name" : "Tim",
"address" : [
{
"street" : "400 Michigan Ave",
"city" : "Chicago, IL"
},
{
"street" : "Berliner Str. 3",
"city" : "München"
}
]
},
{
"_id" : 5,
"first_name" : "Sara",
"address" : [
{
"street" : "120 Ocean Dr",
"city" : "Miami, FL"
},
{
"street" : "Pariser Str. 10",
"city" : "Berlin"
}
]
},
{
"_id" : 6,
"first_name" : "Jake",
"address" : [
[
{
"street" : "456 Broad St",
"city" : "Providence, RI"
},
{
"street" : "1000 Marina Dr",
"city" : "Naples, FL"
}
],
[
{
"street" : "Hauptstr. 12",
"city" : "Berlin"
}
]
]
}
In MongoDB, the $type query operator lets you query for fields that have a certain type. Say, for example, you wanted to find all documents in your “address” collection where the field “first_name” is of type String. You would trivially issue a query like this:
> db.address.find({"first_name": {$type: 2}})
{ "_id" : 1, "first_name" : "Peter", "address" : "100 Main St, Boston, MA" }
{ "_id" : 2, "first_name" : "Paula", "address" : { "street" : "1234 Broad St", "city" : "New York, NY" } }
{ "_id" : 3, "first_name" : "Natalie", "address" : { "street" : "200 High St", "city" : "Miami, FL" } }
{ "_id" : 4, "first_name" : "Tim", "address" : [ { "street" : "400 Michigan Ave", "city" : "Chicago, IL" }, { "street" : "Berliner Str. 3", "city" : "München" } ] }
{ "_id" : 5, "first_name" : "Sara", "address" : [ { "street" : "120 Ocean Dr", "city" : "Miami, FL" }, { "street" : "Pariser Str. 10", "city" : "Berlin" } ] }
{ "_id" : 6, "first_name" : "Jake", "address" : [ [ { "street" : "456 Broad St", "city" : "Providence, RI" }, { "street" : "1000 Marina Dr", "city" : "Naples, FL" } ], [ { "street" : "Hauptstr. 12", "city" : "Berlin" } ] ] }
>
Since all documents have a String in their respective “first_name” fields, you are returned all documents.
Ok, suppose now you wanted to find all documents where the field “address” holds an array, i.e. where the person has several addresses. A quick glance at the documentation of $type shows that type “4” indicates an array. So, then
> db.address.find({"address": {$type: 4}})
{ "_id" : 6, "first_name" : "Jake", "address" : [ [ { "street" : "456 Broad St", "city" : "Providence, RI" }, { "street" : "1000 Marina Dr", "city" : "Naples, FL" } ], [ { "street" : "Hauptstr. 12", "city" : "Berlin" } ] ] }
>
What just happened? Why are Tim and Sara not showing up?
If you keep reading the documentation on the $type operator, you will learn that in case of an array field, “the $type operator performs the type check against the array elements and not the field”. In other words,
db.address.find({"address": {$type: 4}})
won’t return the documents where the field “address” holds an array, but only those documents where “address” holds (at least) two-dimensional arrays (i.e. arrays-in-arrays or second-level arrays).
This has always struck me as a really peculiar design choice. For exactly these semantics where you want something applied to the elements of an array, there is afterall the $elemMatch operator. So, if I had really wanted to find all two-dimensional address arrays, I would have put the query like so:
> db.address.find({"address": {$elemMatch: {$type: 4}}})
{ "_id" : 6, "first_name" : "Jake", "address" : [ [ { "street" : "456 Broad St", "city" : "Providence, RI" }, { "street" : "1000 Marina Dr", "city" : "Naples, FL" } ], [ { "street" : "Hauptstr. 12", "city" : "Berlin" } ] ] }
>
And as you can see, the same result set is returned.
$type Array Inconsistencies
Note, however, that the behavior of the $type query operator is unfortunately inconsistent between top-level arrays and concretely-named deeper-level arrays. Previously, we have learnt that for array fields, $type is actually executed against the elements rather than the array field itself. Ok, we groked that, so what should the following query then return?
db.address.find({"address.0": {$type: 4}})
Well, in each document, it should look at the first element in the address array (if the document has such an array, of course) and check whether it is again of type Array. In that case, though, by the definition in the documentation and as observed above, the $type operator should be applied to the array elements of address.0 instead of directly to the array field address.0. So, we would expect to find documents where the first element of the address array again has elements that are themselves arrays. In other words, we expect to find three-dimensional arrays. Let’s run it then:
> db.address.find({"address.0": {$type: 4}})
{ "_id" : 6, "first_name" : "Jake", "address" : [ [ { "street" : "456 Broad St", "city" : "Providence, RI" }, { "street" : "1000 Marina Dr", "city" : "Naples, FL" } ], [ { "street" : "Hauptstr. 12", "city" : "Berlin" } ] ] }
>
That is bizarre. Instead of an empty result set (as we don’t have any three-dimensional address arrays in our collection), we see Jake’s two-dimensional address array. It appears as though this time, $type was directly applied to the array field “address.0” instead of to its elements.
Maybe this is only an artefact of {$type: 4} ? Let’s investigate this further. Let’s begin by finding all documents that have an address object:
> db.address.find({"address": {$type: 3}})
{ "_id" : 2, "first_name" : "Paula", "address" : { "street" : "1234 Broad St", "city" : "New York, NY" } }
{ "_id" : 3, "first_name" : "Natalie", "address" : { "street" : "200 High St", "city" : "Miami, FL" } }
{ "_id" : 4, "first_name" : "Tim", "address" : [ { "street" : "400 Michigan Ave", "city" : "Chicago, IL" }, { "street" : "Berliner Str. 3", "city" : "München" } ] }
{ "_id" : 5, "first_name" : "Sara", "address" : [ { "street" : "120 Ocean Dr", "city" : "Miami, FL" }, { "street" : "Pariser Str. 10", "city" : "Berlin" } ] }
>
As expected, that finds Paula’s and Natalie’s address objects, as well as object elements in Tim’s and Sara’s address arrays.
So, let’s try the same thing on a concretely-named array element:
> db.address.find({"address.0": {$type: 3}})
{ "_id" : 4, "first_name" : "Tim", "address" : [ { "street" : "400 Michigan Ave", "city" : "Chicago, IL" }, { "street" : "Berliner Str. 3", "city" : "München" } ] }
{ "_id" : 5, "first_name" : "Sara", "address" : [ { "street" : "120 Ocean Dr", "city" : "Miami, FL" }, { "street" : "Pariser Str. 10", "city" : "Berlin" } ] }
>
Tim’s and Sara’s address.0 fields are indeed objects, so they are correctly returned. However, Jake’s address.0 field is an array that has elements of type object, and so should be - if the behavior of $type were consistent - also returned.
These observable inconsistencies strongly suggest that $type is only executed against array elements for top-level array fields. For nested array fields, $type seems to be no longer executed against the array elements but instead directly against the (nested) array field.
By the same token, one could now argue that the observed correct execution of db.address.find({"address": {$elemMatch: {$type: 4}}}) in the previous section is actually also inconsistent. One could expect that when $elemMatch goes and checks the individual element fields of “address”, the $type would be executed on each first-level array element. If such a first-level array element is itself an array again, by the description of $type, {$type: 4} should not be applied to that first-level array element directly but instead to its elements. So, it should really return those elements that have elements of type array. In other words, {$elemMatch: {$type: 4}} should also only return three-dimensional arrays.
How do you detect array fields then?
The official recommendation in the documentation of $type is to revert to JavaScript and run
> db.address.find( { $where : "Array.isArray(this.address)" } )
{ "_id" : 4, "first_name" : "Tim", "address" : [ { "street" : "400 Michigan Ave", "city" : "Chicago, IL" }, { "street" : "Berliner Str. 3", "city" : "München" } ] }
{ "_id" : 5, "first_name" : "Sara", "address" : [ { "street" : "120 Ocean Dr", "city" : "Miami, FL" }, { "street" : "Pariser Str. 10", "city" : "Berlin" } ] }
{ "_id" : 6, "first_name" : "Jake", "address" : [ [ { "street" : "456 Broad St", "city" : "Providence, RI" }, { "street" : "1000 Marina Dr", "city" : "Naples, FL" } ], [ { "street" : "Hauptstr. 12", "city" : "Berlin" } ] ] }
>
That works and returns all matching documents. However, this method comes with a huge performance penalty, as the JavaScript code now needs to be executed against each document in the collection (and also can’t take advantage of any indices). To gauge this performance penalty, I set up a quick testbed on my Early 2013 MacBook Pro running at 2.7 GHz and locally running a 2.6.1 MongoDB server. I created a simple “phones” collection with 1,000,000 documents of the following form:
{
"_id" : 38000425113,
"components" : {
"country" : 3,
"area" : 800,
"prefix" : 42,
"number" : 425113
},
"display" : "+3 800-425113"
}
and manually added two stripped-down documents with two-dimensional arrays:
{ "_id" : 38007654321, "components" : { "number" : [ [ 8, 0, 0 ], [ 7, 6, 5, 4, 3, 2, 1 ] ] } },
{ "_id" : 38001234567, "components" : { "number" : [ [ 8, 0, 0 ], [ 1, 2, 3, 4, 5, 6, 7 ] ] } }
So, let’s compare the $type query
> db.phones.find({"components.number": {$type: 4}})
{ "_id" : 38007654321, "components" : { "number" : [ [ 8, 0, 0 ], [ 7, 6, 5, 4, 3, 2, 1 ] ] } }
{ "_id" : 38001234567, "components" : { "number" : [ [ 8, 0, 0 ], [ 1, 2, 3, 4, 5, 6, 7 ] ] } }
> db.phones.find({"components.number": {$type: 4}}).explain()
{
"cursor" : "BasicCursor",
"isMultiKey" : false,
"n" : 2,
"nscannedObjects" : 1000001,
"nscanned" : 1000001,
"nscannedObjectsAllPlans" : 1000001,
"nscannedAllPlans" : 1000001,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 7812,
"nChunkSkips" : 0,
"millis" : 447
"server" : "Thomass-MacBook-Pro.local:27061",
"filterSet" : false
}
to the JavaScript-based query
> db.phones.find( { $where : "Array.isArray(this.components.number)" } )
{ "_id" : 38007654321, "components" : { "number" : [ [ 8, 0, 0 ], [ 7, 6, 5, 4, 3, 2, 1 ] ] } }
{ "_id" : 38001234567, "components" : { "number" : [ [ 8, 0, 0 ], [ 1, 2, 3, 4, 5, 6, 7 ] ] } }
> db.phones.find( { $where : "Array.isArray(this.components.number)" } ).explain()
{
"cursor" : "BasicCursor",
"isMultiKey" : false,
"n" : 2,
"nscannedObjects" : 1000001,
"nscanned" : 1000001,
"nscannedObjectsAllPlans" : 1000001,
"nscannedAllPlans" : 1000001,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 7831,
"nChunkSkips" : 0,
"millis" : 21623
"server" : "Thomass-MacBook-Pro.local:27061",
"filterSet" : false
}
Roughly half a second for the $type operator-based query vs over 21 seconds for the JavaScript-based work-around. While just one type of query is of course by no means a statistically sound proof, it does suffice to demonstrate that the performance of these query types will differ by orders of magnitude. That may be OK for a one-off query but certainly is not feasible for production-level code in your application.
So, {$type: 4} doesn’t work, the JavaScript workaround is way too slow. What to do? On <a href="https://jira.mongodb.org/browse/SERVER-1475">SERVER-1475</a>, a few very clever workarounds have been suggested.
You could for example test for the existence of a an embedded field “0”, like so:
> db.address.find({"address.0": {$exists: 1}})
{ "_id" : 4, "first_name" : "Tim", "address" : [ { "street" : "400 Michigan Ave", "city" : "Chicago, IL" }, { "street" : "Berliner Str. 3", "city" : "München" } ] }
{ "_id" : 5, "first_name" : "Sara", "address" : [ { "street" : "120 Ocean Dr", "city" : "Miami, FL" }, { "street" : "Pariser Str. 10", "city" : "Berlin" } ] }
{ "_id" : 6, "first_name" : "Jake", "address" : [ [ { "street" : "456 Broad St", "city" : "Providence, RI" }, { "street" : "1000 Marina Dr", "city" : "Naples, FL" } ], [ { "street" : "Hauptstr. 12", "city" : "Berlin" } ] ] }
>
That’s nice and simple but might for the (admittedly extreme) corner case, where you have a document where the field in question actually occurs as an embedded document which happens to have a field “0”, return false positives. It would also fail to detect empty arrays.
Another very clever workaround that was suggested there uses $elemMatch to check for the existence of any element:
> db.address.find({"address": {$elemMatch: {$exists: 1}}})
{ "_id" : 4, "first_name" : "Tim", "address" : [ { "street" : "400 Michigan Ave", "city" : "Chicago, IL" }, { "street" : "Berliner Str. 3", "city" : "München" } ] }
{ "_id" : 5, "first_name" : "Sara", "address" : [ { "street" : "120 Ocean Dr", "city" : "Miami, FL" }, { "street" : "Pariser Str. 10", "city" : "Berlin" } ] }
{ "_id" : 6, "first_name" : "Jake", "address" : [ [ { "street" : "456 Broad St", "city" : "Providence, RI" }, { "street" : "1000 Marina Dr", "city" : "Naples, FL" } ], [ { "street" : "Hauptstr. 12", "city" : "Berlin" } ] ] }
>
Note: If you also want to catch empty arrays, $or it with a {$size: 0}:
> db.address.find({$or: [{"address": {$size: 0}}, {"address": {$elemMatch: {$exists: 1}}}]})
Caveat: This is a really slick workaround to detect array fields. However, note that prior to 2.6, this workaround only works for top-level arrays but not for concretely-named deeper-level arrays. Let me demonstrate what I mean by that. Suppose you wanted to find all documents where the first element of “address”, i.e. “address.0”, is itself an array again.
2.4.10:
> db.address.find({"address.0": {$elemMatch: {$exists: 1}}})
>
2.6.1:
> db.address.find({"address.0": {$elemMatch: {$exists: 1}}})
{ "_id" : 6, "first_name" : "Jake", "address" : [ [ { "street" : "456 Broad St", "city" : "Providence, RI" }, { "street" : "1000 Marina Dr", "city" : "Naples, FL" } ], [ { "street" : "Hauptstr. 12", "city" : "Berlin" } ] ] }
>
See <a href="https://jira.mongodb.org/browse/SERVER-1264">SERVER-1264</a> for more details.
In 2.4.x, for deeper-level arrays, you could of course revert to using the $type operator again and exploit its inconsistent behavior as described above, but that is neither here nor there :-)
Why does that matter? Who would need to find fields of type array anyway?
So why is this important? Why would I ever need to detect arrays? Because there are a number of surprisingly common operations that are directly impacted by the presence of arrays.
1. Adding a field to documents where the field is missing
Suppose we have just realized that we need a “country” field in the embedded “address” documents (i.e. for Paula and Natalie). No worries, there is of course no fixed schema here, so we can easily add the field:
> db.address.update({},{$set: {"address.country": "US"}}, {multi: true})
WriteResult({
"nMatched" : 0,
"nUpserted" : 0,
"nModified" : 0,
"writeError" : {
"code" : 16837,
"errmsg" : "cannot use the part (address of address.country) to traverse the element ({address: \"100 Main St, Boston, MA\"})"
}
})
>
Right, so that failed because Peter’s “address” field is a String and so, obviously, the DB doesn’t know how to add a field to that. Fair enough. So, let’s refine our update query to consider only those documents where “address” is indeed of type Object:
> db.address.update({"address": {$type: 3}},{$set: {"address.country": "US"}}, {multi: true})
WriteResult({
"nMatched" : 0,
"nUpserted" : 0,
"nModified" : 0,
"writeError" : {
"code" : 16837,
"errmsg" : "cannot use the part (address of address.country) to traverse the element ({address: [ { street: \"400 Michigan Ave\", city: \"Chicago, IL\" }, { street: \"120 Ocean Dr\", city: \"Miami, FL\" } ]})"
}
}
>
Ok, so why did this fail now? Well, remember that for an array field, $type is actually applied to the array elements. So, {"address": {$type: 3}} will also return Tim’s and Sara’s documents because they both have address array elements that are of type Object. Trying then to add a field “country” to their respective “address” array fields obviously does not make sense. Hence, the complaining from the DB.
Ok, so let’s refine our update query further. Now, we’ll only try to add the field to documents where “address” is an embedded object and at the same time not an array:
> db.address.update({$and: [ {"address": {$type: 3}}, {"address": {$not: {$elemMatch: {$exists: 1}}}} ]}, {$set: {"address.country": "US"}}, {multi: true})
WriteResult({ "nMatched" : 2, "nUpserted" : 0, "nModified" : 0 })
>
Quite some hoops we had to jump through in order to be able to do something as innocent-sounding as adding a missing field to some documents :-)
2. Finding outlier documents with a certain field of a certain type
As a developer, a very common QA task is to make sure that your application always writes consistent data. So, suppose your application is to always store any address of a person in an “address” array (people with only one address would then only have a one-element address array). We could now easily find outliers:
> db.address.find({"address": {$not: {$elemMatch: {$exists: 1}}}})
{ "_id" : 1, "first_name" : "Peter", "address" : "100 Main St, Boston, MA" }
{ "_id" : 2, "first_name" : "Paula", "address" : { "street" : "1234 Broad St", "city" : "New York, NY" } }
{ "_id" : 3, "first_name" : "Natalie", "address" : { "street" : "200 High St", "city" : "Miami, FL" } }
>
3. Finding specific array element
A probably less common case where the detection of an array is important is finding specifically named (i.e. index-based) array elements across documents. This is something that we encountered while working on our <a href="http://3t.io/products/schema-explorer/">Schema Explorer tool</a>. Suppose you wanted to find all second elements in the address array fields across the documents in your collections. If you naively ran
db.address.find({"address.1": {$exists: 1}})
you would get those documents, but you could in theory also get some false positives in the case where there happened to be address objects that had a field named “1” in them. Granted, that is probably an extreme edge case, but a real concern when designing a generic tool that operates on unknown collections. So, in order to guard against those false positives, you would need to run a query like this:
> db.address.find({$and: [{"address.1": {$exists: 1}}, {"address": {$elemMatch: {$exists: 1}}}]})
{ "_id" : 4, "first_name" : "Tim", "address" : [ { "street" : "400 Michigan Ave", "city" : "Chicago, IL", "country" : "US" }, { "street" : "Berliner Str. 3", "city" : "München" } ] }
{ "_id" : 5, "first_name" : "Sara", "address" : [ { "street" : "120 Ocean Dr", "city" : "Miami, FL", "country" : "US" }, { "street" : "Pariser Str. 10", "city" : "Berlin" } ] }
{ "_id" : 6, "first_name" : "Jake", "address" : [ [ { "street" : "456 Broad St", "city" : "Providence, RI" }, { "street" : "1000 Marina Dr", "city" : "Naples, FL" } ], [ { "street" : "Hauptstr. 12", "city" : "Berlin" } ] ] }
>
Conclusion
In this blog post, we have seen how the presence of arrays can impact a number of fairly common update operations, and how to handle those cases. Overall, I think it would be really great to have a dedicated $isArray query operator. The peculiar (and at times inconsistent) way arrays are handled by the search engine currently make it unnecessarily complex to guard one’s update queries against the tricky intricacies of arrays.