ideas

speed up mongodb aggregation functions

I love MongoDB and last month I started to learn MongoDB's amazing Aggregation API. It has a lot of useful aggregation functions and it's documentation can be found here.

I tried all aggregation functions on mongo shell and I also tried explain function on aggregation functions to understand how MongoDB handles aggregation functions. It's pretty simple. For example if you want to use explain function for $geoNear you can simply call explain as:

db.places.explain().aggregate([  
   {
     $geoNear: {
        near: { type: "Point", coordinates: [ -73.99279 , 40.719296 ] },
        distanceField: "dist.calculated",
        maxDistance: 2,
        query: { type: "public" },
        includeLocs: "dist.location",
        num: 5,
        spherical: true
     }
   }
])

And mongo shell will return explain details. You have a chance to understand the magic. I wanna test how MongoDB indexes affect aggregation functions.

First of all I inserted 7.000.000 test data with simple code piece, you can get code from here. If you run this code piece on your mongo shell you will get 6.191.705 data in testData collection.

Now we are ready to test the aggregation functions. First I tested $avg function with below code:

var start = (new Date).getTime();  
db.testData.aggregate(  
    [ 
        { $match : {country : "Turkey"} }, 
        { $group : { _id : { groupedBy : "$city" }, averageIndex: { $avg: "$index"} } }
    ] 
); 
var end = (new Date).getTime();  
print("elapsed " + (end - start) + " msec");  

And elapsed time for this aggregation is 2892 milliseconds on my environment. And let's explain this query with

db.testData.explain().aggregate(  
    [ 
        { $match : {country : "Turkey"} }, 
        { $group : { _id : { groupedBy : "$city" }, averageIndex: { $avg: "$index"} } }
    ] 
); 

and you will see MongoDB uses COLLSCAN for this match function:

"winningPlan" : {
    "stage" : "COLLSCAN",
    "filter" : {
        "country" : {
                "$eq" : "Turkey"
        }
        },
            "direction" : "forward"
    },

Since aggregation scans all collection, in our example 6 million documents to calculate average latitude, it takes more than 2 seconds. Now let's create index on country field as:

db.testData.ensureIndex({ country : 1 })  

and execute average function again. And now calculating average of 6 million documents takes only 120 milliseconds! It's amazing, isn't it?

Let's explain this aggregation function after creating index. Run explain command again and you will see winningPlan:

"winningPlan" : {
    "stage" : "FETCH",
    "inputStage" : {
            "stage" : "IXSCAN",
            "keyPattern" : {
                    "country" : 1
                    },
            "indexName" : "country_1",
            "isMultiKey" : false,
            "isUnique" : false,
            "isSparse" : false,
            "isPartial" : false,
            "indexVersion" : 1,
            "direction" : "forward",
            "indexBounds" : {
                "country" : [
                    "[\"Turkey\", \"Turkey\"]"
                ]
            }
        }
    }

You'll see aggregation now uses IXSCAN for group aggregation function.

And now! Let's remove $match function and let's group by country. Since we have an index over country, it should use IXSCAN and it should be fast, right? Let's try and see what happens:

db.testData.explain().aggregate(  
    [ 
        { $group : { _id : { groupedBy : "$country" }, averageIndex: { $avg: "$index"} } }
    ] 
); 

And here is the winningPlan:

"winningPlan" : {
        "stage" : "COLLSCAN",
        "filter" : {
            "$and" : [ ]
        },
        "direction" : "forward"
        },

Why the hack MongoDB doesn't use country index? Let's read the query in English: Query is saying given a collection return ALL documents, grouped by country, and display the average index value. The main difference, when you want to group by MongoDB takes all the document and creates groups, you don't use any advantage of the indexing. It's also documented at MongoDB documentation. Also please note that, you cannot take advantages of indexing if $match or $sort function is not presented in the beginning of aggregation pipeline. Because MongoDB creates new document model after each pipeline step and there is no indexing for this dynamic document models.

When I execute this query,

var start = (new Date).getTime();  
db.testData.aggregate(  
    [ 
        { $group : { _id : { groupedBy : "$country" }, averageIndex: { $avg: "$index"} } }
    ] 
); 
var end = (new Date).getTime();  
print("elapsed " + (end - start) + " msec");  

query takes 6623 milliseconds on my environment.

If you have any questions, I'd love to try to help you! Take care!

PS: My hardware configuration is 2,3 GHz Intel Core i7, 16 GB 1600 MHz DDR3 and 500 GB Flash Storage and MongoDB performance can change on different environments.


Soner ALTIN
TAGGED IN work, tech