CopyDisable

Monday 30 May 2016

JOIN in MongoDB ? or JOIN’s kin? or something similar?

One of the sought after features in MongoDB was to have the ability to join collections. People working on RDBMS were very much familiar with joins and could not even imagine working without joins. The base of RDBMS is the relations, and join is one of the success factors of RDBMS. Also the join is the one of the major performance issues in RDBMS when we have large amount of data. MongoDB is based on document model, most of the time all the data for a record is located in a single document. So if the data is properly modelled in MongoDB the need for Joins can be avoided. For some requirements like reporting, analytics etc. it is possible that the data we need may reside in multiple collections. As MongoDB user base is growing and more and more users from RDBMS world are using MongoDB, so requirement of Join came out strongly. Starting with MongoDB version 3.2, one new aggregation framework operator $lookup was added. The $lookup operator performs an operation similar to a Join (left outer join). We can read data from one collection and merge the data with data from another collection. Prior to MongoDB 3.2, similar work had to be implemented in application code.
Let’s get our hands dirty with an example.
Suppose we have two collections:
users collection stores user’s information.
image
activity collection stores users activities.
image
Referring to RDBMS, we may think userID field in users collection as the primary key and userID field in activity collection as the foreign key Smile. So the link between these users and activity collection is the userID field.
Now suppose we got a requirement: “find username and city of the user performing each activity”. But the user’s detail information is stored in users collection, so we have to join the activity and users collections using the userID field to extract the required data.
It’s the time to leverage the power of $lookup operator. So our aggregation query will be:
> db.activity.aggregate(
{
"$lookup": {
from : "users",
localField : "userID",
foreignField: "userID",
as : "userInfo" }
})

image



from: Specifies the collection from the current database to be joined, in our example it will be the users collection.
localField: Specifies the field from the input documents, in our case it will be userID field of activity collection.
foreignField: Specifies the field from the documents of the “from” collection, in our case it will be userID from users collection.
as: Specifies the name of the new array field, each array contains the matching documents from the “from” collection. We are naming this array as userInfo.
image
From above output, we can see that the whole users document is stored within the userInfo array.
The data returned above is not looking cool, this is not the format in which we wanted the data. If we get data in the following format, it would be nice:
 UserID, Activity, UserName, City
So for that we have to use two more aggregation framework operators, $unwind and $project, let’s rewrite our aggregation query:
> db.activity.aggregate(
{
"$lookup": {
from : "users",
localField : "userID",
foreignField: "userID",
as : "userInfo" }
},
{
"$unwind": "$userInfo"
},
{
"$project": {
"UserID":"$userID",
"UserName" : "$userInfo.username",
"City" : "$userInfo.city",
"activity" : 1,
"_id": 0 }
}
)

image
Voila, required data is ready Thumbs up

Saturday 21 May 2016

Index Filters in MongoDB

MongoDB query optimizer processes queries and pick out the most efficient query plan for a query. MongoDB query system uses this plan each time the query runs. MongoDB optimizer chooses the optimal index (if indexes are available) for a query.
MongoDB optimizer works very well but sometimes we may have a better idea of which index to use for a given query. We can run the hint() method on a query to override query optimizer’s index selection process and tell the system which index should be used for the given query. So we have to specify hint() method from client side every time we want to override the index selection process. Sometimes we may have better idea about a query and the index to be used for that query and also we don’t want end user to override the index selection process by providing hint(). For all these the solution is Index Filters.
Index filter provides us a temporary (index filter do not persist after shutdown) way to inform MongoDB that a particular query type should use particular index. It determines which indexes the optimizer evaluates for a query shape (a query shape consists of the query itself, any sort criteria and any projection specifications). So if we have specified an index filter for a specific query type, then we don’t have add hint() to the same query. Also hint() is ignored by MongoDB when index filter exists for the particular query shape.
I will show one example to clarify the concept.
I have one collection users, having following data:  
{ "userID" : 1001, "name" : "Pranab Sharma", "city" : "Mumbai", "favFood" : "Chinese", "favDrink" : "beer" }
{ "userID" : 1002, "name" : "Danish Khan", "city" : "Guwahati", "favFood" : "Chinese", "favDrink" : "beer" }
{ "userID" : 1003, "name" : "Samir Das", "city" : "Mumbai", "favFood" : "Continental", "favDrink" : "milk" }
{ "userID" : 1004, "name" : "John Butler", "city" : "Mumbai", "favFood" : "Indian", "favDrink" : "vodka" }
{ "userID" : 1005, "name" : "Xi Xen", "city" : "Guwahati", "favFood" : "Chinese", "favDrink" : "wine" }
{ "userID" : 1006, "name" : "Vladimir Pulaxy", "city" : "Guwahati", "favFood" : "Chinese", "favDrink" : "beer" }
{ "userID" : 1007, "name" : "Karina Ali", "city" : "Mumbai", "favFood" : "Mexican", "favDrink" : "beer" }

This collection has two user defined indexes:
  • { "userID" : 1, "favDrink" : 1  }
  • { "userID" : 1, "city" : 1 }
Suppose we have a query which finds out the users having userID greater than equal to 1003, loves to drink beer and then sorts the result by the city field.
db.users.find({"userID" : {"$gte": 1003}, "favDrink": "beer"}).sort({"city": 1})
After running the query, if we check the execution stats of the query in MongoDB’s log (set the log level to 1 to get execution stats of the query using the command:  db.adminCommand({setParameter:1, logLevel:1} ) :
[conn1] command test.users command: find { find: "users", filter: { userID: { $gte: 1003.0 }, favDrink: "beer" }, sort: { city: 1.0 } } planSummary: IXSCAN { userID: 1.0, city: 1.0 } keysExamined:5 docsExamined:5 hasSortStage:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:2 reslen:342
From the above log line, we can see that the query optimizer has choosen the index { "userID" : 1, "city" : 1 } for our query. It scanned 5 index entries and 5 documents scanned.
Suppose, we know that if we use the index {"userID":1, "favDrink": 1 } for this query, then the system will require less number of document scanning. Let’s run the query providing this index in hint():
db.users.find({"userID" : {"$gte": 1003}, "favDrink": "beer"}).sort({"city": 1}).hint({"userID":1, "favDrink": 1 })
Now the execution stats for this query:
[conn1] command test.users command: find { find: "users", filter: { userID: { $gte: 1003.0 }, favDrink: "beer" }, sort: { city: 1.0 }, hint: { userID: 1.0, favDrink: 1.0 } } planSummary: IXSCAN { userID: 1.0, favDrink: 1.0 } keysExamined:5 docsExamined:2 hasSortStage:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:2 reslen:342
This time MongoDB scanned 2 documents, 3 documents less than our previous attempt without hint().
Say, we want to enforce the use of the index {"userID":1, "favDrink": 1 }  for our above query. So we can set an index filter Smile, instead of informing everyone to use the index {"userID":1, "favDrink": 1 } in hint().

To create an index filter, we can use the command planCacheSetFilter. This command has the follwing syntax:
db.runCommand(
   {
      planCacheSetFilter: <collection>,
      query: <query>,
      sort: <sort>,
      projection: <projection>,
      indexes: [ <index1>, <index2>, ...]
   }
)

So for our example, the command will be:
db.runCommand(
{
     planCacheSetFilter: "users",
     query: {"userID" : {"$gte": 1003}, "favDrink": "beer"},
     sort: {"city": 1},
     projection: {},
     indexes: [{"userID":1, "favDrink": 1 }]
} )


image
Our index filter is in place, now let’s run the query without hint():
db.users.find({"userID" : {"$gte": 1003}, "favDrink": "beer"}).sort({"city": 1})
image
Checking the execution status of the query:
[conn1] command test.users command: find { find: "users", filter: { userID: { $gte: 1003.0 }, favDrink: "beer" }, sort: { city: 1.0 } } planSummary: IXSCAN { userID: 1.0, favDrink: 1.0 } keysExamined:5 docsExamined:2 hasSortStage:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:2 reslen:342
Yes, we can see MongoDB used the index {"userID":1, "favDrink": 1 } for our query (as we specified in our index filter), so our index filter worked.


To check whether MongoDB really applied an index filter for our query we can use the explain() method and check the indexFilterSet field. If it is set to true, that means MongoDB had applied index filter.
image

If we can change the comparision value for userID field’s $gte and for favDrink, then also our index filter will work.
Let’s examine:
db.users.find({"userID" : {"$gte": 1001}, "favDrink": "wine"}).sort({"city": 1})
image
[conn1] command test.users command: find { find: "users", filter: { userID: { $gte: 1001.0 }, favDrink: "wine" }, sort: { city: 1.0 } } planSummary: IXSCAN { userID: 1.0, favDrink: 1.0 } keysExamined:6 docsExamined:1 hasSortStage:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:1 reslen:221
We can see that MongoDB used the index that we specified in the index filter definition. As changing the value does not change the query shape, so MongoDB used that index filter.
But if we change $gte to say $gt or $lt then our index filter will not work and MongoDB will again use the index { "userID" : 1, "city" : 1 }.
Let’s examine:
db.users.find({"userID" : {"$gt": 1003}, "favDrink": "beer"}).sort({"city": 1})
command test.users command: find { find: "users", filter: { userID: { $gt: 1003.0 }, favDrink: "beer" }, sort: { city: 1.0 } } planSummary: IXSCAN { userID: 1.0, city: 1.0 } keysExamined:4 docsExamined:4 hasSortStage:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:2 reslen:342
From above log, we can see that MongoDB used the default index { userID: 1, city: 1 }, as changing $gte to $gt changed the query shape.


Now let’s examine if providing a hint for a query works, if we have index filter in place for that query:
 db.users.find({"userID" : {"$gte": 1001}, "favDrink": "wine"}).sort({"city": 1}).hint({ userID: 1, city: 1 })
We provided the { userID: 1, city: 1 } index as a hint to our query.
[conn1] command test.users command: find { find: "users", filter: { userID: { $gte: 1001.0 }, favDrink: "wine" }, sort: { city: 1.0 }, hint: { userID: 1.0, city: 1.0 } } planSummary: IXSCAN { userID: 1.0, favDrink: 1.0 } keysExamined:6 docsExamined:1 hasSortStage:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:1 reslen:221
From the log, it is clear that MongoDB used the { userID: 1, favDrink: 1 } index, it did not consider the index { userID: 1, city: 1 } that we specified in the hint.


We can use the planCacheListFilters command to get the list of index filters for a given collection:
db.runCommand( { planCacheListFilters : "users" })
image

Also we can run the planCacheClearFilters command to remove a specific index filter or all the index filters in a collection. To remove a specific index filter we have to specify the query shape. For our example:
db.runCommand(
{
    planCacheClearFilters: "users",
    "query" : {"userID" : {"$gte" : 1003},"favDrink" : "beer"},
    "sort": {"city" : 1},
    "projection": {}
} )
image
To clear all index filters on a collection, just omit the query shape in the planCacheClearFilters command:
db.runCommand({planCacheClearFilters: "users"})
image
Index filter is a very nice tool for optimizing MongoDB experience, so try it out and enjoy Thumbs up.

Friday 20 May 2016

MongoDB Recipes: How to change MongoDB’s sort buffer size

When MongoDB could not use an index obtain the sort order for a query, then it sorts the results in memory. If the sort operation consumes more than 32 megabytes, MongoDB returns an error:
"Executor error during find command: OperationFailed: Sort operation used more than the maximum 33554432 bytes of RAM. Add an index, or specify a smaller limit."

image
As written in MongoDB document, to avoid this error we can either create an index supporting the sort operation or we can use sort() in conjunction with limit()
Also memory usage limit for sorts can be configured via the internalQueryExecMaxBlockingSortBytes parameter. In the following example I am setting the sort buffer size to 128MB:
> db.adminCommand({"setParameter": 1, "internalQueryExecMaxBlockingSortBytes" : 134217728})
image
Now my MongoDB will use 128MB memory for the sorts that could not use an index .