My MongoDB Class Notes from the M101P: MongoDB for Developers back in September 2014
Install on Ubuntu with: sudo apt-get install -y mongodb-org
Run on Ubuntu with: sudo service mongod start
MongoDB is:
mongo is the shell to connect to database mongod is the process that starts the database server (default port of 27017)
By default data goes to ‘/data/db’, so we’ll need to create the data directory
sudo mkdir -p /data/db
sudo chmod 777 /data/db
for local dev onlyWhen you run mongo
, you connect to: mongodb://127.0.0.1:27017 by default
db
to check what db you are indb.names.insert({'name':'Will'})
Returns: WriteResult({ "nInserted" : 1 })
db.names.find()
Returns: { "_id" : ObjectId("5420f0510002277afae9f046"), "name" : "Will" }
You can use Bottle as a micro web framework and connect to pymongo
python blog.py
http://localhost:8082
JSON Contains dictionaries and arrays. BSON is a superset of JSON. MongoDB uses BSON; can read more about that here, which has more support for additional data types like datetimes.
JSON
{ name: 'value', city: 'value', interests: [{..}, {..}] }
)CRUD (Create, Read, Update, Delete) operations in MongoDB
MongoDB’s CRUD operations exist as methods/functions in the programming language API’s, not as a separate language.
Mongo shell is an interactive javascript interpreter
z = { a : 1 }
z.a
Returns 1, usually dot notation means 'a' is a property of 'z'
z["a"]
Returns 1, usually bracket notation means data lookup
Assuming database of ‘db’ and collection of ‘people’ or ‘scores’
Code:
doc = { "name" : "Will", "age" : 29, "profession" : "programmer"}
db.people.insert( doc )
> WriteResult({ "nInserted" : 1 })
doc = {"_id" : ObjectID("43243242cdffd34242342"), "student": 19, "type": "essay", "score": 88 }
Code:
db.people.find()
> { "_id" : ObjectId("542106ac0002277afae9f047"), "name" : "Will", "age" : 29, "profession" : "programmer" }
# Notice that the "_id" is filled in automatically (by default an ObjectID, but you can specify your own type if you'd like)
Can also specify criteria like:
db.scores.find( { student: 19, type: "essay" )
# Displays all fields in result
db.scores.find( { student: 19, type: "essay"}, { "score": true, "_id": false })
# Only display score in result
Comparison (e.g. Greater Than, Less Than):
db.scores.find( { score : { $gt: 95 } } )
db.scores.find( { score : { $gt: 95, $lte : 98 }, type : "essay" } )
Comparison Exists
db.people.find( { profession: { $exists: true } } )
By Type
db.people.find( { name : { $type : 2 } } )
# Note: Type 2 is a string
Regex
db.people.find( { name: { $regex: e$" } } )
OR
db.people.find( { $or : [ { name: { $regex : "e$" } }, { age : { $exists: true } } ] } )
AND
Below two are similar use cases of AND
db.people.find( { $and : [ { name : { $gt : "C" } }, { name : { $regex : "a" } } ] } )
db.people.find( { name : { $gt : "C", $regex : "a" } } )
ALL
db.accounts.find( { favorites : { $all : [ "pretzels", "beer" ] } } )
# Queries anything that has all the specified elements (e.g. "pretzels" and "beer" in any order)
IN
db.accounts.find( { name: { $in : [ "Howard", "Will" ] } } )
#Any document that contains either "Howard" or "Will" in the specified field (name)
FindOne
db.people.findOne()
Gets a random document out
>{
"_id" : ObjectId("542106ac0002277afae9f047"),
"name" : "Will",
"age" : 29,
"profession" : "programmer"
}
db.people.findOne({ "name": "Will" }, { "name":true, "_id": false} )
>{ "name" : "Will" }
Update
Update (warning: dangerous to use, think about using $set)
db.people.update( { name : "Smith" }, { name : "Thompson", salary: 50000})
# Updates the match on the left ("Smith") with the new data on the right ("Thompson" and "salary")
Update with $set
db.people.update( { name : "Alice" }, { $set : { age : 30 } } )
# Queries the left ("Alice") and sets the age to 30, even if age didn't exist previously
Update with $unset
db.people.update( { name : "Jones" }, { $unset : { profession : 1 } } )
# Queries the left ("Jones") and unsets the right field (profession) so it no longer is there
Update with $upsert
db.people.update( { name : "George" }, { $set : { age : 40 } }, { $upsert : true } )
# If the person "George" with age "40" isn't there, then add this field
Update with $multi
# If you want to update multiple options (instead of just one), need the additional parameter `multi`
db.people.update( { }, { $set : { title : "Dr" } }, { multi : true } )
Remove
db.people.remove( { } )
# Removes all documents from the collection one-by-one
db.people.drop()
# Removes all documents from the collection all at once (much faster)
Arrays
Make Query Pretty
db.accounts.find().pretty()
Querying Inside Arrays
db.accounts.find().pretty()
{
"_id" : ObjectId("ffjdlksfjlsa"),
"name" : "George",
"favorites" : [
"ice cream",
"pretzels"
]}
{
"_id" : ObjectId("fjdsklfjal"),
"name" : "Will",
"favorites" : [
"pretzels",
"beer"
]}
db.accounts.find( { favorites : "pretzels" } )
# Returns both documents
db.accounts.find( { favorites : "beer" } )
# Only returns object with Will
# Note: Only looks at top level depth; no recursion to find an arbitrary depth
Cursors
When a query like db.people.find()
is run, a cursor goes through each field
You can hold onto cursors and iterate through them like:
cur = db.people.find(); null;
>cur.hasNext()
>cur.next()
Cursor Limit
cur.limit(5)
Cursor Sort
cur.sort( { name: -1 } );
# Reverse order by name (lexographical)
Cursor Skip
cur.skip(2);
# Can chain together (e.g. sort and limit and skip)
cur.sort( { name: -1 }).limit(3).skip(2);
# Order Processed as first the sort, then the skip, then the limit
Count
db.scores.count({ type: "exam" })
>1000
Performs a query that just counts up where query matches
Load JSON file
$./mongoimport -d students -c grades < grades.json
-d means to pick the database to use
-c means to pick the table/collection to use
< means from this json file
In Mongo:
$use students $db.grades.count()
Homework 2.1:
`db.grades.find({ score: { $gte: 65} }, {student_id:1}).sort({ score:1 })`
In Relational Databases, you want to design for third normal form
post_id
from comment table should match post table’s id)For MongoDB, you want to design so that it’s conducive to the application’s data access patterns (for example, what is read together, what is written together?).
MongoDB supports Rich Documents
MongoDB does have Atomic operations (work is completed for all or none), can do:
Benefits of Embedding instead of traditional relational joins
One-to-One
One-to-Many
Many-to-Many
_id:12, authors:[27]
while ‘Authors’ has _id:27, books:5)
Checking Indexes
We can create indexes and see how the indexes are created:
db.students.ensureIndex({'teachers':1})
How do you represent a tree? (e.g. categories of products in Amazon based on season) Products collection and Category collections, can add ancestors in Categories
When to denormalize?
GRIDFS
GRIDFS (allows you to store more than 16MB per collection)
For MongoDB, you keep your keys ordered in what’s called a B-Tree (a self balancing tree data structure that keeps data sorted and allows searches)
To make an Index on student_id
field and make ascending
db.students.ensureIndex({student_id:1})
To make a compound index (i.e. index on multiple fields); made up
of student_id
(ascending) and class
(descending)
db.students.ensureIndex( {student_id:1, class:-1 })
ASC and DESC doesn’t matter much for searching, but matters for sort
Finding Indexes
See all indexes for current database
use school
db.system.indexes.find()
Get specific indexes for a collection
db.students.getIndexes()
Drop indexes for a collection
db.students.dropIndex( {'student_id':1} )
Single Key Index
MongoDB provides support for indexes on any field in a collection or document.
By default, all collections have an index on the _id
field
{ "_id" : ObjectID(...),
"name" : "Alice"
"age" : 27
}
db.friends.ensureIndex( { "name" : 1 } )
Compound Index
A compound index is where a single index structure holds references to multiple fields within a collection’s documents.
{ "_id" : ObjectId(...),
"item": "Banana",
"category" : ["food", "produce", "grocery"],
"stock": 4
}
Single Compound Index
If applications query on the item
field as well as query on both the
item
field and the stock
field, you can specify a single compound index
to support both of these queries:
db.products.ensureIndex( { "item": 1, "stock": 1 } )
Multikey Index
To index a field that holds an array value, MongoDB adds index items for each item in the array. This multikey index allows MongoDB to return documents from queries using the value of an array.
{ userid: "xyz",
addr:
[
{ zip: "10046", ... },
{ zip: "94301", ... }
],
...
}
{ "addr.zip":1 }
The addr
field contains an array of address documents. The address documents
contain the zip
field
Example of multikey index:
{name: 'Andrew',
tag: ['cycling', 'tennis', 'football'],
color: 'red',
locations: ['NY', 'CA']}
You can create compound keys like tag
, color
(one is an array, other is scalar)
You cannot create compound keys on tags
, locations
(they’re both arrays,
issue is it becomes too large)
Indexing isn’t restricted to the first level of keys (can put on sub-arrays)
"addresses": [
{"tag": "vacation",
"phones": [
1,
2]
}...]
db.people.ensureIndex( {'addresses.tag':1} )
db.people.ensureIndex( {'addresses.phones':1} )
Unique Index
A unique index causes MongoDB to reject all documents that contain a
duplicate value for the indexed field
To create a unique index, use the unique
option to true, like:
db.stuff.ensureIndex( { thing: 1}, { unique:true } )
# By default, unique: false on indexes
"_id" is unique, even though it doesn't say it is (when you try to insert into the collection with a value already there, it returns an error)
Removing Duplicates when creating unique indexes
Sparse indexes only contain entries for documents that have the indexed field, even if the indexed field contains a null value. The index skips over any document that is missing the indexed field; the index is “sparse” because it does not include all documents of a collection
E.g. db.addresses.ensureIndex( { “xmpp_id”: 1}, {sparse: true} )
If a sparse index would result in an incomplete result set for queries and sort operations, MongoDB will not use that index unless a ‘hint()’ explicitly specifies the index
E.g. the query { x: { $exists: false } }
will not use a sparse index on the
x
field unless explicitly hinted
Sparse indexes can be used to create unique indexes when the index key is missing from the document
{ a:1, b:2, c:5} { a:10, b:5, c:10} { a:13, b:17} { a:7, b:23}
So this is what we need a sparse index for (Setting an index on ‘c’ with sparse)
db.products.ensureIndex( {size:1}, {unique:true, sparse:true})
Gets a ‘cursor’: ‘BasicCursor’ with all results back
db.products.find().sort({size:1}).explain()
Hint on the index, only gets two documents back (instead of all four since ‘c’ is sparse)
db.products.find().sort({c:1}).hint({c:1})
By default, an index is created on the foreground (blocks all other writers). This makes it:
You can specify if you want to have an index in the background
background:true
Explain gives you an explanation of what occurred in the query:
db.foo.find({c:1}).explain()
"cursor" : "BasicCursor"
Can be say "BtreeCursor a_1_b_1_c_1"
This means the database is using the index
Can be say "BasicCursor"
This means no index was used
"isMultiKey" : false
Is this a multikey index?
"n" : 1
Number of documents returned, say 1 that matches the query
"nscannedObjects" : 1
Number of documents scanned in the query
"indexOnly" : true
We can see that we didn't need to go to the actual collection to answer the query, just based on index
"millis" : 3
How many milliseconds it took for the query
"indexBounds" : ...
Shows the bounds that were used for the index
Say we have 3 indexes (a, b, c); what happens when you run a query is that mongodb runs all three indexes at the same time and checks which index is the quickest Once the quickest query is determined (say query b wins), MongoDB remembers that it’s the quickest query and will use that for say the next 100ish queries
Indexes help with performance, but takes hard drive space
db.students.totalIndexsize()
db.students.getIndexes()
db.students.dropIndex({'student_id':1})
Regular Index, has 1:1 index to documents
Sparse Index
Multikey Index
tags:[ _, _, _, _]
If you want to specifically tell MongoDB to use this index (instead of normally letting MongoDB figure it out), you can use hinting for indexes
2D
3D
The profiler writes to ‘system.profile’ with the following levels
0 = log off 1 = log only slow queries (more a why is my query so slow) 2 = log all my queries (more a general debugging feature) e.g. mongodb -dbpath /usr/local/var/mongodb –profile 1 –slowms 2 e.g. db.system.profile.find({millis:($gt:1000})
db.getProfilingLevel()
# returns 1
db.getProfilingStatus()
{ "was" : 1, "slowms" : 2 }
mongostat
shows you say the queries per second, flushes per second, etc.
Usually interested in ‘idx missed’ (you want your index in memory, if this
has a high number, then your index can’t fit in memory so it’ll be much slower)
If you can’t get the performance you want from a single server, you can deploy
multiple mongod servers (shards) with one mongos server (that routes to the shards)
E.g. Say your shard key is student_id
, then it’ll say use a specific shard
Products Table containing
name | category | manufacturer | price
ipod | tablet | Apple | $300
nexus | cell | Samsung | $200
Relational SQL’s GROUP BY would create
manufacturer | count(*)
Apple | 2
Samsung | 3
use aggregate
db.products.aggregate([
{ $group:
{
_id: "$manufacturer",
num_products:{$sum:1}
}
}
])
{ "_id" : "Amazon", "num_products" : 2 }
{ "_id" : "Apple", "num_products" : 1 }
Similar to Unix (e.g. du -s * | sort -n) |
MongoDB has the following pipeline stages (these can appear more than once):
$project
$match
$group
$sort
$skip
$limit
$unwind
$out
$redact
$geonear
$match
From products, we do a group, then gets the results as a new collection
Data is:
db.products.find().pretty()
{ "_id" : ObjectId("343l4jljdfklsf"),
"name" : "iPad 16GB Wifi",
"manufacturer" : "Apple",
"category" : "Tablets",
"price" : 499
}
Use query:
db.products.aggregate([
{$group:
{
_id : "$manufacturer",
num_products:{$sum:1}
}
}
])
What happens:
num_products
)$sum:1
SQL equivalent of:
SELECT manufacturer, category, count(*) FROM products
GROUP BY manufacturer, category
MongoDB equivalent:
db.products.aggregate([
{$group:
{ _id: {
"manufacturer":"$manufacturer",
"category":"$category"},
num_products:{$sum:1}
}
}
])
returns:
manufacturer: Apple, category: Tablet
manufacturer: Apple, category: Laptop
Result:
_id
key can also be a compound keySee how this compares to SQL here
Lets you change how
explain
- Lets you see the query plan if you ran the aggregation (good for optimization)allowDiskUse
- controls whether or not the aggregation framework will use the hard drive
Any stage of aggregation is limited to 100MB (e.g. sort might exceed 100MB and fail)
Specify to allowDiskUse if going to exceed 100MBcursor
- gets cursor sizeIf you’re dealing with arrays, you need to move it out of an array form and make it more flat by using unwind to ‘unjoin’ the data
E.g. { a:1, b:2, c:['apple', 'pear', 'orange'] }
$unwind: '$c'
{ a:1, b:2, c:'apple'}
{ a:1, b:2, c:'pear'}
{ a:1, b:2, c:'orange'}
Another Example of an $unwind
db.posts.aggregate([
/* unwind by tags */
{"$unwind":"$tags"},
/* now group by tags, counting each tag */
{"$group":
{"_id":"$tags",
"count":{$sum:1}
}
},
/* sort by popularity */
{"$sort":{"count":-1}},
/* show me the top 10 */
{"$limit":10},
/* change the name of _id to be tag */
{"$project":
{ _id:0,
'tag':'$_id',
'count':1
}
}
])
Double $unwind
/* given this document */
{
"_id": ObjectId("5890lkerjewljljflslsfja"),
"name": "T-Shirt",
"sizes": [
"Small",
"Medium",
"Large",
"X-Large"
],
"colors": [
"navy",
"black",
"orange",
"red"
]
}
/* unwinds by size, then by colors */
db.inventory.aggregate([
{$unwind: "$sizes"},
{$unwind: "$colors"},
{$group:
{
'_id': { 'size': '$sizes', 'color': '$colors'},
'colors': {'$sum': 1}
}
}
])
/* results of the double unwind is */
{ "_id": { "size": "31*20", "color": "violet" }, "count":2 }
Receive the results of the aggregation and redirects the output.
{
"_id" : ObjectId("432ljfkldasjfldsajf"),
"first_name": "William",
"last_name": "Liu",
"points": 2,
"moves": [
5,
6,
8
]
}
db.games.aggregate([
{$group:
{
_id: {
first_name: "$first_name",
last_name: "$last_name"
},
points: {$sum: "points"}
}
},
{$out: 'summary_results'}
])
This creates a new collection summary_results
Code:
db.summary_results.find()
{ "_id": { "first_name": "William", "last_name": "Liu" }, "points": 19 }
Application Engineering is concerned about:
Say you have your application that does writes (e.g. updates and inserts) to mongod
Mongo Shell is also another application that does inserts and updates
w
parameter means whether or not you wait for write to be acknowledged (1 = wait, 0 = don’t wait)j
parameter is journal, which means to commit to diskTo solve both of these issues, we use Replication
Code
$mkdir -p /data/rs1 /data/rs2 /data/rs3
mongod --replSet rs1 --logpath "1.log" --dbpath /data/rs1 --port 27017 --fork
mongod --replSet rs1 --logpath "2.log" --dbpath /data/rs2 --port 27018 --fork
mongod --replSet rs1 --logpath "3.log" --dbpath /data/rs3 --port 27019 --fork
# Ties replica sets together in the mongos
config = { _id: "rs1", members:[
{ _id:0, host: "localhost:27017"},
{ _id:1, host: "localhost:27018"},
{ _id:2, host: "localhost:27019"} ]
};
rs.initiate(config);
rs.status()
Shard Key
There’s a range based approach to shard key
mongoexport and mongoimport are utilities for importing and exporting data to and from a MongoDB instance.
Do not use thsese for production replications (instead use mongodump
and mongorestore
for this funcitonality).
mongoexport
is a utility that produces either JSON or CSV of the data stored in a MongoDB instance.
mongoimport
is a utility for importing data into a MongoDB instance.