MongoDB

My MongoDB Class Notes from the M101P: MongoDB for Developers back in September 2014

Install MongoDB

Install on Ubuntu with: sudo apt-get install -y mongodb-org Run on Ubuntu with: sudo service mongod start

Week 1 - What is MongoDB?

MongoDB is:

Non-relational database that stores JSON documents
JSON documents have key-value pairs (e.g. {“name”: “Will”, “age”: 29})
MongoDB is schemaless
MongoDB does not support JOINs or Transactions

Usage

mongo is the shell to connect to database mongod is the process that starts the database server (default port of 27017)

By default data goes to ‘/data/db’, so we’ll need to create the data directory

sudo mkdir -p /data/db
sudo chmod 777 /data/db for local dev only

Simple Commands

When you run mongo, you connect to: mongodb://127.0.0.1:27017 by default

db to check what db you are in
Can check its working by inserting a value like: db.names.insert({'name':'Will'}) Returns: WriteResult({ "nInserted" : 1 })
Can find the result of the value like: db.names.find() Returns: { "_id" : ObjectId("5420f0510002277afae9f046"), "name" : "Will" }

Bottle and PyMongo

You can use Bottle as a micro web framework and connect to pymongo

Install with: pip install bottle
Run server with: python blog.py
Can see at: http://localhost:8082
http://localhost:8082/signup
http://localhost:8082/login
http://localhost:8082/logout
Install pymongo with: pip install pymongo

JSON and BSON

JSON Contains dictionaries and arrays. BSON is a superset of JSON. MongoDB uses BSON; can read more about that here, which has more support for additional data types like datetimes.

JSON

Dictionaries - key/value pairs represented by { key: value }
Arrays - List of things represented by [ … ]
Can create a deep document
Top level has to be a dictionary (e.g. { name: 'value', city: 'value', interests: [{..}, {..}] } )

Week 2 - CRUD

CRUD (Create, Read, Update, Delete) operations in MongoDB

MongoDB’s CRUD operations exist as methods/functions in the programming language API’s, not as a separate language.

Create - Insert (Mongo) - Insert (SQL)
Read - Find (Mongo) - Select (SQL)
Update - Update (Mongo) - Update (SQL)
Delete - Remove (Mongo) - Delete (SQL)

Mongo Shell

Mongo shell is an interactive javascript interpreter

z = { a : 1 }
z.a
Returns 1, usually dot notation means 'a' is a property of 'z'

z["a"]
Returns 1, usually bracket notation means data lookup

CRUD Operations: Insert, Find, FindOne

Assuming database of ‘db’ and collection of ‘people’ or ‘scores’

Insert operation

Code:

doc = { "name" : "Will", "age" : 29, "profession" : "programmer"}
db.people.insert( doc )
> WriteResult({ "nInserted" : 1 })
doc = {"_id" : ObjectID("43243242cdffd34242342"), "student": 19, "type": "essay", "score": 88 }

Find operation

Code:

db.people.find()
> { "_id" : ObjectId("542106ac0002277afae9f047"), "name" : "Will", "age" : 29, "profession" : "programmer" }
# Notice that the "_id" is filled in automatically (by default an ObjectID, but you can specify your own type if you'd like)

Can also specify criteria like:

db.scores.find( { student: 19, type: "essay" )
# Displays all fields in result

db.scores.find( { student: 19, type: "essay"}, { "score": true, "_id": false })
# Only display score in result

Comparison (e.g. Greater Than, Less Than):

db.scores.find( { score : { $gt: 95 } } )
db.scores.find( { score : { $gt: 95, $lte : 98 }, type : "essay" } )

Comparison Exists

db.people.find( { profession: { $exists: true } } )

By Type

db.people.find( { name : { $type : 2 } } )
# Note: Type 2 is a string

Regex

db.people.find( { name: { $regex: e$" } } )
OR
db.people.find( { $or : [ { name: { $regex : "e$" } }, { age : { $exists: true } } ] } )

AND

Below two are similar use cases of AND

db.people.find( { $and : [ { name : { $gt : "C" } }, { name : { $regex : "a" } } ] } )
db.people.find( { name : { $gt : "C", $regex : "a" } } )

ALL

db.accounts.find( { favorites : { $all : [ "pretzels", "beer" ] } } )
# Queries anything that has all the specified elements (e.g. "pretzels" and "beer" in any order)

db.accounts.find( { name: { $in : [ "Howard", "Will" ] } } )
#Any document that contains either "Howard" or "Will" in the specified field (name)

FindOne

db.people.findOne()
Gets a random document out
>{
 "_id" : ObjectId("542106ac0002277afae9f047"),
 "name" : "Will",
 "age" : 29,
 "profession" : "programmer"
}

db.people.findOne({ "name": "Will" }, { "name":true, "_id": false} )
>{ "name" : "Will" }

Update

Update (warning: dangerous to use, think about using $set)

db.people.update( { name : "Smith" }, { name : "Thompson", salary: 50000})
# Updates the match on the left ("Smith") with the new data on the right ("Thompson" and "salary")

Update with $set

db.people.update( { name : "Alice" }, { $set : { age : 30 } } )
# Queries the left ("Alice") and sets the age to 30, even if age didn't exist previously

Update with $unset

db.people.update( { name : "Jones" }, { $unset : { profession : 1 } } )
# Queries the left ("Jones") and unsets the right field (profession) so it no longer is there

Update with $upsert

db.people.update( { name : "George" }, { $set : { age : 40 } }, { $upsert : true } )
# If the person "George" with age "40" isn't there, then add this field

Update with $multi

# If you want to update multiple options (instead of just one), need the additional parameter `multi`
db.people.update( { }, { $set : { title : "Dr" } }, { multi : true } )

Remove

db.people.remove( { } )
# Removes all documents from the collection one-by-one

db.people.drop()
# Removes all documents from the collection all at once (much faster)

Arrays

$push - add element to right hand side of an array
$pop - remove right most element of an array
$pull - remove left most element of an array
$pushAll - adds all elements of an array
$pullAll - removes any of the items in the array
$addToSet - does nothing if it already exists, otherwise acts as push

Make Query Pretty

db.accounts.find().pretty()

Querying Inside Arrays

db.accounts.find().pretty()
{
  "_id" : ObjectId("ffjdlksfjlsa"),
  "name" : "George",
  "favorites" : [
    "ice cream",
    "pretzels"
    ]}
{
  "_id" : ObjectId("fjdsklfjal"),
  "name" : "Will",
  "favorites" : [
    "pretzels",
    "beer"
    ]}

db.accounts.find( { favorites : "pretzels" } )
# Returns both documents

db.accounts.find( { favorites : "beer" } )
# Only returns object with Will
# Note: Only looks at top level depth; no recursion to find an arbitrary depth

Cursors

When a query like db.people.find() is run, a cursor goes through each field

You can hold onto cursors and iterate through them like:

cur = db.people.find(); null;
>cur.hasNext()
>cur.next()

Cursor Limit

cur.limit(5)

Cursor Sort

cur.sort( { name: -1 } );
# Reverse order by name (lexographical)

Cursor Skip

cur.skip(2);
# Can chain together (e.g. sort and limit and skip)

cur.sort( { name: -1 }).limit(3).skip(2);
# Order Processed as first the sort, then the skip, then the limit

Count

db.scores.count({ type: "exam" })
>1000
Performs a query that just counts up where query matches

Load JSON file

$./mongoimport -d students -c grades < grades.json
    -d means to pick the database to use
    -c means to pick the table/collection to use
    < means from this json file

In Mongo:

$use students $db.grades.count()

Homework 2.1:

`db.grades.find({ score: { $gte: 65} }, {student_id:1}).sort({ score:1 })`

Week 3 - Application Driven Schema

In Relational Databases, you want to design for third normal form

Every non-key attribute in a table must provide a fact about just the key (e.g. a Post Table for a blog that has Author information like Author’s email address)
Relational Databases are good at keeping data consistent with foreign key constraints (e.g. post_id from comment table should match post table’s id)

For MongoDB, you want to design so that it’s conducive to the application’s data access patterns (for example, what is read together, what is written together?).

MongoDB supports Rich Documents

Rich Documents is not just tabular data, it’s also say an array of items or a value for a certain key, or another document
Rich Documents allow us to pre-join data for fast access by embedding the data (MongoDB doesn’t support joins directly inside the kernel) - Joins are hard to scale so we design around it
MongoDB doesn’t have foreign key constraints, but instead embeds the data
MongoDB doesn’t have transactions (ACID in relational databases)

MongoDB does have Atomic operations (work is completed for all or none), can do:

Restructure code so that you’re working within a single document, takes advantage of Atomic operations and won’t miss a transaction
Implement a locking in software (e.g. creating a critical section, build a test-test-and set using modify, build semaphores). This is kind of like transactions across banks (if they’re not using the same system)
Tolerate a little inconsistency (e.g. Friend feed on Facebook, doesn’t matter that some people are a few seconds behind)

Benefits of Embedding instead of traditional relational joins

Improved Read Performance (one round trip to the database)
Disks work by spinning to a specific location, then it reads it all there instead of round trips to the database (same applies for reads and writes)

MongoDB’s Relations

One-to-One

E.g. Employee to Resume has different employee_id and resume_id
In a relational database, you link the different ids
In MongoDB, you embed one into the other (if you access one more frequently, say read Employee table, then place in front)

One-to-Many

Two entities, where many maps to one (e.g. city: person, NYC: Will)
You would create by having two collections (Person collection with City:NYC that links to City collection with id of NYC)
Note: One-to-Few is a special case of One-to-Many, but if it’s just a few records, you can just embed the data (e.g. Blog Post with just a few comments)

Many-to-Many

Two entities, Books and Authors
If Many-to-Many, both collections would have an id as well as a linking id (e.g. ‘Book’ has _id:12, authors:[27] while ‘Authors’ has _id:27, books:5)
Say Students: Teachers example, but for this you might not want to embed one collection into another (very good chance that you want to insert a teacher into the system before he has students or that you want to insert a student without having to insert their teachers)

Checking Indexes

We can create indexes and see how the indexes are created:

db.students.ensureIndex({'teachers':1})
If we append ‘.explain()’, it shows how its linked (e.g. isMultiIndex: true)

How do you represent a tree? (e.g. categories of products in Amazon based on season) Products collection and Category collections, can add ancestors in Categories

When to denormalize?

1:1 then Embed
1:Many then Embed (from the many to the one)
Many:Many then Link

GRIDFS

GRIDFS (allows you to store more than 16MB per collection)

Breaks up your file into smaller chunks
Say we have a 100MB file, this would breaks into multiple files

Week 4 - Database Performance with Indexes

Background

Are you going to use an index to resolve your query or not?
When you do a query, it scans through your entire table or collection (this is called a Table Scan in relational database or a Collection Scan in MongoDB)
Queries can really kill performance if they do not have the right indexes
What you want is an ordered table or collection so you can do a binary search to quickly find what you are looking for

MongoDB Indexes

For MongoDB, you keep your keys ordered in what’s called a B-Tree (a self balancing tree data structure that keeps data sorted and allows searches)

Indexes in MongoDB are an ordered (order is important!) list of keys (e.g. name, hair color, DOB)
e.g. Name: Amy, John, … Zoe Hair_Color: Blonde, Brown, … Black DOB: 1/1/1984, 1/1/1983, …
Indexes mean reads a lot faster, but write a little slower (so use it on just the ones that you’ll query often)
Indexes also take a little bit of extra hard drive space (pretty trivial)
Example, you have index of: (a, b, c) You can use on: a, b You can use on: a You can’t use on: b You can’t use on: c Summary: Needs to be a left subset of the field to use the index

Creating Indexes in MongoDB

To make an Index on student_id field and make ascending

db.students.ensureIndex({student_id:1})

To make a compound index (i.e. index on multiple fields); made up of student_id (ascending) and class (descending)

db.students.ensureIndex( {student_id:1, class:-1 })

ASC and DESC doesn’t matter much for searching, but matters for sort

Finding Indexes

See all indexes for current database

use school
db.system.indexes.find()

Get specific indexes for a collection

db.students.getIndexes()

Drop indexes for a collection

db.students.dropIndex( {'student_id':1} )

Types of Indexes in MongoDB

Single Key Index

MongoDB provides support for indexes on any field in a collection or document. By default, all collections have an index on the _id field

{ "_id" : ObjectID(...),
   "name" : "Alice"
   "age" : 27
}
db.friends.ensureIndex( { "name" : 1 } )

Compound Index

A compound index is where a single index structure holds references to multiple fields within a collection’s documents.

{ "_id" : ObjectId(...),
   "item": "Banana",
   "category" : ["food", "produce", "grocery"],
   "stock": 4
}

Single Compound Index

If applications query on the item field as well as query on both the item field and the stock field, you can specify a single compound index to support both of these queries:

db.products.ensureIndex( { "item": 1, "stock": 1 } )

Multikey Index

To index a field that holds an array value, MongoDB adds index items for each item in the array. This multikey index allows MongoDB to return documents from queries using the value of an array.

{ userid: "xyz",
   addr:
          [
            { zip: "10046", ... },
            { zip: "94301", ... }
           ],
   ...
}
{ "addr.zip":1 }

The addr field contains an array of address documents. The address documents contain the zip field

Example of multikey index:

{name: 'Andrew',
  tag: ['cycling', 'tennis', 'football'],
  color: 'red',
  locations: ['NY', 'CA']}

You can create compound keys like tag, color (one is an array, other is scalar) You cannot create compound keys on tags, locations (they’re both arrays, issue is it becomes too large)

Gives error: “errmsg” : “insertDocument :: caused by :: 10088 cannot index parallel arrays”

Indexing isn’t restricted to the first level of keys (can put on sub-arrays)

"addresses": [
    {"tag": "vacation",
      "phones": [
        1,
        2]
     }...]
db.people.ensureIndex( {'addresses.tag':1} )
db.people.ensureIndex( {'addresses.phones':1} )

MongoDB Index Properties

Unique Index

A unique index causes MongoDB to reject all documents that contain a duplicate value for the indexed field To create a unique index, use the unique option to true, like:

db.stuff.ensureIndex( { thing: 1}, { unique:true } )
# By default, unique: false on indexes
"_id" is unique, even though it doesn't say it is (when you try to insert into the collection with a value already there, it returns an error)

Removing Duplicates when creating unique indexes

Danger! (e.g. “thing”: “pear”, “thing”: “apple”, “thing”: “pear”)
Set: unique: true and dropDups: true
(after appears as “thing”: “pear”, “thing”: “apple”)

Sparse Indexes

Sparse indexes only contain entries for documents that have the indexed field, even if the indexed field contains a null value. The index skips over any document that is missing the indexed field; the index is “sparse” because it does not include all documents of a collection

E.g. db.addresses.ensureIndex( { “xmpp_id”: 1}, {sparse: true} )
If a sparse index would result in an incomplete result set for queries and sort operations, MongoDB will not use that index unless a ‘hint()’ explicitly specifies the index

E.g. the query { x: { $exists: false } } will not use a sparse index on the x field unless explicitly hinted
Sparse indexes can be used to create unique indexes when the index key is missing from the document

{ a:1, b:2, c:5} { a:10, b:5, c:10} { a:13, b:17} { a:7, b:23}
Can create a unique index on ‘a’ since every document has an ‘a’
Cannot create a unique index on ‘c’ since the first two documents have indexes, but the third and fourth document has a ‘c:null’
So this is what we need a sparse index for (Setting an index on ‘c’ with sparse)

db.products.ensureIndex( {size:1}, {unique:true, sparse:true})
Gets a ‘cursor’: ‘BasicCursor’ with all results back

db.products.find().sort({size:1}).explain()
Hint on the index, only gets two documents back (instead of all four since ‘c’ is sparse)

db.products.find().sort({c:1}).hint({c:1})

Index Creation - Background

By default, an index is created on the foreground (blocks all other writers). This makes it:

fast
block writes (per DBlock)
if in a sole development system, you can use this, otherwise make background

You can specify if you want to have an index in the background

background:true
slow
does not block writers
if you’re in a production system, use this

Command - Explain()

Explain gives you an explanation of what occurred in the query:

db.foo.find({c:1}).explain()
"cursor" : "BasicCursor"
Can be say "BtreeCursor a_1_b_1_c_1"
This means the database is using the index
Can be say "BasicCursor"
This means no index was used
"isMultiKey" : false
Is this a multikey index?
"n" : 1
Number of documents returned, say 1 that matches the query
"nscannedObjects" : 1
Number of documents scanned in the query
"indexOnly" : true
We can see that we didn't need to go to the actual collection to answer the query, just based on index
"millis" : 3
How many milliseconds it took for the query
"indexBounds" : ...
Shows the bounds that were used for the index

Choosing an Index

Say we have 3 indexes (a, b, c); what happens when you run a query is that mongodb runs all three indexes at the same time and checks which index is the quickest Once the quickest query is determined (say query b wins), MongoDB remembers that it’s the quickest query and will use that for say the next 100ish queries

Index Commands - Size and Dropping

Indexes help with performance, but takes hard drive space

db.students.totalIndexsize()
db.students.getIndexes()
db.students.dropIndex({'student_id':1})

Index Cardinality

Regular Index, has 1:1 index to documents

Sparse Index

when document is missing the key being indexed, it’s a null
index is less than or equal to number of documents

Multikey Index

E.g. tags:[ _, _, _, _]
Will be an index for every _
index is greater than the number of documents

Hinting an Index and Natural Index

If you want to specifically tell MongoDB to use this index (instead of normally letting MongoDB figure it out), you can use hinting for indexes

Hint specifically - E.g. hint({a:1, b:1, c:3})
Use no index (cursor goes through every document in the collection), can specify by hinting with the natural index - ({$natural:1})

Geospatial Index (2D) and Geospatial Spherical Indexing (3D)

Longitude and Latitude
‘location’: [x,y]
ensureIndex( { ‘location’: ‘2d’, type:1 })
Returns based on order of increasing distance

GeoJSON, but a limited set

Logging and Profiling

Figure out what’s slow by logging
MongoDB automatically logs any queries > 100ms
The profiler writes to ‘system.profile’ with the following levels

0 = log off 1 = log only slow queries (more a why is my query so slow) 2 = log all my queries (more a general debugging feature) e.g. mongodb -dbpath /usr/local/var/mongodb –profile 1 –slowms 2 e.g. db.system.profile.find({millis:($gt:1000})

Profiling Settings

db.getProfilingLevel()
# returns 1

db.getProfilingStatus()
{ "was" : 1, "slowms" : 2 }

MongoStat

mongostat shows you say the queries per second, flushes per second, etc. Usually interested in ‘idx missed’ (you want your index in memory, if this has a high number, then your index can’t fit in memory so it’ll be much slower)

Sharding

If you can’t get the performance you want from a single server, you can deploy multiple mongod servers (shards) with one mongos server (that routes to the shards) E.g. Say your shard key is student_id, then it’ll say use a specific shard

Week 5 - Aggregation Framework

SQL Example

Products Table containing
name    | category  | manufacturer  | price
ipod    | tablet    |   Apple       | $300
nexus   | cell      |  Samsung      | $200

Relational SQL’s GROUP BY would create

manufacturer  | count(*)
Apple         | 2
Samsung       | 3

MongoDB Example

use aggregate

db.products.aggregate([
  { $group:
    {
       _id: "$manufacturer",
       num_products:{$sum:1}
     }
   }
])
{ "_id" : "Amazon", "num_products" : 2 }
{ "_id" : "Apple", "num_products" : 1 }

MongoDB Aggregation Pipeline

Similar to Unix (e.g. du -s *

sort -n)

MongoDB has the following pipeline stages (these can appear more than once):

$project

reshapes the document
1:1 (one comes in and one comes out)

$match

filters the document
n:1 (reduce the number of documents)

$group

aggregate (sum, count, etc.)
n:1 (reduce the number of documents)

$sort

sorts the document
1:1 (one comes in and one comes out)

$skip

skips say the first 10 or 100 documents
n:1 (reduce the number of documents)

$limit

limit to say the first 10
n:1 (reduce the number of documents)

$unwind

normalize/flattens the data
(e.g. tags:[“red”, “blue”, “green”] turns into tags:red, tags:blue, tags:green)
1:n

$out

gets the output of the aggregation
1:1

$redact

security related feature to limit documents that certain users see

$geonear

perform location based queries

$match

pre assort filter n:1
see if a document matches the criteria (and if it does, pushes it to the next stage)
filter the results

Group by single key

From products, we do a group, then gets the results as a new collection

Data is:

db.products.find().pretty()
    {   "_id" : ObjectId("343l4jljdfklsf"),
         "name" : "iPad 16GB Wifi",
         "manufacturer" : "Apple",
         "category" : "Tablets",
         "price" : 499
    }

Use query:

db.products.aggregate([
    {$group:
      {
          _id : "$manufacturer",
          num_products:{$sum:1}
       }
     }
])

What happens:

We’re creating a new key (num_products)
We’re looking through to see if there’s a manufacturer of “Apple” and if it doesn’t exist, then does an insert into the results and run the $sum:1

Compound Grouping

SQL equivalent of:

SELECT manufacturer, category, count(*) FROM products
GROUP BY manufacturer, category

MongoDB equivalent:

db.products.aggregate([
    {$group:
        {  _id: {
                     "manufacturer":"$manufacturer",
                     "category":"$category"},
           num_products:{$sum:1}
        }
    }
])

returns: 
manufacturer: Apple, category: Tablet
manufacturer: Apple, category: Laptop

Result:

This can be multiple keys (not just two like manufacturer or category)
The _id key can also be a compound key

Aggregation Expressions Overview (During the Group Stage)

$sum - Lets you count (just add 1 each key) or add the value of the key
$avg - Average the value of the key across the document
$min - Get the min value of a key
$max - Get the max value of a key
$push - Creates to an array
$addtoset - Is like a push in an array, but only creates if unique
$first - Find the first value of a key (make sure to sort first)
$last - Find the last value of a key (make sure to sort first)

See how this compares to SQL here

Aggregation Options

Lets you change how

explain - Lets you see the query plan if you ran the aggregation (good for optimization)
allowDiskUse - controls whether or not the aggregation framework will use the hard drive Any stage of aggregation is limited to 100MB (e.g. sort might exceed 100MB and fail) Specify to allowDiskUse if going to exceed 100MB
cursor - gets cursor size

$unwind

If you’re dealing with arrays, you need to move it out of an array form and make it more flat by using unwind to ‘unjoin’ the data

E.g. { a:1, b:2, c:['apple', 'pear', 'orange'] }
$unwind: '$c'
{ a:1, b:2, c:'apple'}
{ a:1, b:2, c:'pear'}
{ a:1, b:2, c:'orange'}

Another Example of an $unwind

db.posts.aggregate([
  /* unwind by tags */
  {"$unwind":"$tags"},
  /* now group by tags, counting each tag */
  {"$group":
    {"_id":"$tags",
      "count":{$sum:1}
     }
   },
  /* sort by popularity */
  {"$sort":{"count":-1}},
  /* show me the top 10 */
  {"$limit":10},
  /* change the name of _id to be tag */
  {"$project":
    { _id:0,
      'tag':'$_id',
      'count':1
     }
   }
])

Double $unwind

/* given this document */
{
    "_id": ObjectId("5890lkerjewljljflslsfja"),
    "name": "T-Shirt",
    "sizes": [
        "Small",
        "Medium",
        "Large",
        "X-Large"
      ],
    "colors": [
         "navy",
         "black",
         "orange",
         "red"
       ]
 } 
/* unwinds by size, then by colors */
db.inventory.aggregate([
  {$unwind: "$sizes"},
  {$unwind: "$colors"},
  {$group:
    {
      '_id': { 'size': '$sizes', 'color': '$colors'},
      'colors': {'$sum': 1}
     }
   }
])
/* results of the double unwind is */
{ "_id": { "size": "31*20", "color": "violet" }, "count":2 }

$out

Receive the results of the aggregation and redirects the output.

{
  "_id" : ObjectId("432ljfkldasjfldsajf"),
  "first_name": "William",
  "last_name": "Liu",
  "points": 2,
  "moves": [
          5,
          6,
          8
    ]
}
db.games.aggregate([
  {$group:
    {
      _id: {
                first_name: "$first_name",
                last_name: "$last_name"
              },
       points: {$sum: "points"}
     }
   },
  {$out: 'summary_results'}
])

This creates a new collection summary_results

If the collection already existed, it would destroy it
If there’s an error (say duplicate id, then the previous collection won’t be destroyed)

Code:

db.summary_results.find()
{ "_id": { "first_name": "William", "last_name": "Liu" }, "points": 19 }

Python / PyMongo

In MongoDB 2.4, the aggregate method returns one big document (max size of 16 MB per document)
In MongoDB 2.6’s shell, the aggregate method returns a cursor
In MongoDB 2.6, by default the aggregate method returns a single document

Limitations in Aggregation

100MB limit for pipeline stages
Can get around this limitation by allow disk use option (allowDiskUse)
Will slow down if using allowDiskUse dramatically
16MB limit per document
Set cursor = {}
Set the cursor equal to an empty document and then you can have aggregation results that have no limit

Sharding

For sharded systems, as soon as you use a group by or a sort (or anything that requires to look at all the results), the results will go back to one shard
You can’t get the same level of scalability as say Hadoop’s MapReduce system
Good alternatives include Hadoop - map/reduce
{;w
There’s a Hadoop Connector
MongoDB has map/reduce, but others don’t recommend it

Week 6 - Replication and Sharding

Application Engineering is concerned about:

Durability of writes
Availability / Fault Tolerance
Scaling

Durability of Writes

Say you have your application that does writes (e.g. updates and inserts) to mongod

What happens is there’s no confirmation of the write by default
It just “fires and forget”
If you want to see that the write was received, processed, and it didn’t violate any indexes, then issue a ‘getLastError’ call (which is inside the low-level API)

Mongo Shell is also another application that does inserts and updates

In the Mongo Shell, by default it issues a ‘getLastError’ call
Can specify the ‘w’ parameter
Do you wait for the write to be acknowledged?
w parameter means whether or not you wait for write to be acknowledged (1 = wait, 0 = don’t wait)
j parameter is journal, which means to commit to disk
When you commit something to the journal, if a reboot immediately happens, then it comes back and looks at journal and makes sure operations are done
w = 0, j = 0; fire and forget
w = 1, j = 0; wait for acknowledgement, but can have other errors (like primary key constraint)
w = 1, j = 1; commit to journal
w = 0, j = 1; commit to journal

Network Errors

Even with w=1, j=1, there’s still network errors
Say write thinks its complete, but network interruption happens. With no acknowledgement, you don’t know exactly what happened
You can often check to see if the write happened

Pymongo Drivers

http://api.mongodb.org
Use pymongo.MongoClient() to connect to a standalone server

Replication

Availability
Fault Tolerance

To solve both of these issues, we use Replication

A replica set is 3 nodes that act together and mirror each other
One is primary and the other two are secondaries
The application only writes to the primary
If the primary goes down, an ‘election’ happens and one of the secondaries is the new primary
If the old primary comes back, then it rejoins as a secondary

Types of Replica Sets

Regular
Arbiter (just there for voting purposes)
Delayed/Regular - Usually for disaster recovery (e.g. an hour behind). Cannot be elected to be a primary node
Hidden Node - Cannot be primary node, but can be good for just querying

Write Consistency

By default, you write and read from the primary node
You can prefer to have your reads go to your secondary nodes (but this comes at the risk of possibly reading stale data)
Eventual Consistency means that you’ll eventually be able to read the data (when you’re reading from the secondaries), but may not be able to see the most up to date info immediately

Creating a Replica Set on a single node

Code

$mkdir -p /data/rs1 /data/rs2 /data/rs3
mongod --replSet rs1 --logpath "1.log" --dbpath /data/rs1 --port 27017 --fork
mongod --replSet rs1 --logpath "2.log" --dbpath /data/rs2 --port 27018 --fork
mongod --replSet rs1 --logpath "3.log" --dbpath /data/rs3 --port 27019 --fork

# Ties replica sets together in the mongos
config = { _id: "rs1", members:[
                       { _id:0, host: "localhost:27017"},
                       { _id:1, host: "localhost:27018"},
                       { _id:2, host: "localhost:27019"} ]
           };
rs.initiate(config);
rs.status()

Failover and Rollback

Say we have S1, S2, S3
If we have S1 fail, then election happens and a new Primary is made (say S3 is the new Primary)
When S1 comes back up, S1 will have data to write that S3 doesn’t have
S1 will then rollback into a file that you can apply manually

Sharding

The way to handle scaling out (i.e. instead of having only one database, you’re on multiple databases)
Let’s say we have these different shards (S1, S2, S3, S4, S5) and each shard is its own replica set (e.g. three replicas in one replica set)
You use mongos as the distribution router (keeps a connection pool between all the shards) and you wouldn’t really need to use mongod anymore

Shard Key

There’s a range based approach to shard key

Say it’s based on the number of orders (order_id)
If you query by order number, the orders table would be in ‘chunks’
Each of these chunks lives on a shard
When you query, it sends to mongos, which then sends to mongod

Mongo export and import

mongoexport and mongoimport are utilities for importing and exporting data to and from a MongoDB instance. Do not use thsese for production replications (instead use mongodump and mongorestore for this funcitonality).

Mongo Export

mongoexport is a utility that produces either JSON or CSV of the data stored in a MongoDB instance.

Mongo Import

mongoimport is a utility for importing data into a MongoDB instance.

MongoDB

Install MongoDB

Week 1 - What is MongoDB?

Usage

Simple Commands

Bottle and PyMongo

JSON and BSON

Week 2 - CRUD

Mongo Shell

CRUD Operations: Insert, Find, FindOne

Insert operation

Find operation

Week 3 - Application Driven Schema

MongoDB’s Relations

Week 4 - Database Performance with Indexes

Background

MongoDB Indexes

Creating Indexes in MongoDB

Types of Indexes in MongoDB

MongoDB Index Properties

Sparse Indexes

Index Creation - Background

Command - Explain()

Choosing an Index

Index Commands - Size and Dropping

Index Cardinality

Hinting an Index and Natural Index

Geospatial Index (2D) and Geospatial Spherical Indexing (3D)

Logging and Profiling

Profiling Settings

MongoStat

Sharding

Week 5 - Aggregation Framework

SQL Example

MongoDB Example

MongoDB Aggregation Pipeline

Group by single key

Compound Grouping

Aggregation Expressions Overview (During the Group Stage)

Aggregation Options

$unwind

$out

Python / PyMongo

Limitations in Aggregation

Sharding

Week 6 - Replication and Sharding

Durability of Writes

Network Errors

Pymongo Drivers

Replication

Types of Replica Sets

Write Consistency

Creating a Replica Set on a single node

Failover and Rollback

Sharding

Mongo export and import

Mongo Export

Mongo Import

Related Posts