William Liu

MongoDB

My MongoDB Class Notes from the M101P: MongoDB for Developers back in September 2014

Install MongoDB

Install on Ubuntu with: sudo apt-get install -y mongodb-org Run on Ubuntu with: sudo service mongod start

Week 1 - What is MongoDB?

MongoDB is:

Usage

mongo is the shell to connect to database mongod is the process that starts the database server (default port of 27017)

By default data goes to ‘/data/db’, so we’ll need to create the data directory

Simple Commands

When you run mongo, you connect to: mongodb://127.0.0.1:27017 by default

Bottle and PyMongo

You can use Bottle as a micro web framework and connect to pymongo

JSON and BSON

JSON Contains dictionaries and arrays. BSON is a superset of JSON. MongoDB uses BSON; can read more about that here, which has more support for additional data types like datetimes.

JSON

Week 2 - CRUD

CRUD (Create, Read, Update, Delete) operations in MongoDB

MongoDB’s CRUD operations exist as methods/functions in the programming language API’s, not as a separate language.

Mongo Shell

Mongo shell is an interactive javascript interpreter

z = { a : 1 }
z.a
Returns 1, usually dot notation means 'a' is a property of 'z'

z["a"]
Returns 1, usually bracket notation means data lookup

CRUD Operations: Insert, Find, FindOne

Assuming database of ‘db’ and collection of ‘people’ or ‘scores’

Insert operation

Code:

doc = { "name" : "Will", "age" : 29, "profession" : "programmer"}
db.people.insert( doc )
> WriteResult({ "nInserted" : 1 })
doc = {"_id" : ObjectID("43243242cdffd34242342"), "student": 19, "type": "essay", "score": 88 }

Find operation

Code:

db.people.find()
> { "_id" : ObjectId("542106ac0002277afae9f047"), "name" : "Will", "age" : 29, "profession" : "programmer" }
# Notice that the "_id" is filled in automatically (by default an ObjectID, but you can specify your own type if you'd like)

Can also specify criteria like:

db.scores.find( { student: 19, type: "essay" )
# Displays all fields in result

db.scores.find( { student: 19, type: "essay"}, { "score": true, "_id": false })
# Only display score in result

Comparison (e.g. Greater Than, Less Than):

db.scores.find( { score : { $gt: 95 } } )
db.scores.find( { score : { $gt: 95, $lte : 98 }, type : "essay" } )

Comparison Exists

db.people.find( { profession: { $exists: true } } )

By Type

db.people.find( { name : { $type : 2 } } )
# Note: Type 2 is a string

Regex

db.people.find( { name: { $regex: e$" } } )
OR
db.people.find( { $or : [ { name: { $regex : "e$" } }, { age : { $exists: true } } ] } )

AND

Below two are similar use cases of AND

db.people.find( { $and : [ { name : { $gt : "C" } }, { name : { $regex : "a" } } ] } )
db.people.find( { name : { $gt : "C", $regex : "a" } } )

ALL

db.accounts.find( { favorites : { $all : [ "pretzels", "beer" ] } } )
# Queries anything that has all the specified elements (e.g. "pretzels" and "beer" in any order)

IN

db.accounts.find( { name: { $in : [ "Howard", "Will" ] } } )
#Any document that contains either "Howard" or "Will" in the specified field (name)

FindOne

db.people.findOne()
Gets a random document out
>{
 "_id" : ObjectId("542106ac0002277afae9f047"),
 "name" : "Will",
 "age" : 29,
 "profession" : "programmer"
}

db.people.findOne({ "name": "Will" }, { "name":true, "_id": false} )
>{ "name" : "Will" }

Update

Update (warning: dangerous to use, think about using $set)

db.people.update( { name : "Smith" }, { name : "Thompson", salary: 50000})
# Updates the match on the left ("Smith") with the new data on the right ("Thompson" and "salary")

Update with $set

db.people.update( { name : "Alice" }, { $set : { age : 30 } } )
# Queries the left ("Alice") and sets the age to 30, even if age didn't exist previously

Update with $unset

db.people.update( { name : "Jones" }, { $unset : { profession : 1 } } )
# Queries the left ("Jones") and unsets the right field (profession) so it no longer is there

Update with $upsert

db.people.update( { name : "George" }, { $set : { age : 40 } }, { $upsert : true } )
# If the person "George" with age "40" isn't there, then add this field

Update with $multi

# If you want to update multiple options (instead of just one), need the additional parameter `multi`
db.people.update( { }, { $set : { title : "Dr" } }, { multi : true } )

Remove

db.people.remove( { } )
# Removes all documents from the collection one-by-one

db.people.drop()
# Removes all documents from the collection all at once (much faster)

Arrays

Make Query Pretty

db.accounts.find().pretty()

Querying Inside Arrays

db.accounts.find().pretty()
{
  "_id" : ObjectId("ffjdlksfjlsa"),
  "name" : "George",
  "favorites" : [
    "ice cream",
    "pretzels"
    ]}
{
  "_id" : ObjectId("fjdsklfjal"),
  "name" : "Will",
  "favorites" : [
    "pretzels",
    "beer"
    ]}

db.accounts.find( { favorites : "pretzels" } )
# Returns both documents

db.accounts.find( { favorites : "beer" } )
# Only returns object with Will
# Note: Only looks at top level depth; no recursion to find an arbitrary depth

Cursors

When a query like db.people.find() is run, a cursor goes through each field

You can hold onto cursors and iterate through them like:

cur = db.people.find(); null;
>cur.hasNext()
>cur.next()

Cursor Limit

cur.limit(5)

Cursor Sort

cur.sort( { name: -1 } );
# Reverse order by name (lexographical)

Cursor Skip

cur.skip(2);
# Can chain together (e.g. sort and limit and skip)

cur.sort( { name: -1 }).limit(3).skip(2);
# Order Processed as first the sort, then the skip, then the limit

Count

db.scores.count({ type: "exam" })
>1000
Performs a query that just counts up where query matches

Load JSON file

$./mongoimport -d students -c grades < grades.json
    -d means to pick the database to use
    -c means to pick the table/collection to use
    < means from this json file

In Mongo:

$use students $db.grades.count()

Homework 2.1:

`db.grades.find({ score: { $gte: 65} }, {student_id:1}).sort({ score:1 })`

Week 3 - Application Driven Schema

In Relational Databases, you want to design for third normal form

For MongoDB, you want to design so that it’s conducive to the application’s data access patterns (for example, what is read together, what is written together?).

MongoDB supports Rich Documents

MongoDB does have Atomic operations (work is completed for all or none), can do:

Benefits of Embedding instead of traditional relational joins

MongoDB’s Relations

One-to-One

One-to-Many

Many-to-Many

Checking Indexes

We can create indexes and see how the indexes are created:

How do you represent a tree? (e.g. categories of products in Amazon based on season) Products collection and Category collections, can add ancestors in Categories

When to denormalize?

GRIDFS

GRIDFS (allows you to store more than 16MB per collection)

Week 4 - Database Performance with Indexes

Background

MongoDB Indexes

For MongoDB, you keep your keys ordered in what’s called a B-Tree (a self balancing tree data structure that keeps data sorted and allows searches)

Creating Indexes in MongoDB

To make an Index on student_id field and make ascending

db.students.ensureIndex({student_id:1})

To make a compound index (i.e. index on multiple fields); made up of student_id (ascending) and class (descending)

db.students.ensureIndex( {student_id:1, class:-1 })

ASC and DESC doesn’t matter much for searching, but matters for sort

Finding Indexes

See all indexes for current database

use school
db.system.indexes.find()

Get specific indexes for a collection

db.students.getIndexes()

Drop indexes for a collection

db.students.dropIndex( {'student_id':1} )

Types of Indexes in MongoDB

Single Key Index

MongoDB provides support for indexes on any field in a collection or document. By default, all collections have an index on the _id field

{ "_id" : ObjectID(...),
   "name" : "Alice"
   "age" : 27
}
db.friends.ensureIndex( { "name" : 1 } )

Compound Index

A compound index is where a single index structure holds references to multiple fields within a collection’s documents.

{ "_id" : ObjectId(...),
   "item": "Banana",
   "category" : ["food", "produce", "grocery"],
   "stock": 4
}

Single Compound Index

If applications query on the item field as well as query on both the item field and the stock field, you can specify a single compound index to support both of these queries:

db.products.ensureIndex( { "item": 1, "stock": 1 } )

Multikey Index

To index a field that holds an array value, MongoDB adds index items for each item in the array. This multikey index allows MongoDB to return documents from queries using the value of an array.

{ userid: "xyz",
   addr:
          [
            { zip: "10046", ... },
            { zip: "94301", ... }
           ],
   ...
}
{ "addr.zip":1 }

The addr field contains an array of address documents. The address documents contain the zip field

Example of multikey index:

{name: 'Andrew',
  tag: ['cycling', 'tennis', 'football'],
  color: 'red',
  locations: ['NY', 'CA']}

You can create compound keys like tag, color (one is an array, other is scalar) You cannot create compound keys on tags, locations (they’re both arrays, issue is it becomes too large)

Indexing isn’t restricted to the first level of keys (can put on sub-arrays)

"addresses": [
    {"tag": "vacation",
      "phones": [
        1,
        2]
     }...]
db.people.ensureIndex( {'addresses.tag':1} )
db.people.ensureIndex( {'addresses.phones':1} )

MongoDB Index Properties

Unique Index

A unique index causes MongoDB to reject all documents that contain a duplicate value for the indexed field To create a unique index, use the unique option to true, like:

db.stuff.ensureIndex( { thing: 1}, { unique:true } )
# By default, unique: false on indexes
"_id" is unique, even though it doesn't say it is (when you try to insert into the collection with a value already there, it returns an error)

Removing Duplicates when creating unique indexes

Sparse Indexes

Index Creation - Background

By default, an index is created on the foreground (blocks all other writers). This makes it:

You can specify if you want to have an index in the background

Command - Explain()

Explain gives you an explanation of what occurred in the query:

db.foo.find({c:1}).explain()
"cursor" : "BasicCursor"
Can be say "BtreeCursor a_1_b_1_c_1"
This means the database is using the index
Can be say "BasicCursor"
This means no index was used
"isMultiKey" : false
Is this a multikey index?
"n" : 1
Number of documents returned, say 1 that matches the query
"nscannedObjects" : 1
Number of documents scanned in the query
"indexOnly" : true
We can see that we didn't need to go to the actual collection to answer the query, just based on index
"millis" : 3
How many milliseconds it took for the query
"indexBounds" : ...
Shows the bounds that were used for the index

Choosing an Index

Say we have 3 indexes (a, b, c); what happens when you run a query is that mongodb runs all three indexes at the same time and checks which index is the quickest Once the quickest query is determined (say query b wins), MongoDB remembers that it’s the quickest query and will use that for say the next 100ish queries

Index Commands - Size and Dropping

Indexes help with performance, but takes hard drive space

db.students.totalIndexsize()
db.students.getIndexes()
db.students.dropIndex({'student_id':1})

Index Cardinality

Regular Index, has 1:1 index to documents

Sparse Index

Multikey Index

Hinting an Index and Natural Index

If you want to specifically tell MongoDB to use this index (instead of normally letting MongoDB figure it out), you can use hinting for indexes

Geospatial Index (2D) and Geospatial Spherical Indexing (3D)

2D

3D

Logging and Profiling

Profiling Settings

db.getProfilingLevel()
# returns 1

db.getProfilingStatus()
{ "was" : 1, "slowms" : 2 }

MongoStat

mongostat shows you say the queries per second, flushes per second, etc. Usually interested in ‘idx missed’ (you want your index in memory, if this has a high number, then your index can’t fit in memory so it’ll be much slower)

Sharding

If you can’t get the performance you want from a single server, you can deploy multiple mongod servers (shards) with one mongos server (that routes to the shards) E.g. Say your shard key is student_id, then it’ll say use a specific shard

Week 5 - Aggregation Framework

SQL Example

Products Table containing
name    | category  | manufacturer  | price
ipod    | tablet    |   Apple       | $300
nexus   | cell      |  Samsung      | $200

Relational SQL’s GROUP BY would create

manufacturer  | count(*)
Apple         | 2
Samsung       | 3

MongoDB Example

use aggregate

db.products.aggregate([
  { $group:
    {
       _id: "$manufacturer",
       num_products:{$sum:1}
     }
   }
])
{ "_id" : "Amazon", "num_products" : 2 }
{ "_id" : "Apple", "num_products" : 1 }

MongoDB Aggregation Pipeline

Similar to Unix (e.g. du -s * sort -n)

MongoDB has the following pipeline stages (these can appear more than once):

$project

$match

$group

$sort

$skip

$limit

$unwind

$out

$redact

$geonear

$match

Group by single key

From products, we do a group, then gets the results as a new collection

Data is:

db.products.find().pretty()
    {   "_id" : ObjectId("343l4jljdfklsf"),
         "name" : "iPad 16GB Wifi",
         "manufacturer" : "Apple",
         "category" : "Tablets",
         "price" : 499
    }

Use query:

db.products.aggregate([
    {$group:
      {
          _id : "$manufacturer",
          num_products:{$sum:1}
       }
     }
])

What happens:

Compound Grouping

SQL equivalent of:

SELECT manufacturer, category, count(*) FROM products
GROUP BY manufacturer, category

MongoDB equivalent:

db.products.aggregate([
    {$group:
        {  _id: {
                     "manufacturer":"$manufacturer",
                     "category":"$category"},
           num_products:{$sum:1}
        }
    }
])

returns: 
manufacturer: Apple, category: Tablet
manufacturer: Apple, category: Laptop

Result:

Aggregation Expressions Overview (During the Group Stage)

See how this compares to SQL here

Aggregation Options

Lets you change how

$unwind

If you’re dealing with arrays, you need to move it out of an array form and make it more flat by using unwind to ‘unjoin’ the data

E.g. { a:1, b:2, c:['apple', 'pear', 'orange'] }
$unwind: '$c'
{ a:1, b:2, c:'apple'}
{ a:1, b:2, c:'pear'}
{ a:1, b:2, c:'orange'}

Another Example of an $unwind

db.posts.aggregate([
  /* unwind by tags */
  {"$unwind":"$tags"},
  /* now group by tags, counting each tag */
  {"$group":
    {"_id":"$tags",
      "count":{$sum:1}
     }
   },
  /* sort by popularity */
  {"$sort":{"count":-1}},
  /* show me the top 10 */
  {"$limit":10},
  /* change the name of _id to be tag */
  {"$project":
    { _id:0,
      'tag':'$_id',
      'count':1
     }
   }
])

Double $unwind

/* given this document */
{
    "_id": ObjectId("5890lkerjewljljflslsfja"),
    "name": "T-Shirt",
    "sizes": [
        "Small",
        "Medium",
        "Large",
        "X-Large"
      ],
    "colors": [
         "navy",
         "black",
         "orange",
         "red"
       ]
 } 
/* unwinds by size, then by colors */
db.inventory.aggregate([
  {$unwind: "$sizes"},
  {$unwind: "$colors"},
  {$group:
    {
      '_id': { 'size': '$sizes', 'color': '$colors'},
      'colors': {'$sum': 1}
     }
   }
])
/* results of the double unwind is */
{ "_id": { "size": "31*20", "color": "violet" }, "count":2 }

$out

Receive the results of the aggregation and redirects the output.

{
  "_id" : ObjectId("432ljfkldasjfldsajf"),
  "first_name": "William",
  "last_name": "Liu",
  "points": 2,
  "moves": [
          5,
          6,
          8
    ]
}
db.games.aggregate([
  {$group:
    {
      _id: {
                first_name: "$first_name",
                last_name: "$last_name"
              },
       points: {$sum: "points"}
     }
   },
  {$out: 'summary_results'}
])

This creates a new collection summary_results

Code:

db.summary_results.find()
{ "_id": { "first_name": "William", "last_name": "Liu" }, "points": 19 }

Python / PyMongo

Limitations in Aggregation

Sharding

Week 6 - Replication and Sharding

Application Engineering is concerned about:

Durability of Writes

Say you have your application that does writes (e.g. updates and inserts) to mongod

Mongo Shell is also another application that does inserts and updates

Network Errors

Pymongo Drivers

Replication

To solve both of these issues, we use Replication

Types of Replica Sets

Write Consistency

Creating a Replica Set on a single node

Code

$mkdir -p /data/rs1 /data/rs2 /data/rs3
mongod --replSet rs1 --logpath "1.log" --dbpath /data/rs1 --port 27017 --fork
mongod --replSet rs1 --logpath "2.log" --dbpath /data/rs2 --port 27018 --fork
mongod --replSet rs1 --logpath "3.log" --dbpath /data/rs3 --port 27019 --fork

# Ties replica sets together in the mongos
config = { _id: "rs1", members:[
                       { _id:0, host: "localhost:27017"},
                       { _id:1, host: "localhost:27018"},
                       { _id:2, host: "localhost:27019"} ]
           };
rs.initiate(config);
rs.status()

Failover and Rollback

Sharding

Shard Key

There’s a range based approach to shard key

Mongo export and import

mongoexport and mongoimport are utilities for importing and exporting data to and from a MongoDB instance. Do not use thsese for production replications (instead use mongodump and mongorestore for this funcitonality).

Mongo Export

mongoexport is a utility that produces either JSON or CSV of the data stored in a MongoDB instance.

Mongo Import

mongoimport is a utility for importing data into a MongoDB instance.