Hadoop

##Table of Contents

Summary
- Why Hadoop
How does Hadoop work?
- MapReduce
- HDFS
Hadoop v2 Ecosystem (Pig, Hive, Spark)
Hello Hadoop Code
Amazon Elastic Map Reduce
Amazon Data Pipeline for ETL
- General Pipeline Commands
- Use Case: Clickstreams

Summary

Apache’s Hadoop is an open source software framework for big and/or unstructured data that is programmed in Java. This means that if you’re working with very large data sets (e.g. 100 Terabytes), then you want to consider Hadoop. It’s important to note that if you don’t work with very large data sets or unstructured data, then you probably do NOT need Hadoop you’re probably better off with using a standard relational SQL database.

Why Hadoop

So why Hadoop? Instead of running on a single powerful server, Hadoop can run on a large cluster of commodity hardware. This means that a network of personal computers (nodes) can coordinate their processing power AND storage (see HDFS below). Hadoop allows systems to build horizontally (i.e. more computers) instead of just vertically (i.e. a faster computer).

Hadoop is a system for ‘Big Data’. Choose Hadoop if a lot of this applies:

Data lacks structure
Analyzing streams of information
Processing large datasets
Warehousing large datasets
Flexibility for ad hoc analysis
Speed of queries on large data sets

Hadoop is good for:

Massively parallel
Scalable and fault tolerant
Flixibility for multiple languages and data formats
Open Source
Ecosystem of tools
Batch and real-time analytics

How does Hadoop work?

Hadoop’s architecture is built from Google’s MapReduce and the Google File System white papers, with Hadoop consisting of MapReduce for processing and the Hadoop File System (HDFS) for storage.

MapReduce

MapReduce does processing and is broken down into two pieces, mappers and reducers:

mappers load data from the HDFS and filters, transforms, parses, and outputs (key, value) pairs
reducers automatically groups by the mapper’s output key and summarizes (say aggregates, count) the HDFS

The general steps look like this:

split
map
sort/shuffle (this is done automatically)
reduce

Note: There are specific MapReduce programs that allow higher level querying like:

Pig - that allow GROUP BY, FOREACH statements
Hive - similar to SQL, but less powerful
Mahout - a Machine Learning language library that sometimes uses Hadoop for recommending, clustering, classification

HDFS

The Hadoop File System (HDFS) provides redundancy and fault-tolerant storage by breaking data into chunks of about 64MB - 2GB, then it creates instances (based on the replication factor setting, usually 3 instances) of the same data, and spreads it across a network of computers. Say that one of the networked computers has a piece of data you need and it goes down; there are still two other copies of the data on the network that is readily available. This is much different than many enterprise systems where if a major server goes down, it would take anywhere from minutes to hours or days to fully restore.

Note: There are specific HDFS data systems like HBase and Accumulo that allow you to fetch keys quickly, which are good for transactional systems.

Hadoop v2 ecosystem

There were a few changes from Hadoop v1 to Hadoop v2, mainly the addition of YARN and a lot more data processing applications like Pig, Hive, etc.

Hadoop 1

Main pieces of Hadoop 1.0 were MapReduce sitting on top of the HDFS.

Hadoop 2

With Hadoop 2.0, we still have MapReduce and HDFS, but now also have an additional layer YARN that acts as a resource manager for distributed applications; YARN sits between the MapReduce and HDFS layers. Client submits job to YARN Resource Manager, which then automatically distributes and manages the job. Along with MapReduce, we have a few other data processing like Hive, Pig, Spark, etc.

Overview of Hadoop 2 Layers

Applications with Pig, Hive, Cascading, Mahout, Giraph, Presto
Batch jobs with MapReduce; Interactive with Tez; In memory with Spark
YARN for Cluster Resource Management
Storage with S3, HDFS

Hive

Use Hive to interact with your data in HDFS and Amazon S3

Batch or ad-hoc workloads
SQL-like query language (HiveQL) to allow users with knowledge of SQL to leverage Hadoop
Schema-on-read to query data without needing pre-processing using the Hive metastore. You basically create your table however you want and say here’s how Hive should see the table using the metastore.

Pig

Uses high level ‘Pig latin’ language to easily script data transformations in Hadoop
Strong optimizer for workloads

Spark

An alternative to MapReduce processing. Spark uses a Directed Acyclic Graph instead of Hadoop’s Map-Reduce. Spark has very high performance, but is in memory. We can think of the difference as MapReduce will get data from disk, run an operation, write that update to disk, then read from disk again and then do another operation. Spark holds all that data in memory and can do the operations without going back to disk.

In-memory for fast queries
Great for machine learning (MLlib) or other iterative queries
Use Spark SQL to create a low-latency data warehouse
Spark Streaming library for real-time, stream processing workloads
Runs on YARN (and other cluster managemers too)

To run Spark jobs, you can run in standalone cluster mode, on an EC2, on Hadoop YARN, or on Apache Mesos.

Since Spark is just a processing replacement, you’ll still need to find what to use for data storage. Spark works well with storage solutions like HDFS, Cassandra, HBase, Hive, S3.

Hadoop User Experience (HUE) GUI for Hive and Pig

Can interact in an ad-hoc way with the HUE GUI.

Hello World of Hadoop

Notes

Books: Hadoop the Definitive Guide Streaming with Python just use stdin and stdout

Login to server through ssh. E.g. ssh username@216.230.228.88, then enter your password for the ssh.
Directory ‘streaming-examples’ has code for stock prices, wordcount, and word frequencies. In each directory, enter this in the command line to run your hadoop code: source run-hadoop.sh
Output gets stored in output/part-000000 should match file expected-output.

run-hadoop.sh

Run hadoop using: source run-hadoop.sh

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \
  -input ./input.txt \
  -output ./output \
  -mapper map.py \
  -reducer reduce.py

input.txt

Goog, 230, 240
Apple, 100, 98
MS, 300, 250
MS, 250, 260
MS, 270, 280
Goog, 220, 215
Goog, 300, 350
IBM, 80, 90
IBM, 90, 85

map.py

#!/usr/bin/env python
import sys
import string

for line in sys.stdin:
    record = line.split(",")
    opening = int(record[1])
    closing = int(record[2])
    if (closing > opening):
        change = float(closing - opening) / opening
        print '%s\t%s' % (record[0], change)

reduce.py

#!/usr/bin/env python
import sys
import string

stock = None
max_increase = 0
for line in sys.stdin:
   next_stock, increase = line.split('\t')
   increase = float(increase)
   if next_stock == stock:     # another line for the same stock
       if increase > max_increase:
           max_increase = increase
   else:  # new stock; output result for previous stock
       if stock != None:  # only false on the very first line of input
           print( "%s\t%f" % (stock, max_increase) )
       stock = next_stock
       max_increase = increase
# print the last
print( "%s\t%f" % (stock, max_increase) )    

expected-output

Goog    0.166667
IBM     0.125000
MS      0.040000

output/part-00000

This is created after running the source run-hadoop.sh

Goog    0.166667
IBM     0.125000
MS      0.040000

Amazon Elastic Map Reduce

Launch a cluster in minutes
Low cost with hourly rate
Elastic, easily add or remove capacity
EMR automatically rebalances tasks if servers drop
Can easily resize clusters
Can also have EMR push logs to S3 easily
Can launch in AWS Management Console, AWS Command Line Interface, or Amazon EMR API.
Security taken care of by AWS Identity and Access Management (IAM)

In multiple EMR instance groups, we have:

Master Node (submit jobs to this)
Slave Group - Core (runs data managers like YARN)
Slave Group - Task (just runs node manager, does tasks; this is what you want your spot instances to be running in case it shuts down if you lose bid for server)

S3 and HDFS as your data layers

EMR File System (EMRFS) to access objects in S3
Decouple your storage layer from your cluster
Leverage S3’s durability
Good performance for sequential reads (which is common in analytics workloads)
Hadoop Distributed File System (HDFS)
3x replication for durability
Uses the local disk from your EC2 instances in your cluster

Data Architectures

You can architect your data a few different ways. Here are a few examples:

Data (e.g. GB of logs) are pushed to S3 2a. You can either push out a small amount of data to a local server 2b. If you don’t want to save data in the local server, you can just use S3 instead of HDFS for your data layer to decouple your compute capacity and storage.

Long-running cluster

Daily EMR cluster ETL data into database with Pig
24/7 EMR cluster running HBase

Interactive query

Hive metastore on Amazon EMR, then use business intelligence tools for ad-hoc investigation like cloudera impala, spark, presto, tez, and hive

EMR for ETL and query engine for investigations

This takes the S3 data and splits it two ways:

Spark for transient EMR cluster for ad-hoc analysis of entire log set
Hourly EMR cluster using Spark for ETL, then load subset into Redshift Data Warehouse

Other interesting ETL setups include:

Nasdaq Data Lake Architecture
Washington Post and Spark on EMR

Streaming Data Processing

Logs stored in Amazon Kinesis, then it splits out to:

Amazon Kinesis Client Library
AWS Lambda
Spark Streaming
Amazon EMR with Hive, Pig, Cascading
Amazon EC2 and Storm

How to use execute work on EMR

You can either:

via EMR Step API
Connect to Master Node and then work directly in shell to submit work or connect to HUE.

You can have a variety of data stores including:

AWS S3
HDFS
AWS DynamoDB
AWS Redshift
AWS Glacier
AWS RDS

How to set security for EMR

EMR uses two IAM roles for security:

EMR service role is for the EMR control panel
EC2 instance profile is for the actual instances in the Amazon EMR cluster

EMR by default creates two security groups:

Master Security Group - has port 22 access for SSHing into your cluster
Slave Security Group - is a single default master and default slave security group across all of your clusters
You can add additional security groups to the master and slave groups on a cluster to separate them further

Bootstrap Actions configures your applications (e.g. setup core-site.xml, hdfs-site.xml, mapreduce.xml, yarn.xml)

You should consider using client-side encrypted objects in S3. Also should compress your data files. S3 can be used as Landing Zone and/or as Data Lake.

EMR is also HIPAA-eligible.

AWS Data Pipeline for ETL

AWS Data Pipeline, can access through the console, command line interface, APIs

Manages your workflow data orchestration based on activities you define
Manages dependencies and automated scheduling for you
Provides notifications/alerts using Amazon SNS on job failure/success
Provide ability to run backfills on your data

General Data Pipeline Commands

Define by defining the name of the data pipeline
Import import in the data
Activate to activate your pipeline

Pipeline Use Case for Clickstreams

S3 as Data Lake / Landing Zone
Amazon EMR as ETL grid (hive, pig)
Data Warehouse with Amazon Redshift

On a Data Pipeline, the activities looks like:

weblogs-bucket
logsProcess-ExtractTransform
goes to staging
goes to redshift
reports connect to redshift

For example, a PigActivity can do:

Can setup a schedule (e.g. every day, every 15 min)
Retry
On Fail/On Success
Define Input
Define Output
Runs on

Hadoop

Related Posts