William Liu

AWS Glue

AWS Glue is a managed/serverless ETL (extract, transform, load) service that lets you categorize, clean, and enrich your data. AWS Glue is made up of:

AWS Glue Data Catalog (ETL engine that generates Python or Scala code)
a scheduler that handles dependency resolution, job monitoring, and retries

AWS Glue is setup to work with semi-structured data. We work with dynamic frames, which is kind of like an Apache Spark dataframe (an abstraction to organize data into rows and columns). Each record is self-describing so no schema is required initially.

So why use AWS Glue?

You can transform and move AWS Cloud data into your data store. You can also load data from other data sources into your data warehouse or data lake. AWS lets you:

discover and catalog metadata about your data stores into a central catalog
lets you process semi-structured data (e.g. clickstream data)
populates the AWS Glue Data Catalog with table defintions from scheduled crawler programs
generate ETL scripts to transform, flatten, enrich your data
handle errors and retries automatically

Example use cases include:

use AWS Glue to catalog your S3 data, making it available to query with Amazon Athena or Amazon Redshift Spectrum (Note: crawlers help your metadata stay in sync with the underlying data)
use AWS Glue to understand your data assets

How does it work?

Let’s identify some of the components first:

AWS Glue Data Catalog - the persistent metadata store in AWS Glue (has table definitions, job definitions). There is only one AWS Glue Data Catalog per region
Classifier - defines the schema of your data; AWS Glue has common types (e.g. CSV, JSON, AVRO, XML)
Connection - a Data Catalog object that contains the properties that are required to connect to a specific data store
Crawler - a program that connects to a data store (source or target), progresses through a list of classifiers to determine the schema of your data, then creates metadata tables in the AWS Glue Data Catalog
Database - a set of associated Datta Catalog table defintions organized into a logical group
Data store - a data store is a repository for persistently storing your data.
Data source - a data source is a data store that is used as input to a process or transform
Data target - a data target is a data store that a process or transform writes to
Dynamic Frame - a distributed table that supports nested data like structures and arrays. Each record is self-describing. Each record has both data and the schema that describes the data
Job - business logic that is required to perform ETL work; made up of a transform script (Python or Scala), data sources, and data targets
Table - the metadata definition that represents your data - a table that defines the schema of your data (e.g. S3 data) A Table in the AWS Glue Data Catalog has names of columns, data type definitions, partition information.

You define jobs in AWS Glue to extract, transform, and load data from a data source to a data target.

Data Store Source

What is a data store? A data store might be a JDBC connection to an Amazon Redshift cluster, a RDS, a Kafka cluster, S3, DynamoDB, S3, etc.

To work with data store sources, you define a crawler to populate your AWS Glue Data Catalog with metadata table definitions. You need this metadata table definition for defining a job to transform your data.

AWS Glue Connections

An AWS Glue connection is a Data Catalog object that stores connection information for a specific data store.

AWS Glue

So why use AWS Glue?

How does it work?

Data Store Source

AWS Glue Connections

Related Posts