AWS Glue
AWS Glue is a managed/serverless ETL (extract, transform, load) service that lets you
categorize, clean, and enrich your data. AWS Glue is made up of:
- AWS Glue Data Catalog (ETL engine that generates Python or Scala code)
- a scheduler that handles dependency resolution, job monitoring, and retries
AWS Glue is setup to work with semi-structured data. We work with dynamic frames, which is
kind of like an Apache Spark dataframe (an abstraction to organize data into rows and columns).
Each record is self-describing so no schema is required initially.
So why use AWS Glue?
You can transform and move AWS Cloud data into your data store. You can also load data from
other data sources into your data warehouse or data lake. AWS lets you:
- discover and catalog metadata about your data stores into a central catalog
- lets you process semi-structured data (e.g. clickstream data)
- populates the AWS Glue Data Catalog with table defintions from scheduled crawler programs
- generate ETL scripts to transform, flatten, enrich your data
- handle errors and retries automatically
Example use cases include:
- use AWS Glue to catalog your S3 data, making it available to query with Amazon Athena or Amazon Redshift Spectrum
(Note: crawlers help your metadata stay in sync with the underlying data)
- use AWS Glue to understand your data assets
How does it work?
Let’s identify some of the components first:
- AWS Glue Data Catalog - the persistent metadata store in AWS Glue (has table definitions, job definitions).
There is only one AWS Glue Data Catalog per region
- Classifier - defines the schema of your data; AWS Glue has common types (e.g. CSV, JSON, AVRO, XML)
- Connection - a Data Catalog object that contains the properties that are required to connect to a specific data store
- Crawler - a program that connects to a data store (source or target), progresses through a list of classifiers to
determine the schema of your data, then creates metadata tables in the AWS Glue Data Catalog
- Database - a set of associated Datta Catalog table defintions organized into a logical group
- Data store - a data store is a repository for persistently storing your data.
- Data source - a data source is a data store that is used as input to a process or transform
- Data target - a data target is a data store that a process or transform writes to
- Dynamic Frame - a distributed table that supports nested data like structures and arrays. Each record is self-describing.
Each record has both data and the schema that describes the data
- Job - business logic that is required to perform ETL work; made up of a transform script (Python or Scala),
data sources, and data targets
- Table - the metadata definition that represents your data - a table that defines the schema of your data (e.g. S3 data)
A Table in the AWS Glue Data Catalog has names of columns, data type definitions, partition information.
You define jobs in AWS Glue to extract, transform, and load data from a data source to a data target.
Data Store Source
What is a data store? A data store might be a JDBC connection to an Amazon Redshift cluster, a RDS, a Kafka cluster,
S3, DynamoDB, S3, etc.
To work with data store sources, you define a crawler to populate your AWS Glue Data Catalog with metadata table
definitions. You need this metadata table definition for defining a job to transform your data.
AWS Glue Connections
An AWS Glue connection is a Data Catalog object that stores connection information for a specific data store.