William Liu

AWS Glue

AWS Glue is a managed/serverless ETL (extract, transform, load) service that lets you categorize, clean, and enrich your data. AWS Glue is made up of:

AWS Glue is setup to work with semi-structured data. We work with dynamic frames, which is kind of like an Apache Spark dataframe (an abstraction to organize data into rows and columns). Each record is self-describing so no schema is required initially.

So why use AWS Glue?

You can transform and move AWS Cloud data into your data store. You can also load data from other data sources into your data warehouse or data lake. AWS lets you:

Example use cases include:

How does it work?

Let’s identify some of the components first:

You define jobs in AWS Glue to extract, transform, and load data from a data source to a data target.

Data Store Source

What is a data store? A data store might be a JDBC connection to an Amazon Redshift cluster, a RDS, a Kafka cluster, S3, DynamoDB, S3, etc.

To work with data store sources, you define a crawler to populate your AWS Glue Data Catalog with metadata table definitions. You need this metadata table definition for defining a job to transform your data.

AWS Glue Connections

An AWS Glue connection is a Data Catalog object that stores connection information for a specific data store.