AWS Athena is an interactive query service that makes it easy to analyze data in S3 using standard SQL. You don’t need to setup a server. Simply point to an S3, define the schema, and start querying using standard SQL. Athena is more for very simple reporting. Athena also supports CSV, JSON, Gzip files, and columnar formats like Apache Parquet.
Note that Athena is based off of Presto, a distributed SQL query engine for big data.
Athena uses Presto.
https://github.com/prestodb/presto
If you want to explore Presto more:
# Check for Maven
sudo apt policy maven
maven:
Installed: 3.5.2-2
Candidate: 3.5.2-2
Version table:
*** 3.5.2-2 500
500 http://archive.ubuntu.com/ubuntu bionic/universe amd64 Packages
500 http://archive.ubuntu.com/ubuntu bionic/universe i386 Packages
100 /var/lib/dpkg/status
sudo apt install maven
# Find Maven Install
ls -lsa /usr/share/maven
total 32
4 drwxr-xr-x 6 root root 4096 Oct 9 07:07 .
12 drwxr-xr-x 382 root root 12288 Jan 27 11:23 ..
4 drwxr-xr-x 2 root root 4096 Oct 9 07:07 bin
4 drwxr-xr-x 2 root root 4096 Oct 9 07:07 boot
0 lrwxrwxrwx 1 root root 10 Feb 23 2018 conf -> /etc/maven
4 drwxr-xr-x 2 root root 4096 Oct 9 07:07 lib
4 drwxr-xr-x 2 root root 4096 Oct 9 07:07 man
# Find Maven properties
ls -lsa /etc/maven
total 40
4 drwxr-xr-x 3 root root 4096 Oct 9 07:07 .
12 drwxr-xr-x 160 root root 12288 Jan 27 11:23 ..
4 drwxr-xr-x 2 root root 4096 Oct 9 07:07 logging
4 -rw-r--r-- 1 root root 220 Oct 18 2017 m2.conf
12 -rw-r--r-- 1 root root 10211 Oct 18 2017 settings.xml
4 -rw-r--r-- 1 root root 3645 Oct 18 2017 toolchains.xml
sudo apt-get update
sudo apt-get install default-jre # install Java Runtime Environment (JRE)
sudo apt-get install default-jdk # has JRE in it as well, slightly larger than JRE only
My jre that I wanted to use was in: /usr/lib/jvm/java-1.11.0-openjdk-amd64
My .bashrc
has
export JAVA_HOME="/usr/lib/jvm/java-1.11.0-openjdk-amd64"
export PATH=$PATH:$JAVA_HOME
Follow the instructions here: https://prestodb.github.io/docs/current/installation/deployment.html
#Copy the latest release
cd /tmp && wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.216/presto-server-0.216.tar.gz
tar -xvf presto-server-0.216.tar.gz
Setup files:
/presto-server-0.216/etc/config.properties
coordinator=true
coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8081
query.max-memory=5GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery-server.enabled=true
discovery.uri=http://localhost:8081
/presto-server-0.216/etc/node.properties
node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/var/presto/data
/presto-server-0.216/etc/log.properties
com.facebook.presto=INFO
Setup Catalogs:
/presto-server-0.216/etc/catalog/localfile.properties
connect.name=localfile
Run the UI:
cd presto-server-0.216/bin && ./launcher
# default, it's 8080, but my port for this was used
You'll see a UI of the number of queries at: http://localhost:8081/ui/
Usage:
Create an /etc/catalog
directory, where we’ll put our catalogs, which all you to connect to different data sources
Example catalogs include: MySQL, Hive, LocalFile. When you add a catalog using presto-admin
, make sure to restart
the presto server.
Install the CLI for Presto here: https://prestodb.github.io/docs/current/installation/cli.html
The downloaded file can be renamed to presto
and chmod +x
, then moved to /usr/local/bin
Can run with: presto
or say presto --server localhost:8081 --catalog hive --schema default
If you get into an error, make sure to run with a debug
flag (e.g. ./presto-cli --server localhost:8081 --debug
The errors that I’ve seen are issues like I setup a catalog incorrectly.
presto> show catalogs;
show catalogs;
Catalog
-----------
localfile
system
(2 rows)
Query 20190220_235643_00000_jjy5y, FINISHED, 1 node
http://will.data.lan:8081/ui/query.html?20190220_235643_00000_jjy5y
Splits: 19 total, 19 done (100.00%)
CPU Time: 0.1s total, 0 rows/s, 0B/s, 8% active
Per Node: 0.0 parallelism, 0 rows/s, 0B/s
Parallelism: 0.0
Peak Memory: 0B
0:02 [0 rows, 0B] [0 rows/s, 0B/s]
#show schemas from <catalog> <like pattern>
presto> show schemas from localfile;
Schema
--------------------
information_schema
logs
(2 rows)
Query 20190221_000330_00011_jjy5y, FINISHED, 1 node
http://will.data.lan:8081/ui/query.html?20190221_000330_00011_jjy5y
Splits: 19 total, 19 done (100.00%)
CPU Time: 0.0s total, 125 rows/s, 1.95KB/s, 5% active
Per Node: 0.1 parallelism, 6 rows/s, 103B/s
Parallelism: 0.1
Peak Memory: 82.4KB
0:00 [2 rows, 32B] [6 rows/s, 103B/s]
presto>show tables from localfile.logs;
Table
------------------
http_request_log
(1 row)
Query 20190221_185008_00029_jjy5y, FINISHED, 1 node
http://will.data.lan:8081/ui/query.html?20190221_185008_00029_jjy5y
Splits: 19 total, 19 done (100.00%)
CPU Time: 0.0s total, 111 rows/s, 3.25KB/s, 7% active
Per Node: 0.0 parallelism, 5 rows/s, 164B/s
Parallelism: 0.0
Peak Memory: 0B
0:00 [1 rows, 30B] [5 rows/s, 164B/s]
Presto-Admin is a tool for installing and managing the Presto query engine on a cluster.
https://github.com/prestodb/presto-admi://github.com/prestodb/presto-admin
I downloaded and ran python setup.py develop
to get presto-admin
to work.
$ presto-admin
Usage: presto-admin [options] <command> [arg]
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-d, --display print detailed information about command
--extended-help print out all options, including advanced ones
-I, --initial-password-prompt
Force password prompt up-front
-p PASSWORD, --password=PASSWORD
password for use with authentication and/or sudo
Commands:
catalog add
catalog remove
collect logs
collect query_info
collect system_info
configuration deploy
configuration show
file copy
file run
package install
package uninstall
plugin add_jar
server install
server restart
server start
server status
server stop
server uninstall
server upgrade
topology show
https://prestodb.github.io/docs/current/installation/jdbc.html
Another program to query your data is Apache Drill.
bin/drill-embedded
!quit
0: jdbc:drill:zk=local> show schemas;
+---------------------+
| SCHEMA_NAME |
+---------------------+
| cp.default |
| dfs.default |
| dfs.root |
| dfs.tmp |
| information_schema |
| sys |
+---------------------+
6 rows selected (0.23 seconds)
0: jdbc:drill:zk=local> use cp;
+-------+---------------------------------+
| ok | summary |
+-------+---------------------------------+
| true | Default schema changed to [cp] |
+-------+---------------------------------+
1 row selected (0.082 seconds)
To query from a local file or directory, run your query like so for a file system (say there’s Parquet files there):
0: jdbc:drill:zk=local> select * from dfs.`/usr/local/airflow/data/`
Keep in mind that certain queries don’t work (e.g. DESCRIBE
to show columns will error out with an invalid schema)
A useful setting, especially for a lot of columns, is to:
0: jdbc:drill:zk=local> !set maxwidth 10000
To quit, run !quit
Setup a database, table, schema and run your queries
Data Definition Language (DDL) is a standard for commands that define the different structures in a database. DDL statements create, modify, and remove database objects such as tables, indexes, and users. Common DDL statements are CREATE, ALTER, and DROP.
/* Create a table */
CREATE EXTERNAL TABLE IF NOT EXISTS default.elb_logs (
`request_timestamp` string,
`elb_name` string,
`request_ip` string,
`request_port` int,
`backend_ip` string,
`backend_port` int,
`request_processing_time` double,
`backend_processing_time` double,
`client_response_time` double,
`elb_response_code` string,
`backend_response_code` string,
`received_bytes` bigint,
`sent_bytes` bigint,
`request_verb` string,
`url` string,
`protocol` string,
`user_agent` string,
`ssl_cipher` string,
`ssl_protocol` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = '([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*):([0-9]*) ([.0-9]*) ([.0-9]*) ([.0-9]*) (-|[0-9]*) (-|[0-9]*) ([-0-9]*) ([-0-9]*) \"([^ ]*) ([^ ]*) (- |[^ ]*)\" (\"[^\"]*\") ([A-Z0-9-]+) ([A-Za-z0-9.-]*)$'
) LOCATION 's3://athena-examples-us-east-1/elb/plaintext/'
TBLPROPERTIES ('has_encrypted_data'='false');
ALTER TABLE orders ADD
PARTITION (dt = '2016-05-14', country = 'IN') LOCATION 's3://mystorage/path/to/INDIA_14_May_2016/'
PARTITION (dt = '2016-05-15', country = 'IN') LOCATION 's3://mystorage/path/to/INDIA_15_May_2016/';
LTER TABLE orders
DROP PARTITION (dt = '2014-05-14', country = 'IN'), PARTITION (dt = '2014-05-15', country = 'IN');
SHOW PARTITIONS my_schema.my_table;
run_date=2022-10-01
run_date=2022-10-02
SELECT * FROM "my_table$partitions" LIMIT 100;
Assuming you are partitioning on something like run_date
SELECT DISTINCT run_date, "$path" FROM my_schema.my_table
ORDER BY run_date DESC
LIMIT 1000
run_date, $path
2022-10-12, s3://mybucket/something/here/run_date=2022-10-12/files
2022-10-13, s3://mybucket/something/here/run_date=2022-10-13/files
If you don’t want to drop and add, can just change the partition location
ALTER TABLE customers PARTITION (zip='98040', state='WA') SET LOCATION 's3://mystorage/custdata/';
❯ aws athena get-query-execution --query-execution-id a6475246-xxxx-xxxx-xxxx-xxxxxxxxxxx
{
"QueryExecution": {
"QueryExecutionId": "a6475246-xxxx-xxxx-xxxx-xxxxxxxxxxx",
"Query": ...
AWS Glue is a fully managed ETL service that can categorize your data, clean it, enrich it, and move it reliably between various data stores. AWS Glue crawlers automatically infer database and table schema from your dataset, storing the associated metadata in the AWS Glue Data Catalog.
So how does Athena work with Glue? Athena supports querying datasets and data sources that are registered with the AWS Glue Data Catalog.
AWS Glue Schema Registry is a feature that lets you discover, control, and evolve data stream schemas. A schema defines the structure and format of a data record. With Glue Schema Registry, you can manage and enforce schemas on your data streaming applications. Since Glue Schema Registry is in AWS, there’s convenient integrations with MSK, Kinesis Data Streams, Kinesis Data Analytics for Apache Flink, and AWS Lambda.
So what does Schema Registry solve? You can use a schema as a data contract between producers and consumers to ensure resilience for cases like upstream changes. The idea is that you can have many systems share a schema for serialization and de-serialization.
A schema defines the structure and format of a data record at a specific version. Formats can be in AVRO and has Java language support.
A registry is a logical container of schemas. Registries allow you to organize your schemas as well as manage access for which applications can access which schemas. A registry has an Amazon Resource Name (ARN) that lets you organize and set different access permissions to different schema operations.
Serde libraries are serializer and deserializer libraries used for parsing data from different formats, such as CSV, JSON, Parquet, Apache AVRO, Apache Parquet, and ORC.
Each schema can have multiple versions. Versioning is determined by a compatibility rule that is applied on a schema. When you try to register a new schema version, rules are checked based off the previous compatibility rule. There are 8 compatibility modes:
NONE
- No compatibility mode applies; use for development.
Any new version will be accepted and there are no compatibility checks.DISABLED
- Prevents versioning for a particular schema. No new versions can be added.BACKWARD
- Recommended; allows consumers to read both the current and previous schema version.
Use this to check compatibility against the previous schema version when you delete fields
or add optional fields.BACKWARD_ALL
- Allows consumers to read both current and all previous schema versions.
Use this to check compatibility against all previous schema versions when you delete fields
or add optional fields.FORWARD
- Allows consumers to read both current and the subsequent schema versions, but not
the later versions. A use case is when your application has been created for a previous schema
and should be able to process a more recent schema.FORWARD_ALL
- Allow consumers to read data written by producers of any new registered schema.
Use this choice when you need to add fields or delete optional fields, and check compatibility
against all previous schema versions.FULL
- Allows consumers to read data written by producers using the previous or next version
of the schema, but not earlier or later versions. You can use this choice to check compatibility
against the last schema version when you add or remove optional fields.FULL_ALL
- Allows consumers to read data written by producers using all previous schema versions.
Use this choice to check compatibility against all previous schema versions when you add or remove
optional fields.The Schema Registry works as:
test_topic
, test_stream
)
or the producer can provide a custom name for the schema. Producers can also add key-value pairs to the schema
as metadata (e.g. source: msk_kafka_topic_A
).
Once registered, the Schema Registry returns the schema version id to the serializer.
Note: Compatibility checks are then done on the schema to ensure the new version is compatible before being registered
as the new version.
You can register the schema in two ways: manually or auto-registration.
REGISTRY_NAME
is not provided, then the new schema version will be under the default
registry of default-registry
.Note: There is also resource-level permissions and identity-based IAM policies you can apply