Amazon Redshift is a fully managed petabyte-scaled data warehouse service. An Amazonn Redshift data warehouse is a collection of computing resources called nodes, that are organized into a group called a cluster. Each cluster runs an Amazon Redshift engine and contains one or more databases. Redshift manages all the work of provisioning capacity, monitoring, and backup of the cluster.
Redshift creates one database when you provision a cluster and you’re able to add additional databases from your initial database. Each db can create tables, load data, and run queries. Note that Redshift has roots that are based off of Postgres.
Amazon Redshift Spectrum allows you to run SQL queries against unstructured data in AWS S3. You can use open data formats like CSV, TSV, Parquet, Sequence, and RCFile. AWS Redshift is Amazon’s data warehouse solution. Most databases store data in rows, but Redshift is a column datastore. You’re able to create Redshift tables and query data using Redshift Spectrum.
In an Amazon Redshift data warehouse, we have the following architecture:
Amazon Redshift is based on PostgreSQL so most existing SQL client applications will work with minimal changes. Communication is done through Amazon Redshift’s PostgreSQL JDBC and ODBC.
A redshift cluster consists of nodes. Different node types determine the CPU, RAM, and storage capacity. Types of nodes include:
The Leader Node distributes SQL to the compute nodes when a query references user-created tables
or system tables (e.g. tables with an STL
or STV
prefix and system views with an SVL
or
SVV
prefix). Some queries are only run on the leader node, like referencing catalog tables (
tables with a PG
prefix or if it does not reference any tables).
Some of the leader-node only functions include:
CURRENT_SCHEMA
CURRENT_SCHEMAS
HAS_DATABASE_PRIVILEGE
HAS_SCHEMA_PRIVILEGE
HAS_TABLE_PRIVILEGE
So really, the leader node parses and develops execution plans to carry out database operations (some series of steps) and distributes the compiled code to the compute nodes and assigns a portion of the data to each compute node.
PG_TABLE_DEF
With Amazon Redshift, metadata (table definition information) is stored in the PG_TABLE_DEF
table.
The PG_
part is leftover from PostgreSQL.
SELECT
*
FROM
PG_TABLE_DEF;
# Returns
# schemaname, tablename, column, type, encoding, distkey, sortkey, notnull
You normally don’t want to return everything.
SELECT
DISTINCT tablename
FROM
PG_TABLE_DEF
WHERE
schemaname = 'public';
To list schemas in Redshift:
select s.nspname as table_schema,
s.oid as schema_id,
u.usename as owner
from pg_catalog.pg_namespace s
join pg_catalog.pg_user u on u.usesysid = s.nspowner
order by table_schema;
-- Example
table_schema schema_id owner
my_schema 1 will
-- Returns us
databases
my_database1
schemas
my_schema1
my_schema2
my_database2
schemas
my_schema3
PG_NAMESPACE
SELECT
*
FROM
PG_NAMESPACE;
# returns nspname, nspowner, nspacl
Remember that you can set search_path TO <myschema>
With Amazon Redshift Spectrum, you can query external data.
SELECT * FROM SVV_EXTERNAL_DATABASES;
# eskind, esoptions, databasename, location, parameters
SELECT * FROM SVV_EXTERNAL_SCHEMAS;
# esoid, eskind, schemaname, esowner, databasename, esoptions
SELECT * FROM SVV_EXTERNAL_TABLES;
# schemaname, tablename, location, input_format, output_format, serialization_lib, serde_parameters, compressed, parameters
You have to create an external table in an external schema. An external schema references a database in the external data catalog and provides the IAM role ARN that authorizes your cluster to access S3.
To create an external schema, you can use Amazon Athena, AWS Glue Data Catalog or an Apache Hive metastore like Amazon EMR. It might look like:
CREATE EXTERNAL SCHEMA [IF NOT EXISTS] schema_name
FROM { [DATA CATALOG] | HIVE METASTORE }
DATABASE 'database_name'
[ REGION 'aws-region' ]
[ URI 'hive_metastore_uri' [ PORT port_number ] ]
IAM_ROLE 'iam-role-arn-string'
[ CATALOG ROLE 'catalog-role-arn-string' ]
[ CREATE EXTERNAL DATABASE IF NOT EXISTS ]
E.g.
CREATE EXTERNAL SCHEMA my_raw_schema
FROM DATA CATALOG DATABASE
'my_raw_staging_schema'
IAM_ROLE 'arn:aws:iam::something:role/role_name';
Where my_raw_staging_schema
is your database in Athena.
To create an external table, it might look like this:
Say we have S3 data in: s3://my-bucket-name/tickit/spectrum/sales
CREATE EXTERNAL TABLE spectrum.sales(
salesid integer,
listid integer,
sellerid integer,
dateid smallint,
pricepaid decimal(8,2),
saletime timestamp)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3://my-bucket-name/tickit/sales/'
TABLE PROPERTIES ('numRows'='172000');
Additional Details here: https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-tables.html
An external table can be say a table that’s AWS Athena.
You can see what external tables are available by checking the system view here:
SELECT * FROM SVV_EXTERNAL_TABLES;
Some important columns are:
You can create this source table with AWS Glue Data Catalog so that you can use the data in Athena and Redshift.
You can format shift to Parquet with an AWS Glue Job (or do this outside of the CLI tools by reading from the S3 location, then writing to another location as parquet using some code).
You can query your data in S3 using Athena with a query like:
SELECT COUNT(*) FROM spectrum.sales;
There’s pseudocolumns $path
and $size
that lets you see the size of the data files for each row returned by
a query. You have to explicitly specify these columns; they won’t return from a SELECT *
.
SELECT "$path", "$size"
FROM myschema.mytable
WHERE myid = 1;
https://docs.aws.amazon.com/redshift/latest/dg/c_SQL_commands.html
https://docs.aws.amazon.com/redshift/latest/dg/c_SQL_functions.html
If you are running a lot of Admin functionality (e.g. want the DDL for tables, views, etc), then consider looking at the scripts here: https://github.com/awslabs/amazon-redshift-utils/tree/master/src/AdminViews
A relational database provides data that usually gets represented in a two-dimensional table of columns and rows. E.g.
RowId EmpId Lastname Firstname Salary
1 10 Smith Joe 25000
2 12 Jones Mary 40000
3 11 Johnson Cathy 55000
4 20 Jones Bob 30000
The main idea is that the two-dimensional format is an abstraction. In an actual implementation, storage hardware serializes the data into some form. Hard disks are organized into series of fixed size blocks (usually large enough to fix several rows of a table).
As data is inserted into the table, that row of data is assigned an internal id (RowId). This matches the common use-case of a system retriving information about a specific object (e.g. a user in a rolodex). A row-based system is designed to efficiently return data for an entire row.
A row-oriented table might serialize each row of data like this:
1:10,Smith,Joe,25000
2:12,Jones,Mary,40000
3:11,Johnson,Cathy,55000
4:20,Jones,Bob,30000
Row-base systems are not efficient at performing set-wide operations on the whole table (as opposed to a small number of specific records). For example, if we want to find salaries between 30,000 and 45,000, the database has to scan the entire table.
To improve the performance of these sort of operations (e.g. find salaries between x and y), relational databases use database indexes to store all the values from a set of columns along with rowid pointers back to the original table. An index on salary might look like:
1:25000
2:40000
3:55000
4:30000
Indexes help with lookups, but maintaining indexes adds overhead to the system (like when new data is written to the database, then any attached indexes also have to be updated).
A column-oriented database serializes the values of a column together. For example:
10:1,12:2,11:3,20:4
Smith:1,Jones:2,Johnson:3,Jones:4
Joe:1,Mary:2,Cathy:3,Bob:4
25000:1,40000:2,55000:3,30000:4
One big difference between the systems is that with the row-based system, the primary key is the rowid field. In the column-based system, the primary key is the data. So what does that mean? If we look in detail on the column-based system, we see that for duplicate data (e.g. ‘Jones’), we’re able to store the same value only once, like in this serialized version:
;Smith:1,Jones:2,4;Johnson3;
The type of data store (row or column) depends on the type of data you’re working with.
So for data storage, you can store data locally on a regular server or on the ‘cloud’ with S3 and Redshift-Spectrum.
For local disk storage (say a traditional ssd), the storage is tied to computation. You have to make full read-only copies to scale.
Uses AWS Athena Meta-data to understand files in S3. This decouples storage from compute. Still must make read-only copies, but of meta-data only, so smaller and faster to scale.
Redshift isn’t meant to be a transactional database (i.e. no updates, deletes) The preferred way of working with Redshift is to COPY to load data from files from any of the following:
So what’s different between Redshift and Athena?
Workload Management (WLM) is used to define multiple query queues and to route queries to the appropriate queues at runtime. The idea is that you might have many people running queries at the same time and some queries might use more cluster resources for longer periods of time that might affect the performance of other queries.
You manage database security by controlling which users have access to which database objects. You can grant access to objects to either user accounts or groups.
To create and modify users, you can run:
CREATE USER
ALTER USER
DROP USER
To get a list of current users, run:
SELECT * FROM pg_user;
#usename, usesysid, usecreatedb, usesuper, usecatupd, passwd, valuntil, useconfig
#rdsdb 1
You can make a user a SUPERUSER by:
CREATE USER adminuser myuser password 'mypassword';
ALTER USER adminuser myuser;
Groups are collections of users that can be collectively assigned privileges for easier security maintenance.
To create groups:
CREATE GROUP mywebusers;
To list groups:
SELECT * FROM pg_group;
# groname, grosysid, grolist
# my_group, 100, {109}
A database has one or more named schemas. Each schema in a database contains tables and other kinds of named objects.
By default, a database a has single schema named PUBLIC
. You can use schemas to group database objects under a
common name. Schema commands include:
CREATE SCHEMA
ALTER SCHEMA
DROP SCHEMA
To list all your schemas, you can query pg_namespace
:
SELECT * FROM pg_namespace;
# nspname, nspowner, nspacel
To list all of your tables that belong to a schema, query pg_table_def
:
SELECT DISTINCT(tablename) FROM pg_table_def
WHERE schemaname = 'my_schema';
You can grant usage with:
GRANT USAGE ON SCHEMA myschema TO GROUP mywebgroup;
GRANT ALL ON SCHEMA myschema to GROUP mywebgroup;
Some performance features include:
Work Load Manager (WLM) - flexibly manage priorities with workloads so that short, fast-running queries won’t get stuck in queues behind long-running queries
You can query for recent query errors in:
SELECT *
FROM stl_error
ORDER BY recordtime DESC
LIMIT 100;
You can create a UDF using either a SQL SELECT or a Python program. You can also create an AWS Lambda UDF that use custom functions defined in Lambda as part of your SQL queries. The Lambda route can overcome some of the current limitations of Python and SQL UDFs (e.g. access network and storage resources).
All UDFs are in pg_proc
. You might want to only see your custom named ones (usually starts with f_
)
SELECT * FROM pg_proc
WHERE proname ILIKE 'f_%' LIMIT 100;
A scalar SQL UDF uses a SQL SELECT clause that runs when the function is called and returns a single value. To create this function, you have optional input arguments (each argument needs to have a data type, referenced by $1, $2, etc) and we have one return data type.
create function f_sql_greater (double precision, double precision)
returns double precision
stable
as $$
select case when $1 > $2 then $1
else $2
end
$$ language sql;
A scalar Python UDF uses a Python program that runs when the function is called and returns a single value. This is similar to the scalar SQL UDF except for our input arguments, each argument must have a name and a data type. The input and return data types can be Redshift data types or a Python ANYELEMENT data type
Creating:
create function f_py_greater (a double precision, b double precision)
returns double precision
stable
as $$$
if a > b:
return float(a)
return float(b)
$$ language plpythonu;
How to use:
select f_py_greater (commission, pricepaid*0.20) from sales;
Recommend that you name all UDFs using the prefix f_
(reserved exclusiviely for all UDFs for Redshift).