William Liu

PostgreSQL


##Table of Contents

Summary

PostgreSQL is an open source relational database.

Basic Commands

Login

In the command line, just enter psql to see your options

$psql --help

List Databases

In the command line, to see what databases are available for access, do:

$psql --list

Create Database

Create a database called mydb

createdb mydb

Access Database

Login to psql

psql -U myusername

Connect to a database (e.g. db is named mydb)

psql mydb [optional user name]

Lists all Databases

Now that you’re in psql, you can also list all databases

# \list
or
# \l

                             List of databases
   Name    |  Owner   | Encoding |  Collate   |   Ctype    |   Access privileges
-----------+----------+----------+------------+------------+-----------------------
 airflow   | postgres | UTF8     | en_US.utf8 | en_US.utf8 |
 postgres  | postgres | UTF8     | en_US.utf8 | en_US.utf8 |
 template0 | postgres | UTF8     | en_US.utf8 | en_US.utf8 | =c/postgres          +
           |          |          |            |            | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.utf8 | en_US.utf8 | =c/postgres          +
           |          |          |            |            | postgres=CTc/postgres

Show Table Size

You can run a query like this to see disk usage:

SELECT nspname || '.' || relname AS "relation",
    pg_size_pretty(pg_total_relation_size(C.oid)) AS "total_size"
  FROM pg_class C
  LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
  WHERE nspname NOT IN ('pg_catalog', 'information_schema')
    AND C.relkind <> 'i'
    AND nspname !~ '^pg_toast'
  ORDER BY pg_total_relation_size(C.oid) DESC
  LIMIT 5;
Example output:

COPY
       relation       | total_size
----------------------+------------
 public.snapshots     | 823 MB
 public.invoice_items | 344 MB
 public.messages      | 267 MB
 public.topics        | 40 MB
 public.invoices      | 35 MB
(5 rows)

See Roles

Prints out a list of roles and attributes (e.g. name, superuser, member of)

 \du

Change Database Password

Change the password of a database

\password mydb

View current connection info

View current connection information

\conninfo

View Database Tables

View database tables and their owners

\dt

View Tables, Views, and Sequences

View all of the tables, views, and sequences

\z

Switch Databases

To connect to another database (e.g. db ‘jobwaffle_prod’)

\connect jobwaffle_prod
or
\c jobwaffle_prod

Expanded Output

To use expanded output, do \x

Display Triggers

To display triggers, do \dy+

Display Functions

To display functions, do \df+

Queries

While in the interactive session, remember to terminate queries with a ;

mydb=# SELECT * FROM mytable;

Exit

Exits out of the PostgreSQL command prompt

\q

SQL Help

To get help for SQL Commands

\h

PSQL Help

To get help for psql commands

\?

SQL Examples

Here’s some examples of SQL queries taken from pgexercises.

Example Schema

Given the following tables:

cd.members
memid               | integer
    surname             | character varying(200)
    firstname           | character varying(200)
    address             | character varying(300)
    zipcode             | integer
    telephone           | character varying(20)
    recommendedby       | integer
    joindate            | timestamp

cd.bookings
    facid               | integer
    memid               | integer
    starttime           | timestamp
    slots               | integer

cd.facilities
    facid               | integer
    name                | character varying(100)
    membercost          | numeric
    guestcost           | numeric
    initialoutlay       | numeric
    monthlymaintenance  | numeric

Basic Selects

Select All

Retrieve everything from a table (all columns, all rows):

SELECT *
FROM cd.facilities;

facid	name	        membercost	guestcost	initialoutlay	monthlymaintenance
0	    Tennis Court 1	5	        25	        10000	        200
1	    Tennis Court 2	5	        25	        8000	        200
2	    Badminton Court	0	        15.5        4000            50
3	    Table Tennis	0	        5           320             10
4	    Massage Room 1	35	        80	        4000	        3000
5	    Massage Room 2	35	        80	        4000	        3000
6	    Squash Court	3.5	        17.5        5000            80
7	    Snooker Table	0	        5	        450	            15
8	    Pool Table      0           5           400             15

Select Specific Columns

To select specific columns from a table, we use SELECT.

SELECT name, membercost
FROM cd.facilities;

name	            membercost
Tennis Court 1	    5
Tennis Court 2	    5
Badminton Court	    0
Table Tennis	    0
Massage Room 1	    35
Massage Room 2	    35
Squash Court	    3.5
Snooker Table	    0
Pool Table	        0

Select Specific Rows

To select specific rows from a table, we use SQL’s WHERE:

SELECT facid, name, membercost, monthlymaintenance
FROM cd.facilities
WHERE membercost > 0 AND (membercost < monthlymaintenance/50.0);

facid	name	        membercost	monthlymaintenance
4	    Massage Room 1	35	        3000
5	    Massage Room 2	35	        3000

LIKE

You can filter strings in tables (e.g. list all facilities with the word ‘Tennis’ in the name).

SELECT *
FROM cd.facilities
WHERE name LIKE '%Tennis%';

acid	name	            membercost	guestcost	initialoutlay	monthlymaintenance
0	    Tennis Court 1	    5	        25	        10000	        200
1	    Tennis Court 2	    5	        25	        8000	        200
3	    Table Tennis	    0	        5	        320             10

SIMILAR TO

SIMILAR is similar to LIKE, except you can run a regular expression.

'abc' SIMILAR TO 'abc'      # true
'abc' SIMILAR TO 'a'        # false
'abc' SIMILAR TO '%(b|d)%'  # true

https://www.postgresql.org/docs/8.3/functions-matching.html

We have:

`|` to denote alternation (either of two alternatives)
`*` to denote repetition of the previous item zero or more times
`+` to denote repetition of the previous item one or more times
`()` to group items into a single logical item
`[]` to specify a character class, like in a POSIX regular expression

Filter In List

You can filter to see if an item is in a list (e.g. facilities with ID 1, 3, 5)

SELECT *
FROM cd.facilities
WHERE facid IN (1, 3, 5);

If you return a result that is a single column, you can feed those results into an IN operator. This is called a subquery. For example:

SELECT *
FROM cd.facilities
WHERE facid IN (
    SELECT facid FROM cd.facilities
);

Select into Categories

If you want to select results into categories (e.g. facilities labeled ‘cheap’ or ‘expensive’ depending on monthly maintenacne costs > $100). You want to then use SQL CASE (the equivalent of an if/switch statement)

SELECT name,
       CASE WHEN monthlymaintenance > 100 THEN 'expensive'
            ELSE 'cheap'
       END AS cost
FROM cd.facilities

Select Dates

To select specific dates:

SELECT memid, surname, firstname, joindate
FROM cd.members
WHERE joindate > '2012-09-01';

memid	surname	            firstname	joindate
24	    Sarwin	            Ramnaresh	2012-09-01 08:44:42
26	    Jones	            Douglas	    2012-09-02 18:43:05
27	    Rumney	            Henrietta	2012-09-05 08:42:35
28	    Farrell	            David	    2012-09-15 08:22:05
29	    Worthington-Smyth	Henry	    2012-09-17 12:27:15
30	    Purview	            Millicent	2012-09-18 19:04:01
33	    Tupperware	        Hyacinth	2012-09-18 19:32:05
35	    Hunt	            John	    2012-09-19 11:32:45
36	    Crumpet	            Erica	    2012-09-22 08:36:38
37	    Smith	            Darren	    2012-09-26 18:08:45

Select CONVERT_TIMEZONE

CONVERT_TIMEZONE converts a timestamp from one time zone to another.

Format:

CONVERT_TIMEZONE ( ['source_timezone',] 'target_timezone', 'timestamp')

Example 1:

select listtime, convert_timezone('US/Pacific', listtime) from listing where listid = 16;

 listtime           |   convert_timezone
--------------------+---------------------
2008-08-24 09:36:12 | 2008-08-24 02:36:12

Example 2:

select convert_timezone('EST', 'PST', '20080305 12:25:29');

 convert_timezone
-------------------
2008-03-05 09:25:29

Select COALESCE function

COALESCE returns the first non-null argument with COALESCE(arg1, arg2, ...) with an unlimited number of args and returns the first non-null (evaluating left to right). If all arguments are null, then it will return null.

SELECT COALESCE(1, 2);  # 1
SELECT COALESCE(NULL, 2, 1)  # 2

Often this is used as a default value. For example, say you have a blog post excerpt. If there is no excerpt provided, then use the first 150 characters from the content post.

SELECT COALESCE(excerpt, LEFT(CONTENT, 150)) FROM blog_posts;

Select Order, Limit, and Distinct

Produce an ordered list of the first 10 surnames with no duplicates. We make use of ORDER, DISTINCT, LIMIT.

SELECT DISTINCT(surname)
FROM cd.members
ORDER BY surname
LIMIT 10;

surname
Bader
Baker
Boothe
Butters
Coplin
Crumpet
Dare
Farrell
GUEST
Genting

Combine results from multiple queries with Union

Create a combined list of all surnames and all facility names into a single column (for whatever reason). Use UNION.

SELECT surname
FROM cd.members
UNION
SELECT name
FROM cd.facilities

surname
Tennis Court 2
Worthington-Smyth
Badminton Court
Pinker
Dare
Bader
Mackenzie
Crumpet
Massage Room 1
Squash Court
Tracy
Hunt
Tupperware
Smith
Butters
Rownam
Baker
Genting
Purview
Coplin
Massage Room 2
Joplette
Stibbons
Rumney
Pool Table
Sarwin
Boothe
Farrell
Tennis Court 1
Snooker Table
Owen
Table Tennis
GUEST
Jones

Select Max and Min

Select the largest value of an expression.

SELECT MAX(joindate) AS latest
FROM cd.members;

latest
2012-09-26 18:08:45

Select the smallest value of an expression.

SELECT MIN(joindate) AS earliest
FROM cd.members;

earliest
2012-07-01 00:00:00

Select with Subquery

Get the first and last name of the last member(s) who signed up.

SELECT firstname, surname, joindate
FROM cd.members
WHERE joindate = (
    SELECT MAX(joindate) FROM cd.members
);

Since our subquery returns a table with a single column and single row, we can substitute this value in our WHERE clause.

Another method that isn’t quite as efficient and doesn’t handle the case where two people joined at the exact same time.

SELECT firstname, surname, joindate
FROM cd.members
ORDER BY joindate desc
LIMIT 1;

Joins and Subqueries

Relational databases are relational because of joins.

From

Remember that the output of a FROM is another table. That’s why you can do this:

SELECT _bookings.starttime
FROM cd.bookings _bookings,
     cd.members _members
WHERE
     _members.firstname = 'David'
     AND _members.surname = 'Farrell'
     AND _members.memid = _bookings.memid;

Inner Join

Say we want a list of the start times for bookings by members named ‘David Farrell’.

Use an INNER JOIN to combine two tables with matching values on both tables. You can do this a couple different styles:

# Style 1
SELECT _bookings.starttime
FROM cd.bookings _bookings
INNER JOIN cd.members _members
           ON _members.memid = _bookings.memid
WHERE _members.firstname = 'David'
      AND _members.surname = 'Farrell';

# Style 2
SELECT _bookings.starttime
FROM cd.bookings _bookings,
     cd.members _members
WHERE
     _members.firstname = 'David'
     AND _members.surname = 'Farrell'
     AND _members.memid = _bookings.memid;

starttime
2012-09-18 09:00:00
2012-09-18 17:30:00
2012-09-18 13:30:00
2012-09-18 20:00:00
2012-09-19 09:30:00
2012-09-19 15:00:00
2012-09-19 12:00:00
2012-09-20 15:30:00
2012-09-20 11:30:00
2012-09-20 14:00:00
2012-09-21 10:30:00
2012-09-21 14:00:00
2012-09-22 08:30:00
2012-09-22 17:00:00
2012-09-23 08:30:00
2012-09-23 17:30:00
2012-09-23 19:00:00
2012-09-24 08:00:00
2012-09-24 16:30:00
2012-09-24 12:30:00
2012-09-25 15:30:00
2012-09-25 17:00:00
2012-09-26 13:00:00
2012-09-26 17:00:00
2012-09-27 08:00:00
2012-09-28 11:30:00
2012-09-28 09:30:00
2012-09-28 13:00:00
2012-09-29 16:00:00
2012-09-29 10:30:00
2012-09-29 13:30:00
2012-09-29 14:30:00
2012-09-29 17:30:00
2012-09-30 14:30:00

Another example is to produce a list of the start times for bookings for tennis courts for the date ‘2012-09-21’ and return the start time and facility name pairings, ordered by the time.

SELECT books.starttime AS start,
   facil.name AS name
FROM cd.bookings books
INNER JOIN cd.facilities facil
    ON books.facid = facil.facid
WHERE
    DATE(books.starttime) = '2012-09-21'
    AND facil.name LIKE '%Tennis Court%'
ORDER BY
    books.starttime;

start	            name
2012-09-21 08:00:00	Tennis Court 1
2012-09-21 08:00:00	Tennis Court 2
2012-09-21 09:30:00	Tennis Court 1
2012-09-21 10:00:00	Tennis Court 2
2012-09-21 11:30:00	Tennis Court 2
2012-09-21 12:00:00	Tennis Court 1
2012-09-21 13:30:00	Tennis Court 1
2012-09-21 14:00:00	Tennis Court 2
2012-09-21 15:30:00	Tennis Court 1
2012-09-21 16:00:00	Tennis Court 2

##Views

###Materialized Views

Materialized Views in Postgres are like views, but they persist the results in a table-like form.

CREATE MATERIALIZED VIEW myview AS SELECT * FROM mytable;

The main difference between a materialized view and a created table is that the materialized view cannot be directly updated after it is created. The query used to create the materialized view is stored exactly the same way that a view’s query is stored. The only way to get fresh data is with:

REFRESH MATERIALIZED VIEW myview;

So why use a materialized view?

Aggregates

You can select aggregates using COUNT, SUM, and AVG, MAX, MIN. DISTINCT can be used with aggregates.

Notes

https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types

Arrays

ARRAY Ordered list of zero or more elements of any non-ARRAY type. An ARRAY is an ordered list of zero or more elements of non-ARRAY values. ARRAYs of ARRAYs are not allowed. Queries that would produce an ARRAY of ARRAYs will return an error. Instead a STRUCT must be inserted between the ARRAYs using the SELECT AS STRUCT construct.

Currently, BigQuery has two following limitations with respect to NULLs and ARRAYs: BigQuery raises an error if query result has ARRAYs which contain NULL elements, although such ARRAYs can be used inside the query. BigQuery translates NULL ARRAY into empty ARRAY in the query result, although inside the query NULL and empty ARRAYs are two distinct values.

Declaring an ARRAY type ARRAY types are declared using the angle brackets (< and >). The type of the elements of an ARRAY can be arbitrarily complex with the exception that an ARRAY cannot directly contain another ARRAY.

Format ARRAY

Structs

STRUCT Container of ordered fields each with a type (required) and field name (optional). Declaring a STRUCT type STRUCT types are declared using the angle brackets (< and >). The type of the elements of a STRUCT can be arbitrarily complex.

Format STRUCT

JSON Functions

https://www.postgresql.org/docs/9.3/functions-json.html

json_extract_path_text

Format: json_extract_path_text(from_json json, VARIADIC path_elems text[])

Example:

json_extract_path_text('{"f2":{"f3":1},"f4":{"f5":99,"f6":"foo"}}','f4', 'f6')  # returns 'foo'

Window Functions

The standard window function syntax looks like:

function (expression) OVER (
[ PARTITION BY expr_list ]
[ ORDER BY order_list [ frame_clause ] ] )

Let’s break this apart:

A function is any function like AVG, COUNT, MIN, MEDIAN, MAX, NTH_VALUE, FIRST_VALUE, LAST_VALUE, ROW_NUMBER, ``SUM`, etc. For Redshift, see list here: https://docs.aws.amazon.com/redshift/latest/dg/c_Window_functions.html

OVER defines the window specification. This is required and tells us that we’re looking at a window function instead of a normal function.

PARITION BY <expr_list> is optional and means that we want to subdivide the result set into partitions, kinda like a GROUP BY clause. If a partition clause is present, the function is calculated for the rows in each partition. If no partition clause is specified, a single partition contains the entire table and the function is computed for the entire table.

An expr_list is further broken down into

expression | column_name [, expr_list ]

ORDER BY <order_list> is optional and means that the window function is applied to the rows within each partition sorted according to the order specification in ORDER BY.

An order_list is further broken down into

expression | column_name [ ASC | DESC ]
[ NULLS FIRST | NULLS LAST ]
[, order_list ]

A frame_clause is used in aggregate functions to further refine the set of rows in a function’s window when using ORDER_BY. It provides the ability to include or exclude sets of rows within the ordered result. The frame clause does not apply to ranking functions and is not required when no ORDER BY clause is used in the OVER clause for an aggregate function. If an ORDER BY clause is used for an aggregate function, an explicit frame clause is required.

ROWS - the rows clause defines the window frame by specifying a physical offset from the current row. This clause specifies the rows in the current window or partition that the value in the current row is to be combined with. It uses arguments to specify the row position, which can be before or after the current row. The reference point for all window frames is the current row. Each row becomes the current row in turn as the window frame slides forward in the partition. We have two variations:

  1. The frame is a simple set of rows up to and including the current row

    { UNBOUNDED PRECEDING unsigned_value PRECEDING CURRENT ROW }
  2. The frame can be a set of rows between two boundaries

    {BETWEEN { UNBOUNDED PRECEDING | unsigned_value { PRECEDING | FOLLOWING } | CURRENT ROW} AND { UNBOUNDED FOLLOWING | unsigned_value { PRECEDING | FOLLOWING } | CURRENT ROW }}

UNBOUNDED PRECEDING indicates that the window starts at the first row of the partition. offset PRECEDING indicates that the window starts a number of rows equivalent to the value of offset before the current row UNBOUNDED PRECEDING is the default.

CURRENT ROW

Tips

Explain

In SQL Server (not Postgres), you can kinda see how much work is going on behind the scenes with something like:

SET STATISTICS IO ON
SET STATISTICS TIME ON
<my_query>
SET STATISTICS IO OFF
SET STATISTICS TIME OFF

In Postgres, we don’t have that exact command, but we do have explain:

EXPLAIN (ANALYZE ON, BUFFERS ON) SELECT ...

Monitoring

esides EXPLAIN, we have monitoring stats in pg_statio_* for IO stats. The data isn’t scoped to a session, but it can help in monitoring efficient queries in clean environments.

https://www.postgresql.org/docs/current/monitoring-stats.html

Optimization - Sargable Query vs Nonsargable Query

Non-sargable stands for Non Search Argument. When we do this type of query, SQL is unable to use an index. An example of this is when a function is used in the WHERE clause or a JOIN clause.

-- Non-sargable Query
SELECT EmployeeName
FROM EmployeeTest
WHERE LEFT(EmployeeName, 1) = 'A';
-- Results in performing the `LEFT` function on every single record

A sargable query stands for Searchable Argument. When we do this type of query, SQL is able to use indexes. An example of the above SQL that is faster.

-- Sargable Query
SELECT EmployeeName
FROM EmployeeTest
WHERE EmployeeName LIKE 'A%';

Other Examples of Sargable:

-- Non-Sargable
WHERE YEAR(DateOfBirth) = '1952'

-- Sargable Query
WHERE DateOfBirth >= '19520101' AND DateOfBirth < '19530101'

Summary: Use a sargable query by NOT including a function in your WHERE or JOIN clause so you can utilize indexes

So what do I do? Instead of using functions on your field (i.e. it shows up on the LEFT side of your field, instead we want this on the right hand side)

Optimization - SELECT specific columns’

Do not run SELECT * if you do not need all columns. If you just select the columns you need, it’ll be faster and you probably don’t (and shouldn’t) have an index on all columns.

Optimization - LIMIT

LIMIT your results if you do not need everything

Optimization - LEFT vs RIGHT calculation

If you do a calculation on the LEFT side vs the RIGHT side, this changes our speed:

# Non-Sargable Query because Calculation done on left side
WHERE Salary / 2 = 554677;

# Sargable Query because Calculation done on right side
WHERE Salary = (277338.5 * 2);

Indexes

In Postgres, you cannot just run a SHOW INDEXES command to the list the index information of a table or database. Instead, we have the pg_indexes view that you can query or you can run \d to command on psql to view the index information for a table.

See all indexes of a specific schema (using the pg_indexes view)

SELECT
    tablename,
    indexname,
    indexdef
FROM
    pg_indexes
WHERE
    schemaname = 'myschema'
ORDER BY
    tablename,
    indexname;

See all indexes for a table

SELECT
    indexname,
    indexdef
FROM
    pg_indexes
WHERE
    tablename = 'mytable';

Create an Index

You can create an index on specific column(s) for a table or a materialized view. There’s pros and cons to an index. It comes down to:

Include Column in Index

A lot of times when we create an index, we just think about the keyed columns. There is the optional INCLUDE clause that lets you specify a list of columns that will be included in the index as a non-key column.

Say you’re searching based off an indexed column and return just that index column. This is optimal since it hits the index (DateOfBirth) and return only that data (DateOfBirth).

SELECT DateOfBirth
FROM EmployeeTest
WHERE DateOfBirth >= '19520101' AND DateOfBirth < '19530101';

However, if you change the above to SELECT DateOfBirth, EmployeeName and EmployeeName is not an index, we now do a search for the index, then we go back and need to pull information for the EmployeeName column. If you are regularly pulling this information back, you might want to INCLUDE a column into the index.

Just take care to not add many non-key columns to an index when you have really wide columns. Also keep in mind that the index tuple cannnot exceed a certain size, otherwise data insertion fails. A non-key column duplicates data from the index’s table and increases the size of the index.

https://www.postgresql.org/docs/11/sql-createindex.html

Stale and Fragmented Indexes

Indexes store statistics (see monitoring) and uses estimations of what the data looks like in order to optimize itself. These statistics get old and the distribution of data may get old. You would need to redinex then.

https://www.postgresql.org/docs/9.3/sql-reindex.html

REINDEX

Update statistics with: https://www.postgresql.org/docs/9.3/sql-analyze.html

ANALYZE [VERBOSE} [table_name]

Updating statistics store information into the pg_statistic system catalog

Resources

Good basic and common usage: https://popsql.com/learn-sql/postgresql/