PostgreSQL

# \list
or
# \l

                             List of databases
   Name    |  Owner   | Encoding |  Collate   |   Ctype    |   Access privileges
-----------+----------+----------+------------+------------+-----------------------
 airflow   | postgres | UTF8     | en_US.utf8 | en_US.utf8 |
 postgres  | postgres | UTF8     | en_US.utf8 | en_US.utf8 |
 template0 | postgres | UTF8     | en_US.utf8 | en_US.utf8 | =c/postgres          +
           |          |          |            |            | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.utf8 | en_US.utf8 | =c/postgres          +
           |          |          |            |            | postgres=CTc/postgres

Show Table Size

You can run a query like this to see disk usage:

SELECT nspname || '.' || relname AS "relation",
    pg_size_pretty(pg_total_relation_size(C.oid)) AS "total_size"
  FROM pg_class C
  LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
  WHERE nspname NOT IN ('pg_catalog', 'information_schema')
    AND C.relkind <> 'i'
    AND nspname !~ '^pg_toast'
  ORDER BY pg_total_relation_size(C.oid) DESC
  LIMIT 5;
Example output:

COPY
       relation       | total_size
----------------------+------------
 public.snapshots     | 823 MB
 public.invoice_items | 344 MB
 public.messages      | 267 MB
 public.topics        | 40 MB
 public.invoices      | 35 MB
(5 rows)

See Roles

Prints out a list of roles and attributes (e.g. name, superuser, member of)

\du

Change Database Password

Change the password of a database

\password mydb

View current connection info

View current connection information

\conninfo

View Database Tables

View database tables and their owners

\dt

View Tables, Views, and Sequences

View all of the tables, views, and sequences

\z

Switch Databases

To connect to another database (e.g. db ‘jobwaffle_prod’)

\connect jobwaffle_prod
or
\c jobwaffle_prod

Queries

While in the interactive session, remember to terminate queries with a ;

mydb=# SELECT * FROM mytable;

Exit

Exits out of the PostgreSQL command prompt

\q

SQL Help

To get help for SQL Commands

\h

PSQL Help

To get help for psql commands

\?

SQL Examples

Here’s some examples of SQL queries taken from pgexercises.

Example Schema

Given the following tables:

cd.members
memid               | integer
    surname             | character varying(200)
    firstname           | character varying(200)
    address             | character varying(300)
    zipcode             | integer
    telephone           | character varying(20)
    recommendedby       | integer
    joindate            | timestamp

cd.bookings
    facid               | integer
    memid               | integer
    starttime           | timestamp
    slots               | integer

cd.facilities
    facid               | integer
    name                | character varying(100)
    membercost          | numeric
    guestcost           | numeric
    initialoutlay       | numeric
    monthlymaintenance  | numeric

Basic Selects

Select All

Retrieve everything from a table (all columns, all rows):

SELECT *
FROM cd.facilities;

facid	name	        membercost	guestcost	initialoutlay	monthlymaintenance
   Tennis Court 1	5	        25	        10000	        200
   Tennis Court 2	5	        25	        8000	        200
   Badminton Court	0	        15.5        4000            50
   Table Tennis	0	        5           320             10
   Massage Room 1	35	        80	        4000	        3000
   Massage Room 2	35	        80	        4000	        3000
   Squash Court	3.5	        17.5        5000            80
   Snooker Table	0	        5	        450	            15
   Pool Table      0           5           400             15

Select Specific Columns

To select specific columns from a table, we use SELECT.

SELECT name, membercost
FROM cd.facilities;

name	            membercost
Tennis Court 1	    5
Tennis Court 2	    5
Badminton Court	    0
Table Tennis	    0
Massage Room 1	    35
Massage Room 2	    35
Squash Court	    3.5
Snooker Table	    0
Pool Table	        0

Select Specific Rows

To select specific rows from a table, we use SQL’s WHERE:

SELECT facid, name, membercost, monthlymaintenance
FROM cd.facilities
WHERE membercost > 0 AND (membercost < monthlymaintenance/50.0);

facid	name	        membercost	monthlymaintenance
4	    Massage Room 1	35	        3000
5	    Massage Room 2	35	        3000

LIKE

You can filter strings in tables (e.g. list all facilities with the word ‘Tennis’ in the name).

SELECT *
FROM cd.facilities
WHERE name LIKE '%Tennis%';

acid	name	            membercost	guestcost	initialoutlay	monthlymaintenance
0	    Tennis Court 1	    5	        25	        10000	        200
1	    Tennis Court 2	    5	        25	        8000	        200
3	    Table Tennis	    0	        5	        320             10

The % represents zero, one, or multiple characters. The _ represents one, single character (e.g. LIKE 'A__')

SIMILAR TO

SIMILAR is similar to LIKE, except you can run a regular expression.

'abc' SIMILAR TO 'abc'      # true
'abc' SIMILAR TO 'a'        # false
'abc' SIMILAR TO '%(b|d)%'  # true

https://www.postgresql.org/docs/8.3/functions-matching.html

We have:

`|` to denote alternation (either of two alternatives)
`*` to denote repetition of the previous item zero or more times
`+` to denote repetition of the previous item one or more times
`()` to group items into a single logical item
`[]` to specify a character class, like in a POSIX regular expression

Filter In List

You can filter to see if an item is in a list (e.g. facilities with ID 1, 3, 5)

SELECT *
FROM cd.facilities
WHERE facid IN (1, 3, 5);

If you return a result that is a single column, you can feed those results into an IN operator. This is called a subquery. For example:

SELECT *
FROM cd.facilities
WHERE facid IN (
    SELECT facid FROM cd.facilities
);

If you want to select results into categories (e.g. facilities labeled ‘cheap’ or ‘expensive’ depending on monthly maintenacne costs > $100). You want to then use SQL CASE (the equivalent of an if/switch statement)

SELECT name,
       CASE WHEN monthlymaintenance > 100 THEN 'expensive'
            ELSE 'cheap'
       END AS cost
FROM cd.facilities

Select Dates

To select specific dates:

SELECT memid, surname, firstname, joindate
FROM cd.members
WHERE joindate > '2012-09-01';

memid	surname	            firstname	joindate
   Sarwin	            Ramnaresh	2012-09-01 08:44:42
   Jones	            Douglas	    2012-09-02 18:43:05
   Rumney	            Henrietta	2012-09-05 08:42:35
   Farrell	            David	    2012-09-15 08:22:05
   Worthington-Smyth	Henry	    2012-09-17 12:27:15
   Purview	            Millicent	2012-09-18 19:04:01
   Tupperware	        Hyacinth	2012-09-18 19:32:05
   Hunt	            John	    2012-09-19 11:32:45
   Crumpet	            Erica	    2012-09-22 08:36:38
   Smith	            Darren	    2012-09-26 18:08:45

Select CONVERT_TIMEZONE

CONVERT_TIMEZONE converts a timestamp from one time zone to another.

Format:

CONVERT_TIMEZONE ( ['source_timezone',] 'target_timezone', 'timestamp')

Example 1:

select listtime, convert_timezone('US/Pacific', listtime) from listing where listid = 16;

 listtime           |   convert_timezone
--------------------+---------------------
2008-08-24 09:36:12 | 2008-08-24 02:36:12

Example 2:

select convert_timezone('EST', 'PST', '20080305 12:25:29');

 convert_timezone
-------------------
2008-03-05 09:25:29

Select `COALESCE` function

COALESCE returns the first non-null argument with COALESCE(arg1, arg2, ...) with an unlimited number of args and returns the first non-null (evaluating left to right). If all arguments are null, then it will return null.

SELECT COALESCE(1, 2);  # 1
SELECT COALESCE(NULL, 2, 1)  # 2

Often this is used as a default value. For example, say you have a blog post excerpt. If there is no excerpt provided, then use the first 150 characters from the content post.

SELECT COALESCE(excerpt, LEFT(CONTENT, 150)) FROM blog_posts;

Select `EXTRACT` function

EXTRACT function extracts a part from a given date and returns an integer. Consider DATE_FORMAT() if you want to format a date

-- SELECT EXTRACT(<part> FROM <date>)
SELECT EXTRACT(MONTH FROM "2022-10-12")

Where the part is required and can be say HOUR, DAY, WEEK, YEAR_MONTH, etc

Select `DATE_FORMAT` function

DATE_FORMAT function formats a date into a new format

-- SELECT DATE_FORMAT(<date>, <format>)
SELECT DATE_FORMAT("2024-01-01", "%M %d %Y");

Select `IF` function

IF function returns a value if the condition is true or another value if condition is false.

-- SELECT IF(<condition>, <value_if_true>, <value_if_false>
SELECT
  IF(Quantity>10, "MORE", "LESS")
FROM OrderTails;

Select `IS NULL` and `IS NOT NULL`

IS NULL and IS NOT NULL is used to identify null and non-null values

SELECT * FROM goodreads WHERE book_title IS NULL;
SELECT * FROM goodreads WHERE book_title IS NOT NULL;

Note: When sorting a column containing NULL values, these rows ascend to the top of the result (since NULLs are the smallest values)

Select `IFNULL` function

Select `CASE` function

CASE goes through conditions and returns a value when the first condition is met

-- SELECT
--   CASE
--     WHEN condition1 THEN result1
--     WHEN condition2 THEN result2
--     ELSE result
-- END;

SELECT product_name,
  CASE
    WHEN price < 10 THEN 'Low price product'
    WHEN price > 50 THEN 'High price product'
  ELSE
    'Normal Product'
  END
FROM products;

Select Order, Limit, and Distinct

Produce an ordered list of the first 10 surnames with no duplicates. We make use of ORDER, DISTINCT, LIMIT.

SELECT DISTINCT(surname)
FROM cd.members
ORDER BY surname
LIMIT 10;

surname
Bader
Baker
Boothe
Butters
Coplin
Crumpet
Dare
Farrell
GUEST
Genting

Combine results from multiple queries with Union

Create a combined list of all surnames and all facility names into a single column (for whatever reason). Use UNION.

SELECT surname
FROM cd.members
UNION
SELECT name
FROM cd.facilities

surname
Tennis Court 2
Worthington-Smyth
Badminton Court
Pinker
Dare
Bader
Mackenzie
Crumpet
Massage Room 1
Squash Court
Tracy
Hunt
Tupperware
Smith
Butters
Rownam
Baker
Genting
Purview
Coplin
Massage Room 2
Joplette
Stibbons
Rumney
Pool Table
Sarwin
Boothe
Farrell
Tennis Court 1
Snooker Table
Owen
Table Tennis
GUEST
Jones

Select Aggregates (Sum, Max, Min, Avg, Count)

Aggregate functions perform operations across an entire column of data.

MAX

Select the highest/largest value of an expression.

SELECT MAX(joindate) AS latest
FROM cd.members;

latest
2012-09-26 18:08:45

MIN

Select the lowest/smallest value of an expression.

SELECT MIN(joindate) AS earliest
FROM cd.members;

earliest
2012-07-01 00:00:00

SUM

Add together all the values in a column

SELECT SUM(cost) AS sum_cost
FROM my.table;

AVG

Calculates the average of a group of selected values

SELECT AVG(cost) AS avg_cost
FROM my.table;

COUNT

Counts how many rows are in a column

SELECT COUNT(id) AS id_count
FROM my.table;

Aggregate grouping with `GROUP BY` and `HAVING`

If you only want to aggregate a part of a table (instead of the entire column), then you want to use the GROUP BY function.

SELECT
  category,
  SUM(spend)
FROM product_spend
GROUP BY category;

---
category    sum
electronics 1000
appliance   1200

HAVING

When using GROUP BY (i.e. aggregate function), you can filter with HAVING function. HAVING is different than WHERE since having operates on aggregate results (instead of row by row of WHERE)

SELECT COUNT(customer_id), country
FROM customers
GROUP BY country
HAVING COUNT(customer_id) > 5;

SELECT
  ticker,
  min(open)
FROM stock_prices
GROUP BY
  ticker
HAVING min(open) > 100;

Select with CTE (Common Table Expression)

A CTE (Common Table Expression) is a query within a query. You use the WITH statement to create temporary tables to store results so that you can make complicated queries more readable and maintainable. Once your main query is completed, these temporary tables disappear.

WITH genre_revenue_cte AS (
  SELECT ...
)

SELECT
  g.my_field
FROM genre_revenue_cte AS g ...

Select with Subquery

Subqueries (aka inner queries) embed one query into another. You nest queries within parentheses and create temporary tables.

Get the first and last name of the last member(s) who signed up.

SELECT firstname, surname, joindate
FROM cd.members
WHERE joindate = (
    SELECT MAX(joindate) FROM cd.members
);

Since our subquery returns a table with a single column and single row, we can substitute this value in our WHERE clause.

Another method that isn’t quite as efficient and doesn’t handle the case where two people joined at the exact same time.

SELECT firstname, surname, joindate
FROM cd.members
ORDER BY joindate desc
LIMIT 1;

Subquery vs CTE

Advantages of a CTE

Break down complex queries - You declare a CTE at the beginning of the query using WITH, enhancing code readability
If you need to reuse the same subquery result multiple times in a larger query, CTEs prevent redundant calculations
When you need to perform recursive queries (e.g. traversing hierarchical data like org structures or threaded discussions)

– Finding all employees under a certain manager WITH recursive_cte AS ( SELECT employee_id, name, manager_id FROM employees WHERE manager_id = @manager_id
```
UNION ALL

SELECT
  e.employee_id,
  e.name,
  e.manager_id
FROM employees AS e
INNER JOIN recursive_cte AS r -- The recursive cte is utilized here within the main cte
  ON e.ManagerID = r.employee_id   )
```
SELECT * FROM recursive_cte;

Advantages of a using a subquery

Single-value comparison in WHERE clauses - where you need to compare a single value to a result from another query This allows for more flexibility and precision of making on-the-fly condition adjustments based on subquery results

SELECT artist_name FROM concerts WHERE concert_revenue > ( SELECT AVG(concert_revenue) FROM concerts );
Column Creation and Aggregation - use subqueries to create new columns for real-time computations and to calculate intermediate values for aggregation functions within larger queries. This enables more sophisticated aggregation.

SELECT artist_name, genre, concert_revenue, (SELECT AVG(concert_revenue) FROM concerts) AS avg_concert_revenue, (SELECT MAX(concert_revenue) FROM concerts) AS max_concert_revenue FROM concerts;

– Subqueries often use IN, NOT IN, and EXISTS operators to filter based on conditions from another query SELECT artist_name FROM concerts WHERE artist_id IN ( SELECT artist_id FROM concert_revenue WHERE concert_revenue > 50000);
Correlated Subqueries - utilize correlated subqueries to retrieve values from the outer query For example, say you have two lists of data side by side. With correlated subqueries, you can ask questions about each item in one list and get answers from the other list that are specific to that item.

SELECT artist_name, genre, concert_revenue FROM concerts AS c1 WHERE concert_revenue = ( SELECT MAX(concert_revenue) FROM concerts AS c2 WHERE c1.genre = c2.genre );

`EXISTS` and `NOT EXISTS` for checking if any records in a subquery

You can use the EXISTS operator to test if there is any records in a subquery; returns TRUE if the sub query returns one or more records.

-- EXISTS example
SELECT customers.customer_name
FROM customers
WHERE EXISTS (
  SELECT order_id
  FROM orders
  WHERE customer_id = customers.customer_id
);

-- NOTE EXISTS example
SELET customers.customer_name
FROM customers
WHERE NOT EXISTS (
  SELECT order_id
  FROM orders
  WHERE customer_id = customers.customer_id
);

Joins and Subqueries

Relational databases are relational because of joins.

From

Remember that the output of a FROM is another table. That’s why you can do this:

SELECT _bookings.starttime
FROM cd.bookings _bookings,
     cd.members _members
WHERE
     _members.firstname = 'David'
     AND _members.surname = 'Farrell'
     AND _members.memid = _bookings.memid;

Inner Join

Say we want a list of the start times for bookings by members named ‘David Farrell’.

Use an INNER JOIN to combine two tables with matching values on both tables. You can do this a couple different styles:

# Style 1
SELECT _bookings.starttime
FROM cd.bookings _bookings
INNER JOIN cd.members _members
           ON _members.memid = _bookings.memid
WHERE _members.firstname = 'David'
      AND _members.surname = 'Farrell';

# Style 2
SELECT _bookings.starttime
FROM cd.bookings _bookings,
     cd.members _members
WHERE
     _members.firstname = 'David'
     AND _members.surname = 'Farrell'
     AND _members.memid = _bookings.memid;

starttime
2012-09-18 09:00:00
2012-09-18 17:30:00
2012-09-18 13:30:00
2012-09-18 20:00:00
2012-09-19 09:30:00
2012-09-19 15:00:00
2012-09-19 12:00:00
2012-09-20 15:30:00
2012-09-20 11:30:00
2012-09-20 14:00:00
2012-09-21 10:30:00
2012-09-21 14:00:00
2012-09-22 08:30:00
2012-09-22 17:00:00
2012-09-23 08:30:00
2012-09-23 17:30:00
2012-09-23 19:00:00
2012-09-24 08:00:00
2012-09-24 16:30:00
2012-09-24 12:30:00
2012-09-25 15:30:00
2012-09-25 17:00:00
2012-09-26 13:00:00
2012-09-26 17:00:00
2012-09-27 08:00:00
2012-09-28 11:30:00
2012-09-28 09:30:00
2012-09-28 13:00:00
2012-09-29 16:00:00
2012-09-29 10:30:00
2012-09-29 13:30:00
2012-09-29 14:30:00
2012-09-29 17:30:00
2012-09-30 14:30:00

Another example is to produce a list of the start times for bookings for tennis courts for the date ‘2012-09-21’ and return the start time and facility name pairings, ordered by the time.

SELECT books.starttime AS start,
   facil.name AS name
FROM cd.bookings books
INNER JOIN cd.facilities facil
    ON books.facid = facil.facid
WHERE
    DATE(books.starttime) = '2012-09-21'
    AND facil.name LIKE '%Tennis Court%'
ORDER BY
    books.starttime;

start	            name
2012-09-21 08:00:00	Tennis Court 1
2012-09-21 08:00:00	Tennis Court 2
2012-09-21 09:30:00	Tennis Court 1
2012-09-21 10:00:00	Tennis Court 2
2012-09-21 11:30:00	Tennis Court 2
2012-09-21 12:00:00	Tennis Court 1
2012-09-21 13:30:00	Tennis Court 1
2012-09-21 14:00:00	Tennis Court 2
2012-09-21 15:30:00	Tennis Court 1
2012-09-21 16:00:00	Tennis Court 2

##Views

###Materialized Views

Materialized Views in Postgres are like views, but they persist the results in a table-like form.

CREATE MATERIALIZED VIEW myview AS SELECT * FROM mytable;

The main difference between a materialized view and a created table is that the materialized view cannot be directly updated after it is created. The query used to create the materialized view is stored exactly the same way that a view’s query is stored. The only way to get fresh data is with:

REFRESH MATERIALIZED VIEW myview;

So why use a materialized view?

Accessing the data stored in a materialized view is often faster than accessing the underlying tables directly or through a view
However, the data is not always current (you make that tradeoff)

Aggregates

You can select aggregates using COUNT, SUM, and AVG, MAX, MIN. DISTINCT can be used with aggregates.

Notes

https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types

Arrays

ARRAY Ordered list of zero or more elements of any non-ARRAY type. An ARRAY is an ordered list of zero or more elements of non-ARRAY values. ARRAYs of ARRAYs are not allowed. Queries that would produce an ARRAY of ARRAYs will return an error. Instead a STRUCT must be inserted between the ARRAYs using the SELECT AS STRUCT construct.

Currently, BigQuery has two following limitations with respect to NULLs and ARRAYs: BigQuery raises an error if query result has ARRAYs which contain NULL elements, although such ARRAYs can be used inside the query. BigQuery translates NULL ARRAY into empty ARRAY in the query result, although inside the query NULL and empty ARRAYs are two distinct values.

Declaring an ARRAY type ARRAY types are declared using the angle brackets (< and >). The type of the elements of an ARRAY can be arbitrarily complex with the exception that an ARRAY cannot directly contain another ARRAY.

Format ARRAY

Structs

STRUCT Container of ordered fields each with a type (required) and field name (optional). Declaring a STRUCT type STRUCT types are declared using the angle brackets (< and >). The type of the elements of a STRUCT can be arbitrarily complex.

Format STRUCT

JSON Functions

https://www.postgresql.org/docs/9.3/functions-json.html

`json_extract_path_text`

Format: json_extract_path_text(from_json json, VARIADIC path_elems text[])

Example:

json_extract_path_text('{"f2":{"f3":1},"f4":{"f5":99,"f6":"foo"}}','f4', 'f6')  # returns 'foo'

Window Functions

The standard window function syntax looks like:

function (expression) OVER (
[ PARTITION BY expr_list ]
[ ORDER BY order_list [ frame_clause ] ] )

Let’s break this apart:

A function is any function like AVG, COUNT, MIN, MEDIAN, MAX, NTH_VALUE, FIRST_VALUE, LAST_VALUE, ROW_NUMBER, ``SUM`, etc. For Redshift, see list here: https://docs.aws.amazon.com/redshift/latest/dg/c_Window_functions.html

OVER defines the window specification. This is required and tells us that we’re looking at a window function instead of a normal function.

PARTITION BY <expr_list> is optional and means that we want to subdivide the result set into partitions, kinda like a GROUP BY clause. If a partition clause is present, the function is calculated for the rows in each partition. If no partition clause is specified, a single partition contains the entire table and the function is computed for the entire table.

An expr_list is further broken down into

expression | column_name [, expr_list ]

ORDER BY <order_list> is optional and means that the window function is applied to the rows within each partition sorted according to the order specification in ORDER BY.

An order_list is further broken down into

expression | column_name [ ASC | DESC ]
[ NULLS FIRST | NULLS LAST ]
[, order_list ]

A frame_clause is used in aggregate functions to further refine the set of rows in a function’s window when using ORDER_BY. It provides the ability to include or exclude sets of rows within the ordered result. The frame clause does not apply to ranking functions and is not required when no ORDER BY clause is used in the OVER clause for an aggregate function. If an ORDER BY clause is used for an aggregate function, an explicit frame clause is required.

ROWS - the rows clause defines the window frame by specifying a physical offset from the current row. This clause specifies the rows in the current window or partition that the value in the current row is to be combined with. It uses arguments to specify the row position, which can be before or after the current row. The reference point for all window frames is the current row. Each row becomes the current row in turn as the window frame slides forward in the partition. We have two variations:

The frame is a simple set of rows up to and including the current row

{ UNBOUNDED PRECEDING unsigned_value PRECEDING CURRENT ROW }
The frame can be a set of rows between two boundaries

{BETWEEN { UNBOUNDED PRECEDING | unsigned_value { PRECEDING | FOLLOWING } | CURRENT ROW} AND { UNBOUNDED FOLLOWING | unsigned_value { PRECEDING | FOLLOWING } | CURRENT ROW }}

UNBOUNDED PRECEDING indicates that the window starts at the first row of the partition. offset PRECEDING indicates that the window starts a number of rows equivalent to the value of offset before the current row UNBOUNDED PRECEDING is the default.

CURRENT ROW

What does that mean exactly?

Window functions perform an operation across a set of rows that are somehow related to the current row
Similar to GROUP BY aggregate functions, but all rows remain in the output

Use cases:

Fetch values from preceding or following rows (e.g. fetching the previous row’s value)
Assign ordinal ranks (1st, 2nd, etc) to rows based on their values’ positions in a sorted list
Running totals, moving averages

`ROW_NUMBER`

Example Table Summer_Medals

Year, Medals, Medals_RT
2004, 116, 116
2008, 125, 241
2012, 147, 388

Fetch Previous Row’s Calculations (e.g. previous rows values, running totals):

Year, Champion, Last_Champion, Reigning_Champion
1996, GER, null, false
2000, LTU, GER, false
2004, LTU, LTU, true
2008, EST, LTU, false
2012, GER, EST, false

Example ROW_NUMBER():

SELECT
  Year, Event, Country,
  ROW_NUMBER() OVER () AS Row_N
FROM Summer_Medals
WHERE
  Medal = 'Gold';

Example ‘Numbering Rows’

SELECT
 *,
 ROW_NUMBER() OVER () AS Row_N
FROM Summer_Medals
ORDER BY Row_N ASC;

Example Ordering by Year in descending order (i.e. adding ORDER BY within OVER)

SELECT
  Year, Event, Country,
  ROW_NUMBER() OVER (ORDER BY Year DESC) AS Row_N
FROM Summer_Medals
WHERE
  Medal = 'Gold';

Four Functions (`LAG`, `LEAD`, `FIRST_VALUE`, `LAST_VALUE`)

There are four functions:

Relative

LAG(column, n) returns a column’s value at the row n rows before the current row
LEAD(column, n) returns a column’s value at the row n rows after the current row

Absolute

FIRST_VALUE(column) returns the first value in the table or partition
LAST_VALUE(column) returns the last value in the table or partition

LAG()

LAG is a window function that takes a column and a number n and returns the column’s value n rows before the current row. E.g. passing in 1 as n returns the previous row’s value

Syntax: LAG(column, n) OVER ()

Example Current and Last Champions

WITH Discus_Gold AS(
  SELECT
    Year, Country AS Champion
  FROM Summer_Medals
  WHERE
    Year IN (1996, 2000, 2004, 2008, 2012)
    AND Gender = 'Men' AND Medal = 'Gold'
    AND Event = 'Discus Throw')

SELECT
  Year, Champion,
  LAG(Champion, 1) OVER
    (ORDER BY Year ASC) AS Last_Champion
FROM Discus_Gold
ORDER BY Year ASC;

Results in:

Year, Champion, Last_Champion

`FIRST_VALUE` and `LAST_VALUE`

SELECT
  Year, City,
  FIRST_VALUE(City) OVER
    (ORDER BY Year ASC) AS First_City,
  LAST_VALUE(City) OVER (
    ORDER BY Year ASC
    RANGE BETWEEN
      UNBOUNDED PRECEDING AND
      UNBOUNDED FOLLOWING
    ) AS Last_City
  FROM Hosts
  ORDER BY Year ASC;

PARTITION BY

PARTITION BY splits the table into partitions based on a column’s unique values

Similar to GROUP BY, but the results aren’t rolled into one column
Operated on separately by the window function
- ROW_NUMBER will reset for each partition
- LAG will only fetch a row’s previous value if its previous row is in the same partition

Example Adding partition by one column

WITH Discus_Gold AS (...)

SELECT
  Year, Event, Champion,
  LAG(Champion) OVER
    (PARTITION BY Event
     ORDER BY Event ASC, Year ASC) AS Last_Champion
FROM Discus_Gold
ORDER BY Event ASC, YEAR ASC;

Example Partitioning by multiple columns

WITH Country_Gold AS (
  SELECT
    DISTINCT Year, Country, Event
  FROM Summer_Medals
  WHERE
    Year IN (2008, 2012)
    AND Country IN ('CHN', 'JPN')
    AND Gender = 'Women' AND Medal = 'Gold')

SELECT
  Year, Country, Event
  ROW_NUMBER() OVER (PARTITION BY Year, Country)
FROM Country_Gold;

Results in the Row Number resetting per partition:

Year, Country, Event, Row_N
2008, CHN, +78KG (Heavyweight), 1
2008, CHN, -49KG, 2
...
2008, JPN, 48 - 55KG, 1
2008, JPN, 48 - 55KG, 2

Ranking

ROW_NUMBER is one of three ranking functions.

ROW_NUMBER() always assigns unique numbers, even if two rows’ values are the same
RANK() assigns the same number to rows with identical values, skipping over the next numbers in such cases
DENSE_RANK() assigns the same number to rows with identical values, but doesn’t skip over the next numbers

Given a Source Table of:

SELECT
  Country, COUNT(DISTINCT Year) AS Games
FROM Summer_Medals
WHERE
  Country IN ('GBR', 'DEN', 'FRA',
              'ITA', 'AUT', 'BEL',
              'NOR', 'POL', 'ESP')
GROUP BY Country
ORDER BY Games DESC;

Results in:

Country, Games
GBR, 27
DEN, 26
FRA, 26
ITA, 25
...

`ROW_NUMBER`

WITH Country_Games AS (...)

SELECT
  Country, Games
  ROW_NUMBER()
    OVER (ORDER BY Games DESC) AS Row_N
FROM Country_Games
ORDER BY Games DESC, Country ASC;

Results in:

Country, Games, Row_N
GBR, 27, 1
DEN, 26, 2
FRA, 26, 3
ITA, 25, 4

`RANK`

WITH Country_Games AS (...)

SELECT
  Country, Games
  ROW_NUMBER()
    OVER (ORDER BY Games DESC) AS Row_N,
  RANK()
    OVER (ORDER BY Games DESC) AS Rank_N
FROM Country_Games
ORDER BY Games DESC, Country ASC;

Results in:

Country, Games, Row_N, Rank_N
GBR, 27, 1, 1
DEN, 26, 2, 2
FRA, 26, 3, 2
ITA, 25, 4, 4

`DENSE_RANK`

WITH Country_Games AS (...)

SELECT
  Country, Games
  ROW_NUMBER()
    OVER (ORDER BY Games DESC) AS Row_N,
  RANK()
    OVER (ORDER BY Games DESC) AS Rank_N,
  DENSE_RANK()
    OVER (ORDER BY Games DESC) AS Dense_Rank_N

FROM Country_Games
ORDER BY Games DESC, Country ASC;

Results in:

Country, Games, Row_N, Rank_N, Dense_Rank_N
GBR, 27, 1, 1, 1
DEN, 26, 2, 2, 2
FRA, 26, 3, 2, 2
ITA, 25, 4, 4, 3

You’ll want to think about ranking with partitioning

Tips

Explain

In SQL Server (not Postgres), you can kinda see how much work is going on behind the scenes with something like:

SET STATISTICS IO ON
SET STATISTICS TIME ON
<my_query>
SET STATISTICS IO OFF
SET STATISTICS TIME OFF

In Postgres, we don’t have that exact command, but we do have explain:

EXPLAIN (ANALYZE ON, BUFFERS ON) SELECT ...

Monitoring

esides EXPLAIN, we have monitoring stats in pg_statio_* for IO stats. The data isn’t scoped to a session, but it can help in monitoring efficient queries in clean environments.

https://www.postgresql.org/docs/current/monitoring-stats.html

Optimization - Sargable Query vs Nonsargable Query

Non-sargable stands for Non Search Argument. When we do this type of query, SQL is unable to use an index. An example of this is when a function is used in the WHERE clause or a JOIN clause.

-- Non-sargable Query
SELECT EmployeeName
FROM EmployeeTest
WHERE LEFT(EmployeeName, 1) = 'A';
-- Results in performing the `LEFT` function on every single record

A sargable query stands for Searchable Argument. When we do this type of query, SQL is able to use indexes. An example of the above SQL that is faster.

-- Sargable Query
SELECT EmployeeName
FROM EmployeeTest
WHERE EmployeeName LIKE 'A%';

Other Examples of Sargable:

-- Non-Sargable
WHERE YEAR(DateOfBirth) = '1952'

-- Sargable Query
WHERE DateOfBirth >= '19520101' AND DateOfBirth < '19530101'

Summary: Use a sargable query by NOT including a function in your WHERE or JOIN clause so you can utilize indexes

So what do I do? Instead of using functions on your field (i.e. it shows up on the LEFT side of your field, instead we want this on the right hand side)

Optimization - SELECT specific columns’

Do not run SELECT * if you do not need all columns. If you just select the columns you need, it’ll be faster and you probably don’t (and shouldn’t) have an index on all columns.

Optimization - LIMIT

LIMIT your results if you do not need everything

Optimization - LEFT vs RIGHT calculation

If you do a calculation on the LEFT side vs the RIGHT side, this changes our speed:

# Non-Sargable Query because Calculation done on left side
WHERE Salary / 2 = 554677;

# Sargable Query because Calculation done on right side
WHERE Salary = (277338.5 * 2);

Indexes

In Postgres, you cannot just run a SHOW INDEXES command to the list the index information of a table or database. Instead, we have the pg_indexes view that you can query or you can run \d to command on psql to view the index information for a table.

See all indexes of a specific schema (using the pg_indexes view)

SELECT
    tablename,
    indexname,
    indexdef
FROM
    pg_indexes
WHERE
    schemaname = 'myschema'
ORDER BY
    tablename,
    indexname;

See all indexes for a table

SELECT
    indexname,
    indexdef
FROM
    pg_indexes
WHERE
    tablename = 'mytable';

Create an Index

You can create an index on specific column(s) for a table or a materialized view. There’s pros and cons to an index. It comes down to:

Con: Indexes slow down inserts (need to add additional data for indexes, figure out what indexes to add, increase storage)
Pro: Indexes speed up searches for getting data

Include Column in Index

A lot of times when we create an index, we just think about the keyed columns. There is the optional INCLUDE clause that lets you specify a list of columns that will be included in the index as a non-key column.

Say you’re searching based off an indexed column and return just that index column. This is optimal since it hits the index (DateOfBirth) and return only that data (DateOfBirth).

SELECT DateOfBirth
FROM EmployeeTest
WHERE DateOfBirth >= '19520101' AND DateOfBirth < '19530101';

However, if you change the above to SELECT DateOfBirth, EmployeeName and EmployeeName is not an index, we now do a search for the index, then we go back and need to pull information for the EmployeeName column. If you are regularly pulling this information back, you might want to INCLUDE a column into the index.

Just take care to not add many non-key columns to an index when you have really wide columns. Also keep in mind that the index tuple cannnot exceed a certain size, otherwise data insertion fails. A non-key column duplicates data from the index’s table and increases the size of the index.

https://www.postgresql.org/docs/11/sql-createindex.html

Stale and Fragmented Indexes

Indexes store statistics (see monitoring) and uses estimations of what the data looks like in order to optimize itself. These statistics get old and the distribution of data may get old. You would need to redinex then.

https://www.postgresql.org/docs/9.3/sql-reindex.html

REINDEX

Update statistics with: https://www.postgresql.org/docs/9.3/sql-analyze.html

ANALYZE [VERBOSE} [table_name]

Updating statistics store information into the pg_statistic system catalog

Resources

Good basic and common usage: https://popsql.com/learn-sql/postgresql/
DataLemur: https://datalemur.com

PostgreSQL

MAX

MIN

SUM

AVG

COUNT

HAVING

Notes

json_extract_path_text

ROW_NUMBER

Four Functions (LAG, LEAD, FIRST_VALUE, LAST_VALUE)

LAG()

FIRST_VALUE and LAST_VALUE

PARTITION BY

Ranking

ROW_NUMBER

RANK

DENSE_RANK

Related Posts

`json_extract_path_text`

`ROW_NUMBER`

Four Functions (`LAG`, `LEAD`, `FIRST_VALUE`, `LAST_VALUE`)

`FIRST_VALUE` and `LAST_VALUE`

`ROW_NUMBER`

`RANK`

`DENSE_RANK`