MySQL

Summary

SQL (Structured Query Language) is the standard language for communicating with relational database management systems. There’s a few variations including Microsoft’s SQL Server, MySQL, PostgreSQL, SQLite. Below instructions are with MySQL. We’ll setup a MySQL server, then use a client to connect to the server and query for data.

Installation

Mac Install

On a mac, use homebrew to install mysql using $brew install mysql to install the MySQL as a server. For a GUI client, you can download MySQL Workbench, which will allow you to visually see the data with a GUI (as opposed to only command line).

Linux (Ubuntu) Install

For Ubuntu, use your package manager to install sudo apt-get install mysql-server and your workbench sudo apt-get install mysql-workbench-community

Then setup with sudo /usr/bin/mysql_secure_installation, where you can setup root password and other configurations.

MySQL Config

You can find the mysql config under /etc/my.cnf. It may be in another location, but it’ll show you useful info.

Terminal Client

Instead of the regular ‘mysql’ terminal client, I actually like mycli: https://github.com/dbcli/mycli/

For the MySQL Prompt in mycli, I changed it to:

# MySQL prompt
# \t - Product type (Percona, MySQL, Mariadb)
# \u - Username
# \h - Hostname of the server
# \d - Database name
# \n - Newline
#prompt = '\t \u@\h:\d> '
prompt ="(\u@\h) [\d]> "

Commands

Run secure installation script sudo /usr/bin/mysql_secure_installation
Login with mysql -uroot -psome_password

Useful MySQL commands (Start, Stop Server)

$systemctl status mysql  # check status of server
$systemctl start mysql  # start the server
$systemctl stop mysql   # stop the server
$systemctl help mysql  # get help commands

Run a text file (containing sql scripts) from command line

$mysql < myscript.sql

Switching to mysql shell

$mysql  # enters into the shell from bash
$mysql -u root -p  # enters into the shell as root user and prompts for
password
mysql>  # This is what you'll see when in shell

MySQL Commands

Create, Drop, and Use a Database

mysql> DROP DATABASE IF EXISTS my_database_name;
mysql> CREATE DATABASE my_database_name;
mysql> USE my_database_name;

Show list of users

See what Users you have available:

SELECT User FROM mysql.user;

To show the list of users, we want to look at the mysql database.

mysql> use mysql;
mysql> show tables;

We’re interested in the user table on the mysql database.

mysql> describe mysql.user;

For older versions of MySQL (below 5.7), you might want to look at these fields: mysql> select host, user, password from mysql.user;

For more recent version of MySQL (5.7+), you might want to look at these fields: mysql> select host, user, authentication_string from mysql.user;

Create user

To create a User, you can run:

CREATE USER 'custom'@'localhost' IDENTIFIED BY 'password';

You can then log in, say with: mycli --user will --password mypassword

Update user password

To update the user password:

mysql> update user set authentication_string=password('1111') where user = 'root';

Grant privileges

To grant all privileges to a user on host ‘localhost’ with a database called ‘db_test’ do:

GRANT ALL ON db_test.* to 'tester'@'localhost';

To update privileges

FLUSH PRIVILEGES;

Status of current database, user

mysql> status   # Shows current database, current user
--------------
mysql  Ver 14.14 Distrib 5.7.18, for Linux (x86_64) using  EditLine wrapper

Connection id:		10
Current database:
Current user:		root@localhost
SSL:			Not in use
Current pager:		stdout
Using outfile:		''
Using delimiter:	;
Server version:		5.7.18-0ubuntu0.16.04.1 (Ubuntu)
Protocol version:	10
Connection:		Localhost via UNIX socket
Server characterset:	latin1
Db     characterset:	latin1
Client characterset:	utf8
Conn.  characterset:	utf8
UNIX socket:		/var/run/mysqld/mysqld.sock
Uptime:			13 min 24 sec

Threads: 1  Questions: 10  Slow queries: 0  Opens: 115  Flush tables: 1  Open tables: 34  Queries per second avg: 0.012

Show databases

List all the available databases

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| performance_schema |
| sys                |
| willdb             |
+--------------------+

Show tables

List all the available tables in a database

mysql> use willdb;
mysql> show tables;

mysql> show tables;
+------------------+
| Tables_in_willdb |
+------------------+
| iris             |
+------------------+

describe tables

See the table structure

mysql> describe iris;
+-------------+-------------+------+-----+---------+-------+
| Field       | Type        | Null | Key | Default | Extra |
+-------------+-------------+------+-----+---------+-------+
| sepallength | float       | YES  |     | NULL    |       |
| sepalwidth  | float       | YES  |     | NULL    |       |
| petallength | float       | YES  |     | NULL    |       |
| petalwidth  | float       | YES  |     | NULL    |       |
| class       | varchar(30) | YES  |     | NULL    |       |
+-------------+-------------+------+-----+---------+-------+
5 rows in set (0.00 sec)

see indexes on a table

Indexes in general help speed up database reads (e.g. search) and slows down database writes. If you index a column, you create ordering (e.g. say alphabetically) through an index. Instead of searching through all n rows, you might have a binary search w/ log-n index entries.

List all the indexes on a table

mysql> show index from iris;

See below on topics of ‘lock contention’, ‘deadlock’ caused by indexes.

You get some of the following fields:

Table - name of the table
Non_unique - 0 if the index cannot contain duplicates, 1 if it can
Key_name - name of the index
Seq_in_index - the column sequence number in the index, starting with 1

show triggers

Show database triggers:

mysql> show triggers like '%something%';

Explain

explain is often used with complex queries, optimizing queries, and indexes. Put explain in front of a query that has ‘select’, ‘update’, ‘delete’, ‘insert’, ‘replace’, ‘update’

mysql> explain select * from my_db.my_table;
+------+---------------+-----------------+--------+-----------------+--------+-----------+--------+--------+---------+
| id   | select_type   | table           | type   | possible_keys   | key    | key_len   | ref    | rows   | Extra   |
|------+---------------+-----------------+--------+-----------------+--------+-----------+--------+--------+---------|
| 1    | SIMPLE        | my_table        | ALL    | <null>          | <null> | <null>    | <null> | 343848 |         |
+------+---------------+-----------------+--------+-----------------+--------+-----------+--------+--------+---------+

Here’s an example that uses an index:

explain select property_id, mls_id from my_db.my_table;
+------+---------------+-----------------+--------+-----------------+---------+-----------+--------+--------+-------------+
| id   | select_type   | table           | type   | possible_keys   | key     | key_len   | ref    | rows   | Extra       |
|------+---------------+-----------------+--------+-----------------+---------+-----------+--------+--------+-------------|
| 1    | SIMPLE        | my_table        | index  | <null>          | ix_cpid | 5         | <null> | 343848 | Using index |
+------+---------------+-----------------+--------+-----------------+---------+-----------+--------+--------+-------------+

Here’s an example with a join:

explain select count(*), t1.loc_city from idx.mlslv4_master t1 join
	    enterprise.community t2 where t1.loc_city = t2.name and t2.owner_id=44 group by t1.loc_city;
+------+---------------+---------+--------+-----------------+-------------+-----------+--------------------+--------+---------------------------------+
| id   | select_type   | table   | type   | possible_keys   | key| key_len   | ref    | rows               | Extra  |                                 |
|------+---------------+---------+--------+-----------------+-------------+-----------+--------------------+--------+---------------------------------|
| 1    | SIMPLE        | t2      | ref    | company_id      | company_id 	 | 4      | const              | 179    | Using temporary; Using filesort |
| 1    | SIMPLE        | t1      | ref    | ix_loc_city     | ix_loc_city 	 | 33	  | enterprise.t2.name | 10     | Using where; Using index        |
+------+---------------+---------+--------+-----------------+-------------+-----------+--------------------+--------+---------------------------------+

Show Processes

You can see currently running processes a few different ways:

show full processlist

SELECT * FROM INFORMATION_SCHEMA.PROCESSLIST

Loading Data

Say we want to load the iris dataset. We do a curl to get the dataset, then run the below sql query.

curl http://mlr.cs.umass.edu/ml/machine-learning-databases/iris/iris.data > iris.csv

|sepallength|sepalwidth|petallength|petalwidth|class      |
|-----------|----------|-----------|----------|-----------|
|5.1        |3.5       |1.4        |0.2       |Iris-setosa|
|4.9        |3         |1.4        |0.2       |Iris-setosa|
|4.7        |3.2       |1.3        |0.2       |Iris-setosa|

LOAD DATA LOCAL

#EDIT THIS PATH:
INFILE "/Users/williamliu/Desktop/iris.csv"

INTO TABLE sampdb.iris  #db sampdb, table iris

FIELDS
    TERMINATED BY ','
    #ENCLOSED BY '"'

LINES
   TERMINATED BY '\n'
#IGNORE 1 LINES;

Create Table

create table iris(
sepallength float,
sepalwidth float,
petallength float,
petalwidth float,
class nvarchar(30)
);

Datetimes

In MySQL 5.5, you could only have one TIMESTAMP column in a table that has the current timestamp as a the default value for initializing the column.

From MySQL 5.6.5 and on, the above restriction is lifted so any timestamp column definition can have any combination of default current_timestamp and on update current_timestamp clauses. E.g.

alter table my_table add column created_datetime timestamp default now(),
					 add column modified_datetime timestamp default '0000-00-00 00:00:00' ON UPDATE now();

This looks like:

show columns from my_table like '%datetime%';
+-------------------+-----------+--------+-------+---------------------+-----------------------------+
| Field             | Type      | Null   | Key   | Default             | Extra                       |
|-------------------+-----------+--------+-------+---------------------+-----------------------------|
| created_datetime  | timestamp | NO     |       | CURRENT_TIMESTAMP   |                             |
| modified_datetime | timestamp | NO     |       | 0000-00-00 00:00:00 | on update CURRENT_TIMESTAMP |
+-------------------+-----------+--------+-------+---------------------+-----------------------------+
2 rows in set

Queries

Basic Format of SQL Queries

Assuming data (say mytable from mydatabase) looks like:

# last_name, first_name, suffix, city, state, birth, death
Adams, John, , Braintree, MA, 1735-10-30, 1826-07-04
Adams, John Quincy, , Braintree, MA, 1767-07-11, 1848-02-23
Arthur, Chester A., , Fairfield, VT, 1829-10-05, 1886-11-18
Buchanan, James, , Mercersburg, PA, 1791-04-23, 1868-06-01
Bush, George H.W., , Milton, MA, 1924-06-12,

We can do the following query:

SELECT last_name, first_name, state
FROM mydatabase.mytable;

a database holds a lot of tables
a table holds lots of rows and columns of data
a record is a particular instance of a row and column

ROW NUMBER

To get a returning count of what the row number is, do this. Note that you can set variables and return them in the results.

SET @n=0;
SELECT @n := @n+1 AS 'row_num', sepallength FROM sampdb.iris;

COUNT

SELECT COUNT(*) FROM sampdb.president
SELECT COUNT(first_name), COUNT(death) FROM sampdb.president  # Returns 42 and 38
SELECT COUNT(DISTINCT state, first_name) FROM sampdb.president

Counts the not NULL values passed into it
This is an ‘aggregating’ function - takes a set of records or values and performs calculations on that set
For Distinct, we get 42 instead of 43 total presidents, means one duplicate state and first_name combo
Can also Subset and Count the data

Other Functions (AVG, SUM, MIN, MAX)

SELECT AVG(death) FROM sampdb.president
SELECT SUM(age) FROM sampdb.president
SELECT MIN(age) FROM sampdb.president
SELECT MAX(age) FROM sampdb.president

Functions (AS, DATEDIFF, CONCAT, etc.)

SELECT ... score * 2 AS doubled_score
SELECT ... DATEDIFF(death, birth)
SELECT ... CONCAT(first_name, ' ', last_name)
SELECT ... UNIQUE(last_name, middle_name, first_name)

Can create an alias with AS (required for subqueries)
Can run functions (e.g. CONCAT combines strings into one, DATEDIFF gets difference in days)
Can get unique values using UNIQUE

NULL Values and Functions

NULL values represent missing or unknown data (not the same as the value 0).
To test for NULL, do IS NULL or IS NOT NULL.
- IS NULL checks if the value is null. For example, SELECT LastName, FirstName FROM Persons WHERE LastName IS NULL
- IS NOT NULL checks if the value is not null. For example, SELECT LastName, FirstName FROM Persons WHERE LastName IS NOT NULL
There are functions ISNULL(), NVL(), IFNULL(), and COALESCE() that specify how we want to treat a NULL value
- For MySQL, IFNULL() tells us how to treat a NULL value (in this case, replacing it with 0): SELECT UnitPrice*(UnitsInStock+IFNULL(UnitsOnOrder,0)) FROM Products

WHERE

SELECT COUNT(first_name) FROM sampdb.president WHERE last_name="Adams"

Filters (before GROUP BY)
Can pass in lists (using IN) or tuples
- Filter by list: … WHERE state IN ('MA', 'VA')
- Filter by tuple: … WHERE (last_name, state) IN ('Adams', 'MA')

GROUP BY

SELECT state, COUNT(*) FROM sampdb.president GROUP BY state ORDER BY state

GROUP BY is a subset; divides records by unique tuples

GROUP BY using WHERE and HAVING

SELECT state, COUNT(*) FROM sampdb.president
   WHERE birth <= '1900-01-01'
   GROUP BY state
   HAVING COUNT(*) >=2
   ORDER BY COUNT(*);

Sample Output:

state, count(*)
   MA, 2
   VT, 2
   NC, 2
   NY, 4
   OH, 7
   VA, 8

WHERE filters before the GROUP BY
HAVING filters after the GROUP BY

CAST

You’ll often need to CAST, e.g. from a String to a Datetime

CAST('2009-01-01' AS DATETIME)

DATES and TIMES

Often you’ll need datetime functions. Check out the reference here:

https://dev.mysql.com/doc/refman/8.0/en/date-and-time-functions.html

CURDATE()

Get the Current Date

SELECT CURDATE();
+------------+
| CURDATE()  |
+------------+
| 2019-01-26 |
+------------+

INTERVAL

SELECT CURDATE() - INTERVAL 7 DAY;
+----------------------------+
| CURDATE() - INTERVAL 7 DAY |
+----------------------------+
| 2019-01-19                 |
+----------------------------+

DATE_SUB

mysql> SELECT DATE_SUB(NOW(), INTERVAL 30 DAY) AS 30DaysAgo;
+---------------------+
| 30DaysAgo           |
+---------------------+
| 2018-12-27 14:09:38 |
+---------------------+

STR_TO_DATE

You can select a string into a date or datetime.

SELECT whatever
  FROM yourtable
 WHERE STR_TO_DATE(yourdatefield, '%m/%d/%Y') > CURDATE() - INTERVAL 7 DAYS

CASCADE

Create ‘buildings’ table

CREATE TABLE buildings (
    building_no INT PRIMARY KEY AUTO_INCREMENT,
    building_name VARCHAR(255) NOT NULL,
    address VARCHAR(255) NOT NULL
);

CREATE TABLE rooms (
    room_no INT PRIMARY KEY AUTO_INCREMENT,
    room_name VARCHAR(255) NOT NULL,
    building_no INT NOT NULL,
    FOREIGN KEY (building_no)
        REFERENCES buildings (building_no)
        ON DELETE CASCADE
);

Notice that we add the ON DELETE CASCADE clause at the end of the foreign key constraint definition.

Subqueries

Subqueries is a query inside another query
Usually this means an outer query and an inner query SELECT name FROM city WHERE pincode IN (SELECT pincode FROM pin WHERE zone = 'west')
- outer query is SELECT name FROM city WHERE pincode IN ...
- inner query is SELECT pincode FROM pin WHERE zone = 'west'
Returned table has to have the fields and row expected of it by the calling query
Alias (AS) is required for subqueries
Two varieties of subqueries (correlated and non-correlated)
- Non-correlated subquery means that the inner query doesn’t depend on the outer query and can run as a stand alone query
- SELECT company FROM stock WHERE listed_on_exchange = (SELECT ric FROM market WHERE country='japan') - The inner query executes first, then the outer query. - non-correlated subqueries usually use IN or NOT IN
- Correlated subquery means that the inner query depends on the outer query
- SELECT student_id, score FROM sampdb.score AS scr WHERE event_id = 3 AND score > ( SELECT AVG(score) FROM sampdb.score WHERE event_id = scr.event_id GROUP BY event_id) ) - Requires use of alias (AS) - The outer query executes first, then the inner query; this is because the inner query depends on the output of the outer query - These queries are slower than non-correlated queries (think about doing a join instead) - correlated subqueries usually use EXISTS or NOT EXISTS

EXISTS

The EXISTS and NOT EXISTS conditions are used with a subquery (usually the correlated subquery). Condition is met if the subquery returns at least one row. It can be used with SELECT, INSERT, UPDATE, or DELETE statements.
SELECT * FROM mysuppliers WHERE NOT EXISTS (SELECT * FROM myorders WHERE mysuppliers.supplier_id = myorders.order_id);
Note that these are very inefficient since the sub-query is re-run for every single row so use only if no other solution

Exists Example

Suppose that a website contains two tables, the Customers table and the Orders table. Write a SQL query to find all customers who never order anything.

Table: Customers.

+----+-------+
| Id | Name  |
+----+-------+
| 1  | Joe   |
| 2  | Henry |
| 3  | Sam   |
| 4  | Max   |
+----+-------+

Table: Orders.

+----+------------+
| Id | CustomerId |
+----+------------+
| 1  | 3          |
| 2  | 1          |
+----+------------+
Using the above tables as example, return the following:

+-----------+
| Customers |
+-----------+
| Henry     |
| Max       |
+-----------+


SELECT c.Name AS Customers FROM Customers c
LEFT JOIN Orders o ON o.CustomerId = c.Id
WHERE NOT EXISTS (SELECT * FROM Orders o WHERE o.CustomerId = c.Id);

JOINS

Used to combine rows from two or more tables. Types of joins include:

INNER JOIN selects all rows from both tables that have a match
- SELECT column_name(s) FROM table1 a INNER JOIN table2 b ON a.column_name=b.column_name;
LEFT JOIN (aka LEFT OUTER JOIN) selects all rows from the left table (table1) and matching rows on the right table (table2)
- SELECT column_name(s) FROM table1 a LEFT JOIN table2 b ON a.column_name=b.column_name;
RIGHT JOIN (aka RIGHT OUTER JOIN) selects all rows from the right table (table2) and matching rows on the left table (table1)
- SELECT column_name(s) FROM table1 a RIGHT JOIN table2 b ON a.column_name=b.column_name;
FULL JOIN selects all rows from the left table (table1) and all rows on the right table (table2)
- SELECT column_name(s) FROM table1 a FULL OUTER JOIN table2 b ON a.column_name=b.column_name;
UNION and UNION ALL combines the result of two or more SELECT statements (with each statement having the same number of columns, data types, and order). UNION selects distinct values only. UNION ALL allows duplicates.
- SELECT column_name(s) FROM table1 UNION SELECT column_name(s) FROM table2
- SELECT column_name(s) FROM table1 UNION ALL SELECT column_name(s) FROM table2
LEFT EXCLUDING JOIN - selects all rows from the left table (table1) only, but does not select the matching rows (e.g. a.my_field = b.my_field)
- SELECT column_name(s) FROM table1 a LEFT JOIN table2 b ON a.my_key = b.my_key WHERE b.my_key IS NULL
RIGHT EXCLUDING JOIN - selects all rows from the right table (table2) only, but does not select the matching rows (e.g. a.my_field = b.my_field)
- SELECT column_name(s) FROM table1 a RIGHT JOIN table2 b ON a.my_key = b.my_key WHERE a.my_key IS NULL

Primary Key (PK) and Foreign Key (FK)

A primary key constraint uniquiely identifies each record in a database table. These values must be unique, cannot contain NULL values, and each table can only have one primary key. Note that you can have multiple column values that make up the primary key (e.g. P_ID, LastName, SSN).
- Add a primary key to a table with a single column (e.g. P_ID) as the constraint: ALTER TABLE MyTable ADD PRIMARY KEY (P_ID)
- Add a primary key to a table with multiple columns (e.g. P_ID, LastName) as the constraint: ALTER TABLE MyTable ADD PRIMARY KEY(P_ID, LastName)
- Remove a primary key: ALTER TABLE MyTable DROP PRIMARY KEY
A foreign key constraint in one table is just a primary key in another table.
- Add a foreign key on a single column as the constraint: ALTER TABLE MyTable ADD FOREIGN KEY (P_ID) REFERENCES OtherTable(P_ID)
- Add a foreign key on multiple columns (e.g. P_ID) as the constraint: ALTER TABLE MyTable ADD CONSTRAINT my_fk FOREIGN KEY (P_ID) REFERENCES OtherTable(P_ID)
- Remove a foreign key with: ALTER TABLE MyTable DROP FOREIGN KEY my_fk

PRI, UNI, MUL

If the key is PRI, the column is a Primary Key or is one of the columns in a multiple-column primary key.
If the key is UNI, the column is the first column of a unique index (a unique index permits multiple null values, but you can tell whether the column permits NULL by checking the NULL field)
If the key is MUL, the column is the first column of a nonunique index in which multiple occurences of a given value are permitted within the column. MUL is an index that is neither a primary key nor a unique key. The name comes from “multiple” because multiple occurences of the same value are allowed.

If more than one key value applies to a given column, the key displays the one with the highest priority in the order of PRI, UNI, and MUL.

There’s a column with Non_unique, meaning that 0 if the index cannot contain duplicates and 1 if it can.

DROP and TRUNCATE

To delete an entire database, be extra sure you want to and can spell correctly, then do: DROP DATABASE MyDatabase. This deletes everything in the database.
To delete an entire table (including the table itself), be extra sure you want to and can spell correctly, then do: DROP TABLE MyTable
To delete the data inside an entire table, do: TRUNCATE TABLE MyTable

ALTER

To add, delete, or modify columns in an existing table, use ALTER.
To add a column in a table, use: ALTER TABLE MyTable MODIFY COLUMN MyColumn <datatype>
To delete a column in a table, use: ALTER TABLE MyTable DROP COLUMN MyColumn
To modify a column in a table, use: ALTER TABLE MyTable MODIFY COLUMN MyColumn <datatype>

INSERT

To insert data by field: INSERT INTO MyDatabase.MyTable SET ThisField = ThisData
To insert data by row: INSERT INTO MyDatabase.MyTable VALUES ('field1', 'field2', 'field3'

INSERT ON DUPLICATE KEY UPDATE

You can specify the ON DUPLICATE KEY UPDATE clause when a row is about to be inserted, but that would then cause a duplicate value in a UNIQUE index or PRIMARY KEY. Instead, it’ll run an UPDATE of the old row.

An INSERT statement increases any auto-increment values, but UPDATE does not.

INSERT INTO t1 (a,b,c) VALUES (1,2,3)
  ON DUPLICATE KEY UPDATE c=c+1;

UPDATE

UPDATE MyDatabase.MyTable SET ThisField = ThisData

DELETE

DELETE FROM MyDatabase.MyTable  # Careful, no criteria deletes the entire database!

Example Delete Duplicates

Write a SQL query to delete all duplicate email entries in a table named Person, keeping only unique emails based on its smallest Id.

+----+------------------+
| Id | Email            |
+----+------------------+
| 1  | john@example.com |
| 2  | bob@example.com  |
| 3  | john@example.com |
+----+------------------+

Id is the primary key column for this table. For example, after running your query, the above Person table should have the following rows:

+----+------------------+
| Id | Email            |
+----+------------------+
| 1  | john@example.com |
| 2  | bob@example.com  |
+----+------------------+

# Write your MySQL query statement below
DELETE FROM Person USING Person, Person AS PersonCopy
WHERE Person.Id > PersonCopy.Id AND Person.Email = PersonCopy.Email;

IF

IF(2>1, 'OK', 'NOT)

CASE

CASE
    WHEN 2 > 1
        THEN 'OK'
    WHEN 1==1
        THEN 'YEP'
    ELSE
        'NOT OK'
END

You can combine all of the above into something like this (Note: From SQL Server):

SELECT IVRName,
SUM(CASE WHEN Disconnection in ('Answer') THEN 1 END) AS Answered,
SUM(CASE WHEN Disconnection in ('Abandon') THEN 1 END) AS Abandoned
FROM LifeNetDW.dbo.QueueMetrics
WHERE CallTime > '2015-1-31' AND IVRName IS NOT NULL
GROUP BY IVRName

Indexing

An index is a copy of selected columns from a table that can be searched efficiently (with a link to the complete row of data it was copied from).

Makes processing things much faster
An index is a pre-ordered sort of a particular field
Done and stored by the database server
Indexing has storage and performance costs
For every index on a table, the database must reorder the index for a new entry
Each index takes up space in storage and memory
Do ‘profiling’ to see where things are slow (e.g. EXPLAIN)
To create an index, do: CREATE INDEX MyIndex on MyTable (colToIndex)
To create an index on a combination of volumns, do: CREATE INDEX MyIndex on MyTable(colToIndex1, colToIndex2, colToIndex3)
To drop an index, do: ALTER TABLE MyTable DROP INDEX MyIndex

Advantages of Indexes

Indexes speed up the retrieval of data form the table (by using the index to select eact corresponding rows without scanning the whole table). So searches are faster. If you use primary key index and unique key indexes, it helps to avoid duplicate row data.

Disadvantaes of Indexes

Anytime you have a write query (e.g. INSERT, UPDATE, DELETE), the indexes must be updated so indexes slow down write queries.

B-Trees

A B-Tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-Tree is a generalization of a binary search tree in that a node can have more than two children. B-Trees are often used in relational databases.

Cardinality

Cardinality is the uniqueness of data values in a particular column of a table. The lower the cardinality number, the more duplicated elements in a column. For example, a bit will have a cardinality of 2 (very low) while a date might have a cardinality of say 100000. If every value is unique, then you have maximum cardiniality. If you have the minimum cardinality, then all values are the same.

Cardinality is a property that affects the ability to cluster, sort and search data. Higher cardinality means:

better performance of read-queries
bigger database size
worse performance for write-queries because hidden index data is being updated
the higher the cardinality, the greater the chances that MySQL uses the index when doing joins

Generally you want to list values in a composite index from highest cardiniality to lowest.

You can update the cardinality on a table with the following command:

ANALYZE TABLE my_table;
+-----------------------+---------+------------+------------+
| Table                 | Op      | Msg_type   | Msg_text   |
|-----------------------+---------+------------+------------|
| my_table              | analyze | status     | OK         |
+-----------------------+---------+------------+------------+

SHOW INDEX FROM my_table;

Binary Logging

Before we get into lock contention, let’s bring up how binary logging works. The binary log contains ‘events’ that describe database changes (e.g. table create, updates to table data). This log is used on the master server to send events to slave servers.

Data recovery operations also use the binary log. Statements like SELECT and SHOW aren’t included in the binary log. To enable the binary log, start the server with --log-bin[=base_name] where the default base_name is the name of the host machine followed by -bin.

To see if you have any binary logs, you can run:

mysql> show binary logs;

+---------------------+-------------+
| Log_name            | File_size   |
|---------------------+-------------|
| my-db-bin.000825 | 1073742262  |
| my-db-bin.000826 | 1073742059  |

Depending on your my.cnf file, you can see where your log-bins directory is.

Show Binary Log Events

mysql> show binlog events limit 10;

Here you can see what’s going on with the binlog events. Are there too many inserts, updates, etc.

Lock Contention

This is on the ‘user’ table with a PK on user_id

user_id(pk) | name   | status
1           | Ronald | 0
100         | Messi  | 0
10000000    | Raul   | 9

Read about binary logging first, then read below about lock contention. An example of lock contention is where:

Session/Query 1: DELETE FROM user WHERE status=9 and it takes a long time to scan and delete
- Query OK, 1000 rows affected (14 sec)
Session/Query 2: UPDATE user SET status=9 where user_id=100
- Query OK, 1 row affected (12 sec)
- Rows matched: 1 Changed: 1 Warnings: 0

Why did Query 2 get blocked by Query 1? Let’s think about the scenario if Query 1 does not block Query 2:

Query 2 will finish before Query 1
Query 1 already checked row: user_id=100
The row with Messi was not deleted because status == 0
The final result of the row is that row 100 exists with status=9
This means that data consistency is broken because the binary log looks like:
- UPDATE user SET status=9 WHERE user_id=100;
- DELETE FROM user WHERE status=9;
Any slave databases with user_id=100 will not exist, but will exist on master

So basically, the table user needs to be locked because any following operations on that same table will cause data inconsistencies for binary logging (thus issues w/ say data replication to slaves).

To solve this we can use next-key locking, where we lock not just the ‘modified’ rows, but also ‘scanned’ rows. The disadvantage is low concurrency.

If column is indexed

If the status column is indexed, then Queries 1 and 2 can run in parallel (to improve concurrency).

View Deadlocks

To view deadlocked transactions, you can run:

SHOW ENGINE INNODB STATUS \G

An example might look like this:

*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 320171 page no 48814 n bits 248 index `PRIMARY` of table `my_db`.`my_table` trx id 126544661030 lock_mode X locks rec but not gap waiting
*** (2) TRANSACTION:
TRANSACTION 126544669705, ACTIVE 8 sec fetching rows, thread declared inside InnoDB 2630
mysql tables in use 2, locked 2
3964 lock struct(s), heap size 472616, 34781 row lock(s)
MySQL thread id 10893466142, OS thread handle 0x7ed5f70c3700, query id 829917141049 10.10.0.10 idx System lock
UPDATE my_table.my_db
        SET ready_to_delete_bit = 1
        WHERE my_field = 121 AND start_datetime > NOW()
        AND (another_field = 13 or this_field = 'a')
        ORDER BY that_field ASC
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 320171 page no 48814 n bits 248 index `PRIMARY` of table `my_db`.`my_table` trx id 126544669705 lock_mode X locks rec but not gap
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 320171 page no 3035 n bits 200 index `PRIMARY` of table `my_db`.`my_table` trx id 126544669705 lock_mode X locks rec but not gap waiting
*** WE ROLL BACK TRANSACTION (2)
------------

Deadlocks

Indexes and keys play an important role in deadlocks, often caused by a technique known as a gap lock.

The idea is that InnoDB has two types of standard row-level locks:

a shared (s) lock permits the transaction to hold the lock to read a row.
an exclusive (x) lock permits the transaction to update or delete a row.

InnoDB also supports multiple granularity locking, which allows the coexistence of row locks and table locks.

an intention shared lock (is) says that a transaction intends to set a shared lock on individual rows in a table
an intention exclusive lock (ix) indicates that a transaction intends to set an exclusive lock on individual rows in a table

Table-level lock type compatibility is summarized in the following matrix.

	X			IX			S			IS
X	Conflict	Conflict	Conflict	Conflict
IX	Conflict	Compatible	Conflict	Compatible
S	Conflict	Conflict	Compatible	Compatible
IS	Conflict	Compatible	Compatible	Compatible

Deadlock (gap lock) caused by indexes

A gap lock is a lock on a gap between index records, or a lock on the gap before the first or after the last index record.

An example of this could be: SELECT c1 FROM t where c1 BETWEEN 10 and 20 FOR UPDATE; that will prevent other transactions from inserting a value of 15 into column t.c whether or not there was already such a value in the column because the gaps between all existing values in the range are locked.

A gap might span a single index value, multiple index values, or even be empty.

Gap locks are part of the tradeoff between performance and concurrency, and are used in some transaction isolation levels and not others.

See an example of MySQL deadlocks with concurrent inserts here:

http://thushw.blogspot.com/2010/11/mysql-deadlocks-with-concurrent-inserts.html

Show Innodb Status

Show InnoDB Status is also good for checking the number of row operations you have going on:

mysql>SHOW INNODB STATUS
Main thread process no. 38101, id 139815377716992, state: sleeping
Number of rows inserted 334154636, updated 547122709, deleted 315490449, read 6071813255519
33.11 inserts/s, 106.77 updates/s, 89.77 deletes/s, 950883.78 reads/s

Another way you can do this is by running the below SELECT, then waiting the time difference and doing your calculation (e.g. run, then wait 5 minutes, then do your calculation):

SELECT VARIABLE_NAME, VARIABLE_VALUE
FROM information_schema.global_status
WHERE VARIABLE_NAME IN ('COM_SELECT', 'COM_INSERT', 'COM_UPDATE', 'COM_DELETE');
+-----------------+------------------+
| VARIABLE_NAME   | VARIABLE_VALUE   |
|-----------------+------------------|
| COM_DELETE      | 153108           |
| COM_INSERT      | 5667103          |
| COM_SELECT      | 130384472        |
| COM_UPDATE      | 1978116          |
+-----------------+------------------+

Transactions

One way you can optimize is by instead of having every transaction run as a single query, you can group transactions together with Transactions.

Next-key Locks

A next-key lock is a combination of a record lock on the index record and a gap lock on the gap before the index record.

Example Table

user_id(pk) | name   | status
1           | Ronald | 0
2           | John   | 1
10000000    | Raul   | 9

Covering index and range scan / LIMIT

Covering index and long text/blob

Sorting, indexing and query execution plans

Other

Schemas

Command: SHOW DATABASES;
- Shows all databases available on the server
Command: SHOW TABLES IN books;
- Shows all tables available in a particular database
Command: DESCRIBE books.authors;
- Shows columns, data types, keys, in a particular table
Command: SHOW CREATE TABLE books.authors\G;
- Shows the command for creating a table

Normalization (Nth normal form)

Principles of Modeling Data ‘norms’
Rules for making sure data is where we expect it and that it isn’t duplicated
First Normal Form (1NF)
- Data should be atomic (i.e. values should not be combined)
- E.g. George Washington shouldn’t be a value (because it combines first and last name)
- No repeating columns
- E.g. Columns like Author1, Author2, Author3
Second Normal Form (2NF)
- No ‘partial dependencies’
- Cannot determine a nonkey column value with only a portion of the primary key
- Basically, unnecessary duplication
Third Normal Form (3NF)
- Second Normal Form plus no nonkey column depends on the value of another nonkey column
- Values are driven by the primary key fields; these values should be removed to a table of their own

Moving Files

If you need to move data across servers, you might have to do some extra work. If you’re on the database server, you have access to OUTFILE and INFILE, so you can dump data out or into say a csv file or a .sql file.

mysqldump

You can use mysqldump on the command line like this (login, get data from a database and table with a where clause)

mysqldump -t -u myusername -p mypassword -h myhostname --port 3306 mydatabasename mytablename --where="status='Sure'"

You can also run mysql on the command line like so (run a sql query, replace tabs with ‘,’, then dump to a csv file

mysql -u myusername -p mypassword -h myhostname mydatabasename mytablename -e "SELECT * FROM some_table" | tr '\t' , > myData.csv

Some of these commands assume you’re on the same server as the database server, which you probably won’t be. Use LOCAL INFILE instead if you want to upload a file to a remote server

mysql -u myusername -p mypassword -h myhostname --database mydatabasename -e "LOAD DATA LOCAL INFILE '/home/will/myData.csv' INTO TABLE sometable LINES TERMINATED BY '\n' IGNORE 1 LINES"

Sometimes it helps to dump some data from another server into the server you need, then do a quick join and update.

update some_database.some_table a
	join some_database.some_other_table b on a.id = b.id
set a.some_status=0

Check File Size of Tables

Say you have a database ‘idx’ and want to see how much space each table is, run:

select TABLE_NAME as `table`, round((DATA_LENGTH + INDEX_LENGTH) / 1024 / 1024) AS `Size (MB)`
FROM information_schema.TABLES
WHERE TABLE_SCHEMA = 'idx'
ORDER BY (DATA_LENGTH + INDEX_LENGTH)
DESC;

Partitions

The idea behind partitioning is to use multiple tables instead of one table. You can divide a table into many tables so that you can have old data in one sub table and new data in another table. The database is then able to optimize queries where you ask for new data knowing that they are in the second table. You can also define how the data is partitioned.

CREATE TABLE employees (
	id INT NOT NULL,
	fname VARCHAR(30),
	lname VARCHAR(30),
	hired DATE NOT NULL DEFAULT '1970-01-01',
	separated DATE NOT NULL DEFAULT '9999-12-31',
	job_code INT,
	store_id INT
)
PARTITION BY RANGE ( YEAR(separated) ) (
	PARTITION p0 VALUES LESS THAN (1991),
	PARTITION p1 VALUES LESS THAN (1996),
	PARTITION p2 VALUES LESS THAN (2001),
	PARTITION p3 VALUES LESS THAN MAXVALUE
);

You can drop old data (pruning) with:

ALTER TABLE employees DROP PARTITION p0;

You can also speed up a query like this since all the data will be in partition p2:

SELECT COUNT(*)
FROM employees
WHERE separated BETWEEN '2000-01-01' AND '2000-12-31'
GROUP BY store_id;

Sys Schema

MySQL has a sys schema, which is a collection of views and procedures for MySQL admins to get insight on how a MySQL Database is used. https://github.com/mysql/mysql-sys

Install

Depending on your version (e.g. 5.6, 5.7), download the correct file (e.g. sys_56.sql) and run as root.

cd /tmp/mysql-sys/
mysql -u root -p < ./sys_56.sql

If you’re on your database and you don’t have Git, you can generate a single file using generate_sql_file.sh and you should then see it in your ‘gen’ directory.

mysql> use sys;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> desc host_summary;
+------------------------+---------------+------+-----+---------+-------+
| Field                  | Type          | Null | Key | Default | Extra |
+------------------------+---------------+------+-----+---------+-------+
| host                   | varchar(60)   | YES  |     | NULL    |       |
| statements             | decimal(64,0) | YES  |     | NULL    |       |
| statement_latency      | text          | YES  |     | NULL    |       |
| statement_avg_latency  | text          | YES  |     | NULL    |       |
| table_scans            | decimal(65,0) | YES  |     | NULL    |       |
| file_ios               | decimal(64,0) | YES  |     | NULL    |       |
| file_io_latency        | text          | YES  |     | NULL    |       |
| current_connections    | decimal(41,0) | YES  |     | NULL    |       |
| total_connections      | decimal(41,0) | YES  |     | NULL    |       |
| unique_users           | bigint(21)    | NO   |     | 0       |       |
| current_memory         | text          | YES  |     | NULL    |       |
| total_memory_allocated | text          | YES  |     | NULL    |       |
+------------------------+---------------+------+-----+---------+-------+
12 rows in set (0.00 sec)

Logging

You can find logs in /var/log/mysql

There’s a break down of slow.log, error.log, etc.

Caching

MySQL has a global Query Cache that is shared among all the sessions. It caches the select query along with the result set. The query has to be EXACT (no whitespace difference) in order for the cache to be used correctly. Basically, if the query is in the cache, it’ll search there first.

Remember that there’s tradeoffs; faster reads but slower writes. Anytime an update happens, the cache is invalidated and it needs to reupdate the cache.

# check that query cache is enabled
show variables like '%have_query_cache%';
+------------------+---------+
| Variable_name    | Value   |
|------------------+---------|
| have_query_cache | YES     |
+------------------+---------+

show status like 'qcache%';
+-------------------------+---------+
| Variable_name           | Value   |
|-------------------------+---------|
| Qcache_free_blocks      | 0       |
| Qcache_free_memory      | 0       |
| Qcache_hits             | 0       |
| Qcache_inserts          | 0       |
| Qcache_lowmem_prunes    | 0       |
| Qcache_not_cached       | 0       |
| Qcache_queries_in_cache | 0       |
| Qcache_total_blocks     | 0       |
+-------------------------+---------+

show variables like '%query_cache_size%';
+------------------+---------+
| Variable_name    | Value   |
|------------------+---------|
| query_cache_size | 0       |
+------------------+---------+
1 row in set

DB Replication

You can see the binlog with:

SHOW BINARY LOGS LIMIT 100;

...
| mysql-bin-changelog.510021 |  49349349 |
+----------------------------+-----------+

When you replicate a database to another database, you can do a few quick checks to make sure things are running right:

mysql>SHOW MASTER STATUS;
mysql>SHOW SLAVE STATUS;

Look for the field Seconds_Behind_Master. If this value is NULL, then replication is broken, otherwise it’s an integer with how many seconds its behind.

If there is a Seconds_Behind_Master as NULL, then you’ll probably see `Last_Sql_Error’ with some error message. Fix the error (e.g. say disk is full) and then restart the slave server with:

mysql>START SLAVE;

Sometimes you’ll need to skip a query or two to make slave run again, so set this global variable to skip that query before starting slave:

SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1;
START SLAVE;

Replication Threads

MySQL replication works with 3 threads. These 3 main threads exist for each source/replica connection.

Binlog Dump Thread - on the source server
IO_THREAD - on the replica, this downloads the contents of the master binary log to temp files called relay logs
SQL_THREAD - on the replica, this processes commands on the relay logs

A source that has multiple replicas creates one binary log dump thread for each currently connected replica, and each replica has its own replication I/O and SQL threads.

A replica uses two threads to separate reading updates from the source and executing them into independent tasks.

Binlog Dump Thread

The source creates a thread to send the binary log contents to a replica when the replica connects. This thread can be identified in the output of SHOW PROCESSLIST on the source as the Binlog Dump thread.

The binary log dump thread acquires a lock on the source’s binary log for reading each event that is to be sent to the replica. As soon as the event has been read, the lock is released, even before the event is sent to the replica.

Replication I/O Thread

When a START SLAVE statement is issued on a replica server, the replica creates an I/O thread, which connects to the source and asks it to send the updates recorded in its binary threads.

The replication I/O thread reads the updates that the source’s Binlog Dump thread sends and copies them to local files that comprise the replica’s relay log.

You can STOP SLAVE IO_THREAD and START SLAVE_IO_THREAD

Replication SQL Thread

The replica creates an SQL thread to read the relay log that is written by the replication I/O thread and executes the transactions contained in it. Usually this is the issue with lag.

You can STOP SLAVE SQL_THREAD and START SLAVE_SQL_THREAD

Replication Parallelization

You can set the slave_parallel_workers system variable to a value greater than 0 (the default). When this system variable is set, the replica creates the specified number of worker threads to apply transactions, plus a coordinator thread to manage them. With this setup, transactions that fail can be retried.

Troubleshooting Replication

You can tell if it’s the I/O thread or SQL thread by:

If the SHOW MASTER STATUS has a file (e.g. mysql-bin-changelog.066552) and the replica SHOW SLAVE STATUS has an earlier file (e.g. mysql-bin-changelog.066548), that means the IO_THREAD is behind.
If the replica is reading from the same file as the master and there’s replication lag, then it’s SQL_THREAD not keeping up. Usually this is the case (unless there’s something like networking issues)

Statement-Based vs Row-Based Replication

Replication can take place on the SQL level (statement), on the row level, or a mixed (for binlog_format).

Statement-Based means every SQL statement gets logged on the master, then re-executed on the slave Less data to be transferred between the master and the slave Takes up less space in the update logs Does not need to deal with the format of the row
Row-Based means every row modification gets logged on the master and then applied on the slave If necessary to log a lot of execution context information in order for the update to produce the same results on the slave as it did originally on the master
Mixed means the master decides on a per-query basis which logging to use

Replication Coordinates

SHOW MASTER STATUS\G;

*************************** 1. row **************************
                File: laforge-bin.011
            Position: 566920603
        Binlog_do_db:
    Binlog_ignore_db:

Binlog_do_db means to log updates that were done with only certain databases Binlog_ignore_db means to log updates by ignoring certain databases

The File and Position field shows the offset in the current binary log.

Setting up a Slave DB

    GRANT REPLICATION SLAVE ON *.* TO 'rpl_user@slave-host' IDENTIFIED BY 'rpl_pass';

    CHANGE MASTER TO MASTER_HOST='master-host', MASTER_USER='rpl_user',
    MASTER_PASSWORD='rpl_pass';

    START SLAVE;

    SHOW SLAVE STATUS\G;

Information_Schema

The information_schema allows you to get information about all tables and views within a database. The columns that this view returns are:

TABLE_CATALOG
TABLE_SCHEMA
TABLE_NAME
TABLE_TYPE