William Liu

Hive

Concepts

What is Hive? Hive is a data warehousing infrastructure based on Apache Hadoop (a scalable data storage and data processing system using commodity hardware). Hive lets you do ad-hoc querying and data analysis with custom functionality using User Defined Functions (UDFs). Hive is not designed for online transaction processing, but instead is best for traditional data warehousing tasks.

Components

Hive organizes data into the following components:

Note: Not necessary for tables to be partitioned or bucketed, but if done correctly can result in faster query execution

Install

For local development on Mac:

hive-site.xml

MySQL

$ mysql
mysql> CREATE DATABASE metastore;
mysql> USE metastore;
mysql> CREATE USER 'hiveuser'@'localhost' IDENTIFIED BY 'hivepassword';
mysql> GRANT SELECT,INSERT,UPDATE,DELETE,ALTER,CREATE ON metastore.* TO 'hiveuser'@'localhost';

The configs for hive. You will need to set this up locally (e.g. for MySQL):

Add the following to hive-site.xml

<property>
  <name>system:java.io.tmpdir</name>
  <value>/tmp/hive/java</value>
</property>
<property>
  <name>system:user.name</name>
  <value>${user.name}</value>
</property>

Debug

If there is an issue, you can run in debug mode:

hive -hiveconf hive.root.logger=DEBUG,console

Types

Types can be found here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types

Primitive Types can be:

Complex Types can be:

Operators on Complex Types

Array - A[n] - returns the nth element in the array Maps - M[key] returns the value in key Structs - S.x returns x field of struct S

Functions

To see the operators and functions, run the following:

SHOW FUNCTIONS;
DESCRIBE FUNCTION <function_name>;
DESCRIBE FUNCTION EXTENDED <function_name>;

Built-in Functions

Build in functions include:

User-Defined Functions

Normal user-defined functions (e.g. concat()), take in a single input row and output a single output row.

Built-in Table-Generating Functions (UDTF)

Table-generating functions transform a single input row to multiple output rows. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

For example:

These are often used with LATERAL VIEW

Running Hive

Just type in hive and then you can run your queries

hive> show tables;