Hive

Concepts

What is Hive? Hive is a data warehousing infrastructure based on Apache Hadoop (a scalable data storage and data processing system using commodity hardware). Hive lets you do ad-hoc querying and data analysis with custom functionality using User Defined Functions (UDFs). Hive is not designed for online transaction processing, but instead is best for traditional data warehousing tasks.

Components

Hive organizes data into the following components:

Databases - Namespaces function to avoid name conflict for tables, views, partitions
Tables - a group of rows of data that have the same schema
Partitions - each table has one or more partition keys, which determines how the data is stored Speeds up analysis of large sets of data
Buckets (aka Clusters) - data in each partition may be divided into buckets based on a hash function

Note: Not necessary for tables to be partitioned or bucketed, but if done correctly can result in faster query execution

Install

For local development on Mac:

Install Java 7 (I had to use a lower version of Java)
brew install hive
My Hive home is in /usr/local/Cellar/hive/3.1.1/libexec
I copied the hive-default.xml.template from /usr/local/Cellar/hive/3.1.1/libexec/conf to hive-site.xml
I edited the file so that I’d have local options

hive-site.xml

MySQL

$ mysql
mysql> CREATE DATABASE metastore;
mysql> USE metastore;
mysql> CREATE USER 'hiveuser'@'localhost' IDENTIFIED BY 'hivepassword';
mysql> GRANT SELECT,INSERT,UPDATE,DELETE,ALTER,CREATE ON metastore.* TO 'hiveuser'@'localhost';

The configs for hive. You will need to set this up locally (e.g. for MySQL):

javax.jdo.option.ConnectionURL value might be jdbc:mysql://localhost/test?useUnicode=true&useJDBCCompliantTimezoneShift=true&useLegacyDatetimeCode=false&serverTimezone=UTC
javax.jdo.option.ConnectionDriverName value might be com.mysql.jdbc.Driver
datanucleus.schema.autoCreateAll to true

Add the following to hive-site.xml

<property>
  <name>system:java.io.tmpdir</name>
  <value>/tmp/hive/java</value>
</property>
<property>
  <name>system:user.name</name>
  <value>${user.name}</value>
</property>

Debug

If there is an issue, you can run in debug mode:

hive -hiveconf hive.root.logger=DEBUG,console

Types

Types can be found here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types

Primitive Types can be:

Number (e.g. DOUBLE, FLOAT, BIGINT, INT, SMALLINT, TINYINT)
Boolean
String (e.g. STRING, VARCHAR, CHAR)

Complex Types can be:

Structs - elements within the type can be accessed using DOT (.) notation. E.g. Column c with data {a INT; b INT} can access field a with c.a
Maps - are key-value tuples where the elements are accessed using ['element name'] notation. E.g. map M has a mapping from group -> gid the gid value can be accessed using M['group']
Arrays - are indexable lists, where the elements in the array have to be the same type. Elements are accessed using the [n] notation where n is an index (zero-based) into the array. For examplel elements['a', 'b', 'c'] has A[1] return b.

Operators on Complex Types

Array - A[n] - returns the nth element in the array Maps - M[key] returns the value in key Structs - S.x returns x field of struct S

Functions

To see the operators and functions, run the following:

SHOW FUNCTIONS;
DESCRIBE FUNCTION <function_name>;
DESCRIBE FUNCTION EXTENDED <function_name>;

Built-in Functions

Build in functions include:

regex_replace(string A, string B, string C) returns the string resulting from replacing all substrings in B that match the Java regular expression syntax
cast(<expr> as <type>) converts the results of the expression to a type (e.g. cast('1' AS BIGINT)
get_json_object(string json_string, string path) extracts the JSON object from a JSON string based on the path specified and returns JSON string of the extracted JSON object. Returns NULL if the input json string is invalid.

User-Defined Functions

Normal user-defined functions (e.g. concat()), take in a single input row and output a single output row.

Built-in Table-Generating Functions (UDTF)

Table-generating functions transform a single input row to multiple output rows. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

For example:

explode(ARRAY <T> a) explodes an array to multiple rows (one row for each element from the array)
explode(MAP<Tkey,Tvalue>m) explodes a map to multiple rows. Returns a row-set with a two columns (key, value), one row for each key-value pair from the input map
posexplode(ARRAY<T> a) explodes an array to multiple rows with additional positional column of int type (position of items in the original array, starting with 0). Returns a row-set with two columns (pos, val), one row for each element from the array
inline(ARRAY<STRUCT<f1:T1...fn:Tn>> a) explodes an array of structs to multiple rows
stack breaks up n values into r rows. Each row will have n/r columns
json_tuple takes JSON string and a set of n keys, and returns a tuple of n values

These are often used with LATERAL VIEW

Running Hive

Just type in hive and then you can run your queries

hive> show tables;