What is Hive? Hive is a data warehousing infrastructure based on Apache Hadoop (a scalable data storage and data processing system using commodity hardware). Hive lets you do ad-hoc querying and data analysis with custom functionality using User Defined Functions (UDFs). Hive is not designed for online transaction processing, but instead is best for traditional data warehousing tasks.
Hive organizes data into the following components:
Note: Not necessary for tables to be partitioned or bucketed, but if done correctly can result in faster query execution
For local development on Mac:
brew install hive
/usr/local/Cellar/hive/3.1.1/libexec
hive-default.xml.template
from /usr/local/Cellar/hive/3.1.1/libexec/conf
to hive-site.xml
MySQL
$ mysql
mysql> CREATE DATABASE metastore;
mysql> USE metastore;
mysql> CREATE USER 'hiveuser'@'localhost' IDENTIFIED BY 'hivepassword';
mysql> GRANT SELECT,INSERT,UPDATE,DELETE,ALTER,CREATE ON metastore.* TO 'hiveuser'@'localhost';
The configs for hive. You will need to set this up locally (e.g. for MySQL):
javax.jdo.option.ConnectionURL
value might be jdbc:mysql://localhost/test?useUnicode=true&useJDBCCompliantTimezoneShift=true&useLegacyDatetimeCode=false&serverTimezone=UTC
javax.jdo.option.ConnectionDriverName
value might be com.mysql.jdbc.Driver
datanucleus.schema.autoCreateAll
to true
Add the following to hive-site.xml
<property>
<name>system:java.io.tmpdir</name>
<value>/tmp/hive/java</value>
</property>
<property>
<name>system:user.name</name>
<value>${user.name}</value>
</property>
If there is an issue, you can run in debug mode:
hive -hiveconf hive.root.logger=DEBUG,console
Types can be found here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types
Primitive Types can be:
DOUBLE
, FLOAT
, BIGINT
, INT
, SMALLINT
, TINYINT
)STRING
, VARCHAR
, CHAR
)Complex Types can be:
.
) notation. E.g. Column c
with data {a INT; b INT}
can access field a
with c.a
['element name']
notation. E.g.
map M
has a mapping from group
-> gid
the gid value can be accessed using M['group']
[n]
notation where n
is an index (zero-based) into the array. For examplel elements['a', 'b', 'c']
has A[1]
return b
.Array - A[n]
- returns the nth element in the array
Maps - M[key]
returns the value in key
Structs - S.x
returns x field of struct S
To see the operators and functions, run the following:
SHOW FUNCTIONS;
DESCRIBE FUNCTION <function_name>;
DESCRIBE FUNCTION EXTENDED <function_name>;
Build in functions include:
regex_replace(string A, string B, string C)
returns the string resulting from replacing all substrings in B that
match the Java regular expression syntaxcast(<expr> as <type>)
converts the results of the expression to a type (e.g. cast('1' AS BIGINT)
get_json_object(string json_string, string path)
extracts the JSON object from a JSON string based on the
path specified and returns JSON string of the extracted JSON object. Returns NULL if the input json string is invalid.Normal user-defined functions (e.g. concat()
), take in a single input row and output a single output row.
Table-generating functions transform a single input row to multiple output rows. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
For example:
explode(ARRAY <T> a)
explodes an array to multiple rows (one row for each element from the array)explode(MAP<Tkey,Tvalue>m)
explodes a map to multiple rows. Returns a row-set with a two columns (key, value), one
row for each key-value pair from the input mapposexplode(ARRAY<T> a)
explodes an array to multiple rows with additional positional column of int type
(position of items in the original array, starting with 0). Returns a row-set with two columns (pos, val), one row
for each element from the arrayinline(ARRAY<STRUCT<f1:T1...fn:Tn>> a)
explodes an array of structs to multiple rowsstack
breaks up n values into r rows. Each row will have n/r columnsjson_tuple
takes JSON string and a set of n keys, and returns a tuple of n valuesThese are often used with LATERAL VIEW
Just type in hive
and then you can run your queries
hive> show tables;