NumPy

Table of Contents

Summary
About NumPy
Array objects: ndarray
Array manipulation: copy, slice, index
Descriptive statistical methods
Universal Functions: ufuncs, vectorized operations
- Unary ufuncs (abs, exp, log, sqrt, floor, isnan)
- Binary ufuncs (add, subtract, multiple, divide, mod)
Array Iteration

Summary

NumPy takes Python's sequence data type (things like a list, which is flexible and lets you put in any data type) and takes it a step more specific by making it much more efficient at the cost of making data homogeneous (only the same dtype, like float64 or string). The idea is that with this new array type (ndarray), you can broadcast (work with different array sizes) and apply vectorized operation (i.e. no for-loops) to the entire array at the same time instead of element by element (i.e. like a scalar). This makes it really quick and there are a bunch of useful statistical methods. Because NumPy is optimized for speed, the operations (e.g. modifying sliced objects, sorting) is done inplace instead of making copies. If you want copies, you need to explicitly state it.

About Numpy

NumPy stands for Numerical Python and is a fundamental package for scientific computing in Python. With NumPy, we get array-oriented computing, which includes:

ndarray, multidimensional (n) array, whose advantages are:
- Vectorization - does batch operations without writing any for-loops
- Scalar - does operations that affect a single element at a time
- Broadcasting - allows operations with different sized arrays
NumPy does fast operations on arrays including linear algebra, statistical operations
There are tools for integrating with C, C++

Array objects with ndarray

ndarray is a N-dimensional array object, which is a collection of homogeneous data (i.e. all items must be the same type). Items can be indexed and each item takes up the same size block of memory. Note that this is not Python’s standard library of array.array, which is only one-dimensional and more limited functionality.

Vectorization - do batch operations without writing any for loops
Scalars - propogate changes to each element at a time
Broadcasting - operations between different sized arrays

Create

array (creating an ndarray)

The easiest way to create an ndarray is to use the array and passing in a sequence object (e.g. a list, array).

import numpy as np

#create ndarray by passing in a sequence
data = [6, 7.5, 8, 0, 1]
my_array = np.array(data)  #array([ 6., 7.5, 8., 0., 1. ])

#create ndarray by initializing values
one_dim = np.array([2, 3, 4], dtype='int16')  # can specify type
two_dim = np.array([[1.5, 2., 3.],  # nested sequences make multidimensions
                   [4., 5., 6.]])
three_dim = np.array([[[1., 2., 3.],
                       [4., 5., 6.]],
                      [[7., 8., 9.],
                       [10, 11, 12]]])

zeros, ones, empty

You can create an ndarray specifically with zeros (initial values of 0), ones (initial values of 1), empty (no initial values). For a single dimension, enter in a number. For multidimensional, enter a tuple.

temp = np.zeros(10)  # create a single dimension
temp  #array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

temp = np.zeros((2, 4))  # create multidimensional with tuple
#array([[0., 0., 0., 0.],
#       [0., 0., 0., 0.]])

temp = np.ones((1, 2, 3))  # create multidimensional with tuple
#array([[[1., 1., 1.],
#        [1., 1., 1.]]])

arange, linspace

arange returns the number of samples. arange is an array-valued version of Python’s range function and unless specified, is a float64 data type.

np.arange(7)
#array([0, 1, 2, 3, 4, 5, 6])

linspace is like arange, but it uses a step size instead of explicitly saying the number of samples. linspace(start, stop, num=50) returns evenly spaced numbers over a specified interval num. If endpoint=True then stop is the last sample (otherwise not included)

np.linspace(0, 5, num=6, endpoint=True)
#array([1, 1.8, 2.6, 3.4, 4.2, 5.])

Shape

You can see and modify the shape of your ndarray using a variety of tools, including: shape, ndim, reshape, ravel, t

shape

Each ndarray has a shape, a tuple indicating the size of each dimension. The first argument is the rows, second argument is the number of columns.

#shape
print one_dim.shape  #(3,)
print two_dim.shape  #(2, 3)
print three_dim.shape  #(2, 2, 3)

ndim

The ndarrays are n-dimensional, you can find the n-dimensions with ndim.

#ndim
one_dim.ndim  #1
two_dim.ndim  #2
three_dim.ndim  #3

reshape

If you want to change the shape, then do a reshape.

#reshape
np.reshape(one_dim, (1, 3))
#array([[2, 3, 4]], dtype=int16)

ravel

If you want to flatten the array, do a ravel.

#ravel
np.ravel(two_dim)
#array([1.5, 2., 3., 4., 5., 6.])

t

If you want to transpose, use t.

#Transpose (T)
temp = np.array([[1., 2., 3.], [4., 5., 6.]])
temp.T
#array([[1., 4.],
#       [2., 5.],
#       [3., 6.]])

Data Types

dtype

Each ndarray can only have one data type (i.e. homogeneous). This is stored in dtype. By default, numpy does a good job figuring it out. There is a type name (like float or int) with the number of bits per element.

one_dim.dtype  #dtype('int16')
two_dim.dtype  #dtype('float64')
three_dim.dtype  #dtype('float64')

astype

You can explicitly convert or cast an array from one dtype to another using ndarray’s astype method. If conversion fails, a TypeError will be raised. Using astype always creates a new array (a copy of the data).

default_array = np.array([1, 2, 3, 4, 5])
default_array.dtype  # dtype('int64')

float_array = default_array.astype(np.float64)
float_array.dtype  #dtype('float64')

floating point error

When doing complex calculations, you might get a floating point error because float64 and float32 arrays are only capable of approximating fractional quantities (so comparisons are only valid up to a certain number of decimal places)

Random

random returns random numbers in a given shape, specifies different seed values, shuffles, types, ranges, and distributions. There are a lot of other random functions (e.g. for multinomial, multivariate_normal), these are just a few:

random.seed sets a specific seed for the rng
permutation creates a random sequence
shuffle modifies a sequence in place by shuffling contents
rand creates array of given shape with random uniform distribution (0 to 1)
randint returns random integers from low (inclusive) to high (exclusive)
randn return a sample from the standard normal distribution (0 to 1)
binomial return sample(s) from a binomial distribution (0 to 1); like a coin toss results
normal return sample(s) from a normal (Gaussian) distribution; the ‘bell curve’
logistic returns sample(s) from a logistic distribution
chisquare returns sample(s) from a chisquare distribution
gamma returns sample(s) from a Gamma distribution
beta returns sample(s) of the Beta distribution, a special case of the Dirichlet distribution
uniform returns sample(s) from a uniform distribution

random.seed

seed saves a RandomState in the random number generator so the results are reproducable. You just set it at the beginning before calling your random numbers.

np.random.seed(123)

random.permutation

permutation creates a random sequence

np.random.permutation(10)
array([1, 7, 3, 0, 9, 2, 5, 8, 6])

random.shuffle

shuffle just shuffles the items

arr = np.arange(10)
np.random.shuffle(10)  #[1, 7, 5, 2, 9, 4, 3, 6, 0, 8]

random.rand

rand creates an array of a given shape (first arg is rows, second arg is columns) with a random uniform distribution (values 0 to 1)

np.random.rand(3, 4)
#array([[.004, .544, .077, .701],
#       [.660, .712, .748, .948],
#       [.979, .854, .973, .631]])

random.randint

randint returns a random integer based on first argument (lowest value), second argument (highest value, exclusive), and an optional size or shape of array.

np.random.randint(low=0, high=10, size=2)  # create 2 random ints 0-10
#array([5, 7])

np.random.randint(0, 100, (3, 4))  # create array (3, 4) with random ints 0-100
#array([[10, 92, 15, 72],
#       [ 7, 15, 44, 12],
#       [52, 69, 85, 74]])

random.randn

randn returns sample(s) from a standard normal distribution (0-1)

np.random.randn()  # get a single random value
#0.3706716413438257

np.random.randn(2, 4) + 1  # get an ndarray of random values
#array([[ 2.94723669,  2.52429319,  3.50433846,  1.16161333],
#       [ 0.81147123,  1.60399355, -0.12758733, -0.13248681]])

random.binomial

binomial returns sample(s) from a binomial distribution with n trials and p probability of success where n >=0 and p is in the interval [0, 1].

results = np.random.binomial(n=10, p=.5, size=1000)
results.shape(1000,)  # Results of flipping a coin 10 times, tested 1000 times

random.normal

normal returns sample(s) from a normal distribution with first arg as the mean, second arg as the standard deviation.

mu, sigma = 0, 0.1  # mean and standard deviation
np.random.normal(mu, sigma, 1000)  # get 1000 normal distribution samples

random.uniform

uniform returns sample(s) from a uniform distribution. First argument is low (default 0, inclusive), second argument is high (default 1, exclusive), and size.

np.random.uniform(-1, 0, size=1000)

Array Manipulation with Copy, Slice, Index

Copy

When working with arrays, the data is not copied because NumPy is made for fast performance and some datasets can get very large. To make a copy, you need to explicitly state .copy()

test = arr[2].copy()
test = arr[5:8].copy()

Slice and Index

There are a variety of ways to select a subset of your data or individual elements. Slicing is selecting a subset of your data. The axis 0 is y (vertical, the row) and axis 1 is x (horizontal, the column). Indexing is accessing the values of your data. Array slicing are views on the original array, meaning the data is not copied and any modifications to the view will be reflected in the source array.

One Dimensional ndarray - slice and select items

Each index is a scalar (e.g. arr[5] = 12)

arr = np.arange(10)  #array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

arr[5]  #select single element
#5

arr[5:8]  #select subset
#array([5, 6, 7])

arr[5:8] = 12  # update values in subset with scalar (i.e. broadcasted)
#array([0, 1, 2, 3, 4, 12, 12, 12, 8, 9])

Two Dimensional ndarray - select individual items

Each index is a one-dimensional array (e.g. arr[5] = [7, 8, 9]) instead of a scalar (e.g. arr[5] = 12).

arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

arr[2]  #select single element
#array([7, 8, 9])

arr[0][2]  #select individual items
#3
arr[0, 2]  #select individual items
#3

Two Dimensional ndarray - slice items

Slicing lets you keep specific rows and columns. First argument is rows, second argument is columns. A : means to select all, :x anything before x, x: anything after x. What you select is inclusive and starts at 1.

Slice by row(s) only

arr[:2,]  #slice by row, keep rows 1 to 2 (inclusive)
#array([[1, 2, 3],
#       [4, 5, 6]])
    
arr[2:3,]  #slice by row, keep rows 2 to 3 (inclusive)
#array([[7, 8, 9]])

Slice by column(s) only

arr[:,:2]  #slice by column, keep cols 1 to 2 (inclusive)
#array([[1, 2],
#       [4, 5],
#       [7, 8]])

Slice by row(s) and column(s)

arr[:2, :2]
#array([[1, 2],
#       [4, 5]])

Multidimensional ndarray

Higher dimensional objects give you more options as you can slice one or more axes and also mix integers.

arr = np.array([[[1, 2, 3],
                 [4, 5, 6]],
                [[7, 8, 9],
                 [10, 11, 12]]])    
arr.shape  #(2, 2, 3)

arr[0]
#array([[1, 2, 3],
#       [4, 5, 6]])

Advanced Indexing

Advanced array indexing happens when the selection object is a non-tuple sequence object, an ndarray of data type integer or bool, or a tuple with at least one sequence object or ndarray (of data type integer or bool). There are two types of advanced indexing, integer and Boolean; these always return a copy of the data (instead of a view).

Integer indexing

Integer array indexing allows selection of items based on their n-dimensional index. This is always run as a broadcast (i.e. happens all at once with no iteration).

#Basic indexing
x = np.array([[1, 2], [3, 4], [5, 6]])
x[[0, 1, 2], [0, 1, 0]]  # for each row, select 
#array([1, 4, 5])

Boolean indexing

Boolean array indexing allows selection of items based on whether it meets a specific condition. This is always run as a broadcast (i.e. happens all at once with no iteration).

#strings
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe'])
names == 'Will'
#array([False, False,  True, False,  True, False], dtype=bool)
temp = (names == 'Will') | (names == 'Bob')
#array([ True, False,  True,  True,  True, False], dtype=bool)

#numbers
nums = np.random.rand(3, 4)

Fancy indexing

fancy indexing describes indexing with integer arrays

arr = np.empty((6, 4))  # initialize shape
for i in range(6):
    arr[i] = i    # set values
#array([[0., 0., 0., 0.],
#       [1., 1., 1., 1.],
#       [2., 2., 2., 2.],
#       [3., 3., 3., 3.],
#       [4., 4., 4., 4.],
#       [5., 5., 5., 5.]])

arr[[2, 5, 4]]  # select a subset of your rows in this order
#array([[ 2., 2., 2., 2.],
#       [ 5., 5., 5., 5.],
#       [ 4., 4., 4., 4.]])

Descriptive Statistical Methods

NumPy provides statistical methods, regardless of whether they are applying to a np.ndarray (i.e. a n-dimensional array) or a np.matrix (i.e. a 2-dimensional array).

sum calculates the sum on all the elements in the array or along an axis
mean calculates the arithmetic mean
std is the standard deviation
var is the variance with an optional degrees of freedom
min is the minimum
max is the maximum
cumsum is the cumulative sum of elements starting from 0
cumprod is the cumulative product of elements starting from 1

Aggregation and Sum

Aggregations really just reduce a lot of data (e.g. an entire array) into a smaller chunk of data (e.g. a value like the mean or sum). You can do aggregations on specific axis (0 for row, 1 for column) or on the entire array.

sum

Calculates the sum along an axis (0 = sum up each column, 1 = sum up each row)

temp = np.array([[1., 2., 3., 4.], [5., 6., 7., 8.]])
np.sum(temp, axis=0)  #array([6., 8., 10., 12.])  #add up columns
np.sum(temp, axis=1)  #array([10., 26.])  #add up rows
np.sum(temp)  # 36.0  #sum of the entire array

mean

Calculates the mean along an axis (0 = sum up each column, 1 = sum up each row)

np.mean(temp, axis=0)  #array([3., 4., 5., 6.])  #add up columns
np.mean(temp, axis=1)  #array([2.5, 6.5])  #add up rows
np.mean(temp)  #4.5  #mean of the entire array

min, max

Calculates the min or max along an axis (0 = sum up each column, 1 = sum up each row)

np.min(temp)  # 1.0
np.max(temp)  #8.0

std

Calculates the standard deviation (SD), a measure used to quantify the amount of variation or dispersion of a set of values.

np.std(temp)  #2.29128

Sorting, Unique, Set Logic

np.sort

You can sort data a number of ways (quicksort, mergesort, heapsort). This operation applies inplace.

temp = np.random.randn(8)  # sort a flattened array
temp.sort()  # sorts inplace
temp  #array([-0.211,  0.298,  0.749,  0.897,  0.949])  # sorted

temp = np.array([[6, 5, 4,], [3, 2, 1]])  # sort in ascending
temp.sort()
temp
#array([[4, 5, 6],
#       [1, 2, 3]])

np.unique

Find the unique elements in an array. Can return additional indormation like where the index is, how often the occurrence.

np.unique([1, 2, 1, 2, 3, 7, 7, 7])  # 1 dimension
#array([1, 2, 3, 7])

# 2 dimensional, also count occurrence
temp = np.array([['a', 'a', 'b'], ['b', 'c', 'd']])
x, y = np.unique(temp, return_counts=True)  # 2 dimensional
#array(['a', 'b', 'c', 'd'], dtype='|S1'), array([2, 2, 1, 1]))

np.intersect1d(arg1, arg2)

intersect1d is the sorted, unique values in both input arrays.

np.intersect1d([1, 2, 3, 4], [3, 2, 5, 6])
#array([2, 3])

np.union(arg1, arg2)

union returns the unique, sorted array of values that are in either of the two input arrays.

np.union1d([-1, 0, 1], [-2, 0, 2])
array([-2, -1,  0,  1,  2])

Linear Algebra

dot returns the dot product of two vectors
trace returns the sum along diagonals of the array
det computes the determinant of an array
eig computes the eigenvalue
inv computes the multiplicative inverse of a matrix
pinv computes the Moore-Penrose pseudoinverse of a matrix
qr computes the qr factorization of a matrix
svd singular value decomposition
solve solves a linear matrix equation or system of linear scalar equations
lstsq returns the least-squares solution to a linear matrix equation

dot

dot returns the product of two vectors

x = np.array([1, 2, 3, 4])
y = np.array([5, 6, 7, 8])

np.dot(x, y)  #70  #dot product of two arrays

solve

solve returns the solution to a linear matrix equation (i.e. 2 different equations)

a = np.array([[3,1], [1,2]])
b = np.array([9,8])
np.linalg.solve(a, b)

Matrix Operations

A np.matrix is just a 2-dimensional np.ndarray. Here are some additional functions for them that can help with matrix operations.

eye, identity

Create a square N * N identity matrix (1s on the diagonal and 0s elsewhere)

np.eye(2, dtype=int)
#array([[1, 0],
#       [0, 1]])

Universal Functions

A universal function (aka ufunc) is a function that performs elementwise operations on data in ndarrays. They’re fast vectorized wrappers for simple functions. We have Unary Universal Functions and Binary Universal Functions.

Unary Universal Functions

abs computes the absolute value element-wise for that value
sqrt computes the square root of each element
square computes the square of each element
log creates the natural logarithm (can be different bases)
ceil computes the ceiling of each element (i.e. smallest int of element)
floor computes the floor each element (i.e. largest int of element)
isnan returns a boolean array indicating whether value is NaN
isinfinite, isinf returns boolean array if finite or infinite.
cos, sin, tan applies the trigonometric functions

Binary Universal Functions

add adds corresponding elements in arrays
subtract subtracts elements in second array from first array
multiply multiplies array elements
divide, floor_divide does division or floor division (truncate remainder)
power raises first array to powers indicated in second array

Array Iteration

Sometimes you just have to iterate over an array; you can do this with nditer

np.nditer

Most basic version is to iterate over every element of an array

a = np.arange(6).reshape(2,3)
for x in np.nditer(a):
    print x,
0 1 2 3 4 5