Table of Contents
NumPy takes Python's sequence data type (things like a list, which is flexible and lets you put in any data type) and takes it a step more specific by making it much more efficient at the cost of making data homogeneous (only the same dtype
, like float64 or string). The idea is that with this new array type (ndarray
), you can broadcast (work with different array sizes) and apply vectorized operation (i.e. no for-loops) to the entire array at the same time instead of element by element (i.e. like a scalar). This makes it really quick and there are a bunch of useful statistical methods. Because NumPy is optimized for speed, the operations (e.g. modifying sliced objects, sorting) is done inplace instead of making copies. If you want copies, you need to explicitly state it.
NumPy stands for Numerical Python and is a fundamental package for scientific computing in Python. With NumPy, we get array-oriented computing, which includes:
ndarray
, multidimensional (n) array, whose advantages are:
ndarray
is a N-dimensional array object, which is a collection of homogeneous data (i.e. all items must be the same type). Items can be indexed and each item takes up the same size block of memory. Note that this is not Python’s standard library of array.array
, which is only one-dimensional and more limited functionality.
The easiest way to create an ndarray
is to use the array and passing in a sequence object (e.g. a list, array).
import numpy as np
#create ndarray by passing in a sequence
data = [6, 7.5, 8, 0, 1]
my_array = np.array(data) #array([ 6., 7.5, 8., 0., 1. ])
#create ndarray by initializing values
one_dim = np.array([2, 3, 4], dtype='int16') # can specify type
two_dim = np.array([[1.5, 2., 3.], # nested sequences make multidimensions
[4., 5., 6.]])
three_dim = np.array([[[1., 2., 3.],
[4., 5., 6.]],
[[7., 8., 9.],
[10, 11, 12]]])
You can create an ndarray specifically with zeros (initial values of 0), ones (initial values of 1), empty (no initial values). For a single dimension, enter in a number. For multidimensional, enter a tuple.
temp = np.zeros(10) # create a single dimension
temp #array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
temp = np.zeros((2, 4)) # create multidimensional with tuple
#array([[0., 0., 0., 0.],
# [0., 0., 0., 0.]])
temp = np.ones((1, 2, 3)) # create multidimensional with tuple
#array([[[1., 1., 1.],
# [1., 1., 1.]]])
arange returns the number of samples. arange
is an array-valued version of Python’s range
function and unless specified, is a float64
data type.
np.arange(7)
#array([0, 1, 2, 3, 4, 5, 6])
linspace is like arange
, but it uses a step size instead of explicitly saying the number of samples. linspace(start, stop, num=50)
returns evenly spaced numbers over a specified interval num
. If endpoint=True
then stop
is the last sample (otherwise not included)
np.linspace(0, 5, num=6, endpoint=True)
#array([1, 1.8, 2.6, 3.4, 4.2, 5.])
You can see and modify the shape of your ndarray using a variety of tools, including: shape, ndim, reshape, ravel, t
Each ndarray has a shape, a tuple indicating the size of each dimension. The first argument is the rows, second argument is the number of columns.
#shape
print one_dim.shape #(3,)
print two_dim.shape #(2, 3)
print three_dim.shape #(2, 2, 3)
The ndarrays are n-dimensional, you can find the n-dimensions with ndim.
#ndim
one_dim.ndim #1
two_dim.ndim #2
three_dim.ndim #3
If you want to change the shape, then do a reshape.
#reshape
np.reshape(one_dim, (1, 3))
#array([[2, 3, 4]], dtype=int16)
If you want to flatten the array, do a ravel.
#ravel
np.ravel(two_dim)
#array([1.5, 2., 3., 4., 5., 6.])
If you want to transpose, use t.
#Transpose (T)
temp = np.array([[1., 2., 3.], [4., 5., 6.]])
temp.T
#array([[1., 4.],
# [2., 5.],
# [3., 6.]])
Each ndarray can only have one data type (i.e. homogeneous). This is stored in dtype. By default, numpy does a good job figuring it out. There is a type name (like float
or int
) with the number of bits per element.
one_dim.dtype #dtype('int16')
two_dim.dtype #dtype('float64')
three_dim.dtype #dtype('float64')
You can explicitly convert or cast an array from one dtype to another using ndarray’s astype
method. If conversion fails, a TypeError
will be raised. Using astype
always creates a new array (a copy of the data).
default_array = np.array([1, 2, 3, 4, 5])
default_array.dtype # dtype('int64')
float_array = default_array.astype(np.float64)
float_array.dtype #dtype('float64')
When doing complex calculations, you might get a floating point error because float64
and float32
arrays are only capable of approximating fractional quantities (so comparisons are only valid up to a certain number of decimal places)
random returns random numbers in a given shape, specifies different seed values, shuffles, types, ranges, and distributions. There are a lot of other random functions (e.g. for multinomial, multivariate_normal), these are just a few:
random.seed
sets a specific seed for the rngpermutation
creates a random sequenceshuffle
modifies a sequence in place by shuffling contentsrand
creates array of given shape with random uniform distribution (0 to 1)randint
returns random integers from low (inclusive) to high (exclusive)randn
return a sample from the standard normal distribution (0 to 1)binomial
return sample(s) from a binomial distribution (0 to 1); like a coin toss resultsnormal
return sample(s) from a normal (Gaussian) distribution; the ‘bell curve’logistic
returns sample(s) from a logistic distributionchisquare
returns sample(s) from a chisquare distributiongamma
returns sample(s) from a Gamma distributionbeta
returns sample(s) of the Beta distribution, a special case of the Dirichlet distributionuniform
returns sample(s) from a uniform distributionseed saves a RandomState in the random number generator so the results are reproducable. You just set it at the beginning before calling your random numbers.
np.random.seed(123)
permutation creates a random sequence
np.random.permutation(10)
array([1, 7, 3, 0, 9, 2, 5, 8, 6])
shuffle just shuffles the items
arr = np.arange(10)
np.random.shuffle(10) #[1, 7, 5, 2, 9, 4, 3, 6, 0, 8]
rand creates an array of a given shape (first arg is rows, second arg is columns) with a random uniform distribution (values 0 to 1)
np.random.rand(3, 4)
#array([[.004, .544, .077, .701],
# [.660, .712, .748, .948],
# [.979, .854, .973, .631]])
randint returns a random integer based on first argument (lowest value), second argument (highest value, exclusive), and an optional size or shape of array.
np.random.randint(low=0, high=10, size=2) # create 2 random ints 0-10
#array([5, 7])
np.random.randint(0, 100, (3, 4)) # create array (3, 4) with random ints 0-100
#array([[10, 92, 15, 72],
# [ 7, 15, 44, 12],
# [52, 69, 85, 74]])
randn returns sample(s) from a standard normal distribution (0-1)
np.random.randn() # get a single random value
#0.3706716413438257
np.random.randn(2, 4) + 1 # get an ndarray of random values
#array([[ 2.94723669, 2.52429319, 3.50433846, 1.16161333],
# [ 0.81147123, 1.60399355, -0.12758733, -0.13248681]])
binomial returns sample(s) from a binomial distribution with n
trials and p
probability of success where n
>=0 and p
is in the interval [0, 1].
results = np.random.binomial(n=10, p=.5, size=1000)
results.shape(1000,) # Results of flipping a coin 10 times, tested 1000 times
normal returns sample(s) from a normal distribution with first arg as the mean, second arg as the standard deviation.
mu, sigma = 0, 0.1 # mean and standard deviation
np.random.normal(mu, sigma, 1000) # get 1000 normal distribution samples
uniform returns sample(s) from a uniform distribution. First argument is low (default 0, inclusive), second argument is high (default 1, exclusive), and size.
np.random.uniform(-1, 0, size=1000)
When working with arrays, the data is not copied because NumPy is made for fast performance and some datasets can get very large. To make a copy, you need to explicitly state .copy()
test = arr[2].copy()
test = arr[5:8].copy()
There are a variety of ways to select a subset of your data or individual elements. Slicing is selecting a subset of your data. The axis 0 is y (vertical, the row) and axis 1 is x (horizontal, the column). Indexing is accessing the values of your data. Array slicing are views on the original array, meaning the data is not copied and any modifications to the view will be reflected in the source array.
Each index is a scalar (e.g. arr[5] = 12
)
arr = np.arange(10) #array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
arr[5] #select single element
#5
arr[5:8] #select subset
#array([5, 6, 7])
arr[5:8] = 12 # update values in subset with scalar (i.e. broadcasted)
#array([0, 1, 2, 3, 4, 12, 12, 12, 8, 9])
Each index is a one-dimensional array (e.g. arr[5] = [7, 8, 9]
) instead of a scalar (e.g. arr[5] = 12
).
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
arr[2] #select single element
#array([7, 8, 9])
arr[0][2] #select individual items
#3
arr[0, 2] #select individual items
#3
Slicing lets you keep specific rows and columns. First argument is rows, second argument is columns. A :
means to select all, :x
anything before x, x:
anything after x. What you select is inclusive and starts at 1.
arr[:2,] #slice by row, keep rows 1 to 2 (inclusive)
#array([[1, 2, 3],
# [4, 5, 6]])
arr[2:3,] #slice by row, keep rows 2 to 3 (inclusive)
#array([[7, 8, 9]])
arr[:,:2] #slice by column, keep cols 1 to 2 (inclusive)
#array([[1, 2],
# [4, 5],
# [7, 8]])
arr[:2, :2]
#array([[1, 2],
# [4, 5]])
Higher dimensional objects give you more options as you can slice one or more axes and also mix integers.
arr = np.array([[[1, 2, 3],
[4, 5, 6]],
[[7, 8, 9],
[10, 11, 12]]])
arr.shape #(2, 2, 3)
arr[0]
#array([[1, 2, 3],
# [4, 5, 6]])
Advanced array indexing happens when the selection object is a non-tuple sequence object, an ndarray
of data type integer
or bool
, or a tuple with at least one sequence object or ndarray (of data type integer or bool). There are two types of advanced indexing, integer and Boolean; these always return a copy of the data (instead of a view).
Integer array indexing allows selection of items based on their n-dimensional index. This is always run as a broadcast (i.e. happens all at once with no iteration).
#Basic indexing
x = np.array([[1, 2], [3, 4], [5, 6]])
x[[0, 1, 2], [0, 1, 0]] # for each row, select
#array([1, 4, 5])
Boolean array indexing allows selection of items based on whether it meets a specific condition. This is always run as a broadcast (i.e. happens all at once with no iteration).
#strings
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe'])
names == 'Will'
#array([False, False, True, False, True, False], dtype=bool)
temp = (names == 'Will') | (names == 'Bob')
#array([ True, False, True, True, True, False], dtype=bool)
#numbers
nums = np.random.rand(3, 4)
fancy indexing describes indexing with integer arrays
arr = np.empty((6, 4)) # initialize shape
for i in range(6):
arr[i] = i # set values
#array([[0., 0., 0., 0.],
# [1., 1., 1., 1.],
# [2., 2., 2., 2.],
# [3., 3., 3., 3.],
# [4., 4., 4., 4.],
# [5., 5., 5., 5.]])
arr[[2, 5, 4]] # select a subset of your rows in this order
#array([[ 2., 2., 2., 2.],
# [ 5., 5., 5., 5.],
# [ 4., 4., 4., 4.]])
NumPy provides statistical methods, regardless of whether they are applying to a np.ndarray
(i.e. a n-dimensional array) or a np.matrix
(i.e. a 2-dimensional array).
sum
calculates the sum on all the elements in the array or along an axismean
calculates the arithmetic meanstd
is the standard deviationvar
is the variance with an optional degrees of freedommin
is the minimummax
is the maximumcumsum
is the cumulative sum of elements starting from 0cumprod
is the cumulative product of elements starting from 1Aggregations really just reduce a lot of data (e.g. an entire array) into a smaller chunk of data (e.g. a value like the mean or sum). You can do aggregations on specific axis (0 for row, 1 for column) or on the entire array.
Calculates the sum along an axis (0 = sum up each column, 1 = sum up each row)
temp = np.array([[1., 2., 3., 4.], [5., 6., 7., 8.]])
np.sum(temp, axis=0) #array([6., 8., 10., 12.]) #add up columns
np.sum(temp, axis=1) #array([10., 26.]) #add up rows
np.sum(temp) # 36.0 #sum of the entire array
Calculates the mean along an axis (0 = sum up each column, 1 = sum up each row)
np.mean(temp, axis=0) #array([3., 4., 5., 6.]) #add up columns
np.mean(temp, axis=1) #array([2.5, 6.5]) #add up rows
np.mean(temp) #4.5 #mean of the entire array
Calculates the min or max along an axis (0 = sum up each column, 1 = sum up each row)
np.min(temp) # 1.0
np.max(temp) #8.0
Calculates the standard deviation (SD), a measure used to quantify the amount of variation or dispersion of a set of values.
np.std(temp) #2.29128
You can sort data a number of ways (quicksort
, mergesort
, heapsort
). This operation applies inplace.
temp = np.random.randn(8) # sort a flattened array
temp.sort() # sorts inplace
temp #array([-0.211, 0.298, 0.749, 0.897, 0.949]) # sorted
temp = np.array([[6, 5, 4,], [3, 2, 1]]) # sort in ascending
temp.sort()
temp
#array([[4, 5, 6],
# [1, 2, 3]])
Find the unique elements in an array. Can return additional indormation like where the index is, how often the occurrence.
np.unique([1, 2, 1, 2, 3, 7, 7, 7]) # 1 dimension
#array([1, 2, 3, 7])
# 2 dimensional, also count occurrence
temp = np.array([['a', 'a', 'b'], ['b', 'c', 'd']])
x, y = np.unique(temp, return_counts=True) # 2 dimensional
#array(['a', 'b', 'c', 'd'], dtype='|S1'), array([2, 2, 1, 1]))
intersect1d
is the sorted, unique values in both input arrays.
np.intersect1d([1, 2, 3, 4], [3, 2, 5, 6])
#array([2, 3])
union
returns the unique, sorted array of values that are in either of the two input arrays.
np.union1d([-1, 0, 1], [-2, 0, 2])
array([-2, -1, 0, 1, 2])
dot
returns the dot product of two vectorstrace
returns the sum along diagonals of the arraydet
computes the determinant of an arrayeig
computes the eigenvalueinv
computes the multiplicative inverse of a matrixpinv
computes the Moore-Penrose pseudoinverse of a matrixqr
computes the qr factorization of a matrixsvd
singular value decompositionsolve
solves a linear matrix equation or system of linear scalar equationslstsq
returns the least-squares solution to a linear matrix equationdot
returns the product of two vectors
x = np.array([1, 2, 3, 4])
y = np.array([5, 6, 7, 8])
np.dot(x, y) #70 #dot product of two arrays
solve
returns the solution to a linear matrix equation (i.e. 2 different equations)
a = np.array([[3,1], [1,2]])
b = np.array([9,8])
np.linalg.solve(a, b)
A np.matrix
is just a 2-dimensional np.ndarray
. Here are some additional functions for them that can help with matrix operations.
Create a square N * N identity matrix (1s on the diagonal and 0s elsewhere)
np.eye(2, dtype=int)
#array([[1, 0],
# [0, 1]])
A universal function (aka ufunc) is a function that performs elementwise operations on data in ndarrays. They’re fast vectorized wrappers for simple functions. We have Unary Universal Functions and Binary Universal Functions.
abs
computes the absolute value element-wise for that valuesqrt
computes the square root of each elementsquare
computes the square of each elementlog
creates the natural logarithm (can be different bases)ceil
computes the ceiling of each element (i.e. smallest int of element)floor
computes the floor each element (i.e. largest int of element)isnan
returns a boolean array indicating whether value is NaN
isinfinite
, isinf
returns boolean array if finite or infinite.cos
, sin
, tan
applies the trigonometric functionsadd
adds corresponding elements in arrayssubtract
subtracts elements in second array from first arraymultiply
multiplies array elementsdivide
, floor_divide
does division or floor division (truncate remainder)power
raises first array to powers indicated in second arraySometimes you just have to iterate over an array; you can do this with nditer
Most basic version is to iterate over every element of an array
a = np.arange(6).reshape(2,3)
for x in np.nditer(a):
print x,
0 1 2 3 4 5