Table of Contents
TODO:
Pandas is built on top of NumPy and takes the ndarray
a step even further into high-level data structures with Series
and DataFrame
objects; these data objects contain metadata like column and row names as an index
with an index.name
. There are also a lot of helper functions for loading, selecting, and chunking data. If you are using something like SQL
for anything that goes beyond a simple query or a large dataset, its time to switch to pandas.
Pandas is similar to R
and follows the same patterns of using the split-apply-combine
strategy using the groupby
method. Melting is done through the melt
method.
Import the pandas library
from pandas import Series, DataFrame
import pandas as pd
axis=0
to apply a method down each column.axis=1
to apply a method across each row.So when there is df.mean(axis=1)
it means to take the mean of entries horizontally across columns (i.e. along each individual row).
![Image of Axis] (https://williamqliu.github.com/images/pd_axis.jpg)
A Series is a one-dimensional array-like object containing any NumPy data type as values
as well as data labels called the index
. If no index is specified, a default one is created using integers that span 0 through N-1 (where N is the length of data).
from pandas import Series
obj = Series([4, 7, -5, 3]) #Simplest Series is an array
#If no index specified, default index is created
obj
#0 4
#1 7
#2 -5
#3 3
#dtype: int64
You can select single values or a set of values by specifying the index within []
.
obj2['a'] # Get value corresponding to obj2's index 'a'
#-5
obj2[['c', 'a', 'd']] # Get values to this list of indexes
#c 3
#a -5
#d 4
#dtype: int64
You can think of a Series as a fixed-length, ordered dictionary since it maps the index
to values
. You can even pass in a Python dict to create a Series.
'b' in obj2 # True
'e' in obj2 # False
# You can pass in a Python dict to create a Series
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = Series(sdata)
obj3
#Ohio 35000
#Oregon 16000
#Texas 71000
#Utah 5000
#dtype: int64
# You can specify an index; if no matching value, then value is 'NaN'
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index=states)
obj4
#California NaN #Since we cannot find value to this index
#Ohio 35000
#Oregon 16000
#Texas 71000
A DataFrame is a tabular, spreadsheet-like data structure (like Excel) and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). There is both a row and column index
.
# Create a dict of equal-length lists or NumPy arrays
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
df = pd.DataFrame(data)
df
# pop state year
#0 1.5 Ohio 2000
#1 1.7 Ohio 2001
#2 3.6 Nevada 2001
#3 2.4 Nevada 2001
#4 2.9 Nevada 2002
# Can specify column names, order, and index names
df2 = pd.DataFrame(data,
columns=['year', 'state', 'pop', 'debt'],
index=['one', 'two', 'three', 'four', 'five'])
df2
# year state pop debt
#one 2000 Ohio 1.5 NaN
#two 2001 Ohio 1.7 NaN
#three 2002 Ohio 3.6 NaN
#four 2001 Nevada 2.4 NaN
#five 2002 Nevada 2.9 NaN
You can see what columns are available using columns
.
df2.columns # List all columns
#Index([u'year', u'state', u'pop', u'debt'], dtype='object')
If you want to select a column, this will return a Series
; you can do this by calling the column index as a dictionary (i.e. []
) or by calling the column index as an attribute (i.e. .
).
df2['state'] # get 'state' column using dict notation, returns Series
#one Ohio
#two Ohio
#three Ohio
#four Nevada
#five Nevada
#Name: state, dtype: object
df2.state # get 'state' column using attribute notation, same as above
With ix
, you can select a row index by either position (e.g. df.ix[2]
) or the name
(say df.ix['state']
). Since this is built on top of NumPy, these selections are views (which means they are the actual values); if you want a copy, then use copy()
.
df2.ix['three'] # Get row with index 'three'
#year 2002
#state Ohio
#pop 3.6
#debt NaN
#Name: three, dtype: object
df2.ix['three']['state'] # Get value of index 'three', column 'state'
#'Ohio'
You can select and filter for specific rows and columns.
# Filter on specific column value only
df.ix[df.state == 'Ohio']
# year state
#one 2000 Ohio
#two 2001 Ohio
#three 2002 Ohio
# Filter by chaining (e.g. multiple columns)
df.ix[(df.state == 'Ohio') & (df.year.isin([2001, 2002]))]
# pop state year
#1 1.7 Ohio 2001
#2 3.6 Ohio 2002
# Filter on specific row
df2.ix['three']
#year 2002
#state Ohio
#pop 3.6
#debt NaN
You can also query the columns of a dataframe with a boolean expression.
df.query(df.year > 2001)
# pop state year
#2 3.6 Ohio 2002
#4 2.9 Nevada 2002
You can use the loc
position to return based off of label names.
df = DataFrame({'sepallength': [1, 2, 3, 4, 5, 6, 7],
'sepalwidth': [7, 6, 5, 4, 3, 2, 1],
'class': ['Iris-setosa', 'Iris-versicolor', 'Iris-setosa', 'Iris', 'Iris', 'Iris', 'Iris']})
df.loc[df['class'] == 'Iris-setosa'] # Filter on single column value
df.loc[df['class'].isin(['Iris-setosa', 'Iris'])] # Filter multiple col values
You can also select rows or columns by an integer based indexing using the iloc
method or by using []
; either way requires arguments for the start
, stop
and optionally an interval
.
#Select using Rows
df2.iloc[2:4]
# year state pop debt
#three 2002 Ohio 3.6 NaN
#four 2001 Nevada 2.4 NaN
df2[2:4] #same as above, but prefer to use iloc so we are explicit
#Select using Columns
df2.iloc[:, 1:3]
# state pop
#one Ohio 1.5
#two Ohio 1.7
#three Ohio 3.6
#four Nevada 2.4
#five Nevada 2.9
You can assign values similar to how you select items. The value's length must match the length of the DataFrame.
df2.ix['three']['state'] = 'California' #Assign by row and col
df2['debt'] = 16.5 #Assign entire col the same value
df2['debt'] = [10, 9, 8, 7, 6] #Assign entire col different values
df2['debt'] = np.arange(5) # Assign col values based on a function
# Assign values with a Series; uses index to find where to put values
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
df2['debt'] = val
You can create columns by assigning them values (e.g. say np.empty
). You can delete columns with the del
keyword.
df2.columns # See columns
#Index([u'year', u'state', u'pop', u'debt'], dtype='object')
df2['eastern'] = np.empty # Create a new column
df2.columns
#Index([u'year', u'state', u'pop', u'debt', u'eastern'], dtype='object')
del df2['eastern']
df2.columns
#Index([u'year', u'state', u'pop', u'debt'], dtype='object')
You can transpose a DataFrame using T
.
df2
# year state pop debt
#one 2000 Ohio 1.5 NaN
#two 2001 Ohio 1.7 -1.2
#three 2002 Ohio 3.6 NaN
#four 2001 Nevada 2.4 -1.5
#five 2002 Nevada 2.9 -1.7
df2.T
# one two three four five
#year 2000 2001 2002 2001 2002
#state Ohio Ohio Ohio Nevada Nevada
#pop 1.5 1.7 3.6 2.4 2.9
#debt NaN -1.2 NaN -1.5 -1.7
The advantage of a DataFrame is the vectorized operations, but sometimes you just have to iterate through pieces of a DataFrame. You can do this with iterrows()
for index, row in df2.iterrows():
print index, row['year'], row['state']
# one 2000 Ohio
# two 2001 Ohio
# three 2002 Ohio
# four 2001 Nevada
# five 2002 Nevada
Series and DataFrame have an index and corresponding values. The index appears on the left (accessible through index)
) and the values are on the right (accessible through values
). There are also corresponding names (accessible through the name
attribute).
Indexes are immutable (can't be modified); the main thing to know is that later we will go into different types of indexes, including DatetimeIndex
, which gives more functions for working with datetimes.
obj = Series([4, 7, -5, 3]) # Series without specifying index
obj.index #Int64Index([0, 1, 2, 3], dtype='int64')
#Series with a specified index
obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2
#d 4
#b 7
#a -5
#c 3
#dtype: int64
Some useful index methods and properties are:
append
concatenates with additional Index objects making a new Indexdiff
gets the differenceintersection
gets the intersectionunion
gets the unionisin
creates boolean array indicating if each value in passed collectiondelete
removes the element at this indexdrop
deletes the passed valueinsert
adds a new element at this indexis_unique
returns True
if Index has no duplicate valuesunique
computes the array of unique values in the IndexSeries, DataFrame, and indexes have a name attribute. This allows you to get and set indexes easier.
obj2.name = 'Inches of Rain' # name for Series
obj2.index.name = 'state' # name for index
obj2
#state
#d 4
#b 7
#a -5
#c 3
#Name: Inches of Rain, dtype: int64
You can check if an index is unqiue
df.index.is_unique #True
s1 = Series([10, 20, 30], index=[1, 2, 3])
s2 = Series([40, 50, 60], index=[1, 2, 4])
temp = s1.index.intersection(s2.index)
temp
#Int64Index([1, 2], dtype='int64')
You can drop entries from an axis
obj = Series([4.5, 7.2, -5.3, 3.6], index= ['d', 'b', 'a', 'c'])
new_obj = obj.drop(['d', 'a'])
new_obj
#b 7.2
#c 3.6
#dtype: float64
Reindex creates a new object with the data conformed to the new index, even if there are no index values present. For a scalar value, you can specify a fill_value
(e.g. 0
), otherwise the default is a NaN
. You can also specify method
to fill (e.g. ffill
to fill values forward). To reindex columns, go see the ix
method.
obj = Series([4.5, 7.2, -5.3, 3.6], index= ['d', 'b', 'a', 'c'])
obj
#d 4.5
#b 7.2
#a -5.3
#c 3.6
#dtype: float64
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)
obj2
#a -5.3
#b 7.2
#c 3.6
#d 4.5
#e 0
#dtype: float64
obj3 = Series(['blue', 'green', 'yellow'], index=[0, 2, 4])
obj3.reindex(range(6), method='ffill')
obj3
#0 blue
#1 blue
#2 green
#3 green
#4 yellow
#5 yellow
#dtype: object
Pandas uses many of the same dtype
as NumPy, but there are additional types like dtype=category
. You can also forcefully assign type with the astype
method.
s = Series(['a', 'b', 'c', 'd', 'a', 'b'], dtype='category')
s.dtypes # category
df = DataFrame({'A': ['a', 'b', 'c', 'a'],
'B': ['z', 'z', 'y', 'd']})
df.dtypes
#A object
#B object
df['C'] = df['A'].astype('category')
df.dtypes
#A object
#B object
#C category
Values are always returned as a 2D ndarray
and the dtype
will be automatically chosen to match objects.
obj = Series([4, 7, -5, 3]) #Simplest Series is an array
obj.values #array([ 4, 7, -5, 3])
A NaN
or NA
is returned if a value cannot be found (i.e. a missing value). You can search your data for these missing values using isnull
and notnull
functions.
pd.isnull(obj4)
#California True
#Ohio False
#Oregon False
#Texas False
#dtype: bool
pd.notnull(obj4)
#California False
#Ohio True
#Oregon True
#Texas True
#dtype: bool
You can replace a value with another value. You can use replace() with a lot of variations including replacing a single value with a single value, a list with a single value, a list with a list, and a dict with a dict.
mydata = pd.Series([1., -999., 2., -999., -1000., 3.])
print mydata
#0 1
#1 -999
#2 2
#3 -999
#4 -1000
#5 3
#dtype: float64
# Replace value of x with value of y
mydata.replace(-999, np.nan)
#0 1
#1 NaN
#2 2
#3 NaN
#4 -1000
#5 3
# Replace list x with value of y
mydata.replace([-999, -1000], np.nan)
#0 1
#1 NaN
#2 2
#3 NaN
#4 NaN
#5 3
#dtype: float64
# Replace list of x with list of y
mydata.replace([-999, -1000], [numpy.nan, 0])
#0 1
#1 NaN
#2 2
#3 NaN
#4 0
#5 3
#dtype: float64
# Replace a dict of x with a dict of y
mydata.replace({-999:np.NaN, -1000:0})
#0 1
#1 NaN
#2 2
#3 NaN
#4 0
#5 3
#dtype: float64
To check if a DataFrame has duplicated values, you can use duplicated().
df = pd.DataFrame({
'k1': ['one'] * 3 + ['two'] * 4,
'k2': [1, 1, 2, 3, 3, 4, 4]
})
df
# k1 k2
#0 one 1
#1 one 1
#2 one 2
#3 two 3
#4 two 3
#5 two 4
#6 two 4
# Return a Series of whether the row is duplicated
df.duplicated()
#0 False
#1 True
#2 False
#3 False
#4 True
#5 False
#6 True
#dtype: bool
# Return a DataFrame of unique rows
df.drop_duplicates()
# k1 k2
#0 one 1
#2 one 2
#3 two 3
#5 two 4
# You can specify what columns are used to determine if row is unique
df.drop_duplicates(['k1'])
# k1 k2
#0 one 1
#3 two 3
Pandas really starts to show how powerful it is when working with DateTimes.
convert_objects
and to_datetime
First you want to make sure that the data type is correct (i.e. it’s a time) You can convert dataframe objects a couple different ways. There’s a generic convert_objects
that you can use to also use to convert numbers. If you want a more specific time function, use to_datetime
.
mydf = mydf.convert_objects(convert_dates) # also, convert_numeric, convert_timedeltas
mydf['SomeDateColumn'] = pd.to_datetime(mydf['SomeDateColumn'], coerce=True)
To utilize the datetime specific methods, you need to set an index that is a DatetimeIndex
.
df.set_index(pd.DatetimeIndex(df['someDate']), inplace=True)
df # say this is: 'someDate', 'cookiesEaten', 'Notes'
#2009-02-16 23:26:40, 5, someotherstuff
#2009-02-16 23:45:43, 7, someotherstuff
#2009-02-17 22:05:31, 10, someotherstuff
You can now resample our counts of say ‘cookiesEaten’.
df.resample('5Min', how='sum') # default is mean
df.resample('D', how=np.max)
#2009-02-16, 12
#2009-02-17, 10
With the first argument of a resample, you can specify the offset alias. For example:
D
is for calendar day frequencyB
is for business day frequencyW
is for weekly frequencyM
is for month end frequencyMS
is for month start frequencyQ
is for quarter end frequencyQS
is for quarter start frequencyH
is for hourly frequencyT
is for minute frequencyS
is for secondly frequencyU
is for microseconds frequencyWith the second argument of a resample, you can specify how. For example:
how='mean'
how='min'
how='median'
how='first'
how='last'
how='ohlc'
# first, last, highest, lowest value (mainly used for finance to see what opening stock is, ending stock for the day, etc.)You can apply NumPy array operations (e.g. filtering, scalar multiplication, applying math functions) and uFuncs to our Series and DataFrames.
obj2 # Setup Series
#d 4
#b 7
#a -5
c 3
obj2[obj2 > 0] # Filtering
obj2
#d 4
#b 7
#c 3
#dtype: int64
obj2 * 2 # Scalar Multiplication
#d 8
#b 14
#a -10
#c 6
#dtype: int64
np.sqrt(obj2) # Apply ufunc
#d 2.00000
#b 2.64751
#a NaN
#c 1.73205
#dtype: float64
One of the advantages of pandas is that data is automatically aligned on the index. For example, even though obj4 has the index ‘California’ that obj3 does not have, we can still do arithmetic.
obj3
#Ohio 35000
#Oregon 16000
#Texas 71000
#Utah 5000
#dtype: int 64
obj4
#California NaN
#Ohio 35000
#Oregon 16000
#Texas 71000
obj3 + obj4
#California NaN
#Ohio 70000
#Oregon 32000
#Texas 142000
#Utah NaN
#dtype: float64
You can apply arithmetic methods between Series and DataFrame objects. These methods let you send in additional arguments (like fill_value
, axis
)
add
is the method for addition (+)sub
is the method for subtraction (-)div
is the method for division (/)mul
is the method for multiplication (*)Add together two DataFrames results in NA values in the locations that do not overlap.
df1 = DataFrame(np.arange(12).reshape((3, 4)), columns=['a', 'b', 'c', 'd'])
df2 = DataFrame(np.arange(20).reshape((4,5)), columns=['a', 'b', 'c', 'd', 'e'])
df1
# a b c d
#0 0 1 2 3
#1 4 5 6 7
#2 8 9 10 11
df2
# a b c d e
#0 0 1 2 3 4
#1 5 6 7 8 9
#2 10 11 12 13 14
#3 15 16 17 18 19
df1 + df2
# a b c d e
#0 0 2 4 6 NaN
#1 9 11 13 15 NaN
#2 18 20 22 24 NaN
#3 NaN NaN NaN NaN NaN
df1.add(df2, fill_value=0) # using method allows fill_value
a b c d e
#0 0 2 4 6 4
#1 9 11 13 15 9
#2 18 20 22 24 14
#3 15 16 17 18 19
You can apply NumPy's ufuncs (element-wise array methods) on Pandas objects. For example, we apply np.abs
onto a DataFrame.
df = DataFrame(np.random.randn(4, 3), columns=['a', 'b', 'c'],
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df
# a b c
#Utah 1.167988 -0.837221 -2.029371
#Ohio -0.727655 0.319954 -1.632084
#Texas 0.752019 -0.417954 -1.139173
#Oregon 0.108305 -0.588943 0.073157
np.abs(df)
# a b c
#Utah 1.167988 0.837221 2.029371
#Ohio 0.727655 0.319954 1.632084
#Texas 0.752019 0.417954 1.139173
#Oregon 0.108305 0.588943 0.073157
Besides the ufuncs above, you can apply your own custom function to a column or row by using the DataFrame apply
method. You do not have to return a scalar value; you can also return a Series with multiple values.
my_func = lambda x: x.max() - x.min()
df.apply(my_func, axis=0) #Default, on y axis
#a 1.895643
#b 1.157175
#c 2.102528
#dtype: float64
df.apply(my_func, axis=1) #on x axis
#Utah 3.197359
#Ohio 1.952038
#Texas 1.891192
#Oregon 0.697248
#dtype: float64
# You do not have to return a scalar value
def my_func(x):
return Series([x.min(), x.max()], index=['min', 'max'])
df.apply(my_func)
# a b c
#min -0.727655 -0.837221 -2.029371
#max 1.167988 0.319954 0.073157
If you do not want to apply based on the axis (i.e. column or row), you can instead apply your own functions across every element. If this is a Series, then you use map
. If this is a DataFrame, then use applymap
.
#For Series, use map
my_format = lambda x: '%.2f' % x #2 decimal places
df['c'].map(my_format)
#Utah -2.03
#Ohio -1.63
#Texas -1.14
#Oregon 0.07
#Name: c, dtype: object
#For DataFrames, use applymap
my_format = lambda x: '%.2f' % x #2 decimal places
df.applymap(my_format)
df
# a b c
#Utah 1.17 -0.84 -2.03
#Ohio -0.73 0.32 -1.63
#Texas 0.75 -0.42 -1.14
#Oregon 0.11 -0.59 0.07
You may want to sort by some criteria, either the index or the value. To sort based on the index, use the sort_index
method. To sort by value, you use order
method for a Series and for DataFrames you pass in parameter(s) to the argument by
.
For a Series, use sort_index
to sort by index, order
to sort by value. You do not need to specify an axis since it is just an array. You can pass in additional parameters like a boolean value for ascending
obj = Series(range(4), index=['d', 'a', 'b', 'c'])
obj
#d 0
#a 1
#b 2
#c 3
#dtype: int64
obj.sort_index(ascending=True) #sort by index
obj
#a 1
#b 2
#c 3
#d 0
#dtype: int64
obj.order() #sort by value
#d 0
#a 1
#b 2
#c 3
#dtype: int64
For a DataFrame, you use sort
by values or sort_index
to sort by index (where you can specify which axis to sort). You can pass in additional parameters like a boolean value for ascending
, what to do with NaN
with na_position
, or multiple columns to sort with by
.
df = pd.DataFrame({'Song': ['Hey Jude', 'Help!', 'Do it', 'Let it be'],
'Weeks': [24, 7, 2, 90], 'Plays': [20, 10, 1, 1]}, index=[3, 2, 4, 1])
df
# Plays Song Weeks
#3 20 Hey Jude 24
#2 10 Help! 7
#4 1 Do it 2
#1 1 Let it be 90
# Sort by Values
df.sort(['Plays', 'Weeks'], ascending=[True, True])
# Plays Song Weeks
#4 1 Do it 2
#1 1 Let it be 90
#2 10 Help! 7
#3 20 Hey Jude 24
# Sort by Index
df.sort_index(axis=0, ascending=True) # sort index on x axis (default)
# Plays Song Weeks
#1 1 Let it be 90
#2 10 Help! 7
#3 20 Hey Jude 24
#4 1 Do it 2
df.sort_index(by=['Plays', 'Weeks'], ascending=[True, True])
# Plays Song Weeks
#4 1 Do it 2
#1 1 Let it be 90
#2 10 Help! 7
#3 20 Hey Jude 24
You can assign ranks from one through the number of valid data points. This is useful for statistical methods. By default, the rank
method breaks ties by assigning each group the mean rank. You can specify additional parameters like ascending
and method
.
obj = Series([7, -5, 7, 4, 2, 0, 4])
obj
#0 7
#1 -5
#2 7
#3 4
#4 2
#5 0
#6 4
obj.rank(ascending=True)
#0 6.5
#1 1.0
#2 6.5
#3 4.5
#4 3.0
#5 2.0
There is a pandas.Categorical
object that you can assign to a Series or DataFrame. You pass in the categories as well as an optional boolean value to Ordered
. You can rename categories using Categorical.rename_categories()
method. To convert back to another type (say int), use astype
method.
my_choices = pd.Categorical(['Like', 'Like', 'Hate', 'Neutral'],
categories=['Like', 'Neutral', 'Hate'],
ordered=True)
my_choices
#[Like, Like, Hate, Neutral]
#Categories (3, object): [Like < Neutral < Hate]
my_choices = my_choices.add_categories(['Really Hate', 'Really Like', 'NA'])
my_choices = my_choices.remove_categories(['NA'])
my_choices
#[Like, Like, Hate, Neutral]
#Categories (5, object): [Like < Neutral < Hate < Really Hate < Really Like]
my_choices = my_choices.reorder_categories(['Really Hate', 'Hate',
'Neutral', 'Like', 'Really Like'])
my_choices
#[Like, Like, Hate, Neutral]
#Categories (5, object): [Really Hate < Hate < Neutral < Like < Really Like]
num_choices = my_choices.rename_categories([1, 2, 3, 4, 5])
num_choices
#[4, 4, 2, 3]
#Categories (5, int64): [1 < 2 < 3 < 4 < 5]
A lot of times in statistical analysis you will want to convert a categorical into a dummy/indicator variables.
my_choices
my_choices = pd.Categorical(['Like', 'Like', 'Hate', 'Neutral'],
categories=['Like', 'Neutral', 'Hate'],
ordered=True)
pd.get_dummies(my_choices)
# Like Neutral Hate
#0 1 0 0
#1 1 0 0
#2 0 0 1
#3 0 1 0
Pandas has a lot of summary statistics as methods. They include:
count
counts the number of non-NA valuesdescribe
gives summary statisticsmin
, max
calculates the minimum and maximum valuesquantile
calculates the quantile value (enter value ranging from 0 to 1)sum
calculates the summean
is the mean of valuesmedian
is the arithmetic median (50% quantile) of valuesmad
is the mean absolute deviation from mean valuevar
is the sample variance of valuesstd
is the sample standard deviation of valuesskew
get the skewkurt
get the kurtosiscumsum
get the cumulative sumcumproduct
get the cumulative productdiff
get the differencepct_change
get the percent changedf = DataFrame({ 'Item': ['Hat', 'Sword', 'Armor', 'Shoes'],
'New_Price': [7, 35, 35, 10],
'Old_Price': [9, 30, 35, 15],
'Available': [1, 1, np.nan, np.nan]})
df
# Available Item New_Price Old_Price
#0 1 Hat 7 9
#1 1 Sword 35 30
#2 NaN Armor 35 35
#3 NaN Shoes 10 15
df.sum(axis=0) #default, sums up by columns
#Available 2
#Item HatSwordArmorShoes
#New_Price 87
#Old_Price 89
#dtype: object
df[['New_Price', 'Old_Price']].sum(axis=1) #filter columns, sum rows
#0 16
#1 65
#2 70
#3 25
Gives summary statistics about the dataset.
Available New_Price Old_Price
count 2 4.000000 4.000000
mean 1 21.750000 22.250000
std 0 15.348724 12.257651
min 1 7.000000 9.000000
25% 1 9.250000 13.500000
50% 1 22.500000 22.500000
75% 1 35.000000 31.250000
max 1 35.000000 35.000000
Gives the quantile ranging from 0 to 1.
df.quantile(.75)
#Available 1.00
#New_Price 35.00
#Old_Price 31.25
#dtype: float64
Get the variance.
df.var()
#Available 0.000000
#New_Price 235.583333
#Old_Price 150.250000
#dtype: float64
Get the standard deviation of the value.
df.std()
#Available 0.000000
#New_Price 15.348724
#Old_Price 12.257651
#dtyp#e: float64
Correlation and Covariance is computed from pairs of arguments. For correlation, use corr()
(or corrwith
method to compute pairwise correlations) and for covariance use cov()
temp = Series([14.2, 16.4, 11.9, 15.2, 18.5, 22.1, 19.4, 25.1, 23.4, 18.1, 22.6, 17.2])
ice_sales = Series([215, 325, 185, 332, 406, 522, 412, 614, 544, 421, 445, 408])
df = DataFrame({ 'Temp': temp, 'Sales': ice_sales})
df.cov() #covariance
# Sales Temp
#Sales 15886.810606 484.093182
#Temp 484.093182 16.089318
df.corr() #correlation with all variables (i.e. correlation matrix)
# Sales Temp
#Sales 1.000000 0.957507
#Temp 0.957507 1.000000
df.corrwith(temp) #correlations with column 'temp'
#Sales 0.957507
#Temp 1.000000
#dtype: float64
In a one-dimensional Series, you can get the unique objects using the unique
method. You can get the number of times an item occurs with the value_counts
method. You can also see if a value is in a passed sequence with isin
.
obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
obj.unique()
#array(['c', 'a', 'd', 'b'], dtype=object)
obj.value_counts()
obj.isin(['a','e']) #usually can use this as a mask
#0 False
#1 True
#2 False
#3 True
#4 True
#5 False
#6 False
#7 False
#8 False
There are a few ways that you can deal with missing data, which appears as np.nan
and Python's None
data type. Methods include:
dropna
allows you to drop rows or columns based on missing valuesfillna
allows you to interpolate (i.e. fill in some value)isnull
returns boolean values indicating if missing or np.nan
notnull
is the negation of isnull
dropna removes any values that are missing or np.nan
. You can specify how
(e.g. how=all
means to drop only the rows whose values are all np.nan
; by default this drops the values if there is at least one np.nan
or missing value). You can specify the axis with axis
. You can also specify a thresh
argument (i.e. keep only if there is a certain number of observations)
#Series dropna
data = Series([1, np.nan, 3.5, np.nan, 7])
data.dropna()
#0 1.0
#2 3.5
#4 7.0
#DataFrame dropna
data = DataFrame([[1., 6.5, 3., 2.4, 3],
[1., np.nan, np.nan, 3, np.nan],
[np.nan, np.nan, np.nan, np.nan, np.nan],
[np.nan, 2.3, 4.3, 2.1, np.nan]])
data
# 0 1 2 3 4
#0 1 6.5 3.0 2.4 3
#1 1 NaN NaN 3.0 NaN
#2 NaN NaN NaN NaN NaN
#3 NaN 2.3 4.3 2.1 NaN
cleaned = data.dropna()
# 0 1 2 3 4
#0 1 6.5 3 2.4 3
cleaned = data.dropna(how='all')
# 0 1 2 3 4
#0 1 6.5 3.0 2.4 3
#1 1 NaN NaN 3.0 NaN
#3 NaN 2.3 4.3 2.1 NaN
cleaned = data.dropna(axis=1)
#Empty DataFrame
#Columns: []
#Index: [0, 1, 2, 3]
data.dropna(thresh=2) # needs at least 2 observations
# 0 1 2 3 4
#0 1 6.5 3.0 2.4 3
#1 1 NaN NaN 3.0 NaN
#3 NaN 2.3 4.3 2.1 NaN
fillna lets you interpolate values (i.e. fill in missing values with something). You can specify the axis
, method
, value
(i.e. a scalar or dict), limit
(maximum number of consecutive periods to fill), and if inplace
.
data2 = DataFrame({ 'Month': np.arange(12),
'People': [1001, 1021, 1050, 1071, np.nan, np.nan,
np.nan, 990, np.nan, 990, np.nan, 723],
'Cars': [np.nan, 432, np.nan, 322, 423, np.nan,
924, 432, 468, 568, 433, 323]})
data2.fillna(0) # fill np.nan with scalar value
# Cars Month People
#0 0 0 1001
#1 432 1 1021
#2 0 2 1050
#3 322 3 1071
#4 423 4 0
#5 0 5 0
#6 924 6 0
#7 432 7 990
#8 468 8 0
#9 568 9 990
#10 433 10 0
#11 323 11 723
data2.fillna({'Cars': -1, 'People': -5}) # Specify dict for fill
# Cars Month People
#0 -1 0 1001
#1 432 1 1021
#2 -1 2 1050
#3 322 3 1071
#4 423 4 -5
#5 -1 5 -5
#6 924 6 -5
#7 432 7 990
#8 468 8 -5
#9 568 9 990
#10 433 10 -5
#11 323 11 723
split-apply-combine is a process for group operations. This involes:
groupby accomplishes the first step of the split-apply-combine method; it splits the data into multiple groups and return a grouped object (which can be a SeriesGroupBy
or DataFrameGroupBy
). Say we want to split by the key (e.g. key2
) and we are interested in the value (data1
), apply a function (e.g. mean
method), and return this as a combined result.
df = DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],
'key2': ['one', 'two', 'one', 'two', 'one'],
'data1': np.random.randn(5),
'data2': np.random.randn(5)})
df
# data1 data2 key1 key2
#0 -0.238342 -0.592247 a one
#1 0.187521 1.524641 a two
#2 -1.926102 -1.110540 b one
#3 -1.183664 0.211620 b two
#4 -0.519853 -0.754792 a one
#Step 1 of split-apply-combine; split data with key and value
grouped = df['data1'].groupby(df['key1'])
#<pandas.core.groupby.SeriesGroupBy object at 0x107638c90>
#Step 2 of split-apply-combine; apply a function
grouped.mean()
#key1
#a -0.190224
#b -1.554883
#Name: data1, dtype: float64
#Step 3 of split-apply-combine; result is a Series
type(grouped.mean())
#<class 'pandas.core.series.Series'>
Usually the data you work with is in the same dataframe so you can just specify the columns to pass in as the group keys. If you want to group by multiple levels, pass in a list.
df.groupby('key1').mean() # single key
# data1 data2
#key1
#a -0.190224 0.059201
#b -1.554883 -0.449460
df.groupby(['key1', 'key2']).mean() # multiple keys
#key1 key2
#a one -0.379097 -0.673520
# two 0.187521 1.524641
#b one -1.926102 -1.110540
# two -1.183664 0.211620
You can index columns as another way of saying “select these columns” for aggregation.
#SeriesGroupBy object
df.groupby('key1')['data2'] # df['data2'].groupby(df['key1'])
#<pandas.core.groupby.SeriesGroupBy object at 0x1076815d0>
#DataFrameGroupBy object
df.groupby('key1')[['data2']] # df[['data2']].groupby(df['key1'])
#<pandas.core.groupby.DataFrameGroupBy object at 0x107666490>
#Get the means for just 'data2' column grouped by the keys, calculate mean
df.groupby(['key1', 'key2'])[['data2']].mean()
# data2
#key1 key2
#a one -0.673520
# two 1.524641
#b one -1.110540
# two 0.211620
With a GroupBy object, you can count the number of occurrences using size, which is useful for categoricals.
df.groupby(['key1', 'key2']).size() # count number of occurrences
#key1 key2
#a one 2
# two 1
#b one 1
# two 1
#dtype: int64
With a GroupBy object, you can select a specific group using get_group
.
# One Key
grouped = df.groupby('key1')
grouped.get_group('a')
# data1 data2 key1 key2
#0 -1.299062 0.214834 a one
#1 1.624373 0.301559 a two
#4 -1.427448 -1.142078 a one
# Multiple Keys
grouped = df.groupby(['key1', 'key2'])
grouped.get_group(('a', 'one'))
grouped.get_group(('a', 'one'))
# data1 data2 key1 key2
#0 -1.299062 0.214834 a one
#4 -1.427448 -1.142078 a one
With a GroupBy object, you can get the aggregation with aggregate
. If you need more complex aggregations, go see apply
.
grouped = df.groupby('key1')
grouped.aggregate(np.sum)
# data1 data2
#key1
#a -1.102138 -0.625684
#b -1.694854 -1.843882
You can filter and get a subset of an original object using filter
(e.g. we want to get the elements that belong to a group with a group sum greater than 2).
#Filtering a simple Series
sf = Series([1, 1, 2, 3, 3, 3])
sf.groupby(sf).filter(lambda x: x.sum() > 2)
#3 3
#4 3
#5 3
#dtype: int64
#Filtering a DataFrame
df = DataFrame({'A': np.arange(8),
'B': list('aabbbbcc'),
'C': np.arange(8, 0, -1)})
df
# A B C
#0 0 a 8
#1 1 a 7
#2 2 b 6
#3 3 b 5
#4 4 b 4
#5 5 b 3
#6 6 c 2
#7 7 c 1
#Group by col 'B' and only return values where there is a count of more than 2
df.groupby('B').filter(lambda x: len(x['C']) > 2)
You can just pass in the method sort
to sort an item.
df.groupby('B').filter(lambda x: len(x['C']) > 2).sort('C')
# A B C
#5 5 b 3
#4 4 b 4
#3 3 b 5
#2 2 b 6
You can iterate over a GroupBy object to return a sequence of 2-tuples containing the group name along with the chunk of data.
#Iterate with a single key
for name, group in df.groupby('key1'):
print name #a, b
print group #data1, data2, key1, key2
#a
# data1 data2 key1 key2
#0 -0.238342 -0.592247 a one
#1 0.187521 1.524641 a two
#4 -0.519853 -0.754792 a one
#b
# data1 data2 key1 key2
#2 -1.926102 -1.11054 b one
#3 -1.183664 0.21162 b two
#Iterate with multiple keys
for (k1, k2), group in df.groupby(['key1', 'key2']):
print k1, k2 #(a, one), (a, two), (b, one), (b, two)
print group #data1, data2, key1, key2
#a one
# data1 data2 key1 key2
#0 -0.238342 -0.592247 a one
#4 -0.519853 -0.754792 a one
#a two
# data1 data2 key1 key2
#1 0.187521 1.524641 a two
#b one
# data1 data2 key1 key2
#2 -1.926102 -1.11054 b one
#b two
# data1 data2 key1 key2
#3 -1.183664 0.21162 b two
#Compute a dict with the pieces of data
pieces = dict(list(df.groupby('key1')))
pieces['b']
# data1 data2 key1 key2
#2 -1.926102 -1.11054 b one
#3 -1.183664 0.21162 b two
type(pieces['b'])
#<class 'pandas.core.frame.DataFrame'>
At times you want to reshape your data (e.g. from wide to long or vice versa). Pandas has a few tools for this, including melt, pivot, pivot_table, and crosstab.
Melt ‘unpivots’ a table from wide to long format. You have 1+ variables as identifiers (id_vars) and the remaining fields fall into two variables: variable and value.
cheese = pd.DataFrame({'first_name': ['Will', 'Laura', 'Mike', 'Mandy'],
'last_name': ['Liu', 'Summers', 'Liu', 'Summers'],
'height': [5.10, 5.3, 5.9, 5.2],
'weight': [150, 120, 190, 110]})
print cheese
# first_name height last_name weight
#0 Will 5.11 Liu 150
#1 Laura 5.3 Summers 120
#2 Mike 5.9 Liu 190
#3 Mandy 5.2 Summers 110
# Melt and specify variable name
print pd.melt(cheese, id_vars=['first_name', 'last_name'])
# first_name last_name variable value
#0 Will Liu height 5.11
#1 Laura Summers height 5.3
#2 Mike Liu height 5.9
#3 Mandy Summers height 5.2
#4 Will Liu weight 150
#5 Laura Summers weight 120
#6 Mike Liu weight 190
#7 Mandy Summers weight 110
# Filter for a specific variable
print pd.melt(cheese, id_vars=['first_name', 'last_name'],
var_name='quantity', value_vars='height')
# first_name last_name quantity value
#0 Will Liu height 5.11
#1 Laura Summers height 5.3
#2 Mike Liu height 5.9
#3 Mandy Summers height 5.2
You use pivot for transforming long to wide on categorical values or numerical values. This is the opposite of melt.
df = pd.DataFrame({
'id': [25418726, 25418726, 25418731, 25418731, 25418740],
'question': ['A', 'B', 'A', 'B', 'A'],
'answer': ['V', 'W', 'X', 'Y', 'Z']
})
print df
# answer id question
#0 V 25418726 A
#1 W 25418726 A
#2 X 25418731 A
#3 Y 25418731 A
#4 Z 25418726 B
temp = df.pivot(index='id', columns='question', values='answer')
print temp
#question A B
#id
#25418726 V W
#25418731 X Y
#25418740 Z NaN
You use pivot tables to transform data from long to wide (creating a spreadsheet-like dataset). The main reason of having pivot tables over pivot is that data has to be numerical, which lets you use an ‘aggfunc’ on it (e.g. sum, mean).
print pd.pivot_table(cheese, values='height', index=['last_name', 'first_name'])
#last_name first_name
#Liu Mike 5.9
# Will 5.1
#Summers Laura 5.3
# Mandy 5.2
#Name: height, dtype: float64
print pd.pivot_table(cheese, values='height', index=['last_name'], aggfunc='mean')
Liu 5.50
Summers 5.25
Name: height, dtype: float64
A crosstab creates a cross-tabulation of two or more factors, usually giving a frequency table of the factors.
pd.crosstab(cheese.last_name, cheese.first_name)
#first_name Laura Mandy Mike Will
#last_name
#Liu 0 0 1 1
#Summers 1 1 0 0
You can combine objects together with Series and DataFrame objects using various logic. You can pass in different objects including dicts, list, Series, DataFrames.
This operation concatenates (i.e. stacks together) objects along an axis and does optional logic (axis, join, join_axes, keys). Default concatentation axis=0, default join is ‘outer’.
series1 = pd.Series([0, 1], index=['a', 'b'])
series2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
series3 = pd.Series([5, 6], index=['f', 'g'])
# Concat for a list of Series links the values and indexes together
pd.concat([series1, series2, series3])
#a 0
#b 1
#c 2
#d 3
#e 4
#f 5
#g 6
#dtype: int64
# Default concat is along axis=0
series4 = pd.concat([series1 * 5, series3])
#a 0
#b 5
#f 5
#g 6
#dtype: int64
# Concat along axis=1, series1 has no 'f', 'g' indexes so returns a NaN
pd.concat([series1, series4], axis=1)
# 0 1
#a 0 0
#b 1 5
#f NaN 5
#g NaN 6
# Can specify how to join
pd.concat([series1, series4], axis=1, join='inner')
# 0 1
#a 0 0
#b 1 5
# Can specify the join_axes
pd.concat([series1, series4], axis=1, join_axes=[['a', 'c', 'b', 'e']])
# 0 1
#a 0 0
#c NaN NaN
#b 1 5
#e NaN NaN
df1 = pd.DataFrame(np.arange(6).reshape(3, 2), \
index=['a', 'b', 'c'],
columns=['one', 'two'])
df1
# one two
#a 0 1
#b 2 3
#c 4 5
df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), \
index=['a', 'c'],
columns=['three', 'four'])
df2
# three four
#a 5 6
#c 7 8
# Using list of objects and axis=0
pd.concat([df1, df2], axis=0, join='inner')
# four one three two
#a NaN 0 NaN 1
#b NaN 2 NaN 3
#c NaN 4 NaN 5
#a 6 NaN 5 NaN
#c 8 NaN 7 NaN
# Using list of objects and axis=1
pd.concat([df1, df2], axis=1, join='inner')
# one two three four
#a 0 1 5 6
#c 4 5 7 8
Append is actually just a shortcut to concatenate (i.e. shorter more assumptions filled). An append concatenates along axis=0 using the index.
# Append with Series
s = pd.Series(np.random.randn(10), index=np.arange(10))
s1 = s[:5]
s2 = s[6:]
s1.append(s2)
# Append with DataFrame
df1.append(df2)
# four one three two
#a NaN 0 NaN 1
#b NaN 2 NaN 3
#c NaN 4 NaN 5
#a 6 NaN 5 NaN
#c 8 NaN 7 NaN
# Append can take multiple items
df1.append([df2, df1])
# four one three two
#a NaN 0 NaN 1
#b NaN 2 NaN 3
#c NaN 4 NaN 5
#a 6 NaN 5 NaN
#c 8 NaN 7 NaN
#a NaN 0 NaN 1
#b NaN 2 NaN 3
#c NaN 4 NaN 5
Merge allows you to connect rows in DataFrames based on either the key and/or the index. This is similar to database join operations. There actually is a join function that is exactly equivalent. You can specify the type of join with ‘how’.
df1 = pd.DataFrame({'key':['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data1':range(7)})
df1
# data1 key
#0 0 b
#1 1 b
#2 2 a
#3 3 c
#4 4 a
#5 5 a
#6 6 b
df2 = pd.DataFrame({'key':['a', 'b', 'd'],
'data2':range(3)})
df2
# data2 key
#0 0 a
#1 1 b
#2 2 d
#### Merge on Key
pd.merge(df1, df2, on='key', how='inner')
# data1 key data2
#0 0 b 1
#1 1 b 1
#2 6 b 1
#3 2 a 0
#4 4 a 0
#5 5 a 0
#### Merge on Index
pd.merge(df1, df2, left_index=True, right_index=True, how='inner')
# data1 key_x data2 key_y
#0 0 b 0 a
#1 1 b 1 b
#2 2 a 2 d
#### Merge on Key and Index
pd.merge(df1, df2, left_index=True, right_on='data2', how='inner')
# data1 key_x data2 key_y
#0 0 b 0 a
#1 1 b 1 b
#2 2 a 2 d
Combine_first combines two DataFrame objects and defaults to non-null values in the frame calling the method. The result index columns is the union of the respective indexes and columns (i.e. df1 is prioritized, use df2 values to fill in any missing holes)
df1 = pd.DataFrame({
'a': [1., np.nan, 5., np.nan],
'b': [np.nan, 2., np.nan, 6.],
'c': np.arange(2, 18, 4)})
df1
# a b c
#0 1 NaN 2
#1 NaN 2 6
#2 5 NaN 10
#3 NaN 6 14
df2 = pd.DataFrame({
'a': [5., 4., np.nan, 3., 7.],
'b': [np.nan, 3., 4., 6., 8.]})
df2
# a b
#0 5 NaN
#1 4 3
#2 NaN 4
#3 3 6
#4 7 8
df1.combine_first(df2)
# a b c
#0 1 NaN 2
#1 4 2 6
#2 5 4 10
#3 3 6 14
#4 7 8 NaN