Table of Contents
TODO:
Pandas is built on top of NumPy and takes the ndarray a step even further into high-level data structures with Series and DataFrame objects; these data objects contain metadata like column and row names as an index with an index.name. There are also a lot of helper functions for loading, selecting, and chunking data. If you are using something like SQL for anything that goes beyond a simple query or a large dataset, its time to switch to pandas.
Pandas is similar to R and follows the same patterns of using the split-apply-combine strategy using the groupby method. Melting is done through the melt method.
Import the pandas library
from pandas import Series, DataFrame
import pandas as pd
axis=0 to apply a method down each column.axis=1 to apply a method across each row.So when there is df.mean(axis=1) it means to take the mean of entries horizontally across columns (i.e. along each individual row).
![Image of Axis] (https://williamqliu.github.com/images/pd_axis.jpg)
A Series is a one-dimensional array-like object containing any NumPy data type as values as well as data labels called the index. If no index is specified, a default one is created using integers that span 0 through N-1 (where N is the length of data).
from pandas import Series
obj = Series([4, 7, -5, 3]) #Simplest Series is an array
#If no index specified, default index is created
obj
#0 4
#1 7
#2 -5
#3 3
#dtype: int64
You can select single values or a set of values by specifying the index within [].
obj2['a'] # Get value corresponding to obj2's index 'a'
#-5
obj2[['c', 'a', 'd']] # Get values to this list of indexes
#c 3
#a -5
#d 4
#dtype: int64
You can think of a Series as a fixed-length, ordered dictionary since it maps the index to values. You can even pass in a Python dict to create a Series.
'b' in obj2 # True
'e' in obj2 # False
# You can pass in a Python dict to create a Series
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = Series(sdata)
obj3
#Ohio 35000
#Oregon 16000
#Texas 71000
#Utah 5000
#dtype: int64
# You can specify an index; if no matching value, then value is 'NaN'
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index=states)
obj4
#California NaN #Since we cannot find value to this index
#Ohio 35000
#Oregon 16000
#Texas 71000
A DataFrame is a tabular, spreadsheet-like data structure (like Excel) and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). There is both a row and column index.
# Create a dict of equal-length lists or NumPy arrays
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
df = pd.DataFrame(data)
df
# pop state year
#0 1.5 Ohio 2000
#1 1.7 Ohio 2001
#2 3.6 Nevada 2001
#3 2.4 Nevada 2001
#4 2.9 Nevada 2002
# Can specify column names, order, and index names
df2 = pd.DataFrame(data,
columns=['year', 'state', 'pop', 'debt'],
index=['one', 'two', 'three', 'four', 'five'])
df2
# year state pop debt
#one 2000 Ohio 1.5 NaN
#two 2001 Ohio 1.7 NaN
#three 2002 Ohio 3.6 NaN
#four 2001 Nevada 2.4 NaN
#five 2002 Nevada 2.9 NaN
You can see what columns are available using columns.
df2.columns # List all columns
#Index([u'year', u'state', u'pop', u'debt'], dtype='object')
If you want to select a column, this will return a Series; you can do this by calling the column index as a dictionary (i.e. []) or by calling the column index as an attribute (i.e. .).
df2['state'] # get 'state' column using dict notation, returns Series
#one Ohio
#two Ohio
#three Ohio
#four Nevada
#five Nevada
#Name: state, dtype: object
df2.state # get 'state' column using attribute notation, same as above
With ix, you can select a row index by either position (e.g. df.ix[2]) or the name (say df.ix['state']). Since this is built on top of NumPy, these selections are views (which means they are the actual values); if you want a copy, then use copy().
df2.ix['three'] # Get row with index 'three'
#year 2002
#state Ohio
#pop 3.6
#debt NaN
#Name: three, dtype: object
df2.ix['three']['state'] # Get value of index 'three', column 'state'
#'Ohio'
You can select and filter for specific rows and columns.
# Filter on specific column value only
df.ix[df.state == 'Ohio']
# year state
#one 2000 Ohio
#two 2001 Ohio
#three 2002 Ohio
# Filter by chaining (e.g. multiple columns)
df.ix[(df.state == 'Ohio') & (df.year.isin([2001, 2002]))]
# pop state year
#1 1.7 Ohio 2001
#2 3.6 Ohio 2002
# Filter on specific row
df2.ix['three']
#year 2002
#state Ohio
#pop 3.6
#debt NaN
You can also query the columns of a dataframe with a boolean expression.
df.query(df.year > 2001)
# pop state year
#2 3.6 Ohio 2002
#4 2.9 Nevada 2002
You can use the loc position to return based off of label names.
df = DataFrame({'sepallength': [1, 2, 3, 4, 5, 6, 7],
'sepalwidth': [7, 6, 5, 4, 3, 2, 1],
'class': ['Iris-setosa', 'Iris-versicolor', 'Iris-setosa', 'Iris', 'Iris', 'Iris', 'Iris']})
df.loc[df['class'] == 'Iris-setosa'] # Filter on single column value
df.loc[df['class'].isin(['Iris-setosa', 'Iris'])] # Filter multiple col values
You can also select rows or columns by an integer based indexing using the iloc method or by using []; either way requires arguments for the start, stop and optionally an interval.
#Select using Rows
df2.iloc[2:4]
# year state pop debt
#three 2002 Ohio 3.6 NaN
#four 2001 Nevada 2.4 NaN
df2[2:4] #same as above, but prefer to use iloc so we are explicit
#Select using Columns
df2.iloc[:, 1:3]
# state pop
#one Ohio 1.5
#two Ohio 1.7
#three Ohio 3.6
#four Nevada 2.4
#five Nevada 2.9
You can assign values similar to how you select items. The value's length must match the length of the DataFrame.
df2.ix['three']['state'] = 'California' #Assign by row and col
df2['debt'] = 16.5 #Assign entire col the same value
df2['debt'] = [10, 9, 8, 7, 6] #Assign entire col different values
df2['debt'] = np.arange(5) # Assign col values based on a function
# Assign values with a Series; uses index to find where to put values
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
df2['debt'] = val
You can create columns by assigning them values (e.g. say np.empty). You can delete columns with the del keyword.
df2.columns # See columns
#Index([u'year', u'state', u'pop', u'debt'], dtype='object')
df2['eastern'] = np.empty # Create a new column
df2.columns
#Index([u'year', u'state', u'pop', u'debt', u'eastern'], dtype='object')
del df2['eastern']
df2.columns
#Index([u'year', u'state', u'pop', u'debt'], dtype='object')
You can transpose a DataFrame using T.
df2
# year state pop debt
#one 2000 Ohio 1.5 NaN
#two 2001 Ohio 1.7 -1.2
#three 2002 Ohio 3.6 NaN
#four 2001 Nevada 2.4 -1.5
#five 2002 Nevada 2.9 -1.7
df2.T
# one two three four five
#year 2000 2001 2002 2001 2002
#state Ohio Ohio Ohio Nevada Nevada
#pop 1.5 1.7 3.6 2.4 2.9
#debt NaN -1.2 NaN -1.5 -1.7
The advantage of a DataFrame is the vectorized operations, but sometimes you just have to iterate through pieces of a DataFrame. You can do this with iterrows()
for index, row in df2.iterrows():
print index, row['year'], row['state']
# one 2000 Ohio
# two 2001 Ohio
# three 2002 Ohio
# four 2001 Nevada
# five 2002 Nevada
Series and DataFrame have an index and corresponding values. The index appears on the left (accessible through index)) and the values are on the right (accessible through values). There are also corresponding names (accessible through the name attribute).
Indexes are immutable (can't be modified); the main thing to know is that later we will go into different types of indexes, including DatetimeIndex, which gives more functions for working with datetimes.
obj = Series([4, 7, -5, 3]) # Series without specifying index
obj.index #Int64Index([0, 1, 2, 3], dtype='int64')
#Series with a specified index
obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2
#d 4
#b 7
#a -5
#c 3
#dtype: int64
Some useful index methods and properties are:
append concatenates with additional Index objects making a new Indexdiff gets the differenceintersection gets the intersectionunion gets the unionisin creates boolean array indicating if each value in passed collectiondelete removes the element at this indexdrop deletes the passed valueinsert adds a new element at this indexis_unique returns True if Index has no duplicate valuesunique computes the array of unique values in the IndexSeries, DataFrame, and indexes have a name attribute. This allows you to get and set indexes easier.
obj2.name = 'Inches of Rain' # name for Series
obj2.index.name = 'state' # name for index
obj2
#state
#d 4
#b 7
#a -5
#c 3
#Name: Inches of Rain, dtype: int64
You can check if an index is unqiue
df.index.is_unique #True
s1 = Series([10, 20, 30], index=[1, 2, 3])
s2 = Series([40, 50, 60], index=[1, 2, 4])
temp = s1.index.intersection(s2.index)
temp
#Int64Index([1, 2], dtype='int64')
You can drop entries from an axis
obj = Series([4.5, 7.2, -5.3, 3.6], index= ['d', 'b', 'a', 'c'])
new_obj = obj.drop(['d', 'a'])
new_obj
#b 7.2
#c 3.6
#dtype: float64
Reindex creates a new object with the data conformed to the new index, even if there are no index values present. For a scalar value, you can specify a fill_value (e.g. 0), otherwise the default is a NaN. You can also specify method to fill (e.g. ffill to fill values forward). To reindex columns, go see the ix method.
obj = Series([4.5, 7.2, -5.3, 3.6], index= ['d', 'b', 'a', 'c'])
obj
#d 4.5
#b 7.2
#a -5.3
#c 3.6
#dtype: float64
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)
obj2
#a -5.3
#b 7.2
#c 3.6
#d 4.5
#e 0
#dtype: float64
obj3 = Series(['blue', 'green', 'yellow'], index=[0, 2, 4])
obj3.reindex(range(6), method='ffill')
obj3
#0 blue
#1 blue
#2 green
#3 green
#4 yellow
#5 yellow
#dtype: object
Pandas uses many of the same dtype as NumPy, but there are additional types like dtype=category. You can also forcefully assign type with the astype method.
s = Series(['a', 'b', 'c', 'd', 'a', 'b'], dtype='category')
s.dtypes # category
df = DataFrame({'A': ['a', 'b', 'c', 'a'],
'B': ['z', 'z', 'y', 'd']})
df.dtypes
#A object
#B object
df['C'] = df['A'].astype('category')
df.dtypes
#A object
#B object
#C category
Values are always returned as a 2D ndarray and the dtype will be automatically chosen to match objects.
obj = Series([4, 7, -5, 3]) #Simplest Series is an array
obj.values #array([ 4, 7, -5, 3])
A NaN or NA is returned if a value cannot be found (i.e. a missing value). You can search your data for these missing values using isnull and notnull functions.
pd.isnull(obj4)
#California True
#Ohio False
#Oregon False
#Texas False
#dtype: bool
pd.notnull(obj4)
#California False
#Ohio True
#Oregon True
#Texas True
#dtype: bool
You can replace a value with another value. You can use replace() with a lot of variations including replacing a single value with a single value, a list with a single value, a list with a list, and a dict with a dict.
mydata = pd.Series([1., -999., 2., -999., -1000., 3.])
print mydata
#0 1
#1 -999
#2 2
#3 -999
#4 -1000
#5 3
#dtype: float64
# Replace value of x with value of y
mydata.replace(-999, np.nan)
#0 1
#1 NaN
#2 2
#3 NaN
#4 -1000
#5 3
# Replace list x with value of y
mydata.replace([-999, -1000], np.nan)
#0 1
#1 NaN
#2 2
#3 NaN
#4 NaN
#5 3
#dtype: float64
# Replace list of x with list of y
mydata.replace([-999, -1000], [numpy.nan, 0])
#0 1
#1 NaN
#2 2
#3 NaN
#4 0
#5 3
#dtype: float64
# Replace a dict of x with a dict of y
mydata.replace({-999:np.NaN, -1000:0})
#0 1
#1 NaN
#2 2
#3 NaN
#4 0
#5 3
#dtype: float64
To check if a DataFrame has duplicated values, you can use duplicated().
df = pd.DataFrame({
'k1': ['one'] * 3 + ['two'] * 4,
'k2': [1, 1, 2, 3, 3, 4, 4]
})
df
# k1 k2
#0 one 1
#1 one 1
#2 one 2
#3 two 3
#4 two 3
#5 two 4
#6 two 4
# Return a Series of whether the row is duplicated
df.duplicated()
#0 False
#1 True
#2 False
#3 False
#4 True
#5 False
#6 True
#dtype: bool
# Return a DataFrame of unique rows
df.drop_duplicates()
# k1 k2
#0 one 1
#2 one 2
#3 two 3
#5 two 4
# You can specify what columns are used to determine if row is unique
df.drop_duplicates(['k1'])
# k1 k2
#0 one 1
#3 two 3
Pandas really starts to show how powerful it is when working with DateTimes.
convert_objects and to_datetime First you want to make sure that the data type is correct (i.e. it’s a time) You can convert dataframe objects a couple different ways. There’s a generic convert_objects that you can use to also use to convert numbers. If you want a more specific time function, use to_datetime.
mydf = mydf.convert_objects(convert_dates) # also, convert_numeric, convert_timedeltas
mydf['SomeDateColumn'] = pd.to_datetime(mydf['SomeDateColumn'], coerce=True)
To utilize the datetime specific methods, you need to set an index that is a DatetimeIndex.
df.set_index(pd.DatetimeIndex(df['someDate']), inplace=True)
df # say this is: 'someDate', 'cookiesEaten', 'Notes'
#2009-02-16 23:26:40, 5, someotherstuff
#2009-02-16 23:45:43, 7, someotherstuff
#2009-02-17 22:05:31, 10, someotherstuff
You can now resample our counts of say ‘cookiesEaten’.
df.resample('5Min', how='sum') # default is mean
df.resample('D', how=np.max)
#2009-02-16, 12
#2009-02-17, 10
With the first argument of a resample, you can specify the offset alias. For example:
D is for calendar day frequencyB is for business day frequencyW is for weekly frequencyM is for month end frequencyMS is for month start frequencyQ is for quarter end frequencyQS is for quarter start frequencyH is for hourly frequencyT is for minute frequencyS is for secondly frequencyU is for microseconds frequencyWith the second argument of a resample, you can specify how. For example:
how='mean'how='min'how='median'how='first'how='last'how='ohlc' # first, last, highest, lowest value (mainly used for finance to see what opening stock is, ending stock for the day, etc.)You can apply NumPy array operations (e.g. filtering, scalar multiplication, applying math functions) and uFuncs to our Series and DataFrames.
obj2 # Setup Series
#d 4
#b 7
#a -5
c 3
obj2[obj2 > 0] # Filtering
obj2
#d 4
#b 7
#c 3
#dtype: int64
obj2 * 2 # Scalar Multiplication
#d 8
#b 14
#a -10
#c 6
#dtype: int64
np.sqrt(obj2) # Apply ufunc
#d 2.00000
#b 2.64751
#a NaN
#c 1.73205
#dtype: float64
One of the advantages of pandas is that data is automatically aligned on the index. For example, even though obj4 has the index ‘California’ that obj3 does not have, we can still do arithmetic.
obj3
#Ohio 35000
#Oregon 16000
#Texas 71000
#Utah 5000
#dtype: int 64
obj4
#California NaN
#Ohio 35000
#Oregon 16000
#Texas 71000
obj3 + obj4
#California NaN
#Ohio 70000
#Oregon 32000
#Texas 142000
#Utah NaN
#dtype: float64
You can apply arithmetic methods between Series and DataFrame objects. These methods let you send in additional arguments (like fill_value, axis)
add is the method for addition (+)sub is the method for subtraction (-)div is the method for division (/)mul is the method for multiplication (*)Add together two DataFrames results in NA values in the locations that do not overlap.
df1 = DataFrame(np.arange(12).reshape((3, 4)), columns=['a', 'b', 'c', 'd'])
df2 = DataFrame(np.arange(20).reshape((4,5)), columns=['a', 'b', 'c', 'd', 'e'])
df1
# a b c d
#0 0 1 2 3
#1 4 5 6 7
#2 8 9 10 11
df2
# a b c d e
#0 0 1 2 3 4
#1 5 6 7 8 9
#2 10 11 12 13 14
#3 15 16 17 18 19
df1 + df2
# a b c d e
#0 0 2 4 6 NaN
#1 9 11 13 15 NaN
#2 18 20 22 24 NaN
#3 NaN NaN NaN NaN NaN
df1.add(df2, fill_value=0) # using method allows fill_value
a b c d e
#0 0 2 4 6 4
#1 9 11 13 15 9
#2 18 20 22 24 14
#3 15 16 17 18 19
You can apply NumPy's ufuncs (element-wise array methods) on Pandas objects. For example, we apply np.abs onto a DataFrame.
df = DataFrame(np.random.randn(4, 3), columns=['a', 'b', 'c'],
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df
# a b c
#Utah 1.167988 -0.837221 -2.029371
#Ohio -0.727655 0.319954 -1.632084
#Texas 0.752019 -0.417954 -1.139173
#Oregon 0.108305 -0.588943 0.073157
np.abs(df)
# a b c
#Utah 1.167988 0.837221 2.029371
#Ohio 0.727655 0.319954 1.632084
#Texas 0.752019 0.417954 1.139173
#Oregon 0.108305 0.588943 0.073157
Besides the ufuncs above, you can apply your own custom function to a column or row by using the DataFrame apply method. You do not have to return a scalar value; you can also return a Series with multiple values.
my_func = lambda x: x.max() - x.min()
df.apply(my_func, axis=0) #Default, on y axis
#a 1.895643
#b 1.157175
#c 2.102528
#dtype: float64
df.apply(my_func, axis=1) #on x axis
#Utah 3.197359
#Ohio 1.952038
#Texas 1.891192
#Oregon 0.697248
#dtype: float64
# You do not have to return a scalar value
def my_func(x):
return Series([x.min(), x.max()], index=['min', 'max'])
df.apply(my_func)
# a b c
#min -0.727655 -0.837221 -2.029371
#max 1.167988 0.319954 0.073157
If you do not want to apply based on the axis (i.e. column or row), you can instead apply your own functions across every element. If this is a Series, then you use map. If this is a DataFrame, then use applymap.
#For Series, use map
my_format = lambda x: '%.2f' % x #2 decimal places
df['c'].map(my_format)
#Utah -2.03
#Ohio -1.63
#Texas -1.14
#Oregon 0.07
#Name: c, dtype: object
#For DataFrames, use applymap
my_format = lambda x: '%.2f' % x #2 decimal places
df.applymap(my_format)
df
# a b c
#Utah 1.17 -0.84 -2.03
#Ohio -0.73 0.32 -1.63
#Texas 0.75 -0.42 -1.14
#Oregon 0.11 -0.59 0.07
You may want to sort by some criteria, either the index or the value. To sort based on the index, use the sort_index method. To sort by value, you use order method for a Series and for DataFrames you pass in parameter(s) to the argument by.
For a Series, use sort_index to sort by index, order to sort by value. You do not need to specify an axis since it is just an array. You can pass in additional parameters like a boolean value for ascending
obj = Series(range(4), index=['d', 'a', 'b', 'c'])
obj
#d 0
#a 1
#b 2
#c 3
#dtype: int64
obj.sort_index(ascending=True) #sort by index
obj
#a 1
#b 2
#c 3
#d 0
#dtype: int64
obj.order() #sort by value
#d 0
#a 1
#b 2
#c 3
#dtype: int64
For a DataFrame, you use sort by values or sort_index to sort by index (where you can specify which axis to sort). You can pass in additional parameters like a boolean value for ascending, what to do with NaN with na_position, or multiple columns to sort with by.
df = pd.DataFrame({'Song': ['Hey Jude', 'Help!', 'Do it', 'Let it be'],
'Weeks': [24, 7, 2, 90], 'Plays': [20, 10, 1, 1]}, index=[3, 2, 4, 1])
df
# Plays Song Weeks
#3 20 Hey Jude 24
#2 10 Help! 7
#4 1 Do it 2
#1 1 Let it be 90
# Sort by Values
df.sort(['Plays', 'Weeks'], ascending=[True, True])
# Plays Song Weeks
#4 1 Do it 2
#1 1 Let it be 90
#2 10 Help! 7
#3 20 Hey Jude 24
# Sort by Index
df.sort_index(axis=0, ascending=True) # sort index on x axis (default)
# Plays Song Weeks
#1 1 Let it be 90
#2 10 Help! 7
#3 20 Hey Jude 24
#4 1 Do it 2
df.sort_index(by=['Plays', 'Weeks'], ascending=[True, True])
# Plays Song Weeks
#4 1 Do it 2
#1 1 Let it be 90
#2 10 Help! 7
#3 20 Hey Jude 24
You can assign ranks from one through the number of valid data points. This is useful for statistical methods. By default, the rank method breaks ties by assigning each group the mean rank. You can specify additional parameters like ascending and method.
obj = Series([7, -5, 7, 4, 2, 0, 4])
obj
#0 7
#1 -5
#2 7
#3 4
#4 2
#5 0
#6 4
obj.rank(ascending=True)
#0 6.5
#1 1.0
#2 6.5
#3 4.5
#4 3.0
#5 2.0
There is a pandas.Categorical object that you can assign to a Series or DataFrame. You pass in the categories as well as an optional boolean value to Ordered. You can rename categories using Categorical.rename_categories() method. To convert back to another type (say int), use astype method.
my_choices = pd.Categorical(['Like', 'Like', 'Hate', 'Neutral'],
categories=['Like', 'Neutral', 'Hate'],
ordered=True)
my_choices
#[Like, Like, Hate, Neutral]
#Categories (3, object): [Like < Neutral < Hate]
my_choices = my_choices.add_categories(['Really Hate', 'Really Like', 'NA'])
my_choices = my_choices.remove_categories(['NA'])
my_choices
#[Like, Like, Hate, Neutral]
#Categories (5, object): [Like < Neutral < Hate < Really Hate < Really Like]
my_choices = my_choices.reorder_categories(['Really Hate', 'Hate',
'Neutral', 'Like', 'Really Like'])
my_choices
#[Like, Like, Hate, Neutral]
#Categories (5, object): [Really Hate < Hate < Neutral < Like < Really Like]
num_choices = my_choices.rename_categories([1, 2, 3, 4, 5])
num_choices
#[4, 4, 2, 3]
#Categories (5, int64): [1 < 2 < 3 < 4 < 5]
A lot of times in statistical analysis you will want to convert a categorical into a dummy/indicator variables.
my_choices
my_choices = pd.Categorical(['Like', 'Like', 'Hate', 'Neutral'],
categories=['Like', 'Neutral', 'Hate'],
ordered=True)
pd.get_dummies(my_choices)
# Like Neutral Hate
#0 1 0 0
#1 1 0 0
#2 0 0 1
#3 0 1 0
Pandas has a lot of summary statistics as methods. They include:
count counts the number of non-NA valuesdescribe gives summary statisticsmin, max calculates the minimum and maximum valuesquantile calculates the quantile value (enter value ranging from 0 to 1)sum calculates the summean is the mean of valuesmedian is the arithmetic median (50% quantile) of valuesmad is the mean absolute deviation from mean valuevar is the sample variance of valuesstd is the sample standard deviation of valuesskew get the skewkurt get the kurtosiscumsum get the cumulative sumcumproduct get the cumulative productdiff get the differencepct_change get the percent changedf = DataFrame({ 'Item': ['Hat', 'Sword', 'Armor', 'Shoes'],
'New_Price': [7, 35, 35, 10],
'Old_Price': [9, 30, 35, 15],
'Available': [1, 1, np.nan, np.nan]})
df
# Available Item New_Price Old_Price
#0 1 Hat 7 9
#1 1 Sword 35 30
#2 NaN Armor 35 35
#3 NaN Shoes 10 15
df.sum(axis=0) #default, sums up by columns
#Available 2
#Item HatSwordArmorShoes
#New_Price 87
#Old_Price 89
#dtype: object
df[['New_Price', 'Old_Price']].sum(axis=1) #filter columns, sum rows
#0 16
#1 65
#2 70
#3 25
Gives summary statistics about the dataset.
Available New_Price Old_Price
count 2 4.000000 4.000000
mean 1 21.750000 22.250000
std 0 15.348724 12.257651
min 1 7.000000 9.000000
25% 1 9.250000 13.500000
50% 1 22.500000 22.500000
75% 1 35.000000 31.250000
max 1 35.000000 35.000000
Gives the quantile ranging from 0 to 1.
df.quantile(.75)
#Available 1.00
#New_Price 35.00
#Old_Price 31.25
#dtype: float64
Get the variance.
df.var()
#Available 0.000000
#New_Price 235.583333
#Old_Price 150.250000
#dtype: float64
Get the standard deviation of the value.
df.std()
#Available 0.000000
#New_Price 15.348724
#Old_Price 12.257651
#dtyp#e: float64
Correlation and Covariance is computed from pairs of arguments. For correlation, use corr() (or corrwith method to compute pairwise correlations) and for covariance use cov()
temp = Series([14.2, 16.4, 11.9, 15.2, 18.5, 22.1, 19.4, 25.1, 23.4, 18.1, 22.6, 17.2])
ice_sales = Series([215, 325, 185, 332, 406, 522, 412, 614, 544, 421, 445, 408])
df = DataFrame({ 'Temp': temp, 'Sales': ice_sales})
df.cov() #covariance
# Sales Temp
#Sales 15886.810606 484.093182
#Temp 484.093182 16.089318
df.corr() #correlation with all variables (i.e. correlation matrix)
# Sales Temp
#Sales 1.000000 0.957507
#Temp 0.957507 1.000000
df.corrwith(temp) #correlations with column 'temp'
#Sales 0.957507
#Temp 1.000000
#dtype: float64
In a one-dimensional Series, you can get the unique objects using the unique method. You can get the number of times an item occurs with the value_counts method. You can also see if a value is in a passed sequence with isin.
obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
obj.unique()
#array(['c', 'a', 'd', 'b'], dtype=object)
obj.value_counts()
obj.isin(['a','e']) #usually can use this as a mask
#0 False
#1 True
#2 False
#3 True
#4 True
#5 False
#6 False
#7 False
#8 False
There are a few ways that you can deal with missing data, which appears as np.nan and Python's None data type. Methods include:
dropna allows you to drop rows or columns based on missing valuesfillna allows you to interpolate (i.e. fill in some value)isnull returns boolean values indicating if missing or np.nannotnull is the negation of isnulldropna removes any values that are missing or np.nan. You can specify how (e.g. how=all means to drop only the rows whose values are all np.nan; by default this drops the values if there is at least one np.nan or missing value). You can specify the axis with axis. You can also specify a thresh argument (i.e. keep only if there is a certain number of observations)
#Series dropna
data = Series([1, np.nan, 3.5, np.nan, 7])
data.dropna()
#0 1.0
#2 3.5
#4 7.0
#DataFrame dropna
data = DataFrame([[1., 6.5, 3., 2.4, 3],
[1., np.nan, np.nan, 3, np.nan],
[np.nan, np.nan, np.nan, np.nan, np.nan],
[np.nan, 2.3, 4.3, 2.1, np.nan]])
data
# 0 1 2 3 4
#0 1 6.5 3.0 2.4 3
#1 1 NaN NaN 3.0 NaN
#2 NaN NaN NaN NaN NaN
#3 NaN 2.3 4.3 2.1 NaN
cleaned = data.dropna()
# 0 1 2 3 4
#0 1 6.5 3 2.4 3
cleaned = data.dropna(how='all')
# 0 1 2 3 4
#0 1 6.5 3.0 2.4 3
#1 1 NaN NaN 3.0 NaN
#3 NaN 2.3 4.3 2.1 NaN
cleaned = data.dropna(axis=1)
#Empty DataFrame
#Columns: []
#Index: [0, 1, 2, 3]
data.dropna(thresh=2) # needs at least 2 observations
# 0 1 2 3 4
#0 1 6.5 3.0 2.4 3
#1 1 NaN NaN 3.0 NaN
#3 NaN 2.3 4.3 2.1 NaN
fillna lets you interpolate values (i.e. fill in missing values with something). You can specify the axis, method, value (i.e. a scalar or dict), limit (maximum number of consecutive periods to fill), and if inplace.
data2 = DataFrame({ 'Month': np.arange(12),
'People': [1001, 1021, 1050, 1071, np.nan, np.nan,
np.nan, 990, np.nan, 990, np.nan, 723],
'Cars': [np.nan, 432, np.nan, 322, 423, np.nan,
924, 432, 468, 568, 433, 323]})
data2.fillna(0) # fill np.nan with scalar value
# Cars Month People
#0 0 0 1001
#1 432 1 1021
#2 0 2 1050
#3 322 3 1071
#4 423 4 0
#5 0 5 0
#6 924 6 0
#7 432 7 990
#8 468 8 0
#9 568 9 990
#10 433 10 0
#11 323 11 723
data2.fillna({'Cars': -1, 'People': -5}) # Specify dict for fill
# Cars Month People
#0 -1 0 1001
#1 432 1 1021
#2 -1 2 1050
#3 322 3 1071
#4 423 4 -5
#5 -1 5 -5
#6 924 6 -5
#7 432 7 990
#8 468 8 -5
#9 568 9 990
#10 433 10 -5
#11 323 11 723
split-apply-combine is a process for group operations. This involes:
groupby accomplishes the first step of the split-apply-combine method; it splits the data into multiple groups and return a grouped object (which can be a SeriesGroupBy or DataFrameGroupBy). Say we want to split by the key (e.g. key2) and we are interested in the value (data1), apply a function (e.g. mean method), and return this as a combined result.
df = DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],
'key2': ['one', 'two', 'one', 'two', 'one'],
'data1': np.random.randn(5),
'data2': np.random.randn(5)})
df
# data1 data2 key1 key2
#0 -0.238342 -0.592247 a one
#1 0.187521 1.524641 a two
#2 -1.926102 -1.110540 b one
#3 -1.183664 0.211620 b two
#4 -0.519853 -0.754792 a one
#Step 1 of split-apply-combine; split data with key and value
grouped = df['data1'].groupby(df['key1'])
#<pandas.core.groupby.SeriesGroupBy object at 0x107638c90>
#Step 2 of split-apply-combine; apply a function
grouped.mean()
#key1
#a -0.190224
#b -1.554883
#Name: data1, dtype: float64
#Step 3 of split-apply-combine; result is a Series
type(grouped.mean())
#<class 'pandas.core.series.Series'>
Usually the data you work with is in the same dataframe so you can just specify the columns to pass in as the group keys. If you want to group by multiple levels, pass in a list.
df.groupby('key1').mean() # single key
# data1 data2
#key1
#a -0.190224 0.059201
#b -1.554883 -0.449460
df.groupby(['key1', 'key2']).mean() # multiple keys
#key1 key2
#a one -0.379097 -0.673520
# two 0.187521 1.524641
#b one -1.926102 -1.110540
# two -1.183664 0.211620
You can index columns as another way of saying “select these columns” for aggregation.
#SeriesGroupBy object
df.groupby('key1')['data2'] # df['data2'].groupby(df['key1'])
#<pandas.core.groupby.SeriesGroupBy object at 0x1076815d0>
#DataFrameGroupBy object
df.groupby('key1')[['data2']] # df[['data2']].groupby(df['key1'])
#<pandas.core.groupby.DataFrameGroupBy object at 0x107666490>
#Get the means for just 'data2' column grouped by the keys, calculate mean
df.groupby(['key1', 'key2'])[['data2']].mean()
# data2
#key1 key2
#a one -0.673520
# two 1.524641
#b one -1.110540
# two 0.211620
With a GroupBy object, you can count the number of occurrences using size, which is useful for categoricals.
df.groupby(['key1', 'key2']).size() # count number of occurrences
#key1 key2
#a one 2
# two 1
#b one 1
# two 1
#dtype: int64
With a GroupBy object, you can select a specific group using get_group.
# One Key
grouped = df.groupby('key1')
grouped.get_group('a')
# data1 data2 key1 key2
#0 -1.299062 0.214834 a one
#1 1.624373 0.301559 a two
#4 -1.427448 -1.142078 a one
# Multiple Keys
grouped = df.groupby(['key1', 'key2'])
grouped.get_group(('a', 'one'))
grouped.get_group(('a', 'one'))
# data1 data2 key1 key2
#0 -1.299062 0.214834 a one
#4 -1.427448 -1.142078 a one
With a GroupBy object, you can get the aggregation with aggregate. If you need more complex aggregations, go see apply.
grouped = df.groupby('key1')
grouped.aggregate(np.sum)
# data1 data2
#key1
#a -1.102138 -0.625684
#b -1.694854 -1.843882
You can filter and get a subset of an original object using filter (e.g. we want to get the elements that belong to a group with a group sum greater than 2).
#Filtering a simple Series
sf = Series([1, 1, 2, 3, 3, 3])
sf.groupby(sf).filter(lambda x: x.sum() > 2)
#3 3
#4 3
#5 3
#dtype: int64
#Filtering a DataFrame
df = DataFrame({'A': np.arange(8),
'B': list('aabbbbcc'),
'C': np.arange(8, 0, -1)})
df
# A B C
#0 0 a 8
#1 1 a 7
#2 2 b 6
#3 3 b 5
#4 4 b 4
#5 5 b 3
#6 6 c 2
#7 7 c 1
#Group by col 'B' and only return values where there is a count of more than 2
df.groupby('B').filter(lambda x: len(x['C']) > 2)
You can just pass in the method sort to sort an item.
df.groupby('B').filter(lambda x: len(x['C']) > 2).sort('C')
# A B C
#5 5 b 3
#4 4 b 4
#3 3 b 5
#2 2 b 6
You can iterate over a GroupBy object to return a sequence of 2-tuples containing the group name along with the chunk of data.
#Iterate with a single key
for name, group in df.groupby('key1'):
print name #a, b
print group #data1, data2, key1, key2
#a
# data1 data2 key1 key2
#0 -0.238342 -0.592247 a one
#1 0.187521 1.524641 a two
#4 -0.519853 -0.754792 a one
#b
# data1 data2 key1 key2
#2 -1.926102 -1.11054 b one
#3 -1.183664 0.21162 b two
#Iterate with multiple keys
for (k1, k2), group in df.groupby(['key1', 'key2']):
print k1, k2 #(a, one), (a, two), (b, one), (b, two)
print group #data1, data2, key1, key2
#a one
# data1 data2 key1 key2
#0 -0.238342 -0.592247 a one
#4 -0.519853 -0.754792 a one
#a two
# data1 data2 key1 key2
#1 0.187521 1.524641 a two
#b one
# data1 data2 key1 key2
#2 -1.926102 -1.11054 b one
#b two
# data1 data2 key1 key2
#3 -1.183664 0.21162 b two
#Compute a dict with the pieces of data
pieces = dict(list(df.groupby('key1')))
pieces['b']
# data1 data2 key1 key2
#2 -1.926102 -1.11054 b one
#3 -1.183664 0.21162 b two
type(pieces['b'])
#<class 'pandas.core.frame.DataFrame'>
At times you want to reshape your data (e.g. from wide to long or vice versa). Pandas has a few tools for this, including melt, pivot, pivot_table, and crosstab.
Melt ‘unpivots’ a table from wide to long format. You have 1+ variables as identifiers (id_vars) and the remaining fields fall into two variables: variable and value.
cheese = pd.DataFrame({'first_name': ['Will', 'Laura', 'Mike', 'Mandy'],
'last_name': ['Liu', 'Summers', 'Liu', 'Summers'],
'height': [5.10, 5.3, 5.9, 5.2],
'weight': [150, 120, 190, 110]})
print cheese
# first_name height last_name weight
#0 Will 5.11 Liu 150
#1 Laura 5.3 Summers 120
#2 Mike 5.9 Liu 190
#3 Mandy 5.2 Summers 110
# Melt and specify variable name
print pd.melt(cheese, id_vars=['first_name', 'last_name'])
# first_name last_name variable value
#0 Will Liu height 5.11
#1 Laura Summers height 5.3
#2 Mike Liu height 5.9
#3 Mandy Summers height 5.2
#4 Will Liu weight 150
#5 Laura Summers weight 120
#6 Mike Liu weight 190
#7 Mandy Summers weight 110
# Filter for a specific variable
print pd.melt(cheese, id_vars=['first_name', 'last_name'],
var_name='quantity', value_vars='height')
# first_name last_name quantity value
#0 Will Liu height 5.11
#1 Laura Summers height 5.3
#2 Mike Liu height 5.9
#3 Mandy Summers height 5.2
You use pivot for transforming long to wide on categorical values or numerical values. This is the opposite of melt.
df = pd.DataFrame({
'id': [25418726, 25418726, 25418731, 25418731, 25418740],
'question': ['A', 'B', 'A', 'B', 'A'],
'answer': ['V', 'W', 'X', 'Y', 'Z']
})
print df
# answer id question
#0 V 25418726 A
#1 W 25418726 A
#2 X 25418731 A
#3 Y 25418731 A
#4 Z 25418726 B
temp = df.pivot(index='id', columns='question', values='answer')
print temp
#question A B
#id
#25418726 V W
#25418731 X Y
#25418740 Z NaN
You use pivot tables to transform data from long to wide (creating a spreadsheet-like dataset). The main reason of having pivot tables over pivot is that data has to be numerical, which lets you use an ‘aggfunc’ on it (e.g. sum, mean).
print pd.pivot_table(cheese, values='height', index=['last_name', 'first_name'])
#last_name first_name
#Liu Mike 5.9
# Will 5.1
#Summers Laura 5.3
# Mandy 5.2
#Name: height, dtype: float64
print pd.pivot_table(cheese, values='height', index=['last_name'], aggfunc='mean')
Liu 5.50
Summers 5.25
Name: height, dtype: float64
A crosstab creates a cross-tabulation of two or more factors, usually giving a frequency table of the factors.
pd.crosstab(cheese.last_name, cheese.first_name)
#first_name Laura Mandy Mike Will
#last_name
#Liu 0 0 1 1
#Summers 1 1 0 0
You can combine objects together with Series and DataFrame objects using various logic. You can pass in different objects including dicts, list, Series, DataFrames.
This operation concatenates (i.e. stacks together) objects along an axis and does optional logic (axis, join, join_axes, keys). Default concatentation axis=0, default join is ‘outer’.
series1 = pd.Series([0, 1], index=['a', 'b'])
series2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
series3 = pd.Series([5, 6], index=['f', 'g'])
# Concat for a list of Series links the values and indexes together
pd.concat([series1, series2, series3])
#a 0
#b 1
#c 2
#d 3
#e 4
#f 5
#g 6
#dtype: int64
# Default concat is along axis=0
series4 = pd.concat([series1 * 5, series3])
#a 0
#b 5
#f 5
#g 6
#dtype: int64
# Concat along axis=1, series1 has no 'f', 'g' indexes so returns a NaN
pd.concat([series1, series4], axis=1)
# 0 1
#a 0 0
#b 1 5
#f NaN 5
#g NaN 6
# Can specify how to join
pd.concat([series1, series4], axis=1, join='inner')
# 0 1
#a 0 0
#b 1 5
# Can specify the join_axes
pd.concat([series1, series4], axis=1, join_axes=[['a', 'c', 'b', 'e']])
# 0 1
#a 0 0
#c NaN NaN
#b 1 5
#e NaN NaN
df1 = pd.DataFrame(np.arange(6).reshape(3, 2), \
index=['a', 'b', 'c'],
columns=['one', 'two'])
df1
# one two
#a 0 1
#b 2 3
#c 4 5
df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), \
index=['a', 'c'],
columns=['three', 'four'])
df2
# three four
#a 5 6
#c 7 8
# Using list of objects and axis=0
pd.concat([df1, df2], axis=0, join='inner')
# four one three two
#a NaN 0 NaN 1
#b NaN 2 NaN 3
#c NaN 4 NaN 5
#a 6 NaN 5 NaN
#c 8 NaN 7 NaN
# Using list of objects and axis=1
pd.concat([df1, df2], axis=1, join='inner')
# one two three four
#a 0 1 5 6
#c 4 5 7 8
Append is actually just a shortcut to concatenate (i.e. shorter more assumptions filled). An append concatenates along axis=0 using the index.
# Append with Series
s = pd.Series(np.random.randn(10), index=np.arange(10))
s1 = s[:5]
s2 = s[6:]
s1.append(s2)
# Append with DataFrame
df1.append(df2)
# four one three two
#a NaN 0 NaN 1
#b NaN 2 NaN 3
#c NaN 4 NaN 5
#a 6 NaN 5 NaN
#c 8 NaN 7 NaN
# Append can take multiple items
df1.append([df2, df1])
# four one three two
#a NaN 0 NaN 1
#b NaN 2 NaN 3
#c NaN 4 NaN 5
#a 6 NaN 5 NaN
#c 8 NaN 7 NaN
#a NaN 0 NaN 1
#b NaN 2 NaN 3
#c NaN 4 NaN 5
Merge allows you to connect rows in DataFrames based on either the key and/or the index. This is similar to database join operations. There actually is a join function that is exactly equivalent. You can specify the type of join with ‘how’.
df1 = pd.DataFrame({'key':['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data1':range(7)})
df1
# data1 key
#0 0 b
#1 1 b
#2 2 a
#3 3 c
#4 4 a
#5 5 a
#6 6 b
df2 = pd.DataFrame({'key':['a', 'b', 'd'],
'data2':range(3)})
df2
# data2 key
#0 0 a
#1 1 b
#2 2 d
#### Merge on Key
pd.merge(df1, df2, on='key', how='inner')
# data1 key data2
#0 0 b 1
#1 1 b 1
#2 6 b 1
#3 2 a 0
#4 4 a 0
#5 5 a 0
#### Merge on Index
pd.merge(df1, df2, left_index=True, right_index=True, how='inner')
# data1 key_x data2 key_y
#0 0 b 0 a
#1 1 b 1 b
#2 2 a 2 d
#### Merge on Key and Index
pd.merge(df1, df2, left_index=True, right_on='data2', how='inner')
# data1 key_x data2 key_y
#0 0 b 0 a
#1 1 b 1 b
#2 2 a 2 d
Combine_first combines two DataFrame objects and defaults to non-null values in the frame calling the method. The result index columns is the union of the respective indexes and columns (i.e. df1 is prioritized, use df2 values to fill in any missing holes)
df1 = pd.DataFrame({
'a': [1., np.nan, 5., np.nan],
'b': [np.nan, 2., np.nan, 6.],
'c': np.arange(2, 18, 4)})
df1
# a b c
#0 1 NaN 2
#1 NaN 2 6
#2 5 NaN 10
#3 NaN 6 14
df2 = pd.DataFrame({
'a': [5., 4., np.nan, 3., 7.],
'b': [np.nan, 3., 4., 6., 8.]})
df2
# a b
#0 5 NaN
#1 4 3
#2 NaN 4
#3 3 6
#4 7 8
df1.combine_first(df2)
# a b c
#0 1 NaN 2
#1 4 2 6
#2 5 4 10
#3 3 6 14
#4 7 8 NaN