William Liu

Pandas


Table of Contents

TODO:


Summary

Pandas is built on top of NumPy and takes the ndarray a step even further into high-level data structures with Series and DataFrame objects; these data objects contain metadata like column and row names as an index with an index.name. There are also a lot of helper functions for loading, selecting, and chunking data. If you are using something like SQL for anything that goes beyond a simple query or a large dataset, its time to switch to pandas.

Pandas is similar to R and follows the same patterns of using the split-apply-combine strategy using the groupby method. Melting is done through the melt method.


Setup

Import the pandas library

from pandas import Series, DataFrame
import pandas as pd

Axis

So when there is df.mean(axis=1) it means to take the mean of entries horizontally across columns (i.e. along each individual row).

![Image of Axis] (https://williamqliu.github.com/images/pd_axis.jpg)


SERIES

A Series is a one-dimensional array-like object containing any NumPy data type as values as well as data labels called the index. If no index is specified, a default one is created using integers that span 0 through N-1 (where N is the length of data).

Creating Series

from pandas import Series

obj = Series([4, 7, -5, 3])  #Simplest Series is an array

#If no index specified, default index is created
obj
#0  4
#1  7
#2 -5
#3  3
#dtype: int64

Selecting Series Values

You can select single values or a set of values by specifying the index within [].

obj2['a']  # Get value corresponding to obj2's index 'a'
#-5

obj2[['c', 'a', 'd']]  # Get values to this list of indexes
#c  3
#a -5
#d  4
#dtype: int64

Dictionary and Series Similarities

You can think of a Series as a fixed-length, ordered dictionary since it maps the index to values. You can even pass in a Python dict to create a Series.

'b' in obj2  # True
'e' in obj2  # False

# You can pass in a Python dict to create a Series
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = Series(sdata)
obj3
#Ohio     35000
#Oregon   16000
#Texas    71000
#Utah      5000
#dtype: int64

# You can specify an index; if no matching value, then value is 'NaN'
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index=states)
obj4
#California   NaN  #Since we cannot find value to this index
#Ohio       35000
#Oregon     16000
#Texas      71000

DATAFRAME

A DataFrame is a tabular, spreadsheet-like data structure (like Excel) and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). There is both a row and column index.

Creating (from a dict of lists or arrays)

# Create a dict of equal-length lists or NumPy arrays
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
df = pd.DataFrame(data)

df
#   pop   state  year
#0  1.5    Ohio  2000
#1  1.7    Ohio  2001
#2  3.6  Nevada  2001
#3  2.4  Nevada  2001
#4  2.9  Nevada  2002

# Can specify column names, order, and index names
df2 = pd.DataFrame(data,
                   columns=['year', 'state', 'pop', 'debt'],
                   index=['one', 'two', 'three', 'four', 'five'])
df2
#       year     state   pop  debt
#one    2000      Ohio   1.5   NaN
#two    2001      Ohio   1.7   NaN
#three  2002      Ohio   3.6   NaN
#four   2001    Nevada   2.4   NaN
#five   2002    Nevada   2.9   NaN

columns

You can see what columns are available using columns.

df2.columns  # List all columns
#Index([u'year', u'state', u'pop', u'debt'], dtype='object')

Selecting Columns (using dict and attribute)

If you want to select a column, this will return a Series; you can do this by calling the column index as a dictionary (i.e. []) or by calling the column index as an attribute (i.e. .).

df2['state']  # get 'state' column using dict notation, returns Series
#one     Ohio
#two     Ohio
#three   Ohio
#four  Nevada
#five  Nevada
#Name: state, dtype: object

df2.state  # get 'state' column using attribute notation, same as above

Selecting Rows and Columns (ix)

With ix, you can select a row index by either position (e.g. df.ix[2]) or the name (say df.ix['state']). Since this is built on top of NumPy, these selections are views (which means they are the actual values); if you want a copy, then use copy().

df2.ix['three']  # Get row with index 'three'
#year   2002
#state  Ohio
#pop     3.6
#debt    NaN
#Name: three, dtype: object

df2.ix['three']['state']  # Get value of index 'three', column 'state'
#'Ohio'

Selecting and Filtering Rows and Columns #

You can select and filter for specific rows and columns.

# Filter on specific column value only
df.ix[df.state == 'Ohio']
#       year  state
#one    2000   Ohio
#two    2001   Ohio
#three  2002   Ohio

# Filter by chaining (e.g. multiple columns)
df.ix[(df.state == 'Ohio') & (df.year.isin([2001, 2002]))]
#  pop state  year
#1 1.7  Ohio  2001
#2 3.6  Ohio  2002

# Filter on specific row 
df2.ix['three']
#year   2002
#state  Ohio
#pop     3.6
#debt    NaN

query #

You can also query the columns of a dataframe with a boolean expression.

df.query(df.year > 2001)
#   pop    state  year
#2  3.6     Ohio  2002
#4  2.9   Nevada  2002

Selecting Rows and Columns by label (loc)

You can use the loc position to return based off of label names.

df = DataFrame({'sepallength': [1, 2, 3, 4, 5, 6, 7],
                'sepalwidth': [7, 6, 5, 4, 3, 2, 1],
                'class': ['Iris-setosa', 'Iris-versicolor', 'Iris-setosa', 'Iris', 'Iris', 'Iris', 'Iris']})
df.loc[df['class'] == 'Iris-setosa']  # Filter on single column value
df.loc[df['class'].isin(['Iris-setosa', 'Iris'])]  # Filter multiple col values

Selecting Rows and Columns by position (iloc)

You can also select rows or columns by an integer based indexing using the iloc method or by using []; either way requires arguments for the start, stop and optionally an interval.

#Select using Rows
df2.iloc[2:4]
#      year   state  pop  debt
#three 2002    Ohio  3.6   NaN
#four  2001  Nevada  2.4   NaN

df2[2:4]  #same as above, but prefer to use iloc so we are explicit

#Select using Columns
df2.iloc[:, 1:3]
#       state  pop
#one     Ohio  1.5
#two     Ohio  1.7
#three   Ohio  3.6
#four  Nevada  2.4
#five  Nevada  2.9

Assigning values

You can assign values similar to how you select items. The value's length must match the length of the DataFrame.

df2.ix['three']['state'] = 'California'  #Assign by row and col
df2['debt'] = 16.5  #Assign entire col the same value
df2['debt'] = [10, 9, 8, 7, 6]  #Assign entire col different values
df2['debt'] = np.arange(5)  # Assign col values based on a function

# Assign values with a Series; uses index to find where to put values
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
df2['debt'] = val

Making and Deleting columns

You can create columns by assigning them values (e.g. say np.empty). You can delete columns with the del keyword.

df2.columns  # See columns
#Index([u'year', u'state', u'pop', u'debt'], dtype='object')

df2['eastern'] = np.empty  # Create a new column
df2.columns
#Index([u'year', u'state', u'pop', u'debt', u'eastern'], dtype='object')

del df2['eastern']
df2.columns
#Index([u'year', u'state', u'pop', u'debt'], dtype='object')

Transposing

You can transpose a DataFrame using T.

df2
#       year   state  pop  debt
#one    2000    Ohio  1.5   NaN
#two    2001    Ohio  1.7  -1.2
#three  2002    Ohio  3.6   NaN
#four   2001  Nevada  2.4  -1.5
#five   2002  Nevada  2.9  -1.7

df2.T
#        one   two three    four    five
#year   2000  2001  2002    2001    2002
#state  Ohio  Ohio  Ohio  Nevada  Nevada
#pop     1.5   1.7   3.6     2.4     2.9
#debt    NaN  -1.2   NaN    -1.5    -1.7

Iteration

The advantage of a DataFrame is the vectorized operations, but sometimes you just have to iterate through pieces of a DataFrame. You can do this with iterrows()

for index, row in df2.iterrows():
    print index, row['year'], row['state']
# one 2000 Ohio
# two 2001 Ohio
# three 2002 Ohio
# four 2001 Nevada
# five 2002 Nevada

OBJECT PROPERTIES

Series and DataFrame have an index and corresponding values. The index appears on the left (accessible through index)) and the values are on the right (accessible through values). There are also corresponding names (accessible through the name attribute).

index

Indexes are immutable (can't be modified); the main thing to know is that later we will go into different types of indexes, including DatetimeIndex, which gives more functions for working with datetimes.

obj = Series([4, 7, -5, 3])  # Series without specifying index
obj.index  #Int64Index([0, 1, 2, 3], dtype='int64')

#Series with a specified index
obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2
#d  4
#b  7
#a -5
#c  3
#dtype: int64

Some useful index methods and properties are:

name

Series, DataFrame, and indexes have a name attribute. This allows you to get and set indexes easier.

obj2.name = 'Inches of Rain'  # name for Series
obj2.index.name = 'state'  # name for index

obj2
#state
#d    4
#b    7
#a   -5
#c    3
#Name: Inches of Rain, dtype: int64

index.is_unique

You can check if an index is unqiue

df.index.is_unique  #True

index.intersection

s1 = Series([10, 20, 30], index=[1, 2, 3])
s2 = Series([40, 50, 60], index=[1, 2, 4])
temp = s1.index.intersection(s2.index)
temp
#Int64Index([1, 2], dtype='int64')

index.drop

You can drop entries from an axis

obj = Series([4.5, 7.2, -5.3, 3.6], index= ['d', 'b', 'a', 'c'])
new_obj = obj.drop(['d', 'a'])
new_obj
#b   7.2
#c   3.6
#dtype: float64

reindex

Reindex creates a new object with the data conformed to the new index, even if there are no index values present. For a scalar value, you can specify a fill_value (e.g. 0), otherwise the default is a NaN. You can also specify method to fill (e.g. ffill to fill values forward). To reindex columns, go see the ix method.

reindex

obj = Series([4.5, 7.2, -5.3, 3.6], index= ['d', 'b', 'a', 'c'])

obj
#d   4.5
#b   7.2
#a  -5.3
#c   3.6
#dtype: float64

reindex (fill_value)

obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)
obj2
#a  -5.3
#b   7.2
#c   3.6
#d   4.5
#e     0
#dtype: float64

reindex (method)

obj3 = Series(['blue', 'green', 'yellow'], index=[0, 2, 4])
obj3.reindex(range(6), method='ffill')
obj3
#0      blue
#1      blue
#2     green
#3     green
#4    yellow
#5    yellow
#dtype: object

dtype, astype

Pandas uses many of the same dtype as NumPy, but there are additional types like dtype=category. You can also forcefully assign type with the astype method.

s = Series(['a', 'b', 'c', 'd', 'a', 'b'], dtype='category')
s.dtypes # category

df = DataFrame({'A': ['a', 'b', 'c', 'a'],
                'B': ['z', 'z', 'y', 'd']})
df.dtypes
#A   object
#B   object

df['C'] = df['A'].astype('category')
df.dtypes
#A   object
#B   object
#C category

values

Values are always returned as a 2D ndarray and the dtype will be automatically chosen to match objects.

obj = Series([4, 7, -5, 3])  #Simplest Series is an array
obj.values  #array([ 4,  7, -5,  3])

Missing Values (isnull, notnull)

A NaN or NA is returned if a value cannot be found (i.e. a missing value). You can search your data for these missing values using isnull and notnull functions.

pd.isnull(obj4)
#California    True
#Ohio         False
#Oregon       False
#Texas        False
#dtype: bool

pd.notnull(obj4)
#California   False
#Ohio          True
#Oregon        True
#Texas         True
#dtype: bool

Replacing Values (replace)

You can replace a value with another value. You can use replace() with a lot of variations including replacing a single value with a single value, a list with a single value, a list with a list, and a dict with a dict.

mydata = pd.Series([1., -999., 2., -999., -1000., 3.])
print mydata
#0       1
#1    -999
#2       2
#3    -999
#4   -1000
#5       3
#dtype: float64

# Replace value of x with value of y
mydata.replace(-999, np.nan)
#0       1
#1     NaN
#2       2
#3     NaN
#4   -1000
#5       3

# Replace list x with value of y
mydata.replace([-999, -1000], np.nan)
#0     1
#1   NaN
#2     2
#3   NaN
#4   NaN
#5     3
#dtype: float64

# Replace list of x with list of y
mydata.replace([-999, -1000], [numpy.nan, 0])
#0     1
#1   NaN
#2     2
#3   NaN
#4     0
#5     3
#dtype: float64

# Replace a dict of x with a dict of y
mydata.replace({-999:np.NaN, -1000:0})
#0     1
#1   NaN
#2     2
#3   NaN
#4     0
#5     3
#dtype: float64

Duplicated Values

To check if a DataFrame has duplicated values, you can use duplicated().

df = pd.DataFrame({
    'k1': ['one'] * 3 + ['two'] * 4,
    'k2': [1, 1, 2, 3, 3, 4, 4]
    })
df
#    k1  k2
#0  one   1
#1  one   1
#2  one   2
#3  two   3
#4  two   3
#5  two   4
#6  two   4

# Return a Series of whether the row is duplicated
df.duplicated()
#0    False
#1     True
#2    False
#3    False
#4     True
#5    False
#6     True
#dtype: bool

# Return a DataFrame of unique rows
df.drop_duplicates()
#    k1  k2
#0  one   1
#2  one   2
#3  two   3
#5  two   4

# You can specify what columns are used to determine if row is unique
df.drop_duplicates(['k1'])
#    k1  k2
#0  one   1
#3  two   3

DATETIME / TIMESERIES

Pandas really starts to show how powerful it is when working with DateTimes.

convert_objects and to_datetime

First you want to make sure that the data type is correct (i.e. it’s a time) You can convert dataframe objects a couple different ways. There’s a generic convert_objects that you can use to also use to convert numbers. If you want a more specific time function, use to_datetime.

mydf = mydf.convert_objects(convert_dates)  # also, convert_numeric, convert_timedeltas

mydf['SomeDateColumn'] = pd.to_datetime(mydf['SomeDateColumn'], coerce=True)

DatetimeIndex

To utilize the datetime specific methods, you need to set an index that is a DatetimeIndex.

df.set_index(pd.DatetimeIndex(df['someDate']), inplace=True)

df  # say this is: 'someDate', 'cookiesEaten', 'Notes'
#2009-02-16 23:26:40, 5, someotherstuff
#2009-02-16 23:45:43, 7, someotherstuff
#2009-02-17 22:05:31, 10, someotherstuff

Datetime Resample

You can now resample our counts of say ‘cookiesEaten’.

df.resample('5Min', how='sum')  # default is mean

df.resample('D', how=np.max)
#2009-02-16, 12
#2009-02-17, 10

Datetime Offset

With the first argument of a resample, you can specify the offset alias. For example:

Datetime How

With the second argument of a resample, you can specify how. For example:


NUMERICAL OPERATIONS

NumPy Operations and uFuncs

You can apply NumPy array operations (e.g. filtering, scalar multiplication, applying math functions) and uFuncs to our Series and DataFrames.

obj2  # Setup Series
#d  4
#b  7
#a -5
c   3

obj2[obj2 > 0]  # Filtering
obj2
#d  4
#b  7
#c  3
#dtype: int64

obj2 * 2  # Scalar Multiplication
#d   8
#b  14
#a -10
#c   6
#dtype: int64

np.sqrt(obj2)  # Apply ufunc
#d  2.00000
#b  2.64751
#a      NaN
#c  1.73205
#dtype: float64

Basic Arithmetic

One of the advantages of pandas is that data is automatically aligned on the index. For example, even though obj4 has the index ‘California’ that obj3 does not have, we can still do arithmetic.

obj3
#Ohio    35000
#Oregon  16000
#Texas   71000
#Utah     5000
#dtype: int 64

obj4
#California   NaN
#Ohio       35000
#Oregon     16000
#Texas      71000

obj3 + obj4
#California   NaN
#Ohio       70000
#Oregon     32000
#Texas     142000
#Utah         NaN
#dtype: float64

Arithmetic Methods

You can apply arithmetic methods between Series and DataFrame objects. These methods let you send in additional arguments (like fill_value, axis)

add

Add together two DataFrames results in NA values in the locations that do not overlap.

df1 = DataFrame(np.arange(12).reshape((3, 4)), columns=['a', 'b', 'c', 'd'])
df2 = DataFrame(np.arange(20).reshape((4,5)), columns=['a', 'b', 'c', 'd', 'e'])

df1
#   a  b   c   d
#0  0  1   2   3
#1  4  5   6   7
#2  8  9  10  11

df2
#    a   b   c   d   e
#0   0   1   2   3   4
#1   5   6   7   8   9
#2  10  11  12  13  14
#3  15  16  17  18  19

df1 + df2
#    a   b   c   d   e
#0   0   2   4   6 NaN
#1   9  11  13  15 NaN
#2  18  20  22  24 NaN
#3 NaN NaN NaN NaN NaN

df1.add(df2, fill_value=0)  # using method allows fill_value
     a   b   c   d   e
#0   0   2   4   6   4
#1   9  11  13  15   9
#2  18  20  22  24  14
#3  15  16  17  18  19

FUNCTIONS, APPLY, AND MAP

You can apply NumPy's ufuncs (element-wise array methods) on Pandas objects. For example, we apply np.abs onto a DataFrame.

df = DataFrame(np.random.randn(4, 3), columns=['a', 'b', 'c'],
               index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df
#               a         b         c
#Utah    1.167988 -0.837221 -2.029371
#Ohio   -0.727655  0.319954 -1.632084
#Texas   0.752019 -0.417954 -1.139173
#Oregon  0.108305 -0.588943  0.073157

np.abs(df)
#               a         b         c
#Utah    1.167988  0.837221  2.029371
#Ohio    0.727655  0.319954  1.632084
#Texas   0.752019  0.417954  1.139173
#Oregon  0.108305  0.588943  0.073157

apply

Besides the ufuncs above, you can apply your own custom function to a column or row by using the DataFrame apply method. You do not have to return a scalar value; you can also return a Series with multiple values.

my_func = lambda x: x.max() - x.min()

df.apply(my_func, axis=0)  #Default, on y axis
#a    1.895643
#b    1.157175
#c    2.102528
#dtype: float64

df.apply(my_func, axis=1)  #on x axis
#Utah      3.197359
#Ohio      1.952038
#Texas     1.891192
#Oregon    0.697248
#dtype: float64

# You do not have to return a scalar value
def my_func(x):
    return Series([x.min(), x.max()], index=['min', 'max'])
df.apply(my_func)
#            a         b         c
#min -0.727655 -0.837221 -2.029371
#max  1.167988  0.319954  0.073157

map, applymap

If you do not want to apply based on the axis (i.e. column or row), you can instead apply your own functions across every element. If this is a Series, then you use map. If this is a DataFrame, then use applymap.

map

#For Series, use map
my_format = lambda x: '%.2f' % x  #2 decimal places
df['c'].map(my_format)
#Utah      -2.03
#Ohio      -1.63
#Texas     -1.14
#Oregon     0.07
#Name: c, dtype: object

applymap

#For DataFrames, use applymap
my_format = lambda x: '%.2f' % x  #2 decimal places
df.applymap(my_format)
df
#            a      b      c
#Utah     1.17  -0.84  -2.03
#Ohio    -0.73   0.32  -1.63
#Texas    0.75  -0.42  -1.14
#Oregon   0.11  -0.59   0.07

Sorting

You may want to sort by some criteria, either the index or the value. To sort based on the index, use the sort_index method. To sort by value, you use order method for a Series and for DataFrames you pass in parameter(s) to the argument by.

Sort and Order with Series

For a Series, use sort_index to sort by index, order to sort by value. You do not need to specify an axis since it is just an array. You can pass in additional parameters like a boolean value for ascending

obj = Series(range(4), index=['d', 'a', 'b', 'c'])
obj
#d    0
#a    1
#b    2
#c    3
#dtype: int64    

obj.sort_index(ascending=True)  #sort by index
obj
#a    1
#b    2
#c    3
#d    0
#dtype: int64

obj.order()  #sort by value
#d    0
#a    1
#b    2
#c    3
#dtype: int64

Sort and Order with DataFrame

For a DataFrame, you use sort by values or sort_index to sort by index (where you can specify which axis to sort). You can pass in additional parameters like a boolean value for ascending, what to do with NaN with na_position, or multiple columns to sort with by.

df = pd.DataFrame({'Song': ['Hey Jude', 'Help!', 'Do it', 'Let it be'],
'Weeks': [24, 7, 2, 90], 'Plays': [20, 10, 1, 1]}, index=[3, 2, 4, 1])
df
#   Plays       Song  Weeks
#3     20   Hey Jude     24
#2     10      Help!      7
#4      1      Do it      2
#1      1  Let it be     90

# Sort by Values
df.sort(['Plays', 'Weeks'], ascending=[True, True])
#   Plays       Song  Weeks
#4      1      Do it      2
#1      1  Let it be     90
#2     10      Help!      7
#3     20   Hey Jude     24

# Sort by Index
df.sort_index(axis=0, ascending=True)  # sort index on x axis (default)
#   Plays       Song  Weeks
#1      1  Let it be     90
#2     10      Help!      7
#3     20   Hey Jude     24
#4      1      Do it      2

df.sort_index(by=['Plays', 'Weeks'], ascending=[True, True])
#   Plays       Song  Weeks
#4      1      Do it      2
#1      1  Let it be     90
#2     10      Help!      7
#3     20   Hey Jude     24

Ranking

You can assign ranks from one through the number of valid data points. This is useful for statistical methods. By default, the rank method breaks ties by assigning each group the mean rank. You can specify additional parameters like ascending and method.

obj = Series([7, -5, 7, 4, 2, 0, 4])
obj
#0    7
#1   -5
#2    7
#3    4
#4    2
#5    0
#6    4

obj.rank(ascending=True)
#0    6.5
#1    1.0
#2    6.5
#3    4.5
#4    3.0
#5    2.0

Categories

There is a pandas.Categorical object that you can assign to a Series or DataFrame. You pass in the categories as well as an optional boolean value to Ordered. You can rename categories using Categorical.rename_categories() method. To convert back to another type (say int), use astype method.

my_choices = pd.Categorical(['Like', 'Like', 'Hate', 'Neutral'],
                                 categories=['Like', 'Neutral', 'Hate'],
                                 ordered=True)
my_choices
#[Like, Like, Hate, Neutral]
#Categories (3, object): [Like < Neutral < Hate]

categorical.add_cateogires(), categorical.remove_categories()

my_choices = my_choices.add_categories(['Really Hate', 'Really Like', 'NA'])
my_choices = my_choices.remove_categories(['NA'])
my_choices
#[Like, Like, Hate, Neutral]
#Categories (5, object): [Like < Neutral < Hate < Really Hate < Really Like]

categorical.reorder_categories()

my_choices = my_choices.reorder_categories(['Really Hate', 'Hate', 
                                            'Neutral', 'Like', 'Really Like'])
my_choices
#[Like, Like, Hate, Neutral]
#Categories (5, object): [Really Hate < Hate < Neutral < Like < Really Like]

categorical.rename_categores()

num_choices = my_choices.rename_categories([1, 2, 3, 4, 5])
num_choices
#[4, 4, 2, 3]
#Categories (5, int64): [1 < 2 < 3 < 4 < 5]

Dummies

A lot of times in statistical analysis you will want to convert a categorical into a dummy/indicator variables.

my_choices
my_choices = pd.Categorical(['Like', 'Like', 'Hate', 'Neutral'],
                            categories=['Like', 'Neutral', 'Hate'],
                            ordered=True)    
pd.get_dummies(my_choices)
#   Like  Neutral  Hate
#0     1        0     0
#1     1        0     0
#2     0        0     1
#3     0        1     0

SUMMARY STATISTICS

Pandas has a lot of summary statistics as methods. They include:

sum

df = DataFrame({ 'Item': ['Hat', 'Sword', 'Armor', 'Shoes'],
             'New_Price': [7, 35, 35, 10],
             'Old_Price': [9, 30, 35, 15],
             'Available': [1, 1, np.nan, np.nan]})
df
#   Available   Item  New_Price  Old_Price
#0          1    Hat          7          9
#1          1  Sword         35         30
#2        NaN  Armor         35         35
#3        NaN  Shoes         10         15

df.sum(axis=0)  #default, sums up by columns
#Available                     2
#Item         HatSwordArmorShoes
#New_Price                    87
#Old_Price                    89
#dtype: object

df[['New_Price', 'Old_Price']].sum(axis=1)  #filter columns, sum rows
#0    16
#1    65
#2    70
#3    25

describe

Gives summary statistics about the dataset.

       Available  New_Price  Old_Price
count          2   4.000000   4.000000
mean           1  21.750000  22.250000
std            0  15.348724  12.257651
min            1   7.000000   9.000000
25%            1   9.250000  13.500000
50%            1  22.500000  22.500000
75%            1  35.000000  31.250000
max            1  35.000000  35.000000

quantile

Gives the quantile ranging from 0 to 1.

df.quantile(.75)
#Available     1.00
#New_Price    35.00
#Old_Price    31.25
#dtype: float64

var

Get the variance.

df.var()
#Available      0.000000
#New_Price    235.583333
#Old_Price    150.250000
#dtype: float64

std

Get the standard deviation of the value.

df.std()
#Available     0.000000
#New_Price    15.348724
#Old_Price    12.257651
#dtyp#e: float64

Correlation, Covariance

Correlation and Covariance is computed from pairs of arguments. For correlation, use corr() (or corrwith method to compute pairwise correlations) and for covariance use cov()

temp = Series([14.2, 16.4, 11.9, 15.2, 18.5, 22.1, 19.4, 25.1, 23.4, 18.1, 22.6, 17.2])
ice_sales = Series([215, 325, 185, 332, 406, 522, 412, 614, 544, 421, 445, 408])
df = DataFrame({ 'Temp': temp, 'Sales': ice_sales})

df.cov()  #covariance
#              Sales        Temp
#Sales  15886.810606  484.093182
#Temp     484.093182   16.089318

df.corr()  #correlation with all variables (i.e. correlation matrix)
#          Sales      Temp
#Sales  1.000000  0.957507
#Temp   0.957507  1.000000

df.corrwith(temp)  #correlations with column 'temp'
#Sales    0.957507
#Temp     1.000000
#dtype: float64

Unique Values, Value Counts, isin

In a one-dimensional Series, you can get the unique objects using the unique method. You can get the number of times an item occurs with the value_counts method. You can also see if a value is in a passed sequence with isin.

obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

obj.unique()
#array(['c', 'a', 'd', 'b'], dtype=object)

obj.value_counts()

obj.isin(['a','e'])  #usually can use this as a mask
#0    False
#1     True
#2    False
#3     True
#4     True
#5    False
#6    False
#7    False
#8    False

Missing Data

There are a few ways that you can deal with missing data, which appears as np.nan and Python's None data type. Methods include:

dropna

dropna removes any values that are missing or np.nan. You can specify how (e.g. how=all means to drop only the rows whose values are all np.nan; by default this drops the values if there is at least one np.nan or missing value). You can specify the axis with axis. You can also specify a thresh argument (i.e. keep only if there is a certain number of observations)

#Series dropna
data = Series([1, np.nan, 3.5, np.nan, 7])
data.dropna()
#0    1.0
#2    3.5
#4    7.0

#DataFrame dropna
data = DataFrame([[1., 6.5, 3., 2.4, 3],
                 [1., np.nan, np.nan, 3, np.nan],
                 [np.nan, np.nan, np.nan, np.nan, np.nan],
                 [np.nan, 2.3, 4.3, 2.1, np.nan]])
data
#    0    1    2    3   4
#0   1  6.5  3.0  2.4   3
#1   1  NaN  NaN  3.0 NaN
#2 NaN  NaN  NaN  NaN NaN
#3 NaN  2.3  4.3  2.1 NaN

cleaned = data.dropna()
#   0    1  2    3  4
#0  1  6.5  3  2.4  3

cleaned = data.dropna(how='all')
#    0    1    2    3   4
#0   1  6.5  3.0  2.4   3
#1   1  NaN  NaN  3.0 NaN
#3 NaN  2.3  4.3  2.1 NaN

cleaned = data.dropna(axis=1)
#Empty DataFrame
#Columns: []
#Index: [0, 1, 2, 3]

data.dropna(thresh=2)  # needs at least 2 observations
#    0    1    2    3   4
#0   1  6.5  3.0  2.4   3
#1   1  NaN  NaN  3.0 NaN
#3 NaN  2.3  4.3  2.1 NaN

fillna

fillna lets you interpolate values (i.e. fill in missing values with something). You can specify the axis, method, value (i.e. a scalar or dict), limit (maximum number of consecutive periods to fill), and if inplace.

data2 = DataFrame({ 'Month': np.arange(12),
                    'People': [1001, 1021, 1050, 1071, np.nan, np.nan, 
                    np.nan, 990, np.nan, 990, np.nan, 723],
                    'Cars': [np.nan, 432, np.nan, 322, 423, np.nan, 
                    924, 432, 468, 568, 433, 323]})
data2.fillna(0)  # fill np.nan with scalar value
#    Cars  Month  People
#0      0      0    1001
#1    432      1    1021
#2      0      2    1050
#3    322      3    1071
#4    423      4       0
#5      0      5       0
#6    924      6       0
#7    432      7     990
#8    468      8       0
#9    568      9     990
#10   433     10       0
#11   323     11     723

data2.fillna({'Cars': -1, 'People': -5})  # Specify dict for fill
#    Cars  Month  People
#0     -1      0    1001
#1    432      1    1021
#2     -1      2    1050
#3    322      3    1071
#4    423      4      -5
#5     -1      5      -5
#6    924      6      -5
#7    432      7     990
#8    468      8      -5
#9    568      9     990
#10   433     10      -5
#11   323     11     723

Split-Apply-Combine (i.e. groupby)

split-apply-combine is a process for group operations. This involes:

  1. Take data in a pandas object (Series, DataFrame) and split it into groups based on one or more keys.
  2. A function is applied to each group, producing a new value (e.g. sum up the values from each group)
  3. The results of all these function applications are combined into a result object. The type of object returned depends on the previous operations.

groupby

groupby accomplishes the first step of the split-apply-combine method; it splits the data into multiple groups and return a grouped object (which can be a SeriesGroupBy or DataFrameGroupBy). Say we want to split by the key (e.g. key2) and we are interested in the value (data1), apply a function (e.g. mean method), and return this as a combined result.

df = DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],
                'key2': ['one', 'two', 'one', 'two', 'one'],
                'data1': np.random.randn(5),
                'data2': np.random.randn(5)})
df
#      data1     data2 key1 key2
#0 -0.238342 -0.592247    a  one
#1  0.187521  1.524641    a  two
#2 -1.926102 -1.110540    b  one
#3 -1.183664  0.211620    b  two
#4 -0.519853 -0.754792    a  one

#Step 1 of split-apply-combine; split data with key and value
grouped = df['data1'].groupby(df['key1'])
#<pandas.core.groupby.SeriesGroupBy object at 0x107638c90>

#Step 2 of split-apply-combine; apply a function
grouped.mean()
#key1
#a   -0.190224
#b   -1.554883
#Name: data1, dtype: float64

#Step 3 of split-apply-combine; result is a Series
type(grouped.mean())
#<class 'pandas.core.series.Series'>

Usually the data you work with is in the same dataframe so you can just specify the columns to pass in as the group keys. If you want to group by multiple levels, pass in a list.

df.groupby('key1').mean()  # single key
#         data1     data2
#key1
#a    -0.190224  0.059201
#b    -1.554883 -0.449460

df.groupby(['key1', 'key2']).mean()  # multiple keys
#key1 key2
#a    one  -0.379097 -0.673520
#     two   0.187521  1.524641
#b    one  -1.926102 -1.110540
#     two  -1.183664  0.211620

GroupBy using the index

You can index columns as another way of saying “select these columns” for aggregation.

#SeriesGroupBy object
df.groupby('key1')['data2']  # df['data2'].groupby(df['key1'])
#<pandas.core.groupby.SeriesGroupBy object at 0x1076815d0>

#DataFrameGroupBy object
df.groupby('key1')[['data2']]  # df[['data2']].groupby(df['key1'])
#<pandas.core.groupby.DataFrameGroupBy object at 0x107666490>

#Get the means for just 'data2' column grouped by the keys, calculate mean
df.groupby(['key1', 'key2'])[['data2']].mean()
#              data2
#key1 key2
#a    one  -0.673520
#     two   1.524641
#b    one  -1.110540
#     two   0.211620

size

With a GroupBy object, you can count the number of occurrences using size, which is useful for categoricals.

df.groupby(['key1', 'key2']).size()  # count number of occurrences
#key1  key2
#a     one     2
#      two     1
#b     one     1
#      two     1
#dtype: int64

get_group

With a GroupBy object, you can select a specific group using get_group.

# One Key
grouped = df.groupby('key1')
grouped.get_group('a')
#      data1     data2 key1 key2
#0 -1.299062  0.214834    a  one
#1  1.624373  0.301559    a  two
#4 -1.427448 -1.142078    a  one

# Multiple Keys
grouped = df.groupby(['key1', 'key2'])
grouped.get_group(('a', 'one'))
grouped.get_group(('a', 'one'))
#      data1     data2 key1 key2
#0 -1.299062  0.214834    a  one
#4 -1.427448 -1.142078    a  one

aggregate function

With a GroupBy object, you can get the aggregation with aggregate. If you need more complex aggregations, go see apply.

grouped = df.groupby('key1')
grouped.aggregate(np.sum)
#         data1     data2
#key1
#a    -1.102138 -0.625684
#b    -1.694854 -1.843882

Filtration

You can filter and get a subset of an original object using filter (e.g. we want to get the elements that belong to a group with a group sum greater than 2).

#Filtering a simple Series
sf = Series([1, 1, 2, 3, 3, 3])
sf.groupby(sf).filter(lambda x: x.sum() > 2)
#3    3
#4    3
#5    3
#dtype: int64

#Filtering a DataFrame
df = DataFrame({'A': np.arange(8), 
                'B': list('aabbbbcc'),
                'C': np.arange(8, 0, -1)})
df
#   A  B  C
#0  0  a  8
#1  1  a  7
#2  2  b  6
#3  3  b  5
#4  4  b  4
#5  5  b  3
#6  6  c  2
#7  7  c  1

#Group by col 'B' and only return values where there is a count of more than 2
df.groupby('B').filter(lambda x: len(x['C']) > 2)

Sort

You can just pass in the method sort to sort an item.

df.groupby('B').filter(lambda x: len(x['C']) > 2).sort('C')
#   A  B  C
#5  5  b  3
#4  4  b  4
#3  3  b  5
#2  2  b  6

GroupBy iteration

You can iterate over a GroupBy object to return a sequence of 2-tuples containing the group name along with the chunk of data.

#Iterate with a single key
for name, group in df.groupby('key1'):
    print name  #a, b
    print group  #data1, data2, key1, key2

#a
#      data1     data2 key1 key2
#0 -0.238342 -0.592247    a  one
#1  0.187521  1.524641    a  two
#4 -0.519853 -0.754792    a  one
#b
#      data1    data2 key1 key2
#2 -1.926102 -1.11054    b  one
#3 -1.183664  0.21162    b  two

#Iterate with multiple keys
for (k1, k2), group in df.groupby(['key1', 'key2']):
    print k1, k2  #(a, one), (a, two), (b, one), (b, two)
    print group  #data1, data2, key1, key2
#a one
#      data1     data2 key1 key2
#0 -0.238342 -0.592247    a  one
#4 -0.519853 -0.754792    a  one
#a two
#      data1     data2 key1 key2
#1  0.187521  1.524641    a  two
#b one
#      data1    data2 key1 key2
#2 -1.926102 -1.11054    b  one
#b two
#      data1    data2 key1 key2
#3 -1.183664  0.21162    b  two

#Compute a dict with the pieces of data
pieces = dict(list(df.groupby('key1')))
pieces['b']
#      data1    data2 key1 key2
#2 -1.926102 -1.11054    b  one
#3 -1.183664  0.21162    b  two
type(pieces['b'])
#<class 'pandas.core.frame.DataFrame'>

Reshaping

At times you want to reshape your data (e.g. from wide to long or vice versa). Pandas has a few tools for this, including melt, pivot, pivot_table, and crosstab.

melt

Melt ‘unpivots’ a table from wide to long format. You have 1+ variables as identifiers (id_vars) and the remaining fields fall into two variables: variable and value.

cheese = pd.DataFrame({'first_name': ['Will', 'Laura', 'Mike', 'Mandy'],
                       'last_name': ['Liu', 'Summers', 'Liu', 'Summers'],
                       'height': [5.10, 5.3, 5.9, 5.2],
                       'weight': [150, 120, 190, 110]})

print cheese
#   first_name  height  last_name   weight
#0        Will    5.11        Liu     150
#1       Laura     5.3    Summers     120
#2        Mike     5.9        Liu     190
#3       Mandy     5.2    Summers     110

# Melt and specify variable name
print pd.melt(cheese, id_vars=['first_name', 'last_name'])
#   first_name  last_name  variable  value
#0        Will        Liu    height   5.11
#1       Laura    Summers    height    5.3
#2        Mike        Liu    height    5.9
#3       Mandy    Summers    height    5.2
#4        Will        Liu    weight    150
#5       Laura    Summers    weight    120
#6        Mike        Liu    weight    190
#7       Mandy    Summers    weight    110

# Filter for a specific variable
print pd.melt(cheese, id_vars=['first_name', 'last_name'],
              var_name='quantity', value_vars='height')
#   first_name  last_name  quantity  value
#0        Will        Liu    height   5.11
#1       Laura    Summers    height    5.3
#2        Mike        Liu    height    5.9
#3       Mandy    Summers    height    5.2    

pivot

You use pivot for transforming long to wide on categorical values or numerical values. This is the opposite of melt.

df = pd.DataFrame({
    'id': [25418726, 25418726, 25418731, 25418731, 25418740],
    'question': ['A', 'B', 'A', 'B', 'A'],
    'answer': ['V', 'W', 'X', 'Y', 'Z']
})

print df
#  answer        id question
#0      V  25418726        A
#1      W  25418726        A
#2      X  25418731        A
#3      Y  25418731        A
#4      Z  25418726        B

temp = df.pivot(index='id', columns='question', values='answer')
print temp
#question  A    B
#id
#25418726  V    W
#25418731  X    Y
#25418740  Z  NaN

pivot_table

You use pivot tables to transform data from long to wide (creating a spreadsheet-like dataset). The main reason of having pivot tables over pivot is that data has to be numerical, which lets you use an ‘aggfunc’ on it (e.g. sum, mean).

print pd.pivot_table(cheese, values='height', index=['last_name', 'first_name'])
#last_name  first_name
#Liu        Mike          5.9
#           Will          5.1
#Summers    Laura         5.3
#           Mandy         5.2
#Name: height, dtype: float64

print pd.pivot_table(cheese, values='height', index=['last_name'], aggfunc='mean')
Liu          5.50
Summers      5.25
Name: height, dtype: float64

crosstab

A crosstab creates a cross-tabulation of two or more factors, usually giving a frequency table of the factors.

pd.crosstab(cheese.last_name, cheese.first_name)
#first_name  Laura  Mandy  Mike  Will
#last_name
#Liu             0      0     1     1
#Summers         1      1     0     0

Concatenate, Merge, Combine

You can combine objects together with Series and DataFrame objects using various logic. You can pass in different objects including dicts, list, Series, DataFrames.

Concatenate

This operation concatenates (i.e. stacks together) objects along an axis and does optional logic (axis, join, join_axes, keys). Default concatentation axis=0, default join is ‘outer’.

Concat with Series

series1 = pd.Series([0, 1], index=['a', 'b'])
series2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
series3 = pd.Series([5, 6], index=['f', 'g'])

# Concat for a list of Series links the values and indexes together
pd.concat([series1, series2, series3])
#a    0
#b    1
#c    2
#d    3
#e    4
#f    5
#g    6
#dtype: int64

# Default concat is along axis=0
series4 = pd.concat([series1 * 5, series3])
#a    0
#b    5
#f    5
#g    6
#dtype: int64

# Concat along axis=1, series1 has no 'f', 'g' indexes so returns a NaN
pd.concat([series1, series4], axis=1)
#    0  1
#a   0  0
#b   1  5
#f NaN  5
#g NaN  6

# Can specify how to join
pd.concat([series1, series4], axis=1, join='inner')
#   0  1
#a  0  0
#b  1  5

# Can specify the join_axes
pd.concat([series1, series4], axis=1, join_axes=[['a', 'c', 'b', 'e']])
#    0   1
#a   0   0
#c NaN NaN
#b   1   5
#e NaN NaN

Concat with DataFrame

df1 = pd.DataFrame(np.arange(6).reshape(3, 2), \
                   index=['a', 'b', 'c'],
                   columns=['one', 'two'])
df1
#   one  two
#a    0    1
#b    2    3
#c    4    5

df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), \
                   index=['a', 'c'],
                   columns=['three', 'four'])
df2
#    three  four
#a      5     6
#c      7     8

# Using list of objects and axis=0
pd.concat([df1, df2], axis=0, join='inner')
#   four  one  three  two
#a   NaN    0    NaN    1
#b   NaN    2    NaN    3
#c   NaN    4    NaN    5
#a     6  NaN      5  NaN
#c     8  NaN      7  NaN

# Using list of objects and axis=1
pd.concat([df1, df2], axis=1, join='inner')
#   one  two  three  four
#a    0    1      5     6
#c    4    5      7     8

Append

Append is actually just a shortcut to concatenate (i.e. shorter more assumptions filled). An append concatenates along axis=0 using the index.

# Append with Series
s = pd.Series(np.random.randn(10), index=np.arange(10))
s1 = s[:5]
s2 = s[6:]
s1.append(s2)

# Append with DataFrame
df1.append(df2)
#   four  one  three  two
#a   NaN    0    NaN    1
#b   NaN    2    NaN    3
#c   NaN    4    NaN    5
#a     6  NaN      5  NaN
#c     8  NaN      7  NaN

# Append can take multiple items
df1.append([df2, df1])
#   four  one  three  two
#a   NaN    0    NaN    1
#b   NaN    2    NaN    3
#c   NaN    4    NaN    5
#a     6  NaN      5  NaN
#c     8  NaN      7  NaN
#a   NaN    0    NaN    1
#b   NaN    2    NaN    3
#c   NaN    4    NaN    5

Merge

Merge allows you to connect rows in DataFrames based on either the key and/or the index. This is similar to database join operations. There actually is a join function that is exactly equivalent. You can specify the type of join with ‘how’.

df1 = pd.DataFrame({'key':['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                    'data1':range(7)})
df1
#   data1 key
#0      0   b
#1      1   b
#2      2   a
#3      3   c
#4      4   a
#5      5   a
#6      6   b

df2 = pd.DataFrame({'key':['a', 'b', 'd'],
                    'data2':range(3)})
df2
#   data2 key
#0      0   a
#1      1   b
#2      2   d

#### Merge on Key
pd.merge(df1, df2, on='key', how='inner')
#   data1 key  data2
#0      0   b      1
#1      1   b      1
#2      6   b      1
#3      2   a      0
#4      4   a      0
#5      5   a      0

#### Merge on Index
pd.merge(df1, df2, left_index=True, right_index=True, how='inner')
#   data1 key_x  data2 key_y
#0      0     b      0     a
#1      1     b      1     b
#2      2     a      2     d

#### Merge on Key and Index
pd.merge(df1, df2, left_index=True, right_on='data2', how='inner')
#   data1 key_x  data2 key_y
#0      0     b      0     a
#1      1     b      1     b
#2      2     a      2     d

Combine First

Combine_first combines two DataFrame objects and defaults to non-null values in the frame calling the method. The result index columns is the union of the respective indexes and columns (i.e. df1 is prioritized, use df2 values to fill in any missing holes)

df1 = pd.DataFrame({
    'a': [1., np.nan, 5., np.nan],
    'b': [np.nan, 2., np.nan, 6.],
    'c': np.arange(2, 18, 4)})
df1
#    a   b   c
#0   1 NaN   2
#1 NaN   2   6
#2   5 NaN  10
#3 NaN   6  14

df2 = pd.DataFrame({
    'a': [5., 4., np.nan, 3., 7.],
    'b': [np.nan, 3., 4., 6., 8.]})
df2
#    a   b
#0   5 NaN
#1   4   3
#2 NaN   4
#3   3   6
#4   7   8

df1.combine_first(df2)
#   a   b   c
#0  1 NaN   2
#1  4   2   6
#2  5   4  10
#3  3   6  14
#4  7   8 NaN