William Liu

Learning Statistics with R


##Table of Contents

##Summary

R is an open source statistical computing program for data analysis.

##Setup

Download R and the R Studio IDE.

####Install Packages

You can expand R’s capabilities by installing packages from CRAN (Comprehensive R Archive Network). You can also install packages from say GitHub. To install a package from CRAN, you can do this a few different ways:

Install a single package (every time)

install.packages("ggplot2", dependencies = TRUE)  # installs package 'ggplot2'
library(ggplot2)  # only need to install once, then can just reference

Install a single package (checks if installed before)

if (!require("ggplot2")) {
    install.packages("ggplot2", dep=TRUE, repos="http://cran.rstudio.com/")
    library("ggplot2")
}

Install a list of packages using a custom function

install <- function(pkg){
  new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
  if(length(new.pkg))
      install.packages(new.pkg, dependencies=TRUE, repos="http://cran.rstudio.com/")
  sapply(pkg, require, character.only=TRUE)
}
packages <- c("RODBC", "plyr", "knitr", "ggplot2")
install(packages)

Install a list of packages using the pacman package

install.packages("pacman", repos="http://cran.rstudio.com")
pacman::p_load(RODBC, plyr, knitr, ggplot2)  # so meta!

####Common Packages

Some common packages include:

####Setup a Working Directory

You can set a working directory (e.g. when you try to open up a file, this is the location it looks in)

getwd()  # Gets the current directory
setwd()  # Sets the working directory (your default)

####Help

R has built-in help that works well in RStudio by showing you how to use specific functions.

help(myfunction)
?function

##R Data Frames

As a bit of background, R is made up of objects and functions. Here’s an example of creating an object (my_people) and assigning multiple names using the concatenate function c() with the strings ("Will", "Laura", "Mike", "Roger").

my_people<-c("Will", "Laura", "Mike", "Roger")  # variable can hold strings
my_age<-c(30, 26, 33, 27)  # variable can hold numbers (no quotes)

####Functions for Loading and Saving Files

Some example functions are loading data using read.csv() and write.csv(). There’s also other varations including write.table()

mydata = read.csv(file="C:\\Users\\wliu\\Desktop\\myfile.csv")  # read file
mydata = write.csv(my_dataframe, "myfile.csv", sep=",", row.names=FALSE)

####Data Frames Dataframes are specific objects that contain variables, like worksheets in Excel.

family<-data.frame(Name=my_people, Age=my_age)  # dataframe w/ 2 variables
family  # display contents of the entire dataframe
#  Name  Age
#1 Will  30
#2 Laura 26
#3 Mike  33
#4 Roger 27
family$Name  # can reference dataframe by variable (e.g. Name)
names()  # can list the variables in the dataframe  # Name, Age
new_family <- family[2,c("Name")]  # rows, columns
# Laura

####list(), cbind()

The list() and cbind() functions can be used to combine variables (instead of dataframes).

family<-list(my_people, my_age)  # create two lists
# [1] "Will" "Laura" "Mike" Roger"
# [2] 30 26 33 27

family<-cbind(my_people, my_age)  # paste columns of data together
#    my_people, my_age
#[1,] "Will", "30"  # Notice cbind() converts from int to string
#[2,] "Laura", "26"  # cbind() is good only for combining same data types
#[3,] "Mike", "33"
#[4,] "Roger", "27"

####Common Data Functions

####Creating a Custom Function

nameOfFunction <- function(input1, input2)
{
    output <- input1 + input2
    cat("Output is: ", output)
}

####Data Formatting (wide and long/molten)

The wide format formats data so that each row represents data from one entity while each column represents a variable. There is no discrimination between independent and dependent variables (each should be its own column)

# Wide Format Example
# person, gender, happy_base, happy_6_months, happy_1_year
# Will,  Male, 2, 3, 4
# Laura, Female, 5, 6, 7

The long or molten format formats data so that scores on different variables (happiness over time) are placed in a single column.

# Long or Molten Format Example
# person, gender, variable, value
# Will,  Male, happy_base, 2
# Will,  Male, happy_6_months, 3
# Will,  Male, happy_1_year, 4
# Laura, Female, happy_base, 5
# Laura, Female, happy_6_months, 6
# Laura, Female, happy_1_year, 7

####Data Formatting with melt() and cast()

To reshape the data between wide and long formats, we can use these functions from the reshape package:

####Data Filtering with by() and subset()

To separate into different groups, we can use by() and subset().

####Factor (aka coding variable, grouping variable)

Factors are variables that take on a limited number of different values (i.e. categorical variables) and is helpful in statistical modeling. Use the functions factor() (says this data is a categorical) and levels() (shows the different categories).

data = c(1,2,2,3,1,2,3,3,1,2,3,3,1)
fdata = factor(data, levels=c(a,b,c), labels=("x","y","z")) # Levels: 1 2 3

my_months = c("January","February","March",
          "April","May","June","July","August","September",
          "October","November","December")
ordered_months = factor(my_months,levels=c("January","February","March",
                        "April","May","June","July","August","September",
                        "October","November","December"),ordered=TRUE)
ordered_months  # Levels: January < February < March < April < May < June < July < August < September < October < November < December

##Plot Graphs with ggplot2

You can do a quick plot using qplot() (quick and easy) or build a plot layer by layer using ggplot() (for more detailed plots). Plots are made up of ggplot(myData, aes(variable for x, variable for y)) + geoms() + opts() + theme(). For full documentation, see: http://docs.ggplot2.org/

##Statistics

R really shines when it comes to statistical packages. Here’s a few ways to do some statistical tests. We won’t get into all the details of statistics here, just know there’s probably an R package. See the statistics analysis guide for more details on what stats to use.

##Check for a normal distribution

Shapiro-Wilk test used as a way to test for normality (aka normal distribution)

shapiro.test(variable)

Q-Q plot (aka Quantile-Quantile plot) used as a way to test for normality (aka normal distribution)

qplot(sample = rexam$exam, stat="qq")

Levene’s Test tests the null hypothesis that the variances in different groups are equal (i.e. the difference between the variances is zero). This is from the car package. If Levene’s Test is significant (i.e. p<=.05), then we can conclude the null hypothesis is incorrect and the assumption of homogeneity of variances is violated. If Levene’s Test is non-significant (i.e. p>.05) then the variances are roughly equal and the assumption holds true.

leveneTest(outcome_variable, group, center = median/mean)
# where the outcome_variable is what we want to test the variances
# group variable is a factor
# center can be median or mean

##Learning Statistics with R

###Why do we learn Statistics

It is hard to be neutral in evaluating evidence impartially and without pre-existing biases. Belief bias effect is the tendency to judge the strength of arguments based on the plausibility of their conclusion rather than how strongly an argument supports that conclusion. A person is more likely to accept an argument that supports a conclusion that aligns with their values, beliefs, and prior knowledge, while rejecting counter arguments to the conclusion.

Example of a logically unsound argument (birds and pigeons are able to fly, it does not mean that a pigeon is a bird and not all birds can fly; we use our belief bias):

Simpson’s paradox (aka Simpson’s reversal, Yule-Simpson effect, amalgamation paradox, reversal paradox) is a phenomenon in which a trend appears in several groups of data but disappears or reverses when the groups are combined. The best known example is the UC Berkeley gender bias among graduate school admissions. The idea was that men applying were more likely than women to be admitted, but when taking into account the information about departments being applied to (different rejection percentages for different departments), it showed that women tended to apply to more competitive departments with lower rates of admission whereas men tended to apply to less competitive departments with higher rates of admission. The pooled data showed a small, but statistically significant bias in favor of women.

Another example is batting averages (David is higher each year, but when combined, Derek is higher)

Batter:             1995            1996            Combined
Derek Jeter         12/48   = .250  183/582 = .314  195/630 = .310
David Justic        104/411 = .253  45/140  = .321  149/551 = .270

###Introduction to Research Design

Data collection can be thought of as we want to do some kind of measurement. We want a process (operationalisation) in which we take a meaningful but somewhat vague concept and turn it into a precise measurement. We have the following components:

Variables

Scale

A nominal scale variable (aka categorical variable) does not have particular relationships between the possibilities. An example is ‘eye colour’ (e.g. green, brown, green), none is better.

An ordinal scale variable has more structure than a nominal scale (e.g. finish first in a race, but don’t know by how much).

An interval scale are variables where numerical value is meaningful, but there isn’t a natural ‘zero’ value. E.g. it was 15 degrees yesterday and 18 degrees today. A 0 degree temperature does not mean there is no temperature (instead it means the temperature that water freezes). Another example is years (can say 5 years later, but using mathematical equations like 1.0024 times later” does not make sense).

A ratio scale is where zero really means zero and it’s okay to multiply and divide. E.g. the response time of someone

Continuous vs Discrete Variables

A continuous variable is where for any two values that you can think of, it’s always logically possible to have another value inbetween.

A discrete variable is a variable that is not continuous (sometimes the case that there’s nothing in the middle).

            continuous      discrete
nominal                     X
ordinal                     X
interval    X               X
ratio       X               X
Is the measurement any good?

Reliability of a measure tells you how precisely you are measuring something. How repeatable or consistent are your measurements? E.g. your ‘bathroom scale’ is very reliable.

Validity of a measure tells you how accurate the measure is (internal validity and external validity are most important)

Confounds and Artifacts

Threats to validity are confounds, artifacts, and history effects:

Independent Variable and Dependent Variable

We try to use X (the predictors) to make guesses about Y (the outcomes).

|role of the variable    |   classical name         |   modern name     |
-------------------------------------------------------------------------
|to be explained         | dependent variable (DV)  |  outcome          |
|to do the explaining    | independent variable (IV)|  predictor        |
Experimental Research vs Non-experimental Research

In Experimental Research the researcher controls all aspects of the study (manipulating predictor variables and then allows the outcome variable to vary naturally).

In Non-experimental Research the research does not have quite as much control.