William Liu

Statistical Analysis Primer


Table of Contents

Overview

Welcome! This is a short onboarding course for new data analysts. I’ll cover the basics of statistics. No other knowledge is necessary.

What data are we measuring?

Independent and Dependent Variables

Data is made up of variables. Variables can have any value (like a category or number). y = 10 says that the variable ‘y’ has the value of ‘10’. There are two types of variables:

Categorical and Continuous Variables

There are different levels of measurement used to categorize and quantify variables. Variables can be categorized ‘categorical’ or ‘continuous’ as well as quantified at different levels. As we go down the list, the measurements become more detailed and useful for statistical analysis.

Reliability

Validity

Validity is whether an instrument actually measures what it sets out to measure (e.g. does a scale actually measure my weight?). Validity is usually divided into three forms including Criterion, Content, and Construct.

How are we measuring data?

Types of Research Methods

We’re interested in correlation as well as causality (cause and effect). To test a hypothesis, we can do the following types of research.

Descriptive Statistics

Frequency Distribution (aka Histogram)

A count of how many times different values occur. The two main ways a distribution can deviate from normal is by skew and kurtosis

Mode, Median, Mean

Used in frequency distribution to describe central tendancy

Quantiles, Quartiles, Percentages

Dispersion

From Frequency to Probability

So what does the frequency tell us? Instead of thinking of it as as the frequency of values occuring, think of it as how likely a value is to occur (i.e. probability). For any distribution of scores, we can calculate the probability of obtaining a given value.

Populations and Samples

We want to find results that apply to an entire population.

Inferential Statistics

Statistical Model

We can predict values of an outcome variable based on some kind of model. All models follow the simple equation of outcome = model + error.

Null Hypothesis Significance Testing (aka NHST)

Null Hypothesis Significance Testing (aka NHST) is a method of statistical inference used for testing a hypothesis. A test result is statistically significant if it has been predicted as unlikely to have occurred by chance alone, according to a threshold probability (the significance level)

Type I and Type II Errors

There are two types of errors that we can make when testing a hypothesis. These two errors are linked; if we correct for one, the other is affected. You can visualize this with a confusion matrix (aka contingency table, error matrix)

Parametric Statistics

A branch of statistics that assumes the data comes from a type of probability distribution and makes inferences about the parameters of the distribution. The assumptions are:

Homogeneity and Heterogeneity of Variance

Variances should be the same throughout the data (i.e. data is tightly packed around the mean). For example, say we measured the number of hours a person’s ear rang after a concert across multiple concerts.

Linear Regression

Data Transformations

If you cannot make data fit into a normal distribution and you checked that the data was entered in correctly, you can do the following:

Linear Regression - Covariance and Correlation

We can see the relationship between variables with covariance and the correlation coefficient.

Linear Regression Analysis

Regression Analysis is a way of predicting an outcome variable from one predictor variable (simple regression) or from several predictor variables (multiple regression). We fit a model to our data and use it to predict values of the dependent variable from one or more independent variables.

Linear Regression - Selecting Predictors

If you’re making a complex model, the selection of predictors and its order can have a huge impact. Put predictors methodically, not just add a bunch and see what results happen. Here’s a few methods:

Linear Regression - How’s my model doing?

When making a model, we should check for two things 1.) how well the model fits the observed data through outliers, residuals, influence cases and 2.) for generalization, which is how the model generalizes to other cases outside your sample.

Linear Regression - Assumptions

Violating Linear Regression Assumptions

If assumptions are violated, then you cannot generalize your findings beyond your sample. You can try correcting the samples using:

Logistic Regression

Logistic Regression - Types of Logistic Regressions

Logistic Regression is multiple regression, but with an outcome variable that is a categorical and predictor variables that are continuous or categorical. There’s two types of logistic regression:

Comparing the Linear and Logistic Regression Models

Logistic Regression - Selecting Predictors

Using R, we can see how the model fits the data

Logistic Regression - Assumptions

What are Logistic Regression assumptions?

Logistic Regression - Issues

What are situations where Logistic Regression can cause trouble and return invalid results?

Comparing two means (i.e. t-test)

We’ve looked at relationships between variables, but sometimes we’re interested in differences between groups of people. This is useful in making causal inferences (e.g. two groups of people, one gets a sugar pill and other gets actual pill). There’s two different ways to expose people to experiments: independent-means t-test and dependent-means t-test. t-tests are basically regressions so it has much of the same assumptions (parametric tests based on a normal distribution).

t-test - Independent and Dependent Means t-test

t-test - Calculations

Calculations for t-tests can be viewed as if the t-test were a linear regression (GLM). The independent t-test compares the means between two unrelated groups on the same continuous dependent variable (e.g. blood pressure of patients who were given a drug vs control group of a placebo). The dependent t-test can be seen as a within-subjects (aka repeated-measures) test (e.g. blood pressure of patients ‘before’ vs ‘after’ they received a drug) Note: Don’t really worry about the below, R will do the calculations:

##One-way ANOVA

Comparing Several Means with ANOVA (Analysis of Variance GLM 1, aka One-way ANOVA)

If we want to compare more than two conditions, we use one-way ANOVA. t-tests checked whether two samples have the same mean while ANOVA checks if three or more groups have the same means. ANOVA is an omnibus test, which means it tests for an overall effect between all groups and does not say what group has a different mean.

One-way ANOVA - Assumptions

Comparing different groups (Planned Contrasts and Post Hoc Comparisons)

Planned contrasts and Post Hoc Comparisons are two methods that tells us which groups differ in an ANOVA. This is important because F-ratio tells if there’s an effect, but not what group causes it. Planned contrasts and Post hoc comparisons are a way of comparing different groups without causing familywise error rate.

Planned Contrasts (aka Planned Comparisons) - break down the variance accounted for by the model into component parts. This is done when you have a specific hypotheses to test. Some rules:

Planned contrasts can be either: Orthogonal comparisons or Non-orthogonal comparisons

-  The _p-values_ here are correlated so you'll need to be careful how to interpret the results.  Instead of .05 probability, you might want to have a more conservative level before accepting that a given comparison is statistically meaningful.

Post hoc comparisons (aka post hoc test, data mining, exploring data) Compare every group (like you’re doing multiple t-tests) but using a stricter acceptance criterion so that the familywise error rate doesn’t rise above the acceptable .05. This is done when you have no specific hypothesis to test.

Comparing Several Means with ANCOVA (Analysis of Covariance GLM 2)

ANCOVA is like ANOVA, but also includes covariates, which are one or more continuous variables that are not part of the main experimental manipulation, but have an influence on the outcome (aka dependent variable). We include covariates because of two reaons:

  1. To reduce within-group error variance: In ANOVA, if we can explain some of the ‘unexplained’ variance (SSr)__ in terms of _covariates, then we reduce the error variance, allowing us to more accurately assess the effect of the independent variable (SSm)
  2. Elimination of confounds: this means that an experiment has unmeasured variables that confound the results (i.e. variables other than the experimental manipulation that affect the outcome variable). ANCOVA removes the bias from the variables that influence the independent variable.
    • E.g. Say we look at the example with the effects of Viagra on the libido; the covariates would be other things like medication (e.g. antidepressants). ANCOVA attempts to measure these continuous variables and include them in the regression model.

ANCOVA Assumptions

ANCOVA has the same assumptions as any linear model with these two additional considerations:

  1. independence of the covariate and treatment effect - Unexplained variance (SSr) should only overlap with the Variance explained by Covariate (and not with Variance explained by the independent variable or else the effect is obscured).
  2. homogeneity of regression slopes - the slopes of the different regression lines should all be equivalent (i.e. parallel among groups) For example, if there’s a positive relationship between the covariate and the outcome in one group, then we assume there’s a positive relationship for all other groups

ANCOVA Calculations

Factorial ANOVA

ANOVA and ANCOVA looks at differences between groups with only a single independent variable (i.e. just one variable being manipulated). Factorial ANOVA looks at differences between groups with two or more independent variables. When there are two or more independent variables, it is factorial design (because sometimes variables are known as factors). There are multiple types of this design including:

ANOVA Naming Convention

The names of _ANOVA_s can seem confusing, but are easy to break down. We’re simply saying how many independent variables and how they were measured. This means:

  1. The quantity of independent variables (e.g. one independent variable translates to ‘one-way independent’)
  2. Are the people being measured the same or different participants?
    • If the same participants, we say repeated measures
    • If different participants, we say independent
    • If there are two or more independent variables, then it’s possible some variables use the same participants while others use different participants so we say mixed

Independent Factorial Design ANOVA (aka Between Groups, Between-Subjects ANOVA, GLM 3)

An example of Factorial ANOVA using two independent variables is looking at the effects of alcohol on mate selection at nightclubs. The hypothesis was that after alcohol has been consumed (the first independent variable), subjective perceptions of physical attractiveness would become more inaccurate. Say we’re also interested if this effect is different for men and women (this is the second independent variable). We break groups into gender (male, female), drinks (none, 2 pints, 4 pints), and measured based off an independent assessment of attractiveness (say 1 to 100).

Repeated-Measures Factorial Designs ANOVA (aka Within Groups, Within-Subjects ANOVA, ANOVA for correlated samples, GLM 4)

Repeated measures is when the same entities participate in all conditions of the experiment. A Repeated-Measures ANOVA is like a regular ANOVA, but it violates the assumption that scores in different conditions are independent (scores are likely to be related because they’re from the same people), which will cause the F-test to lack accuracy.

Mixed Designs ANOVA (aka split-plot ANOVA, GLM 5)

Mixed Designs ANOVA is where you compare several means when there are two or more independent variables and at least one independent variable has been measured using the same participants (repeated-measures) and at least another independent variable has been measured using different participants (independent measures design).

Non-parametric Tests

Non-parametric tests are statistical procedures that make fewer assumptions about the type of data, mostly on the principle of ranking the data. For example, find the lowest score (rank of 1), then next highest score (rank of 2) and so forth. We then carry out the analysis on the rank instead of the actual data. These tests include:

Wilcoxon’s rank-sum test

The Wilcoxon’s rank-sum test (aka WRS, Mann-Whitney test, Mann-Whitney-Wilcoxon test, Wilcoxon-Mann-Whitney test) is used for the non-parametric equivalent of a independent t-test (i.e. if you want to test the differences between two conditions and different participants have been used in each condition). The theory is that you rank the data and ignore the group to which a person belonged (say we’re looking at depression levels between ecstasy and alcohol users).

####Wilcoxon Signed-Rank Test Wilcoxon signed-rank test is used in situations where there are two sets of scores to compare, but these scores come from the same participants. This is the non-parametric equivalent of a dependent t-test. The theory is that we’re looking at the differences between scores in the two conditions you’re comparing (e.g. see the effects of two drugs, one measured on Saturday and again on Wednesday for the same participants). The main difference is that there’s a sign (positive or negative) assigned to the rank.

####Kruskal-Wallis Test Kruskal-Wallis test looks at the differences between several independent groups. This is the non-parametric counterpart to the one-way independent ANOVA. The theory is also based on ranked data (ordering scores from lowest to highest ignoring the group that the score belongs, lowest score of 1 and going up).

####Friedman’s ANOVA Friedman’s ANOVA looks at differences between several related groups. This is used for testing differences between conditions where there are more than two conditions and the same participants have been used in all conditions (each case contributes several scores to the data). The theory is also based on ranked data.

Multivariate Analysis of Variance (MANOVA)

Matrices

You should know about Matrices before learning about MANOVA.

MANOVA - When to use MANOVA

MANOVA (multivariate analysis of variance) is used when you want to compare groups on several dependent variables (outcomes); when you can compare against several dependent variables, this is known as a multivariate test (as opposed to ANOVA that can only be used on one dependent varialbe, which is known as a univariate test). We can use MANOVA when there is one independent variable or when there are several.

MANOVA - Variances and Covariances

We’re interested in the ratio of the effect of the systematic variance to the unsystematic variance. In ANOVA, these variances were single values. In MANOVA, each of these is a matrix containing many variances and covariances.

MANOVA - Predicting

In ANOVA, we try to predict an outcome (dependent variable) using a combination of predictor variables (independent variables). In essence, what we’re doing with MANOVA is predicting an independent variable from a set of dependent variables).

MANOVA - Eigenvalues

Eigenvalues are the equivalent to the F-ratio in an ANOVA. Now we need to compare how large these values are compared to what we would expect by chance alone. We can do that a few ways. - Pillai-Bartlett trace (aka V, Pillai’s trace) is the sum of the proportion of explained variance on the discriminant funtions; it’s similar to the ratio of SSm/SSt, which is known as R^2. Formula is insert formula This should usually be your default test statistic. - Hotelling-Lawley trace (aka Hotelling’s T^2) is the sum of the eigenvalues for each variate. This test statistic is the sum of SSm/SSt for each of the variates and so it compares directly to the F-ratio in ANOVA. - Wilk’s lambda is the product of the unexplained variance on each of the variates. This test statistic represents the ratio of error variance to total variance (SSr/SSt) for each variate. Note that large eigenvalues (which represent a large experimental effect) lead to small values of Wilk’s lambda, which means statistical significance is found when Wilk’s lambda is small. - Roy’s largest root is the eigenvalue of the first variate. It’s basically the same the Hotelling-Lawley trace, but only for the first variate. This test statistic represents the proportion of explained variance to unexplained variance (SSm/SSr) for the first discriminant function. Roy’s root represents the maximum possible between-group difference given nad is in many cases the most powerful. - Choosing a method * If the sample sizes are small or moderate, the four approaches differ little in terms of power. * If group differences are concentrated on the first variate, then Roy’s statistic should be the most powerful (since it takes account of only the first variate), followed by Hotelling’s trace, Wilk’s lambda, and Pillai’s trace. If groups differ along more than one variate, the power order is reversed. * It’s recommended to use fewer than 10 dependent variables unless sample sizes are large. * When sample sizes are unequal, use the Pillai-Bartlett trace since it’s the most robust to violations of assumptions. Make sure to check the homogeneity of covariance matrices; if they seem homogeneous and if the assumption of multivariate normality is tenable, then Pillai-Bartlett trace is assumed to be accurate.

MANOVA - Assumptions

MANOVA Assumptions are similar to ANOVA assumptions, but extends them to multivariate cases instead of univariate cases:

MANOVA - Which group caused the effect

You can figure out which group caused the effect a couple different ways. After a MANOVA, you can either do a discriminant function analysis or do a different univariate ANOVA for each dependent variable.

####MANOVA - Robust Methods

Two robust methods for MANOVA are Munzel and Brunner’s method (aka Munzel-Brunner rank order test) and Choi and Marden’s robust test (aka Choi-Marden Multivariate Rank Test (which is an extension of the Kruskal-Wallis one-way analysis of variance by ranks test)

Factor Analysis

So the idea is that factor analysis identifies clusters of variables that relate to each other. We check to make sure our variables aren’t related to each other too much or too little using a correlation matrix. We then check if there’s any issues (e.g. enough sample size), then decide how many factors we want to keep (factor extraction), then finally decide which variables go to which factors (factor loading). We finally consider whether items you have are reliable measures of what you’re trying to measure.

Factor Analysis - Calculating Factor Scores

Calculating factor scores is estimating a score on a factor, based on their scores for the constituent variables. There’s a few different methods including:

What to do with Factor Scores

####Factor Analysis - Factor Extraction

Now that you have your factor score, you need to choose a technique to explore the factors in your data; choosing the technique depends on what you want to do:

Factor Analysis - Interpretation

factor rotation is a technique to interpret factor analysis by discriminating between factors; this is used after factors have been extracted and comes in two forms:

Comparing Categorical Variables

For continuous variables we measure using the means, but this is useless for categorical variables (since we’d assign numbers to categories and it would just depend on how many categories there were). Instead, for categorical variables, we count the frequency of the category to get a contingency table, which is a tabulation of the frequencies. We use different algorithms depending on how many categorical variables there are (2 or more than 2).

Comparing Multiple Categorical Variables using Loglinear Analysis

loglinear analysis is used when we want to look at more than two categorical variables. Think of this as the ANOVA for categorical variables (where for every variable we have, we get a main effect but we also get interactions between variables). For example, say we want three variables: Animal (dog or cat), Training (food as reward or affection as reward) and Dance (did they dance or not?) Again, this can also be seen as a regression model. Let’s not get into the math, but basically categorical data can be expressed in the form of a linear model provided we use log values (thus loglinear analysis). The idea is that we try to fit a simpler model without any substantial loss of predictive power through backward elimination (remove one at a time hierarchically).

Multilevel Linear Models

Sometimes data is hierarchical instead of at a single level. This means that some variables are clustered or nested within other variables (i.e. some of our other analysis may be oversimplification). Let’s use this example data set: