Statistical Analysis Primer
Table of Contents
Overview
Welcome! This is a short onboarding course for new data analysts. I’ll cover the basics of statistics. No other knowledge is necessary.
What data are we measuring?
Independent and Dependent Variables
Data is made up of variables. Variables can have any value (like a category or number). y = 10
says that the variable ‘y’ has the value of ‘10’. There are two types of variables:
- Independent Variable (aka cause, predictor) - the variable(s) does not depend on any other variable (e.g. Smoked Cigarettes Per Day, Time Spent Studying, How much you eat). This variable is thought to be the cause of some effect. In experimental research, this variable(s) is manipulated.
- Dependent Variable (aka effect, outcome) - the variable(s) that we think are an effect because the value depends on the independent variables (e.g. Has Lung Cancer, Test Score). This variable(s) is thought to be affected by the independent variable.
- Example: ‘Time Spent Studying’ causes a change in ‘Test Score’. ‘Test Score’ can’t cause a change in ‘Time Spent Studying’. ‘Test Score’ (independent variable) depends on ‘Time Spent Studying’ (dependent variable)
- Summary: What you’re trying to do is see if the independent variable(s) causes some kind of change in the dependent variable(s); we’re checking if there’s any relationship, which later we’ll understand it to be as part of the overall:
outcome = model + error
.
Categorical and Continuous Variables
There are different levels of measurement used to categorize and quantify variables. Variables can be categorized ‘categorical’ or ‘continuous’ as well as quantified at different levels. As we go down the list, the measurements become more detailed and useful for statistical analysis.
- Categorical (aka Qualitative) - Deals with unmeasurables/can’t do arithmetic with it (e.g. ‘species’ could have values of ‘human’, ‘cat’, or ‘dog’). Your choices are discrete (as in you can be a human or a cat, but not both). Categoricals are further categorized as:
- Dichotomous (aka Binary) - Two distinct possibilities (e.g. pregnant or not pregnant)
- Nominal - Two or more possibilities (e.g. human, cat, dog)
- Ordinal - Two or more possibilities and there is a logical order (e.g. first, second). You know which is bigger or smaller, but not by how much (e.g. you know who won race, but not how close race was)
- Programming - In R, this is ‘factor()’ and Python’s Pandas, this is ‘categorical’ with an optional order
- Continuous (aka Quantitative) - Deals with numbers (e.g. ‘length’, ‘age’). Continuous are further categorized as:
- Interval - Two or more possibilities, there is a logical order that you can measure (i.e. numbers that you can do arithmetic with), and there are equal intervals (e.g. Fahrenheit measurement of 40 to 50 degrees is the same difference as 50 to 60 degrees)
- Ratio - Two or more possibilities, there is a logical order that you can measure, there are equal intervals, and there is a true zero point (e.g. the weight of an object cannot weigh less than 0)
- Note: Parametric and Non-parametric: The types of data determines the types of tests: parametric or non-parametric, and they affect the type of statistical procedures you can use. The main differences are that parametric tests have either ‘interval’ or ‘ratio’ scales and uses information about the mean and deviation from the mean (meaning it has a relatively normal distribution), which gives it more statistical power. Nonparametric statistics uses ‘nominal’ or ‘ordinal’ data, (i.e. less information) in their calculation and are thus less powerful.
Reliability
- Measurement Error - the discrepancy between the numbers we use to represent the thing we’re measuring and the actual value of measuring it directly (e.g. I measure my height with a ruler, but might be a little inaccurate)
- Reliability - whether an instrument can be interpreted consistently across different situations. The easiest method is through test-retest reliability, which tests the same group of people twice (e.g. if I weighed myself within minutes of each other, would I get the same result?)
- Counterbalancing - order in which a person participates in an experiment may affect their behavior. Counterbalancing fixes this; say there are two possible conditions (A and B), subjects are divided into two groups with Group 1 (A then B), Group 2 (B then A)
- Randomization - randomly allocates experimental units across groups; reduces confounding by dispersing at chance levels (hopefully roughly evenly)
Validity
Validity is whether an instrument actually measures what it sets out to measure (e.g. does a scale actually measure my weight?). Validity is usually divided into three forms including Criterion, Content, and Construct.
- Criterion Validity - looks at the correlation between one set of variable(s) predicts an outcome based on information from another set of criterion variable(s). e.g. IQ tests are often validated against measures of academic performance (the criterion). Criterion Validity can be either concurrent or predictive:
- Concurrent Validity - assess the ability to distinguish between groups (e.g. check if an AP exam can substitute taking a college course; all students take AP exam and college course, check if there is a strong correlation between exam scores and college course grade)
- Predictive Validity - assess the ability to predict observations at a later point in time (e.g. how well does SAT test predict academic success? Determine usefulness by correlating SAT scores to first-year student’s GPA)
- Note: Difficult because objective criteria that can be measured easily may not exist
- Content Validity (aka Logical Validity) - a non-statistical type of validity that assesses the degree to which individual items represent the construct being measured (e.g. test of the ability to add two numbers should include adding a combination of digits including odd, even and not multiplying)
- Face Validity - an estimate of whether a test appears to measure a certain criterion
- Representation Validity (aka Translation Validity) - the extent to which an abstract theoretical construct can be turned into a specific practical test
- Construct Validity - a construct is an abstraction (attribute, ability or skill) created to conceptualize a latent variable. (e.g. someone’s English proficency is a construct). Construct Validity is whether your test measures the construct adequately (e.g. how well does your test measure someone’s English proficiency). In order to have construct validity, you need both convergent and discriminant validity.
- Convergent Validity - the degree to which two measures of constructs that should be related, are in fact related (e.g. if a construct of general happiness had convergent validity, then it should correlate to a similar construct, say of marital satisfaction)
- Discriminant Validity - the degree to which two measures of constructs that should be unrelated, are in fact unrelated (e.g. if a construct of general happiness had discriminant validity, then it should correlate to an unrelated construct, say of depression)
How are we measuring data?
Types of Research Methods
We’re interested in correlation as well as causality (cause and effect). To test a hypothesis, we can do the following types of research.
- Observational/Correlational Research - observe what naturally goes on without directly inferring. Correlation suggests a relationship between two variables, but cannot prove that one caused another. Correlation does not equal causation.
- Cross-sectional Study - a snapshot of many different variables at a single point in time (e.g. measure cholesterol levels of daily walkers across two age groups, but can’t consider past or future cholestral levels of snapshot)
- Longitudinal Study - measure variables repeatedly at different time points (e.g. we might measure workers’ job satisfaction under different managers)
- Limitations of Correlational Research
- Continuity - we view the co-occurrence of variables so we have no timeline (e.g. we have people with low self-esteem and dating anxiety, can’t tell which came first)
- Confounding Variables (aka Teritum Quid) - extraneous factors (e.g. a correlation between breast implants and suicide; breast implants don’t cause suicide, but a third factor like low self-esteem might cause both)
- Experimental Research - manipulate one variable to see its effect on another. We do a comparison of situations (usually called treatments or conditions) in which cause is present or absent (e.g. we split students into two groups: one with motivation and the other as control, no motivation)
- Between-Subjects Design (aka independent measures, between-groups) - Participants can be part of the treatment group or the control group, but not both. Every participant is only subjected to a single treatment (e.g. motivational group gets motivation, control gets no motivation through entire experiment)
- Within-Subjects Design (aka repeated measures) - Participants are subjected to every single treatment, including the control (e.g. give positive motivation for a few weeks, then no motivation)
Descriptive Statistics
Frequency Distribution (aka Histogram)
A count of how many times different values occur. The two main ways a distribution can deviate from normal is by skew and kurtosis
- N - N is the size of the sample and n represents a subsample (e.g. number of cases within a particular group)
- skew - the symmetry (tall bars are clustered at one end of the scale); positively skewed means clustered at the lower end / left side) while negatively skewed means clustered at the higher end / right side)
- kurtosis - the pointyness (degree that scores cluster at the end of the distributions/ the tails); positive kurtosis has many scores in the tails (aka leptokurtic distribution, has heavy-tailed distribution) while negative_kurtosis is relatively thin (aka platykurtic distribution, has light tails)
- Normal Distribution - a bell-shaped curve (majority of bars lie around the center of the distribution); has values of
skew=0
and kurtosis=0
Used in frequency distribution to describe central tendancy
- Mode - the value that occurs most frequently. Modes can have many ‘highest points’
- bimodal - two bars that are the highest
- multimodal - more than two bars that are the highest
- Median - the middle value when ranked in order of magnitude
- Mean - the average, represented as:
Quantiles, Quartiles, Percentages
- Range - largest value minus the smallest value
- Quantiles - quantiles split data into equal portions. Quantiles include quartiles (split into four equal parts), percentages (split into 100 equal parts), noniles (points that split the data into nine equal parts)
- Quartiles - the three values that split sorted data into four equal parts. Quartiles are special cases of quantiles
- Lower Quartile (aka First Quartile, Q1) - the median of the lower half of the data
- Second Quartile (aka Q2) - the median, which splits our data into two equal parts
- Upper Quartile (aka Third Quartile, Q3) - the median of the upper half of the data
- Note: For discrete distributions, there is no universal agreement on selecting quartile values
- Interquartile Range (aka IQR) - exclude values at the extremes of the distribution (e.g. cut off the top and bottom 25%) and calculate the range of the middle 50% of scores.
IQR = Q3 - Q1
Dispersion
- Variation (aka spread, dispersion) - a change or difference in condition, amount, level
- Variance - a particular measure of variation; how far a set of numbers are spread out around the mean or expected value; low variance means values are clumped together (zero means all values are identical), high variance means values are spread out from the mean and each other
- Unsystematic Variation (aka random variance) - random variability within individuals and/or groups (e.g. I feel better today than yesterday)
Unsystematic Variance = Measurement Error + Individual Differences
- Systematic Variation (aka between-groups variance) - variance between-groups created by a specific experimental manipulation (e.g. give bananas to reward monkeys for successfully completing tasks); doing something in one condition but not in the other condition
- Deviance - quality of fit for a model, the difference between each score and the fit. Used in the sum of squares of residuals in ordinary least squares where model-fitting is achieved by maximum likelihood. Can be calculated as:
deviance = <insert deviance equation>
- Sum of Squared Errors (aka SS, sum of squares) - we can’t just add up all the deviance (or else the total spread is just zero, which is meaningless) so we square all the values in order to get the total dispersion / total deviance of scores (i.e. gets rid of negatives)
- Variance - SS works nicely until the number of observations (n) changes, then we’ll need to recalculate. Instead of using total dispersion, we use the average dispersion, which is the variance
- Standard Deviation - Since the variance is still squared, we need to do the square root of the variance, as calculated here:
standard deviation = <insert standard deviation formula>
- Test Statistic - the ratio of systematic to unsystematic variance or effect to error (i.e. the signal-to-noise). Depending on the model, we have a different name for the test statistic. Examples include:
- F-test
- Leven’s test
- Bartlett’s test
- Brown-Forsythe test
From Frequency to Probability
So what does the frequency tell us? Instead of thinking of it as as the frequency of values occuring, think of it as how likely a value is to occur (i.e. probability). For any distribution of scores, we can calculate the probability of obtaining a given value.
- Probability Density Functions (aka probability distribution function, probability function) - just like a histogram, except that lumps are smoothed out so that we have a smooth curve. The area under curve tells us the probability of a value occurring.
- Common distributions include z-distribution, t-distribution, chi-square distribution, F-distribution
- We use a normal distribution with a mean of 0 and a standard deviation of 1 (this lets us use tabulated probabilities instead of calculating ourselves)
- To ensure a standard deviation of 1, we calculate the z-scores using:
z = <insert z-score equation>
- Check tabulated values and you’ll get the probability P of a value occurring
Populations and Samples
We want to find results that apply to an entire population.
- Population - summation of the same group or species; can be very general or very specific (e.g. humans, cats named Will)
- Sample - small subset of the population
- Degrees of Freedom - number of values in the final calculation that are free to vary (e.g. mean is 5 for the values 4, 6, and . must then be 5)
- Sampling Variation - samples will vary because they contain different members of the population (e.g. grab random samples of people off the street, some samples you’ll get a sample/group of people that is smarter, some samples not so much)
- Sampling Distribution (aka finite-sample distribution) - the distribution of a sample statistic (based on a random sample); this tells us about the behavior of samples from the population
- Standard Error (aka SE) - a measure of how representative a sample is likely to be of the population. A large standard error means there is a lot of variability so samples might not be representative of the population. A small standard error means that most samples are likely to be an accurate reflection of the population
- Central Limit Theorem - as samples get large (greater than 30), the sampling distribution has a normal distribution and a standard deviation of:
insert standard deviation formula
- Confidence Intervals (for large samples) - along with standard error, we can calculate boundaries that we believe the population will fall. In general, we could calculate confidence intervals using the central limit theorem:
lower boundary of confidence interval = formula
upper boundary of confidence interval = formula
- Note: Different for 95% confidence interval (most common) or 99% or 90%
- Confidence Intervals (for small samples) - for smaller samples, we can’t calculate boundaries using the central limit theorem because the sampling distribution isn’t a normal distribution. Instead, smaller samples have a t-distribution and would be calculated with:
lower boundary of confidence interval = formula
upper boundary of confidence interval = formula
- p-value - the probability of obtaining the observed sample results when the null hypothesis is true. If the p-value is very small (threshold based on the previously chosen significance level, traditionally 5% or 1%) then the hypothesis must be rejected.
- p <= 0.01 means very strong presumption against null hypothesis
- 0.01 < p <= 0.05 means strong presumption against null hypothesis
- 0.05 < p <= 0.1 means low presumption against null hypothesis
- p > 0.1 means no presumption against the null hypothesis
- Note: NHST only works if you generate your hypothesis and decide on threshold before collecting the data (otherwise chances of publishing a bad result will increase; as is with 95% confidence level you’ll only report 1 bad in 20)
Inferential Statistics
Statistical Model
We can predict values of an outcome variable based on some kind of model. All models follow the simple equation of outcome = model + error
.
- Model - Models are made up of variables and parameters
- Variables - are measured constructs that vary across entities in the sample
- Parameters - are estimated from the data (rather than being measured) and are (usually) constants. E.g. the mean and median (which estimate the center of the distribution) and the correlation and regression coefficients (which estimate the relationship between two variables). We say ‘estimate the parameter’ or ‘parameter estimates’ because we’re working with a sample, not the entire population.
- Outcome -
outcome = model + error
- Error (aka deviance) - Error for an entity is the score predicted by the model subtracted from the observed score for that entity.
error = outcome - model
Null Hypothesis Significance Testing (aka NHST)
Null Hypothesis Significance Testing (aka NHST) is a method of statistical inference used for testing a hypothesis. A test result is statistically significant if it has been predicted as unlikely to have occurred by chance alone, according to a threshold probability (the significance level)
- Alternative Hypothesis (aka H1, Ha, experimental hypothesis) - the hypothesis or prediction that sample observations are influenced by some non-random cause
- Null Hypothesis (H0) - Opposite of the alternative hypothesis, refers to the default position that there is no relationship between two measured phenomena
- Directional and Nondirectional Hypothesis - hypothesis can be either directional if it predicts whether an effect is larger or smaller (e.g. if I buy cookies, I’ll eat more) or non-directional if it does not predict whether effect is larger or smaller (e.g. if I buy cookies, I’ll eat more or less).
- One-tailed test - a statistical model that tests a directional hypothesis (Note: Basically, do not ever do a one-tailed test unless you are absolutely sure of the direction)
- Two-tailed test - a statistical model that tests a non-directional hypothesis
- Sample Size and Statistical Significance - there is a close relationship between sample size and statistical significance (the p-value). The same effect will have different p-values in different sized samples (i.e. small differences can be ‘significant’ in large samples and large effects might be deemed ‘non-significant’ in small samples)
- Effect Size and Cohen’s d - statistical significance does not tell us about the importance of an effect. The solution is to measure the size of the effect, which is a quantitative measure of the strength of a phenomenon. In order to compare the mean of one sample to another, we calculate Cohen’s d.
- Pearson’s correlation coefficient (aka r) - measure of strength of relationship between two variables (thus it’s an effect size)
- r ranges between 0 (no effect) and 1 (a perfect effect)
- r=.10 means a small effect
- r=.30 means a medium effect
- r=.50 means a large effect
- Note: r=.6 does not mean it has twice the effect of r=.3
- Meta-analysis - statistical methods for contrasting and combining results from different studies in the hope of identifying patterns among study results (i.e. conducting research about previous research). Meta-analysis allows to achieve a higher statistical power.
Type I and Type II Errors
There are two types of errors that we can make when testing a hypothesis. These two errors are linked; if we correct for one, the other is affected. You can visualize this with a confusion matrix (aka contingency table, error matrix)
- Type I error (aka false positive) - occurs when we believe there is an effect in our population, when in fact there isn’t (e.g. a doctor thinking a man is pregnant). Using conventional 95% confidence level, the probability is 5% of seeing this error. This means if we repeated our experiment, we would get this error 5 out of 100 times.
- Familywise (FWER) or Experimentwise Error Rate (EER) - the error rate across statistical tests conducted on the same data for Type I errors. This can be calculated using the following equation (assuming .05 level of significance):
familywise error=1=(0.95)^n
where n is the number of tests carried out.
- Bonferroni correction - the easiest and most popular way to correct FWER, which fixes familywise error rate, but at the loss of statistical power.
- Type II error (aka false negative, lack of Statistical Power) - opposite of Type I error, occurs when we believe there is no effect in the population when in fact there is an effect (e.g. a doctor thinks a woman is not pregnant, but she is). The maximum acceptable probability of a Type_II error is 20%. This means we miss 1 in 5 genuine effects.
- Statistical Power - statistical power is the ability of a test to find an effect; the power of the test is the probability that a given test will find an affect assuming that one exists in the population. Basically, statistical power is linked with the sample size. We aim to achieve a power of 80% chance of determining an effect if one exists.
- Note: For R, use the package ‘pwr’ to use power to calculate the necessary sample size. You can also calculate the power of a test after the experiment, but if you find a non-significant effect, it might be that you didn’t have enough power. If you find a significant effect, then you had enough power.
Parametric Statistics
A branch of statistics that assumes the data comes from a type of probability distribution and makes inferences about the parameters of the distribution. The assumptions are:
- normal distribution - different depending on the test you’re using (remember sample size affects how we test for normal distribution). What we want is a
skew = 0
and kurtosis = 0
. We can eye-ball this in R with ‘stat_function()’ and ‘Q-Q plot’. To get more accurate (instead of just eye-balling), we can use ‘describe()’.
- stat_function() - draws a function over the current plot layer; for example, using the argument fun = dnorm(), which returns the probability (i.e. the density) for a given value.
- Q-Q plot (quantile-quantile plot) - a quantile is the proportion of cases we find below a certain value; using the
stat = "qq"
argument in a qplot(), we can plot the cumulative values we have in our data against the cumulative probability of a normal distribution (i.e. data is ranked and sorted). If data is normally distributed, then the actual scores will have the same distribution as the score so we’ll get a straight diagonal line.
- describe() or stat.desc() - We can get descriptive summaries of our data with the describe() function of the ‘psych’ package or stat.desc() function of the ‘pasetcs’ package. Again, we’re interested in
skew = 0
and kurtosis = 0
, which we can then calculate the z-score (so that we can compare to different samples that used different measures and so that we can see how likely our values of skew and kurtosis are likely to occur).
- z-score skewness calculation - we get a z-score for skewness
- z-score kurtosis calculation - we get a z-score for kurtosis
- Note: interpretting z-scores depends on the sample size. Absolute value greater than 1.96 is significant at ‘p < .05’, greater than 2.58 at ‘p < .01’, greater than 3.29 is significant at ‘p < .001’
- small samples (<30) - it’s okay to look for values above 1.96
- large samples(>30 and <200) - it’s okay to look for values above 2.58
- very large samples (_200+) - look at the shape’s distribution visually and value of the skew and kurtosis statistics instead of calculating their significance (because they are likely to be significant even when skew and kurtosis are not too different than normal)
- skew.2SE and kurt.2SE - stat.desc() gives us skew.2SE. and kurt.2SE, which stands for the skew and kurtosis value divided by 2 standard errors (i.e. instead of values above, we can say if the absolute value greater than 1 is significant at ‘p < 0.5’, greater than 1.29 at ‘p < .01’, greater than 1.65 is significant at ‘p < .001’)
- Shapiro-Wilk Test of Normality - Shapiro-Wilk is a way of looking for normal distribution by checking whether the distribution as a whole deviates from a comparable normal distribution. This is represented as normtest.W (W) and normtest.p (p-value) through either the stat.desc() or shapiro.test() functions
- interval data - data should be measured at least in the interval level (tested with common sense)
- independence - different depending on the test you’re using
Homogeneity and Heterogeneity of Variance
Variances should be the same throughout the data (i.e. data is tightly packed around the mean). For example, say we measured the number of hours a person’s ear rang after a concert across multiple concerts.
- Homogeneity of Variance - After each concert, the ringing lasts about 5 hours (even if this is sometimes 10-15 hours, 20-25 hours, 40-45 hours)
- Heterogeneity of Variance - After each concert, the ringing lasts from 5 to 30 hours (say first concert is 5-10 hours, last concert is 20-50 hours)
- Levene’s test (F) - Levene’s test tests that the variances in different groups are equal (i.e. the difference between variances is equal to zero).
- Test is significant at p <= .05, which we can then conclue that the variances are significantly different (meaning it is not homogeneity of variance)
- Test is non-significant at p > .05, then the variances are roughly equal and it is homogeneity of variance
- Note: in large samples, Leven’s test can be significant even when group variances are not very different; for this reason, it should be interpreted with the variance ratio
- Hartley’s Fmax (aka variance ratio) - the ratio of the variances between the group with the biggest variance the group with the smallest variance. This value should be smaller than the critical values.
Linear Regression
If you cannot make data fit into a normal distribution and you checked that the data was entered in correctly, you can do the following:
- Remove the outlier - delete the case only if you believe this is not representative (e.g. for Age someone puts 200, pregnant male)
- Data transformation - do trial and error data transformations. If you’re looking at differences between variables you must apply the same transformation to all variables. Types of transformations include:
- Log transformation log(Xi) - taking the logarithm of a set of numbers squashes the right tail of the distribution. Advantages is this corrects for positive skew and unequal variances. Disadvantage is that you can’t take the log of zero or negative numbers (though you can do log(x +1) to make it positive where 1 is the smallest number to make the value positive.
- Square root transformation - Taking the square root centers your data (square root affects larger values more than smaller values). Advantage is it corrects positive skew and unequal variances. Disadvantage is same as log, no square root of negative numbers
- Reciprocal transformation (1/Xi) - divide 1 by each score reduces the impact of large scores. This reverses the score (e.g. a low score of 1 would create 1/1 = 1; a high score of 10 would create 1/10 = 0.1). The small score becomes the bigger score. You can avoid score reversal by reversing the scores before the transformation 1/(X highest - Xi).Advantages is this corrects positive skew and unequal variances
- Reverse Score transformation - The above transformations can correct for negative skew by reversing the scores. To do this, subtract each score from the highest score obtained or the highest score + 1 (depending if you want lowest score to be 0 or 1). Big scores have become small and small scores have become big. Make sure to reverse this or interpret the variable as reversed.
- Change the score - if transformation fails, consider replacing the score (it’s basically the lesser of two evils) with one of the following:
- Change the score to be one unit above the next highest score in the data set
- The mean plus three standard deviations (i.e. this converts back from a z-score)
- The mean plus two standard deviations (instead of three times above)
- Dummy coding - is a way of representing groups of people using only zeros and ones. To do this, we create several variables (one less than the number of groups we’re recoding) by doing:
- Count the number of groups you want to recode and subtract 1
- Create as many new (dummy) variables as step above
- Choose one of your groups as a baseline; this should be your control group or the group that represents the majority of your people (because it might be interesting to compare other groups against the majority)
- Assign the baseline group value of 0 for all of your dummy variables
- For the first dummy variable, assign value of 1 to the first group that you want to compare against the baseline and 0 for all other groups
- Repeat above step for all dummy variables
- Place all your dummy variables into analysis
- Robust test (robust statistics) - statistics that are not unduly affected by outliers or other small departures from model assumptions (i.e. if the distribution is not normal, then consider using a robust test instead of a data tranformation). These tests work using these two concepts trimmed mean and bootstrap:
- trimmed mean - a mean based on the distribution of scores after you decide that some percentage of scores will be removed from each extreme (i.e. remove say 5%, 10%, or 20% of top and bottom scores before the mean is calculated)
- M-estimator - slightly different than trimmed mean in that the M-estimator determines the amount of trimming empiraccly. Advantage is that we never over or under trim the data. Disadvantage is sometimes M-estimators don’t give an answer.
- bootstrap - bootstrap estimates the properties of the sampling distribution from the sample data. It treats the sample data as a population from which smaller samples (bootstrap samples) are taken with replacement (i.e. puts the data back before a new sample is taken again).
- Note: R has a package called WRS that has these robust tests, including boot()
my_object <- boot(data, function, replications)
where data is the dataframe, function is what you want to bootstrap, replications is the number of bootstrap samples (usually 2,000)
- boot.ci(my_object) returns an estimate of bias, empirically derived standard error, and confidence intervals
Linear Regression - Covariance and Correlation
We can see the relationship between variables with covariance and the correlation coefficient.
- variance - remember that variance of a single variable represents the average amount that the data vary from the mean
- covariance - covariance is a measure of how one variable changes in relation to another variable. Positive covariance means if one variable moves in a certain direction (e.g. increases), the other variable also moves in the same direction (e.g. increases). Negative covariance means if one variable moves in a certain direction (e.g. increases), the other variable moves in the opposite direction (e.g. decreases).
- deviance - remember that deviance is the difference in value vertically (i.e. the quality of fit)
- cross-product deviations - when we multiple the deviation of one variable by the corresponding deviation of another variable
- covariance - the average sum of combined deviations (i.e. it’s the cross-product deviation, but also divided by the number of observations, N-1)
- Note: covariance is NOT a good standardized measure of the relationship between two variables because it depends on the scales of measurement (i.e. we can’t say whether a covariance is particularly large relative to another data set unless both data sets used the same units).
- standardization - is the process of being able to compare one study to another using the same unit of measurement (e.g. we can’t compare attitude in metres)
- standard deviation - the unit of measurement that we use for standardization; it is a measure of the average deviation from the mean
- correlation coefficient - this is the standardized covariance; we basically get the standard deviation of the first variable, multiply by the standard deviation of the second variable, and divide by the product of the multiplication. There’s two types of correlations including bivariate and partial:
- bivariate correlation - correlation between two variables. Can be:
- Pearson product-moment correlation (aka Pearson correlation coefficient) - for parametric data that requires interval data for both variables
- Spearman’s rho - for non-parametric data (so can be used when data is non-normally distributed data) and requires only ordinal data for both variables. This works by first ranking the data, then applying Pearson’s equestion to the ranks.
- Kendall’s tau - for non-parametric data, similar to Spearman’s, but use this when there’s a small number of samples and there’s a lot of tied ranks (e.g. lots of ‘liked’ in ordinal ranks of: dislike, neutral, like)
- partial correlation - correlation between two variables while ‘controlling’ the effect of one or more additional variables
- Note: values range from -1 (negatively correlated) to +1 (positively correlated)
- +1 means the two variables are perfectly positively correlated (as one increases, the other increases by a proportional amount)
- 0 means no linear relationship (if one variable changes, the other stays the same)
- -1 means the two variables are perfectly negatively correlated (as one increases, the other decreases in a proportional amount)
- correlation matrix - if you want to get correlation coefficients for more than two variables (a gird of correlation coefficients)
- correlation test - if you only want a single correlation coefficient
- R to calculate correlation, we can the Hmisc package, specifically with cor(), cor.test(), and rcorr()
- causaulity - correlation does not give the direction of causality (i.e. correlation does not imply causation)
- third-variable problem (aka tertium quid) - causality between two variables cannot be assumed because there may be other measured or unmeasured variables affecting the results
- coefficient of determination (R^2) - a measure of the amount of variability in one variable that is shared by the other (i.e. indicates how well data fit a statistical model); this is simply the correlation coefficient squared. If we multiply this value by 100, we can say that variable A CAN (not necessarily does) account up to X% of variable B.
- biserial and point-biserial correlation coefficient - these correlational coefficients are used when one of the two variables is dichotomous (aka binary). point-biserial correlation is used when one variable is a discrete dichotomy (i.e. dead or alive, can’t be half-dead). biserial correlation is used when that one variable is a continuous dichotomy (i.e. your grade is pass or fail, but it can have multiple levels including A+, C-, F).
- partial correlation and semi-partial correlation (aka part correlation) - a partial correlation is the relationship between two variables while controlling for the effects of a third variable on both variables in the original correlation. semi-partial correlation is the relationship between two variables while controlling for the effects of a third variable on only one of the variables in the original correlation.
Linear Regression Analysis
Regression Analysis is a way of predicting an outcome variable from one predictor variable (simple regression) or from several predictor variables (multiple regression). We fit a model to our data and use it to predict values of the dependent variable from one or more independent variables.
- method of least squares - method to find the line of best fit, which finds the smallest residuals (aka the difference, variance) between the actual data and our model
- regression model (aka regression line) is the line of best fit resulting from the method of least squares. Remember that even though this is the best fitting line, the model could still be a bad fit to the data
- residual sum of squares (aka sum of squared residuals) - represents the degree of inaccuracy when the best model is fitted to the data
- model sum of squares - the improvement going from the worst fit line (which just uses the mean for every value) to the best fit line (our regression model). If this value is large, then the regression model has made a big improvement on how well the outcome variable can be predicted. If this value is small, then the model isn’t much of an improvement.
- t-test (aka Student’s t-test) - is a method to test the hypothesis about the mean of a small sample from a normally distributed population when the population standard deviation is unknown. This is when you can’t compute a z-score because samples are too small or we don’t know the population variance of the population.
- t-statistic (aka t-score) - usually used in t-tests, helps us assess predictor variables in whether they improve our model.
- t-distribution (aka Student’s t-distribution) is the distribution of the t-statistic. This is a probability distribution that is used to estimate population parameters when the sample size is small and/or when the population variance is unknown.
- Akaike Information Criterion (aka AIC) - measure of fit (like R^2), except that it penalizes the model for having more variables (whereas R^2 only fits the data better with more predictors). The bigger the value means a worse fit, smaller the value means better fit. Only compare the AIC to models of the same data (there’s no reference; can’t say 10 is small or big).
- Bayesian Information Criterion (aka BIC), a measure of fit like AIC, but has a larger penalty for more variables than AIC.
Linear Regression - Selecting Predictors
If you’re making a complex model, the selection of predictors and its order can have a huge impact. Put predictors methodically, not just add a bunch and see what results happen. Here’s a few methods:
- Hierarchical Regression is where predictors are selected based on past work and the experimenter decides what order to enter the predictors into the model. Known predictors should be entered in first in order of importance in predicting the outcome.
- Forced entry is where all predictors are carefully chosen, then forced into the model simultaneously in random order.
- Stepwise methods is where all predictors and their order are based off a mathematical criterion and has a direction of forward, backward, both.
- Forward Direction (aka Forward Selection) - An initial model is defined that contains only the constant, then the computer searches for the predictor (out of the ones available) that best predicts the outcome variable. If the model improves the ability to predict the outcome, the predictor is retained. The next predictor selected is based on the largest semi-partial correlation (i.e. the ‘new variance’; if predictor A explains 40% of the variation in the outcome variable, then there’s 60% variation left unexplained. If predictor B is measured only on the remaining 60% variation). We stop selecting predictors when the AIC goes up on the remaining variables (remember lower AIC means better model)
- Backward Direction (aka Backward Elimination) - An initial model that starts with all predictor variables, removes one at a time, stops if remaining variables makes AIC go up (remember lower AIC means better model). This is the preferred method because of suppressor effects, which occurs when a predictor has an effect but only when another variable is held constant. Forward Direction is more likely than Backward Direction to exclude predictors involved in suppressor effects (and thus make a Type II error)
- Both Direction (aka stepwise) - Starts the same as Forward Direction, but whenever you add a predictor, a removal test of the least useful predictor is done to see if there’s any redundant predictors
- Note: If you used one of the above stepwise methods to get dressed on a cold day, you might put on pants first instead of underwear. It’ll see that underwear doesn’t fit now that you have pants on so it’ll skip. If you don’t mind your computer doing lots of calculations, try the all-subsets method.
- All-subsets methods is where we try every combination of predictors to see which is the best fit (using a statistic called Mallow’s Cp). You can increase accuraccy, but the possibilities increase exponentially so calculations take much longer.
- Another Note: There’s a huge danger of over-fitting with too many variables so it’s important to cross-validate the model by splitting into train/test sets. Remember, the fewer predictors the better.
Linear Regression - How’s my model doing?
When making a model, we should check for two things 1.) how well the model fits the observed data through outliers, residuals, influence cases and 2.) for generalization, which is how the model generalizes to other cases outside your sample.
- Outliers and Residuals - We want to look at outliers to see if a small number of cases heavily influence the entire model. What we do is look for residuals, which is the error pressent in the model (smaller the value, the better the fit. Large values mean outliers).
- Unstandardized residuals (normal residuals) are measured in the same units as the outcome variable and are difficult to interpret across different models. We cannot define a universal cut-off point of what is an outlier. Instead, we need standardized residuals.
- Standardized residuals are residuals divided by an estimate of their standard deviation, which gives us the ability to compare residuals from different models using the properties of z-scores to determine universal guidelines on acceptable and unacceptable values. E.g. Normally distributed sample, 99% of z-scores should lie between -3.29 and 3.29. Anything above or below these values are cause for concern and thus the model is a bad fit.
- Influential Cases - We should also check to see that any influential cases aren’t greatly biasing the model. We can assess the influence of a particular case using multiple methods:
- Adjusted Predicted Value - So basically, two models are created; one without a particular case and with the case. The models are then compared to see if the predicted value is the same regardless of whether the value is included. If the predicted value is the same, the model is good. If the predicted value is not the same, the model is a bad fit. The adjusted predicted value is the predicted value for the model without the case.
- DFFit is the difference between the adjusted predicted value (when the model doesn’t include a case) from the original predicted value (when the model includes the case).
- DFBeta is the difference between a parameter estimated using all cases and estimated when one case is excluded
- Studentized Residual is when the residual is divded by the standard error so it gives us a standardized value; this can be compared across different regression analyses because it is measured in standard units. It’s called the studentized residual because it follows a Student’s t-distribution. This is useful to assess the influence of a case on the ability of the model to predict the case, but doesn’t provide info on how a case influences the model as a whole (i.e. the ability to predict all cases)
- Cook’s distance is a statistic that considers the effect of a single case on the model as a whole (i.e. the overall influence of a case on the model); any values greater than 1 may be cause for concern. Use to test outliers.
- hat values (aka leverage) is another way to check the influence of the observed value of the outcome variable over the predicted values (0 means the case has no influence up to 1 meaning the case has complete influence). If no cases exert undue influence over the model, then all leverage values should be close to the average value. Some recommend investigating cases greater than twice to three times the average.
Linear Regression - Assumptions
- Generalization asks the question of whether our model can generalize to other cases outside of this sample (i.e. apply to a wider population)? We need to check for these assumptions in regression analysis. Once these below assumptions are met, the coefficients and parameters of the regression equation are said to be unbiased. An unbiased model tells us that on average, the regression model from the sample is the same as the population model (but not 100%)
- Variable types - All predictor variables must be quantitative or categorical (with two categories). The outcome variable must be quantitative (measured at the interval level), continuous, and unbounded (no constraints on the variability of the outcome; e.g. if the outcome is a measure ranging from 1 to 10 yet the data collected vary between 3 and 7, then its constrained)
- Non-zero variance - Predictors should have some variation in value (i.e. do not have variances of 0)
- No perfect multicollinearity - Should be no perfect linear relationship between two or more predictors (and don’t correlate too highly).
- Predictors are uncorrelated with ‘external variables’ - external variables that haven’t been included in the regression model which influence the outcome variable. If external variables do correlate with the predictors, then the conclusions we draw from the model are unreliable (because other variables exist that can predict the outcome)
- Homoscedasticity - At each level of the predictor variable(s), the variance of the residual terms should be constant. The residuals at each level of the predictor(s) should have the same variance (homoscedasticity); when the variances are very unequal there is heteroscedasticity
- Independent Errors - For any two observations the residual terms should be uncorrelated (i.e. independent, sometimes called lack of autocorrelation).
- Durbin-Watson test is a test that checks for serial correlations between errors (i.e. whether adjacent residuals are correlated). The test statistic can vary between 0 and 4:
- 2 means that the residuals are uncorrelated
- <2 means a positive correlation between adjacent residuals
-
2 means a negative correlation between adjacent residuals
- <1 or >3 means definite cause for concern
- Even if value is close to 2, can still be cause for concern (this is just a quick test, still depends on sample size and model)
- Note: this test depends on the order; if you reorder the data, you’ll get a different value
- Normally Distributed Errors - assumed that the residuals in the model are random, normally distributed variables with a mean of 0, which means the differences between the model and the observed data are most frequently zero or close to zero and that differences greater than zero are rare. Note: does not mean that predictors have to be normally distributed.
- Independence - All the values of the outcome variable are independent
- Linearity - the mean values of the outcome variable for each increment of the predictor(s) lie along a straight line (i.e. this is a regression, we should be modeling a linear relationship)
- Cross-validation - A way to assess the accuracy of our model/ how well our model can predict the outcome in a different sample. If a model is generalized, it can predict another sample well. If the model is not generalized, it can’t predict another sample well.
- Adjusted R^2 - this adjusted value indicates the loss of predictive power (aka shrinkage). The equation is Stein’s Formula. Note this is different than R^2, which uses Wherry’s equation.
- Data splitting - Usually split data randomly to 80% train, 20% test
- Sample size - Depends on the size of effect (i.e. how well our predictors predict the outcome), how much statistical power we want and the number of predictors. As a general rough guide, check out Figure 7.10 in the book (Page 275)
- Multicollinearity - exists when there is a strong relationship between two or more predictors in a regression model. Perfect collinearity exists when at least one predictor is a perfect linear combination of the others (e.g. two predictors have a correlation coefficient of 1). As collinearity increases, these three problems arise:
- Untrustworthy b - b coefficients increase as collinearity increase; big standard errors for b coefficients mean that bs are more variable across samples, thus b is less likely to represent the population, thus predictor equations will be unstable across samples
- limits size of R - Having uncorrelated predictors gives you better unique variance
- importance of predictors - multicollinearity makes it difficult to assess the individual importance of a predictor. If the predictors are highly correlated then we can’t tell which of say two variables is important
- Testing collinearity - To test for collinearity, you can do the following:
- correlation matrix shows relationships between variables to variables with anything above say above .8 as an indicator of really highly correlated and might be an issue, however it misses on detecting multicollinearity since it only looks at one variable at a time
- variance inflation factor (VIF) is a collinearity diagnostic that indicates whether a predictor has a strong linear relationship with another predictor(s) and is good for spotting relationships between multiple variables. There’s no hard and fast rules, but a 10 is a good value to start worrying or if the average VIF is close to 1, then multicollinearity might be biasing the model
- tolerance statistic is the reciprocal of variance inflation factor (i.e. 1/VIF) Any values below .1 are serious concerns
- Plotting is a good way to check assumptions of regression to make sure the model generalizes beyond your sample.
- Plotting - You can check assumptions quickly with graphs
- Graph the standardized residuals against the fitted (predicted) values.
- If the plot looks like a random array of dots, then it’s good.
- If the dots seem to get more or less spread out over the graph (like a funnel shape) then is probably a violation of the assumption of homogeneity of variance.
- If the dots have a pattern to them (like a curved shape) then this is probably a violation of the assumption of linearity
- If the dots have a pattern and are more spread out at some points on the plot than others then this probably reflects violations of both homogeneity of variance and linearity
- Graph the histogram of the residuals
- If the histogram looks like a normal distribution (and the Q-Q plot looks like a diagonal line) then its good
- If the histogram looks non-normal, then things might not be good
Violating Linear Regression Assumptions
If assumptions are violated, then you cannot generalize your findings beyond your sample. You can try correcting the samples using:
- If residuals show problems with heteroscedasticity or non-normality, you could try transforming the raw data (though might not affect the residuals)
- If there’s a violation of the linearity assumption, then you could do a logistic regression instead
- You can also try a robust regression (aka bootstrapping, robust statistics), which is an alternative to the least squares regression when there’s too many outliers or influential cases
Logistic Regression
Logistic Regression - Types of Logistic Regressions
Logistic Regression is multiple regression, but with an outcome variable that is a categorical and predictor variables that are continuous or categorical. There’s two types of logistic regression:
- binary logistic regression is used to predict an outcome variable with a binary response (e.g. tumor cancerous or benign).
- multinomial (or polychotomous) logistic regression predicts an outcome variable that has more than two categories (e.g. favorite color).
- There’s no additional formulas between these two types of logistic regression; the reason is that multinomial logistic regression just breaks the outcome variable down into a series of comparisons between two categories. Say we have three outcome categories (A, B, C) then the analysis will consist of a series of two comparisons (e.g. A vs B and A vs C) or (A vs C and B vs C) or (B vs A and B vs C); basically you have to select a baseline category.
Comparing the Linear and Logistic Regression Models
- What are we measuring? (A comparison between linear and logistic regressions)
- logit - In linear/simple regression, you predict the value of Y given X. In logistic regression, you predict the probability of Y occuring given X. You can’t use linear regression equations on a logistic regression problem (i.e. outcome is a categorical instead of continuous) unless you do some data transformations (like logit, which logs the data)
- log-likelihood - In linear/simple regression, we used R^2 (the Pearson correlation) to check between observed values of the outcome and the values predicted by the regression model. For logistic regression, we use the log-likelihood given by
insert equation
, which is based on summing the probabilities associated with the predicted and actual outcomes (i.e. how much unexplained information there is after the model has been fitted). A larger log-likelihood means poor fitting model because there’s more unexplained observations. This is good for overall fit. For partial fit, see R-statistic.
- maximum-likelihood estimation (MLE) is a method to estimate the parameters of a statistical model (i.e. given a sample population, it estimates what most likely would have occured). MLE gives you the coefficients most likely to have occurred.
- deviance (aka -2LL) is related to the log-likelihood and its equation is
deviance =-2*log-likelihood
and sometimes used instead of the log-likelihood because it has a chi-square distribution, which makes it easy to calculate the significance of the value.
Logistic Regression - Selecting Predictors
Using R, we can see how the model fits the data
- R-statistic is the partial correlation between the outcome variable and each of the predictor variables (opposed to log-likelihood which is for the overall correlation instead of partial); can be between -1 (meaning as the the predictor value increases, likelihood of the outcome occurring decreases) to 1 (meaning that as the predictor variable increases, so does the likelihood of the event occurring).
- Hosmer and Lemeshow’s RL^2 measure is the proportional reduction in the absolute value of the log-likelihood measure and thus is a measure of how much the badness of fit improves as a result of the inclusion of the predictor values. Values range from 0 (predictors are useless at predicting the outcome variable) to 1 (model predicts the outcome variable perfectly)
- Cox and Snell’s RCS^2 and Nagelkerke’s RN^2 are another way of getting a equivalent of R^2 in linear regression.
- Information Criteria - Penalize a model for having more predictors
- Akaike Information Criterion (AIC) for logistic regression is
AIC = -2LL +2k
where -2LL
is the deviance and k
is the number of predictors
- Bayes Information Criterion (BIC) for logistic regression is
BIC = -2LL +2k * log(n)
where n
is the number of cases
- How much are each predictor(s) contributing?
- t-statistic - in linear regression, the regression coefficient b and their standard errors created the t-statistic, which tells us how much a preditor was contributing
- z-statistic (aka the Wald statistic) - in logistic regression, the z-statistic tells us if the b coefficient for that predictor is significantly different than zero. If a coefficient b is much greater than zero, then that predictor is making a significant contribution to the prediction of the outcome.
z = b / (SEb)
. Be warned, when the z-statistic value is large, the standard error tends to be inflated, resulting in z-statistic being underestimated. An inflated standard error increases the probability of rejecting a predictor as being significant when in reality it is making a significant contribution (i.e. Type II error)
- So the coefficient b in a logistic regression is an exponent instead of multiplying, how does that work out / what does it mean?
- odds as we normally know it is the probability of something happening / something not happening (e.g. probability becoming pregnant divided by / probability of not becoming pregnant). This isn’t the same as the logistic regression’s odds ratio mentioned below.
- odds ratio is the exponential of B (i.e., e^B), which is the change in odds resulting from a unit changes in the predictor. E.g. calculate the odds of becoming pregnant when a condom is not used, calculate the odds of becoming pregnant when a condom is used, then calculate the proportionate change in odds between the two.
- Formula is:
change in odds = odds after a unit change in the predictor / original odds
- if the value is greater than 1, then it means as the predictor increases, the odds of the outcome occuring also increases
- if the value is less than 1, then it means as the predictor increases, the odds of the outcome occuring decrease
- How do we know what order to put in / take out predictors in logistic regression?
- forced entry method - the default method for conducting regression; place in one block and estimate parameters for each predictor
- stepwise method - select a forward, backward, or both stepwise method (remember the issues with it though).
Logistic Regression - Assumptions
What are Logistic Regression assumptions?
- Linearity - In linear regression we assume the outcome had a linear relationship with the predictors. However, logistic regression’s outcome is categorical so our linear regression assumptions don’t apply. Instead, with logistic regression we check if there are any linear relationships between any continuous predictors and the logit of the outcome variable.
- Independence of errors - Same as linear regression where the cases of data are not related
- Multicollinearity - Same as linear regression where predictors should not be too highly correlated; can check with tolerance statistic and VIF statistics, the eigenvalues of the scaled, uncentred cross-roducts matrix, the condition indices and the variance proportions.
Logistic Regression - Issues
What are situations where Logistic Regression can cause trouble and return invalid results?
- Not enough samples for specific categories (e.g. checking if people are happy, but only n of 1 for people that are 80-year old, Buddhist, left-handed lesbian). To spot, create a crosstabulations and look for really high standard errors. To fix, collect more data.
- complete separation is where the outcome variable can be perfectly predicted by one variable or a combination of variables. This causes issues in that there’s no data in the middle probabilities (where we’re not very sure of the probability) which can cause a wide range of curves/probabilities. E.g. tell between cats and burglars, there’s no cats that weigh more than 15kg and no burgulars that weigh less than 30kg. The issue is usually caused when there’s too many variables fitted to too few cases. To fix, collect more data or use a simpler model.
Comparing two means (i.e. t-test)
We’ve looked at relationships between variables, but sometimes we’re interested in differences between groups of people. This is useful in making causal inferences (e.g. two groups of people, one gets a sugar pill and other gets actual pill). There’s two different ways to expose people to experiments: independent-means t-test and dependent-means t-test. t-tests are basically regressions so it has much of the same assumptions (parametric tests based on a normal distribution).
t-test - Independent and Dependent Means t-test
- independent-means t-test (aka between groups, independent-measures, independent-samples t-test) is used when there are two experimental conditions and different participants were assigned to each condition (e.g. Group 1 gets Treatment A, Group 2 gets Treatment B). independent t-tests also assume that scores in different treatment conditions are independent (because they come from different people) and that there is homogeneity of variance (but this only really matters if you have unequal group sizes. Also if this is violated, you can use Welch’s t-test to adjust your data, which you should always do)
- dependent-means t-test (aka within subjects, repeated measures design, matched-pairs, pair-samples t-test, paired t-test) - is used when there are two experimental conditions and the same participants took part in both conditions of the experiment (e.g. Group 1 gets Treatment A, Group 2 gets Treatment B, then swap with Group 1 gets Treatment B, Group 2 gets Treatment A). The sampling distribution of the differences between scores should be a normal distribution (not the scores themselves)
t-test - Calculations
Calculations for t-tests can be viewed as if the t-test were a linear regression (GLM). The independent t-test compares the means between two unrelated groups on the same continuous dependent variable (e.g. blood pressure of patients who were given a drug vs control group of a placebo). The dependent t-test can be seen as a within-subjects (aka repeated-measures) test (e.g. blood pressure of patients ‘before’ vs ‘after’ they received a drug) Note: Don’t really worry about the below, R will do the calculations:
- Generalized Linear Model (GLM), (i.e. t-test as a linear regression) the t-test can be thought of as a form of linear regression. It’ll allow you to test differences between two means.
- E.g. say you have two groups (one looks at picture of spider versus other that looks at a real spider and you measure their anxiety as heartrate). The t-test as a GLM would setup each group (picture vs real spider) with the equation
outcome = (model) + error
. We dummy code the group variable (say Group 1 has value of 0 and Group 2 has value of 1). We calculate the anxiety across both groups and then test whether the difference between group means is equal to 0. The t-statistic tests if the difference between group means is significantly different than zero.
- Independent t-test This means that when different groups participate in different conditions, pairs of scores will differ not just because of the experiment’s manipulation, but also because of other sources of variance (e.g. IQ). Therefore, we make comparisons on a ‘per condition’ basis’. By looking at the ‘per condition basis’, we assess whether the difference between two sample means is statistically meaningful or by chance (through numerous sampling and looking at the sampling means distribution).
- Dependent t-test Since we put the same group through multiple experiments, we need to see the score in condition A compared to condition B and add up the differences (could be large or small) for all participants. We divide by the number of participants in the group and get the average difference (on average, how much each person’s score changed in condition A to condition B). We then compare (divide) this by the standard error (which represents if we just randomly sampled from the population and not done any experiments). This gives us the test statistic that represents the model/error.
t-statistic (aka test statistic) - t-tests produce the t-statistic, which tells us how extreme a statistical estimate is. If the experiment had any kind of effect, we expect the systematic variation to be much greater than the unsystematic variation (i.e. if t is much greater than 1, there’s an effect; if t is less than 1, there’s no effect. If t exceeds the critical value for an effect, we’re confident that this reflects an effect of our independent variable)
- effect-size and t-tests - even though a t-statistic might not be statistically significant, it doesn’t mean that our effect is unimportant. To check if an effect-size is substantive, we use the following equation:
insert formula
- Reporting t-tests should involve stating the finding to which the test relates, report the test statistic, degrees of freedom, an estimate of the effect-size, and the probability of that test statistic. E.g. On average, participants experienced greater anxiety from real spiders (M = 47.00, SE = 3.18), than from pictures of spiders (M=40.00, SE = 2.68). This difference was not significant t(21.39) = -1.68, p>.05; however, it did represent a medium-sized effect r=.34
##One-way ANOVA
Comparing Several Means with ANOVA (Analysis of Variance GLM 1, aka One-way ANOVA)
If we want to compare more than two conditions, we use one-way ANOVA. t-tests checked whether two samples have the same mean while ANOVA checks if three or more groups have the same means. ANOVA is an omnibus test, which means it tests for an overall effect between all groups and does not say what group has a different mean.
- familywise (aka experimentwise error rate) is the error rate across statistical tests conducted on the same experimental data. For example, we use ANOVA instead of using multiple t-tests on each pair of groups because the probability of Type I errors would quickly stack (e.g. say .05 level of significance with 3 pairs would be .95 * .95 * .95 = .857 probability of no Type I error). The more groups the more chance of an error.
- F-statistic (aka F-ratio, F-test) is a test statistic that is the systematic variance divided by the unsystematic variance (i.e. a measure of how much the model has improved the prediction of the outcome compared to the level of inaccuracy of the model). A large value (greater than at least 1) means a good model. Basically, its the ratio of the model to its error (much like the t-statistic). Say we have Groups A, B, C. The F-ratio tells us if the means of these three groups (as a whole) are different. It can’t tell what groups are different (e.g. if groups A and B are the same and C is different, if groups A, B, C are all different). F-ratio says that the experimental manipulation has some effect, but doesn’t say what causes the effect. The F-ratio can be used to fit a multiple regression model and test the differences between the means (again, much like the t-statistic with a linear regression). The F-ratio is based on the ratio of the improvement due to the model sum of squares and the difference between the model and the observed data residual sum of squares.
- When creating dummy variables for ANOVA groups (e.g. Group A, B, C) where we have one less dummy variable than the number of groups in the experiment. We choose a control group (aka baseline group) that acts as a baseline for other categories. This baseline group is usually the one with no experiments (e.g. baseline is no viagra and others are low dose and high dose viagra). When group sizes are unequal, the baseline group should have a large number of samples.
- Assessing variation (deviance)
At every stage of the ANOVA we’re assessing variation (or deviance) from a particular model using the formula
deviation = change in (observed - model)^2
. We calculate the fit of the most basic model, then the fit of the best model; if the model is any good then it should fit the data significantly better than the basic model.
- Total Sum of Squares (aka TSS, SST) gives us the total amount of variation. We calculate the difference between each observed data point and the grand mean. We then square these differences and add them together to get us the total sum of squares.
- For TSS, the degrees of freedom is one less than the total sample size (N-1); the mean is the constant being held. The equation is
N-1
E.g. we have 15 participants, the degrees of freedom is 14.
- grand variance is the variance between all scores, regardless of the experimental condition, and can be used to calculate the total sum of squares with the equation
<insert equation>
- Model Sum of Squares (SSm) tells us how much of the total variation can be explained by the fact that different data points come from different groups. We calculate the difference between the mean of each group and the grand mean, square the differences, multiply each result by the number of participants within that group, then add the values for each group together.
- For SSm, the degrees of freedom is one less than the number of parameters estimated. The equation is:
k-1
E.g. with 3 groups of participants, we have 2 degrees of freedom.
- Residual Sum of Squares (SSr) tells us how much of the variation cannot be explained by the model (e.g. individual differences in weight, testosterone, etc in particpants). SSr can be seen by looking at the difference between the score obtained by a person and the mean of the group that the person belongs.
- For SSr, the degrees of freedom are the total degrees of freedom (i.e. the total sample size
N
) minus the degrees of freedom for the model (i.e. the number of groups k
). The equation is: N-k
. E.g. with 15 participants and 3 groups, we have (14 - 2 = 12) degrees of freedom.
- Mean Squares (MS) eliminates the bias of the number of scores (because SSm and SSr tells us the total, not the average). MS gives us the the sum of squares divided by the degrees of freedom.
- MSm is the average amount of variation explained by the model (the systematic variation)
- MSr is the average amount of variation explained by extraneous variables (the unsystematic variation)
- F-ratio is the measure of the ratio of the variation explained by the model and the variation explained by unsystematic factors (i.e. how good the model is against how much error there is). The equation is:
(_MSm_)/(_MSr_)
One-way ANOVA - Assumptions
- Homogeneity of variance - the variances of the groups are equal. You can check this using Levene’s test, which is an ANOVA test done on the absolute differences between the observed data and the mean or median. If Levene’s test is significant (i.e. p-value is less than .05) then the variances are significantly different and we shouldn’t use ANOVA.
- Note that ANOVA is a robust test, which means it doesn’t matter if we break some assumptions (the F-ratio is still accurate). When group sizes are equal, then F-ratio is quite robust. However, when group sizes are not equal, the accuracy of F-ratio is affected by skew.
- If homogeneity of variance has been violated, you can try different data transformations or you can try a different version of the F-ratio, like Welch’s F.
- If there’s distributional problems, there are other methods like bootstrapping or trimmed means and M-estimators that can correct for it.
Comparing different groups (Planned Contrasts and Post Hoc Comparisons)
Planned contrasts and Post Hoc Comparisons are two methods that tells us which groups differ in an ANOVA. This is important because F-ratio tells if there’s an effect, but not what group causes it. Planned contrasts and Post hoc comparisons are a way of comparing different groups without causing familywise error rate.
Planned Contrasts (aka Planned Comparisons) - break down the variance accounted for by the model into component parts. This is done when you have a specific hypotheses to test. Some rules:
- If we have a control group, we compare this against the other groups
- Each comparison must compare only two ‘chunks’ of variation (e.g. low dose, high dose)
- Once a group has been singled out in a comparison, it can’t be used in another comparison (we’re slicing up the data like a cake, the same part of a cake can’t be on multiple slices)
- When we’re comparing two groups, we’re just comparing the mean of the group to the mean of the other group
- E.g. We compare the average of the placebo group to the average of the low dose and high dose groups; if the standard errors are the same, the experimental group with the highest mean (high dose) will be significantly different from the mean of the placebo group.
Planned contrasts can be either: Orthogonal comparisons or Non-orthogonal comparisons
- Orthogonal comparisons
- Now we want to answer, what groups do we compare? Instead of creating dummy variables of 0 and 1 as we would the main ANOVA, we instead assign a weight. Here’s a few basic rules for that:
- Choose sensible comparisons
- Groups coded with positive weights will be compared against groups coded with negative weights (so assign one chunk a positive, the other a negative; it’s arbitrary which one is positive or negative)
- The sum of weights for a comparison should be zero
- If a group is not involved in a comparison, it’s automatically assigned a weight of 0 (which eliminates it from all calculations)
-
For a given comparison, the weights assigned to the group in one chunk of variation should be equal to the number of groups in the opposite chunk of variation
- Let’s go through an example of how to apply these:
1.) Example (Step 1): Chunk1 (Low Dose vs High Dose) vs Chunk2 (Placebo)
2.) Example (Step 2): Chunk1 (Positive weight) vs Chunk2 (Negative Weight)
3.) Example (Step 3): Chunk1 has two chunks (‘Low Dose’ and ‘High Dose’), they each have a magnitude of 1 (for a sum of 2), weight of +1 each (for a sum of +2). Chunk2 only has one chunk (‘Placebo’) so it has a magnitude of 2, weight of -2.
4.) The sum of the weights should add up on both sides so this means Chunk1 has weight of +1 (for Low Dose) and +1 (for High Dose) -2 (for Placebo) = 1+1-2=0
5.) To compare just two groups (Low Dose and High Dose), we set the Placebo with a weight of 0 (which eliminates it from calculations). We now have a comparison of Chunk1 (High Dose, 1 Magnitude, +1 Weight) vs Chunk2 (Low Dose, 1 Magnitude, -1 Weight).
6.) We now have the following dummy variables:
Dum_var1 (comp1), Dum_var2 (comp2), Product (comp1* comp2) Grp Placebo -2, 0, 0 Grp Low Dose 1, -1, -1 Grp High Dose 1, 1, 1 Total 0 0 0
7.) We want to do an orthogonal comparison (basically make sure that Total row is 0) and this means our comparisons are independent (so we can use t-tests). The p-values won’t be correlated so we won’t get familywise errors
8.) From the significance values of the t-tests we can see if the experimental groups (Low Dose, High Dose) were significantly different from the control (Placebo).
-
Non-orthogonal comparisons
- Similar to orthogonal comparisons except the comparisons don’t have to sum to zero.
- Example, you can compare across different levels, say Chunk1 is ‘High Dose’ only and Chunk2 is ‘Placebo’
Dum_var1 (comp1), Dum_var2 (comp2), Product (comp1* comp2) Grp Placebo -2, -1, 2 Grp Low Dose 1, 0, 0 Grp High Dose 1, 1, 1 Total 0 0 3
- The _p-values_ here are correlated so you'll need to be careful how to interpret the results. Instead of .05 probability, you might want to have a more conservative level before accepting that a given comparison is statistically meaningful.
- Polynomial Contrast (trend analysis) is a more complex way of looking at comparisons. The different groups usually represent a different amount of a single common variable (e.g. amount of dosage for drug) and its effect
- Linear Trend is where the group means increase proportionately (e.g. a positive linear trend is where the more of the drug we give, the more effect, looks like a linear line)
- Quadratic Trend is where the the group means increase at the beginning, peak in the middle, then goes down (e.g. drug enhances performance if given a certain amount, then after a certain amount, the more you give the worse the performance); basically there’s one direction change. This requires at least 3 groups.
- Cubic Trend is where the group means goes up, down, up (or vice versa); basically there’s two direction changes. This requires at least 4 groups.
- Quartic Trend is where the group goes up, down, up, down (or vice versa); basically there’s at least four direction changes.
Post hoc comparisons (aka post hoc test, data mining, exploring data)
Compare every group (like you’re doing multiple t-tests) but using a stricter acceptance criterion so that the familywise error rate doesn’t rise above the acceptable .05. This is done when you have no specific hypothesis to test.
- post hoc comparisons is kinda cheating since this consits of pairwise comparisons that compare all different combinations of the treatment groups. However, pairwise comparisons control familywise errors by correcting the level of significance for each test so that the overall Type I error rate across all comparisons remains at .05. This is accomplished in multiple ways:
- Bonferroni correction is a trade-off for controlling the familywise error rate with a loss of statistical power; this means that the probability of rejecting an effect that does actually exist is increased (Type II error). For example, we do 10 tests, we use .005 as our criterion for significant (instead of .05). Formula is
<insert formula>
- Holm’s method computes the p-values for all of the pairs of groups in your data, then order them from smallest to largest. We start similar with the normal Bonferroni correction for the first comparison, but then each subsequent comparison the p-value gets bigger (i.e. less conservative). This method is stepped, which means we continue as long as comparisons are significant. If there’s a non-significant comparison we stop and assume all remaining comparisons are also non-significant.
- Benjamini-Hochberg’s method doesn’t focus on making Type I error rate like the above methods (basically, if we make a Type I error, it’s not that bad) and instead focuses on False Discovery Rate (FDR), which is the proportion of falsely rejected null hypothesis to the total number of rejected null hypothesis. The Benjamini-Hochberg tries to keep the FDR under control instead of the familywise error rate. Like Holm’s method, you comput the p-value for all pairs of groups in your data and order them similarly (smallest to largest). The difference is that this method is step-up, which means instead of working down the talbe we work up. We say the bottom value is non-significant and continue moving up the table until we see a significant comparison, then assume that all other comparisons are also significant.
- So which post hoc procedure should you use? There’s these and numerous others that R doesn’t do out of the box. When deciding, consider:
- Whether the test controls the Type I error rate
- Whether the test controls the Type II error rate (i.e. has good statistical power)
- Whether the test is reliable when the test assumptions of ANOVA have been violated
- Bonferroni and Tukey’s Honest Significant Difference (HSD) tests (aka Tukey’s range test, Tukey method both control the Type I error rate, but are conservative so they lack statistical power. Bonferroni has more power when the number of comparisons is small. Tukey is more powerful when testing large numbers of means.
- Benjamini-Hochberg has more power than Holm’s; Holm’s has more power than Bonferroni. Just remember Benjamini-Hochberg doesn’t attempt to control Type I errors.
- Warning: The above post-hoc comparisons perform badly when group sizes are unequal and when population variances are different. If this is the case, look up bootstrapping or trimmed means and M-estimators (both of which include a bootstrap). Use bootstrap to control Type I errors and M-estimators if you want more statistical power.
- Note: for some reason, the effect size r^2 specifically for ANOVAs are called eta squared and looks like
n^2
Comparing Several Means with ANCOVA (Analysis of Covariance GLM 2)
ANCOVA is like ANOVA, but also includes covariates, which are one or more continuous variables that are not part of the main experimental manipulation, but have an influence on the outcome (aka dependent variable). We include covariates because of two reaons:
- To reduce within-group error variance: In ANOVA, if we can explain some of the ‘unexplained’ variance (SSr)__ in terms of _covariates, then we reduce the error variance, allowing us to more accurately assess the effect of the independent variable (SSm)
- Elimination of confounds: this means that an experiment has unmeasured variables that confound the results (i.e. variables other than the experimental manipulation that affect the outcome variable). ANCOVA removes the bias from the variables that influence the independent variable.
- E.g. Say we look at the example with the effects of Viagra on the libido; the covariates would be other things like medication (e.g. antidepressants). ANCOVA attempts to measure these continuous variables and include them in the regression model.
ANCOVA Assumptions
ANCOVA has the same assumptions as any linear model with these two additional considerations:
- independence of the covariate and treatment effect - Unexplained variance (SSr) should only overlap with the Variance explained by Covariate (and not with Variance explained by the independent variable or else the effect is obscured).
- homogeneity of regression slopes - the slopes of the different regression lines should all be equivalent (i.e. parallel among groups) For example, if there’s a positive relationship between the covariate and the outcome in one group, then we assume there’s a positive relationship for all other groups
ANCOVA Calculations
- Sum of Squares (Type I, II, III, IV)
- Remember that order matters when evaluating
- Type I Sum of Squares We put one predictor into the model first (it gets evaluated), then the second (then it gets evaluated), etc. Use if the variables are completely independent of each other (unlikely), then Type I SS can evaluate the true effect of each variable.
- Type II Sum of Squares Use if you’re interested in main effects; it gives an accurate picture of a main effect because they’re evaluated ignoring the effect of any interactions involving the main effect under consideration. However, if an interaction is present, then Type II can’t evaluate main effects (because variance from the interaction term is attributed to them)
- Type III Sum of Squares is usually the default; use this when sample sizes are unequal, however they work only when predictors are encoded with orthogonal contrasts instead of a non-orthogonal contrast.
- Type IV Sum of Squares is the same as Type III, but is designed for situations in which there’s missing data.
- Interpreting the covariate - Draw a scatterplot of the covariate against the outcome (e.g. participant’s libido as y, partner’s libido as x). If covariate is positive, then there’s a positive relationship where the covariate increases, so does the outcome. If covariate is negative, then there’s a negative relationship where the covariate increases, the outcome decreases.
- Partial eta squared (aka partial n^2) is an effect size measure for ANCOVA (kinda similar to eta squared (n^2) in ANOVA or r^2). This differs from eta squared in that it looks not at the proportion of total variance that a variable explains, but at the proportion of variance that a variable explains that is not explained by other variables in the analysis.
Factorial ANOVA
ANOVA and ANCOVA looks at differences between groups with only a single independent variable (i.e. just one variable being manipulated). Factorial ANOVA looks at differences between groups with two or more independent variables. When there are two or more independent variables, it is factorial design (because sometimes variables are known as factors). There are multiple types of this design including:
- independent factorial design - there are several independent variables and each has been measured using different entities
- repeated-measures (related) factorial design - there are several independent variables measured, but the same entities have been used in all conditions
- mixed design - several independent variables have been measured, some have been measured with different entities and some used the same entities
ANOVA Naming Convention
The names of _ANOVA_s can seem confusing, but are easy to break down. We’re simply saying how many independent variables and how they were measured. This means:
- The quantity of independent variables (e.g. one independent variable translates to ‘one-way independent’)
- Are the people being measured the same or different participants?
- If the same participants, we say repeated measures
- If different participants, we say independent
- If there are two or more independent variables, then it’s possible some variables use the same participants while others use different participants so we say mixed
Independent Factorial Design ANOVA (aka Between Groups, Between-Subjects ANOVA, GLM 3)
An example of Factorial ANOVA using two independent variables is looking at the effects of alcohol on mate selection at nightclubs. The hypothesis was that after alcohol has been consumed (the first independent variable), subjective perceptions of physical attractiveness would become more inaccurate. Say we’re also interested if this effect is different for men and women (this is the second independent variable). We break groups into gender (male, female), drinks (none, 2 pints, 4 pints), and measured based off an independent assessment of attractiveness (say 1 to 100).
- Calculations for a Two-way ANOVA is very similar to a One-way ANOVA with the exception that in a Two-way ANOVA the variance that is explained by the experiment (SSm) is broken down into the following:
- SSa (aka main effect of variable A) is the variance explained by Variable A
- SSb (aka main effect of variable B) - is the variance explained by Variable B
- SSa*b (aka the interaction effect) - is the variance explained by the interaction of Variable A and Variable B
- For the F-ratio, each one of the above effects (SSa, SSb, and SSa*b) all have their own F-ratio so that we can tell if each effect happened by chance or reflects an effect of our experimental manipulations
- interaction graphs can help you interpret and visualize significant interaction effects
- Once again, we run Levene’s test to see if there are any significant differences between group variances (homogeneity of variance); for example, check if the variance in attractiveness differs across all different gender and alcohol group combinations.
- For choosing contrasts, we need to define contrasts for all of the independent variables. If we use Type II sums of squares, then we have to use an orthogonal contrast. If we want to look at the interaction effect (i.e. the effect of one independent variable at individual levels of the other independent variable), we use a technique called simple effects analysis
- Like ANOVA, do a post hoc test on the main effects(e.g. SSa, SSb) in order to see where the differences between groups are. If you want to see the interaction effect (e.g. SSa*b) then look at contrasts.
Repeated measures is when the same entities participate in all conditions of the experiment. A Repeated-Measures ANOVA is like a regular ANOVA, but it violates the assumption that scores in different conditions are independent (scores are likely to be related because they’re from the same people), which will cause the F-test to lack accuracy.
- Since the F-test lacks accuracy, we have to make a different assumption called the assumption of sphericity (aka circularity), which is an assumption about the structure of the covariance matrix; we assume the relationship between pairs of experimental conditions is similar (More precisely, the variances of the differences between treatment levels is about the same). That means we calculate the differences between pairs of scores for all combinations of the treatment level. We then calculate the variance of these differences. As such, sphericity is only an issue with three or more variables.
- compound symmetry is a stricter requirement than sphericity (if this is met, so is sphericity; if this is not met, you still need to check for sphericity). This requirement is true when both the variances across conditions are equal (same as the homogeneity of variance assumption in between-group designs) and the covariances between pairs of conditions are equal.
- Mauchly’s test is a way to assess sphericity and let you know if the variances of the differences between conditions are equal. If Mauchly’s test is significant, then we have to be wary of the F-ratio, otherwise you’re okay. However, remember that this is also dependent on sample size; large samples can have small deviations from sphericity to be significant while in small samples large violations can be non-significant.
- Violating sphericity means a loss of statistical power and it also impacts post hoc tests. If sphericity is definitely not violated, use Tukey’s test. If sphericity is violated, then using the Bonferroni method is usually the most robust of the univariate techniques in terms of power and control of Type I error rate.
- epsilon is a descriptive statistic that indicates the degree to which sphericity has been violated. The formula is:
epsilon = 1/(k-1)
where k
is the number of repeated-measures conditions. If epsilon is closer to 1, the more homogeneous the variances of differences (and closer to the data being spherical).
- If data violates the assumption of sphericity, then we can apply a correction factor that is applied to the degrees of freedom used to assess the F-ratio (or we can not use the F-ratio).
- Greenhouse-Geisser correction - Use if epsilon is less than .75 or if nothing is known about sphericity (otherwise the correction is too conservative and there’s too many false null hypothesis that fail to be rejected). Go look up how to do this.
- Huynh-Feldt correction - Use if estimate is more than .75
- Use a different test other than the F-ratio like using multivariate test statistics (multivariate analysis, MANOVA) because they don’t depend on the assumption of sphericity. There’s a trade-off in power between univariate and multivariate tests. Go look up how to do this.
- Analyze the data as a multilevel model so we can interpret the model coefficients without worrying about sphericity because dummy-coding our group variables ensures that we only compare two things at once (and sphericity is only an issue when comparing three or more means). Not for the faint of heart (think about doing a multivariate test first)
- The calculations for repeated-measures ANOVA can be seen as
SSt = SSb + SSw
where SSt is the Total Variability, SSb is the Between-Participant variability and SSw is the Within-Participant Variability. The main difference is that the SSw (‘Within-Participant Sum of Squares’) is now broken down into:
- SSw looks at the variation with an actual person, not within a group of people; some of this variation is explained by the effect of the experiment and some of it is just random fluctuation
- SSm (Model Sum of Squares) is the effect of the experiment
- SSr (Residual Sum of Squares) represents the error; the variation not explained by the experiment
- Example of this is if we tested a group of participants (Person1, Person2, Person3, Person4, Person5, Person6) each eating multiple gross foods (NastyFoodA, NastyFoodB, NastyFoodC, NastyFoodD) and measuring how long it takes to retch. The independent variable is the type of food eaten (e.g. NastyFoodA) since that’s what we’re manipulating, the dependent variable is the time to retch (since it depends on the food). SSw represents each participant’s own tolerance level for nasty food. SSm is how much our manipulation (change in food) calculates the mean for each level of the independent variable (mean time to retch for say NastyFoodA, NastyFoodB, etc) and compares this to the overall mean of all foods. SSr is just the error.
- To check effect size for repeated-measures designs, we calculate omega squared (w^2). This formula is slightly different than in one-way independent ANOVAs and it looks pretty scary so google it up.
- Factorial repeated-measures designs is just extending the repeated-measures designs to include a second or more independent variable. If there’s two independent measures, then it’s a two-way repeated-measures ANOVA. If there’s three independent measures, then it’s a three-way repeated-measures ANOVA, etc. Again we’ll need to be careful about interaction effects of multiple independent variables.
Mixed Designs ANOVA (aka split-plot ANOVA, GLM 5)
Mixed Designs ANOVA is where you compare several means when there are two or more independent variables and at least one independent variable has been measured using the same participants (repeated-measures) and at least another independent variable has been measured using different participants (independent measures design).
- You can explore the data similar to a repeated-measures design ANOVA or as a multilevel model. If you’re using the ANOVA approach, check for the assumption of sphericity, then choose what you want to contrast, compute the main model (might need to run a robust version of the test), then follow up with your post hoc tests.
- At this point you’ll realize why you want to limit the number of independent variables that you include. The interpretation gets increasingly difficult with the more variables you have.
Non-parametric Tests
Non-parametric tests are statistical procedures that make fewer assumptions about the type of data, mostly on the principle of ranking the data. For example, find the lowest score (rank of 1), then next highest score (rank of 2) and so forth. We then carry out the analysis on the rank instead of the actual data. These tests include:
- Wilcoxon rank-sum test (aka Mann-Whitney test)
- Wilcoxon signed-rank test
- Kruskal-Wallis test
- Friedman’s test
- McNemar’s test (aka McNemar’s Z-test)
Wilcoxon’s rank-sum test
The Wilcoxon’s rank-sum test (aka WRS, Mann-Whitney test, Mann-Whitney-Wilcoxon test, Wilcoxon-Mann-Whitney test) is used for the non-parametric equivalent of a independent t-test (i.e. if you want to test the differences between two conditions and different participants have been used in each condition). The theory is that you rank the data and ignore the group to which a person belonged (say we’re looking at depression levels between ecstasy and alcohol users).
- A tied rank is where multiple values are given the same rank. Say we had two scores of 6 and they would’ve both ranked 3 and 4, we then take the average of the two potential ranks (3+4/2=3.5)
- We need to correct for the number of people in the group (or else larger groups would have larger ranks) so we calculate the mean rank by taking the mean of the numbers with this formula:
mean rank = (N(N+1))/2
so say you have
- Now for each group, calculate
W = sum of ranks - mean rank
- Monte Carlo method is an exact approach to obtain the significance level (p-value). This means creating lots of data sets that match the sample, but instead of putting people into the correct groups, they’re put into a random group MANY times. It then compares the difference that appears in the data when the null hypothesis is true is as large as the difference in your data. This is good for small samples. Keep in mind:
- This process takes a while because it’s done MANY times; with an increase in sample size, the length of time takes more and more
- If you have tied ranks in the data, you can’t use this method because this is an exact approach.
- If your sample size is large (say larger than 40) or if you have tied ranks, try a normal approximation approach to calculate the significance level (p-value). This doesn’t need a normal distribution; it just assumes that the sampling distribution of the W statistic is normal, which means that the standard error can be computed that can be used to calculate a z and then a p-value.
- With a normal approximation, you have an option to do a continuity correction, which tries to smooth out the distribution (since there’s tied ranks), but comes at the expensve of maybe making your p-value a little too high.
####Wilcoxon Signed-Rank Test
Wilcoxon signed-rank test is used in situations where there are two sets of scores to compare, but these scores come from the same participants. This is the non-parametric equivalent of a dependent t-test. The theory is that we’re looking at the differences between scores in the two conditions you’re comparing (e.g. see the effects of two drugs, one measured on Saturday and again on Wednesday for the same participants). The main difference is that there’s a sign (positive or negative) assigned to the rank.
- For each group, if the difference in scores are the same (from Sat to Wed) then we exclude this data from the ranking
- We make a note of the sign (positve or negative) and then rank the differences (starting with the smallest) while ignoring the sign
- We deal with tied ranks in the same way as before, we average them (e.g. say we had two scores of 6 and they would’ve both ranked 3 and 4, we then take the average of the two potential ranks (3+4/2=3.5))
- T+ is where we collect the ranks that came from a positive difference and add them up
- T- is where we collect the ranks that came from a negative difference and add them up
- We do T+ and T- calculations for both groups (say alcohol and ecstasy)
- To calculate the significance of the test statistic T, we need:
- The mean T, which is given by
T = (n(n+1)/4)
- The standard error SEt, which is given by
SEt = sqrt( (n(n+1)(2n+1))/24 )
- With the above values, we can convert the test statistic into a z-score, which then tells us if the test is significant based on the p-value. This will tell us if there is a significant difference between depression scores on Wed and Sat for both ecstasy and alcohol.
####Kruskal-Wallis Test
Kruskal-Wallis test looks at the differences between several independent groups. This is the non-parametric counterpart to the one-way independent ANOVA. The theory is also based on ranked data (ordering scores from lowest to highest ignoring the group that the score belongs, lowest score of 1 and going up).
- Once you rank the data you then collect the scores back into their groups and add up the ranks for each group. The sum of ranks for each group is denoted by Ri where i is used to denote the particular group
- Once the sum of ranks (Ri) has been calculated for each group, the test statistic H is calculated
insert formula
. H has a chi-square distribution and for this distribution, there is one value for the degrees of freedom, which is one less than the number of groups (k-1)
- You can visualize the comparisons by group really well by using a boxplot. For example, sperm count as y, number of soya meals as groups (no soya meals, 1 soya meal, 4 soya meals, 7 soya meals a week) on x
- To do a non-parametric post hoc procedure is to do a Wilcoxon rank-sum test on all possible comparisons; this involves taking the difference between the mean ranks of the different groups and comparing this to a value based on the value of z (corrected for the number of comparisons being done) and a constant based on the total sample size and the sample size in the two groups being compared. The inequality is given in formula
insert formula
- Jonckheere-Terpstra test (aka Jonckheere test) is used when you think that groups are different and you want to hypothesize that there’s a trend (i.e. an ordered pattern to the medians of the groups you’re comparing).
####Friedman’s ANOVA
Friedman’s ANOVA looks at differences between several related groups. This is used for testing differences between conditions where there are more than two conditions and the same participants have been used in all conditions (each case contributes several scores to the data). The theory is also based on ranked data.
- In your columns (say you’re measuring weight at different times), you put down the weight at different times (start, after 1 month, after 2 months). Each person is a row.
- Now go through each row and give rank per column (e.g. if someone lost weight each month, we would have a rank of ‘3 for start’, ‘2 for the first month’, ‘1 for the last month’
- Now add up the ranks across each column (e.g. sum rank for the ‘start’ column, sum rank for the ‘1 month’ column, etc.), this is the Ri
- Then calculate the test statistic Fr given by the equation
insert equation
. When the number of people tested is large (bigger than 10), this test statistic has a chi-square distribution and there is one value for the degrees of freedom, which is one less than the number of groups (k-1)
- Again, you can follow up on the main analysis with post hoc tests and checking if the differences is significant.
Multivariate Analysis of Variance (MANOVA)
Matrices
You should know about Matrices before learning about MANOVA.
- matrix is simply a collection of numbers arranged in columns and rows. For example, a 2*3 matrix is 2 rows, 3 columns of data.
- components (aka elements) are the values within a matrix
- square matrix is a matrix where there are an equal number of columns and rows.
- In a square matrix, the diagonal component are the values that lie on the diagonal line from the top left element to the bottom right element. The off-diagonal component are the other elements.
- In a square matrix, the identity matrix is the diagonal component elements are equal to 1 and the off-diagonal component are 0
- A row vector is a special case of a matrix where there are data for only one entity (e.g. just a row of data, like a single person’s score on four differnt variables for a 1*4 vector)
- A column vector is a special case of a matrix where there are data for only one column (e.g. just a column of data, like five participants’ scores on one variable for a 5*1 vector)
MANOVA - When to use MANOVA
MANOVA (multivariate analysis of variance) is used when you want to compare groups on several dependent variables (outcomes); when you can compare against several dependent variables, this is known as a multivariate test (as opposed to ANOVA that can only be used on one dependent varialbe, which is known as a univariate test). We can use MANOVA when there is one independent variable or when there are several.
- Using MANOVA instead of ANOVA: The reason why we don’t do a separate ANOVA for each dependent variable is similar to why we don’t do multiple t-tests (if we conduct more tests on the same data, the more we inflate the familywise error rate, thus the increased chance of Type I error). Also, if we do a bunch of ANOVA_s, the relationship between dependent variables is ignored, whereas _MANOVA includes all dependent variables in the same analysis and can account for the relationship between dependent variables. ANOVA can only tell us if groups differ on a single dimension whereas MANOVA has the power to detect whether groups differ along a combination of dimensions. For example, we can’t tell happiness by individual factors of: work, social, sexual, and self-esteem, but you can tell by the combination of factors (i.e. MANOVA has greater power to detect an effect)
- When to use MANOVA - Since MANOVA is used to compare several dependent variables, it makes sense that it depends a lot on the correlation between dependent variables.
- Use MANOVA with highly negatively correlated dependent variables (best scenario); MANOVA works acceptably well with moderately correlated dependent variables in either direction
- Don’t use MANOVA if there is no correlation because it is useless. Also, as the correlation between dependent variables increases, the power of MANOVA decreases.
- Note: The power of MANOVA depends on a combination of the correlation between dependent variables, the effect size to be detected, and the pattern of group differences (e.g. if measures are somewhat different, if group differences are in the same direction for each measure)
MANOVA - Variances and Covariances
We’re interested in the ratio of the effect of the systematic variance to the unsystematic variance. In ANOVA, these variances were single values. In MANOVA, each of these is a matrix containing many variances and covariances.
- hypothesis SSCP (aka H, hypothesis sum of squares and cross-products matrix, model SSCP Matrix) is the matrix that represents the systematic variance (similar to the model sum of squares for all variables (SSm))
- error SSCP (aka E, error sum of squares and cross-products matrix, residual SSCP Matrix) is the matrix that represents the unsystematic variance (the residual sums of squares for all variables (SSr))
- total SSCP (aka T, total sum of sqaures and cross-products matrix) is the matrix that represents the total amount of variance present for each dependent variable (the total sums of squares for each dependent variable (SSt))
- Note: The above are all sum of squares and cross-products (SSCP) matrices. The sum of squares of a variable is the total squared difference between the observed values and the mean value. The cross-product is the total combined error between two variables
- MANOVA’s test statistic isn’t as easy to calculate as ANOVA’s F-ratio. For ANOVA’s test statistic, we just divide the systematic variance (SSm) by the unsystematic variance (SSr). You can’t just divide matrices so we can’t do this for MANOVA. Instead, the test statistic, which for MANOVA is called HE^-1, is calculated from multiplying the model SSCP with the inverse of the residual SSCP. Unlike ANOVA’s test statistic, we don’t get a single value and instead get a matrix with several values (with p^2 values, where p is the number of dependent variables)
MANOVA - Predicting
In ANOVA, we try to predict an outcome (dependent variable) using a combination of predictor variables (independent variables). In essence, what we’re doing with MANOVA is predicting an independent variable from a set of dependent variables).
- variates (aka latent variables, factors) are linear combinations of the dependent variables
- discriminant function variates (aka discriminant functions) is when we use linear variates to predict which group a person belongs to (i.e. we’re using them to discriminate groups of people). For example, which group of treatment did they receive (CBT, BT, no treatment)
- eigenvector (aka characteristic vector) are the vectors associated with a given matrix that are unchanged by transformation of that matrix to a diagonal matrix
- A vector is just a set of numbers that tells us the location of a line in geometric space; it has both magnitude (how long it is) and direction. For example, vector a=(8, 13).
- A diagonal matrix is simply a matrix in which the the off-diagonal elements are zero and by changing HE^-1 to a diagonal matrix we eliminate all of the off-diagonal elements (thus reducing the number of values that we must consider for significance testing)
- Example, say that supermodels make more salary (y value) when their attractiveness (x value) increases. Let’s say the two values are correlated and when scatterplotted, forms an ellipse. If we draw two linse to measure the length and height of the ellipse, then they are the eigenvectors of the original correlation matrix for these two variables. The two lines (one for height, one for width of the oval) are perpendicular (90 degrees, which means they’re independent of one another). We can add additional variables (e.g. experience of the supermodel) then the scatterplot gets additional dimension(s).
- Interpreting eigenvectors - If the shape is a circle (or sphere if 3D, etc) then there is no correlation (eigenvector value near 1). If the shape is an ellipse scatterplot that looks like a diagonal straight line (height of eigenvector is almost 0), then there is high correlation.
- eigenvalues (aka characteristic roots) is the length (distance from one end of the eigenvector to the other). By looking at all the eigenvalues, we can tell the dimensions of the data (which tells us how evenly or non-evenly the variances of the matrix are distributed).
MANOVA - Eigenvalues
Eigenvalues are the equivalent to the F-ratio in an ANOVA. Now we need to compare how large these values are compared to what we would expect by chance alone. We can do that a few ways.
- Pillai-Bartlett trace (aka V, Pillai’s trace) is the sum of the proportion of explained variance on the discriminant funtions; it’s similar to the ratio of SSm/SSt, which is known as R^2. Formula is insert formula
This should usually be your default test statistic.
- Hotelling-Lawley trace (aka Hotelling’s T^2) is the sum of the eigenvalues for each variate. This test statistic is the sum of SSm/SSt for each of the variates and so it compares directly to the F-ratio in ANOVA.
- Wilk’s lambda is the product of the unexplained variance on each of the variates. This test statistic represents the ratio of error variance to total variance (SSr/SSt) for each variate. Note that large eigenvalues (which represent a large experimental effect) lead to small values of Wilk’s lambda, which means statistical significance is found when Wilk’s lambda is small.
- Roy’s largest root is the eigenvalue of the first variate. It’s basically the same the Hotelling-Lawley trace, but only for the first variate. This test statistic represents the proportion of explained variance to unexplained variance (SSm/SSr) for the first discriminant function. Roy’s root represents the maximum possible between-group difference given nad is in many cases the most powerful.
- Choosing a method
* If the sample sizes are small or moderate, the four approaches differ little in terms of power.
* If group differences are concentrated on the first variate, then Roy’s statistic should be the most powerful (since it takes account of only the first variate), followed by Hotelling’s trace, Wilk’s lambda, and Pillai’s trace. If groups differ along more than one variate, the power order is reversed.
* It’s recommended to use fewer than 10 dependent variables unless sample sizes are large.
* When sample sizes are unequal, use the Pillai-Bartlett trace since it’s the most robust to violations of assumptions. Make sure to check the homogeneity of covariance matrices; if they seem homogeneous and if the assumption of multivariate normality is tenable, then Pillai-Bartlett trace is assumed to be accurate.
MANOVA - Assumptions
MANOVA Assumptions are similar to ANOVA assumptions, but extends them to multivariate cases instead of univariate cases:
- Independence: observations are statistically independent. This can be checked the same as with univariate cases (like ANOVA)
- Random Sampling: data should be randomly sampled from the population of interest and measured at an interval level. This can be checked the same as with univariate cases (like ANOVA)
- Multivariate normality: In ANOVA we assumed that our dependent variable is normally distributed within each group. In MANOVA we assume that the dependent variables (collectively) have multivariate normality within groups. This can be tested with the Shapiro test that we used similar to the univariate cases
- Homogeneity of covariance matrices: In ANOVA we assume that the variances in each group are roughly equal (i.e. homogeneity of variance). In MANOVA, we must assume that this is true for each dependent variable, but also that the correlation between any two dependent variables is the same in all groups. We can check this assumption by testing whether the population variance-covariance matrices of the different groups in the analysis are equal. We can test the assumption of equality of covariance matrices using Box’s test
- Box’s test is used to test the assumption of equality of covariance matrices; it’ll be non-significant if the matrices are the same. If sample sizes are equal, ignore Box’s test because it’s unstable. If group sizes are different, then the robustness of the MANOVA cannot be assumed. The more dependent variables you have measured, the greater the differences in sample sizes, the more distorted the probability values become.
MANOVA - Which group caused the effect
You can figure out which group caused the effect a couple different ways. After a MANOVA, you can either do a discriminant function analysis or do a different univariate ANOVA for each dependent variable.
- ANOVA One way to figure out which group caused the effect (not recommended) is to do a univariate ANOVA for each dependent variable, which are then followed up with contrasts. The idea behind this follow-up ANOVA approach is that the ANOVA hides behind the MANOVA (so it doesn’t run into the same errors as multiple ANOVAs by itself), but is dangerous because a significant MANOVA reflects a significant difference for one, but not all of the dependent variables. If you do this approach, then Bonferroni corrections should be applied to the level at which you accept significance.
- discriminant function analysis (aka DFA, discriminant analysis) finds the linear combination(s) of the dependent variable that best separates (or discriminates) the groups. The main advantage for this is that it reduces and explains the dependent variables in terms of a set of underlying dimensions thought to reflect substantive theoretical dimensions.
- It might be useful to look at the discriminant scores, which are scores for each participant on each variate. These scores are useful because the variates that the analysis identifies may represent underlying constructs, which if they’re identified is useful for interpretation to know what a participcant scores on each dimension.
- To find out which groups are discriminated by a variate, plot the discriminant scores. Split the vertical and horizontal axes at the midpoint and look at which groups tend to fall on either side of the line. The variate plotted on a given axis is discriminating between groups that fall on different sides of the line (i.e. the midpoint)
####MANOVA - Robust Methods
Two robust methods for MANOVA are Munzel and Brunner’s method (aka Munzel-Brunner rank order test) and Choi and Marden’s robust test (aka Choi-Marden Multivariate Rank Test (which is an extension of the Kruskal-Wallis one-way analysis of variance by ranks test)
Factor Analysis
So the idea is that factor analysis identifies clusters of variables that relate to each other. We check to make sure our variables aren’t related to each other too much or too little using a correlation matrix. We then check if there’s any issues (e.g. enough sample size), then decide how many factors we want to keep (factor extraction), then finally decide which variables go to which factors (factor loading). We finally consider whether items you have are reliable measures of what you’re trying to measure.
- factors (aka latent variables) are things that cannot directly be measured. For example, we can’t directly measure ‘burnout’ (someone who has been working very hard on a project for a prolonged time suddenly finds themselves out of motivation). However, we can measure different aspects of burnout (get an idea of motivation, stress levels, whether the person has any new ideas)
- factor analysis is a technique for for identifying groups or clusters of variables. It has three main uses:
- to understand the structure of a set of variables (e.g. try to find the latent variable for ‘intelligence’ or ‘burnout’)
- to construct a questionnaire to measure an underlying variable (e.g. create questionnaire to measure ‘burnout’)
- to reduce a dataset to a more manageable size while retaining as much of the original information as possible. We look for variables that correlate highly within a group of other variables, but do not correlate with variables outside of that group.
Examples: Common example of factor analysis is trying to assess personality traits (say extroversion-interoversion). Another example might be looking at the traits of profit, productivity, workforce and seeing if this can be reduced down to an underlying dimension of company growth.
Sample Size: This technique requires a large sample size with a rule of thumb of a bare minimum of 10 observations per variable
- 50 = Very Poor
- 100 = Poor
- 200 = Fair
- 300 = Good
- 500 = Very Good
- 1000+ = Excellent
- You can also calculate out the sample size needed using Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy, which calculates for individual and multiple variables and represents the ratio of the squared correlation between variables to the squared partial correlation between variables. Basically, you get back a statistic that is between 0 (factor analysis is unlikley to be beneficial) and 1 (factor analysis should be reliable). This value should be greater than .5 as a bare minimum (or else collect more data)
- correlation-matrix (aka R-matrix) is a table of correlation coefficients between each pair of variables (with the off-diagonal elements as the correlation coefficients). Clusters of large correlation coefficients between subsets of variables suggest that these are factors
- If correlation between variables is about .3 or lower, then they’re probably not the same underlying dimension so we can exclude them
- Bartlett’s test examines whether the population correlation matrix resembles an identity matrix (i.e. the off-diagonal components are zero). If Bartlett’s test is significant then it means that the correlations between variables are overall significantly different from zero (which means good news). A non-significant Barlett’s test is cause for concern because we’re looking for clusters of variables that measure similar things and if no variables correlate then there’s no clusters to find.
- singularity is when variables are perfectly correlated (say above .8 or .9); this causes multicollinearity and causes a problem because it becomes impossible to determine the unique contribution to a factor of the variables that are highly correlated. Considering eliminating one of these high correlated variables
- Remember that total variance for a particular variable is made up of common variance (variance shared with other variables) and unique variance (which itself is made up of variance specific to that measure and error (aka random variance). When we’re looking at variance within a correlation matrix, we’re interested in finding the common underlying dimensions within the data so we’re primarily interested in only the common variance (aka communality). If there is no random variance, then the communality is 1. A variable that shares none of its variance with other variables has a communality of 0. The issue is that in order to do the Factor analysis, we need to know the common variance, but in order to do the factor analysis we need to know the common variance. We can solve this issue by:
- Assuming common variance of every variable is 1 and do principal components analysis
- Estimate the common variance by estimating communality values for each variable using
- principal components analysis (aka PCA) is also another technique for identifying groups or clusters of variables and is similar to factor analysis; the difference is that factor analysis typically incorporates more domain specific assumptions about the underlying structure (communality=1) and solves eigenvectors of a slightly different matrix. Usually PCA and factor analysis differs very little.
- If doing PCA, a factor matrix is known as component matrix
- PCA works in a similar way to MANOVA. We use a matrix to represent the relationships between variables. The linear components (aka variates, factors) of that matrix are then calculated by determining the eigenvalues of the matrix. The eigenvalues are used to calculate eigenvectors, the elements of which provide the loading of a particular variable on a particular factor (i.e. the b values). Eigenvalues are also used to measure the importance of the eigenvector.
- factor loading is the relative contribution that a variable makes to a factor. When graphically represented, if you were to plot the variables (where the factors are the axis of a graph) and we plot variables along these axis (with the axis representing correlation between -1 to 1), the factor loading would be the coordinate of a variable along the classification axis. For example, say a list of traits are reduced to two factors: ‘Consideration’ is the y axis, ‘Sociability’ as the x axis. We look at a variable (say ‘Social Skills’) and see how it correlates (Pearson correlation) between these two factors (‘Sociability’, ‘Consideration’); we get two correlations and plot this point (which this point is the factor loading).
- To get the importance of a particular factor (represented by b), just square it.
- factor matrix (denoted as A) is the matrix that results from converting the factor loadings into a matrix and is usually denoted A. The columns represent each factor (say ‘Sociability’, ‘Consideration’) and the rows represent the loadings of each variable (e.g. Social Skills, Interest, Selfish, Liar, etc)
Factor Analysis - Calculating Factor Scores
Calculating factor scores is estimating a score on a factor, based on their scores for the constituent variables. There’s a few different methods including:
- weighted averages - We assign a weight (say 1 to 10) for all our traits and then multiply the weights with the correlation and add up for each trait. This method is overly simplistic and not really used because it requires all the variables to use the same measurement scale.
- regression method - We calculate factor score coefficients as weights rather than using factor loadings. The factor loadings are adjusted to take account of the initial correlations between variables and so the differences in units of measurement and variable variances are stabalized. To get the matrix of factor score coefficients (B), we multiply the matrix of factor loadings by the inverse (R^-1) (because we can’t divide a matrix) and multiply by the factor matrix (A). This resulting matrix is a measure of the unique relationship between variables and factors. Equation is:
B = R^-1*A
What to do with Factor Scores
- If the goal is to reduce the size of the dataset, we can now use the factor scores as the smaller subset of measurement variables (e.g. here’s an individual’s scores on this subset of measures). Any further analysis can be done on the factor scores.
- We can even use factor scores to remove the issue of collinearity (aka multicollinerarity). For example, say we did a multiple regression analysis and we identified sources of multicollinerarity. To fix this, we do a PCA on the predictor variables to reduce them to a subset of uncorrelated factors. We then combine the correlated factors into a single factor. Now we can rerun the analysis with the new factor score as a predictor variable.
####Factor Analysis - Factor Extraction
Now that you have your factor score, you need to choose a technique to explore the factors in your data; choosing the technique depends on what you want to do:
- confirmatory factory analysis - used if you have a specific hypothesis to test
- If you want to explore your data, then you can apply our findings using either the descriptive method or the inferential method.
- descriptive method is to apply our findings to just the sample collected instead of extrapolating it beyond the sample. Recommended techniques for the descriptive method are Principal Components Analysis and Principal Factors Analysis (Principal Axis Factoring)
- inferential methods generalizes our findings to a population. Recommended techniques for the iniferential methods include the maximum-likelihood method and Kaiser’s alpha factoring
- factor extraction is deciding what factors to keep; we don’t want to keep all our factors and are interseted in keeping only the ones with a relatively large eigenvalue.
- scree plot is a plot of each eigenvalue (y axis) against the factor that it’s associated with (x axis). The reason it’s called this is because it usually looks like a rock face with a pile of debris/scree at the bottom. By plotting the eigenvalues, the relative importance of each is apparent (usually there’s a cut-off point called the point of inflexion where there’s a very steep slope). For example, if the point of inflexion happens on the third point, only select the first two points. Use when sample size is greater than 200.
- Kaiser’s criterion is usually the worst way to determine how many factors to keep; it says to keep any eigenvalue that is 1 or more, which usually overestimates the number of factors to retain.
- Jolliffe’s criterion is a way to determine how many factors to keep; it says to keep any eigenvalue that is .7 or more.
- parallel analysis is usually the best way to determine how many factors to retain (over Kaiser’s criterion); parallel analysis takes each eigenvalue (which represents the size of the factor) and compares against an eigenvalue for the corresponding factor in many randomly generated data sets that have the same characteristics as the data being analysed.
Factor Analysis - Interpretation
factor rotation is a technique to interpret factor analysis by discriminating between factors; this is used after factors have been extracted and comes in two forms:
- orthogonal rotation is used when any underlying factors are assumed to be independent and the factor loading is the correlation between the factor and the variable, but is also the regression coefficient (i.e. values of the correlation coefficient = values of the regression coefficient)
- oblique rotation is used when the underlying factors are assumed to be related or correlated to each other. The resulting correlations between variables and factors will differ from the corresponding regression coefficients. This results in two different sets of factor loadings: the factor structure matrix, which is the correlation coefficients between each variable and factor and the factor pattern matrix, which is the regression and coefficients for each variable on each factor. The math behind this is difficult and requires a factor transformation matrix, a square matrix whose size depends on how many factors were extracted (e.g. 2 factors = 2*2 matrix)
Comparing Categorical Variables
For continuous variables we measure using the means, but this is useless for categorical variables (since we’d assign numbers to categories and it would just depend on how many categories there were). Instead, for categorical variables, we count the frequency of the category to get a contingency table, which is a tabulation of the frequencies. We use different algorithms depending on how many categorical variables there are (2 or more than 2).
- An example of two categoricals is say training (if the cat was trained using either food or affect, but not both) and dance (if the cat ended up being able to dance)
- Pearson’s chi-square statistic (aka chi-square test) is used to compare the relationship between two categorical variables. This is given by the equation `
, which is a variation on
deviation = change in (observed -model)^2`. Note that this is only an approximation (which works really well for large samples), but if the sample size is too small, use Fisher’s exact test, likelihood ratio, or Yate’s continuity correction to avoid making Type I errors.
- Fisher’s exact test is a way to compute the exact probability of the chi-square statistic, useful for when the sample sizes are too small. Usually this is used on 2 * 2 contingency tables (two categorical variables each with two categorical options) and small sample size, but can be used on larger at the cost of really intensive calculations
- likelihood ratio statistic is an alternative to the Pearson’s chi-square and is based on the maximum-likelihood theory. The idea is that you collect some data and create a model for which the probability of obtaining the observed set of data is maximized, then you compare this model to the probability of obtaining those data under the null hypothesis. You then compare observed frequencies with those predicted by the model. This is roughly the same as Pearson’s chi-square, but preferred when samples are small. The equation is: ` `
- Yate’s continuity correction is used specifically for 2 * 2 contingency tables because for these cases Pearson’s chi-square tends to make Type I errors. Be careful though, this sometimes overcorrects and produces a chi-square value that is too small . The equation is: ` `
- Assumptions of chi-square test are:
- Independence of data - this means that each person, item, entity contributes to only one cell of the contingency table; you can’t use this on a repeated measures design (e.g. train cats with food to dance, then trained the same cats with affection to see if they would dance)
- Expected fequencies should be greater than 5; although in larger contingency tables, it’s okay to have up to 20% of expected frequencies below 5, the result is loss of statistical power (which means the test might fail to detect a genuine effect). If there isn’t more than 5, collect more data of that particular category.
- Interpreting the chi-squared test; if the p-value is less than .05, then there is a significant relationship between your two variables
- odds ratio is a way to calculate effect size for categorical data. odds ratio are good for 2 * 2 contingency tables and probably not useful for anything larger. Example is odds of cat dancing after food = (number that had food and danced / number that had food but didn’t dance), say 28/10 = 2.8. Repeat for odds of dancing after affection (let’s say that’s .0421). Then to calculate the odds ratio = (odds dancing after food / odds dancing after affection); odds ratio = (2.8/.0421) =6.65 This means an annimal trained with food has 6.65 times higher chance to dance than being trained with affection.
- If we’re interested in finding out which particular factor contributed significantly in larger contingency tables, we can use the standardized residuals.
Comparing Multiple Categorical Variables using Loglinear Analysis
loglinear analysis is used when we want to look at more than two categorical variables. Think of this as the ANOVA for categorical variables (where for every variable we have, we get a main effect but we also get interactions between variables). For example, say we want three variables: Animal (dog or cat), Training (food as reward or affection as reward) and Dance (did they dance or not?) Again, this can also be seen as a regression model. Let’s not get into the math, but basically categorical data can be expressed in the form of a linear model provided we use log values (thus loglinear analysis). The idea is that we try to fit a simpler model without any substantial loss of predictive power through backward elimination (remove one at a time hierarchically).
- For example, say we have the following interactions for these three variables: Animal (dog or cat), Training (food or reward), Dance (can they dance or not). We take the one interaction involving all three variables (i.e. highest-order interaction) and remove it. We look at whether the new model without this interaction and if the new model significantly changes the likelihood ratio statistic, then we stop here and say that we have a significant three way interaction. If there is no change in the likelihood ratio statistic, then we move to a lower-order interaction.
- Three main effects (Animal, Training, Dance)
- Three interactions involving two variables each (Animal * Training, Animal * Dance, Training * Dance)
- One interaction involving all three variables (Animal * Training * Dance)
- Assumptions in loglinear analysis include:
- Loglinear analysis is an extension of the chi-square test so it has similar assumptions
- Each cell must be independent
- Expected frequencies should be large enough for a reliable analysis. With more than two variables, it’s okay to have up to 20% of cells with expected frequencies less than 5; but all cells must have expected frequencies greater than 1. If this assumption is broken, try to collapse the data across one or the variables
- Reporting results of a loglinear analysis - Example: The three-way loglinear analysis produced a final model that retained all effects. The likelihood ratio of this model was this and p was that. This indicated that the highest order interaction (Animal * Training * Dance) was significant. To break down this effect, separate chi-square tests on the Training and Dance variables were performed separately for dogs and cats.
Multilevel Linear Models
Sometimes data is hierarchical instead of at a single level. This means that some variables are clustered or nested within other variables (i.e. some of our other analysis may be oversimplification). Let’s use this example data set:
- Level 1 variable - Say we have a lot of children (this is the lowest level of the hierarchy)
- Level 2 variable - Say these students are organized by classrooms (this means children are nested within classes)
+ Note: We can have additional levels (e.g. level 3 variable is the school of that classroom, level 4 is the school in that school district)
- The idea is that children at different layers (say within the same classroom or the same school) are more similar to each other. This means that each case is not entirely independent. To solve for this, we use intraclass correlation (ICC), which represents the proportion of the total variability in the outcome that is attributable to the child’s classroom. This means that if the classroom had a huge effect on the children, then the ICC will be small. Howerver, if the classroom had little effect on the children, then the ICC will be large. Simply, ICC says if this hierarchical grouping had an effect on the outcome.
- So why use a multilevel linear model? The benefits include:
- We don’t have to assume that the relationship between our covariate and our outcome is the same across the different groups that make up our predictor variable.
- We don’t need to make the assumption of independence.
- It’s okay to have missing data. You don’t have to correct and impute for missing data.