This is the final part of what I've learned from I Heart Stats: Learning to Love Statistics about Inferential Statistics.
When looking at regression, we want to know the impact of an independent variable ($X$) on a dependent variable ($Y$), and also if we want to know how much the increase in the dependent variable is gonna be for an increase in the independent variable.
We will have the following hypotheses.
If the slope is $0$, that would mean that $X$ doesn't affect $Y$. The thing is that what we want to know is if the slope is significantly different from $0$ to know if there is a relationship.
We find the "best fit line"
$$ y = b + mx $$$$ \hat{y} = \alpha + \beta x $$
By using Ordinary Least Squares (OLS), that minimizes those squared vertical distances from each data point (residuals, errors, or distances from our actual data to the estimated line).
First we find $\beta$ which is the slope, and $\alpha$ by computing
$$ \beta = \frac {N \sum xy - \sum x \sum y}{N \sum x^2 - (\sum x)^2}$$$$ \alpha = \overline{y} - \beta \overline{x} $$
Then we find if the slope is significantly different from $0$ by using the T-test.
$$ t = \frac{\beta}{\sigma_{\beta}} $$$$ \sigma_{\beta} = \frac{\sqrt{ \frac{\sum(y - \hat{y})^2}{N-2}}}{\sqrt{ \sum (x - \overline{x})^2}} $$
and $ df = N - 2 $ because we estimate two things, $\alpha$ and $\beta$ (two-tailed).
Then we compare this value $t$ and compare it with a critical value to make a conclusion.
First, we must know that Covariation and Correlation measure the strength of a linear relationship between $X$ and $Y$. This means the way they vary (linearly) in relation to the other.
We divide by $N$ to equalize the sum so that it does not depend on the number of cases.
So one way is to divide the covariance by the products of the standard deviations.
$$ r = \frac{cov(X, Y)}{\sigma_X \sigma_Y}$$
We must also know the possible values of $r$.
Then to do know the statistical significance of a value $r$, we use the t-table again.
$$ t = \frac{r \sqrt{N-2}}{1 - r^2} $$
And because it's a two-tailed test
$$ df = N - 2 $$
Then we see if we reject the "Null Hypothesis" with the same rule as in the previous "Statistical Tests".
Previous post: http://relguzman.blogspot.pe/2015/07/reviewing-inferential-statistics-part-1.html
In an experiment where we measure the IQ (Ratio Variable) of a bunch of people categorized into three groups (B, W, H) according to their race (Categorical Variable).
We want to decide if there is a relation between IQ and race or if IQ varies significantly by race, then our hypotheses will be.
Each group will have a mean value $\overline{x}$, there will be two kinds of variations:
$$\text{Total Variation = Variation Within + Variation Between}$$
This Total Variation is just the sum of squares:
$$\text{Total Variation} = \sum_{i=1}^N (x_i - \overline{x})^2$$
The thing is that we must compare both variations to get a value called $F$.
$$F = \frac{\text{Variation between groups}}{\text{Variation within groups}}$$
Which can be computed like this.
Where
$$ N = \text {# of cases in dataset} $$$$ N_k = \text {# of cases that belong to group } k $$
$$ x_k = \text {mean of group } k $$ $$ x_o = \text {mean of the dataset} $$
Or like this.
Where
$$ MS = \text {Mean Square} $$$$ SS = \text {Sum of Squares} $$$$ df = \text {Degrees of Freedom} $$$$ k = \text {# of groups} $$
And then this value will tell us something.
Statistical Test - ANOVA (Analysis of Variance)
To use this test, the independent variable has to be "Nominal/Categorical" or "Ordinal" and the dependent variable has to be "Interval" or "Ratio".In an experiment where we measure the IQ (Ratio Variable) of a bunch of people categorized into three groups (B, W, H) according to their race (Categorical Variable).
We want to decide if there is a relation between IQ and race or if IQ varies significantly by race, then our hypotheses will be.
- $H_0$: There is no significant difference in the IQ by race.
- $H_1$: There is a significant difference in the IQ by race.
Each group will have a mean value $\overline{x}$, there will be two kinds of variations:
- Variation within each group (Sum of Squares within, $SS_w$)
- Variation between groups (Sum of Squares between, $SS_b$)
$$\text{Total Variation = Variation Within + Variation Between}$$
This Total Variation is just the sum of squares:
$$\text{Total Variation} = \sum_{i=1}^N (x_i - \overline{x})^2$$
The thing is that we must compare both variations to get a value called $F$.
$$F = \frac{\text{Variation between groups}}{\text{Variation within groups}}$$
Which can be computed like this.
Where
$$ N = \text {# of cases in dataset} $$$$ N_k = \text {# of cases that belong to group } k $$
$$ x_k = \text {mean of group } k $$ $$ x_o = \text {mean of the dataset} $$
Or like this.
Where
$$ MS = \text {Mean Square} $$$$ SS = \text {Sum of Squares} $$$$ df = \text {Degrees of Freedom} $$$$ k = \text {# of groups} $$
And then this value will tell us something.
- If $F > 1$ then there may be a relationship.
- If $F < 1$ then there's no relationship.
Statistical Test - Regression
In order to use Regression, we need to have "Interval" or "Ratio" variables.When looking at regression, we want to know the impact of an independent variable ($X$) on a dependent variable ($Y$), and also if we want to know how much the increase in the dependent variable is gonna be for an increase in the independent variable.
We will have the following hypotheses.
- $H_0$: $X$ does not have an impact over $Y$ (or is not a significant predictor of $Y$).
- $H_1$: $X$ does have an impact over $Y$ (this could that $X$ decreases as $Y$ increases or the other way around).
If the slope is $0$, that would mean that $X$ doesn't affect $Y$. The thing is that what we want to know is if the slope is significantly different from $0$ to know if there is a relationship.
We find the "best fit line"
$$ y = b + mx $$$$ \hat{y} = \alpha + \beta x $$
By using Ordinary Least Squares (OLS), that minimizes those squared vertical distances from each data point (residuals, errors, or distances from our actual data to the estimated line).
First we find $\beta$ which is the slope, and $\alpha$ by computing
$$ \beta = \frac {N \sum xy - \sum x \sum y}{N \sum x^2 - (\sum x)^2}$$$$ \alpha = \overline{y} - \beta \overline{x} $$
Then we find if the slope is significantly different from $0$ by using the T-test.
$$ t = \frac{\beta}{\sigma_{\beta}} $$$$ \sigma_{\beta} = \frac{\sqrt{ \frac{\sum(y - \hat{y})^2}{N-2}}}{\sqrt{ \sum (x - \overline{x})^2}} $$
and $ df = N - 2 $ because we estimate two things, $\alpha$ and $\beta$ (two-tailed).
Then we compare this value $t$ and compare it with a critical value to make a conclusion.
Statistical Test - Correlation
The variables must also be "Interval" or "Ratio" variables.First, we must know that Covariation and Correlation measure the strength of a linear relationship between $X$ and $Y$. This means the way they vary (linearly) in relation to the other.
Covariance
$$ \sigma(X, Y) =cov[XY] = cov(X, Y) = \frac{\sum_{i = 1}^{N} (x_i - \overline{x})(y_i - \overline{y})}{N}$$We divide by $N$ to equalize the sum so that it does not depend on the number of cases.
- If those two variables change in the same way consistently, they are covarying.
- If they don't change in the same way consistently, covariation will be low.
- If covariance is zero they are called uncorrelated.
- If $X$ and $Y$ are independent, then their covariance is zero. The converse, however, is not generally true because correlation and covariance are measures of LINEAR DEPENDENCE between two variables.
- If $cov(X,Y) > 0$, then $Y$ tends to increase as $X$ increases, and if $cov(X,Y) < 0$ then the other way around.
The correlation coefficient, Pearson product-moment correlation or Pearson correlation (r)
One problem with the covariance calculation is that it's really sensitive to the metric of the variable that you're using. If you have something that you're measuring that's in the hundreds, such as the IQ. You're gonna get big numbers for the covariance.So one way is to divide the covariance by the products of the standard deviations.
$$ r = \frac{cov(X, Y)}{\sigma_X \sigma_Y}$$
We must also know the possible values of $r$.
Then to do know the statistical significance of a value $r$, we use the t-table again.
$$ t = \frac{r \sqrt{N-2}}{1 - r^2} $$
And because it's a two-tailed test
$$ df = N - 2 $$
Then we see if we reject the "Null Hypothesis" with the same rule as in the previous "Statistical Tests".