Learning objectives

  • Hypothesis testing

  • Understand the steps for hypothesis testing.

  • Parametric and non-parametric tests

  • T-tests

Hypothesis testing

Statistical tests of significance

  • Test whether the two observed differences could have been expected to occur ‘by chance’ or whether the independent variable truly affects the outcomes of the dependent variable.

The null hypothesis and the alternative hypothesis

Null hypothesis: there is no difference between the values of the means in the populations from which the samples were drawn (the two samples belong to the same population)

  • There is no difference between the number of syllables in the populations in which the two groups of scores were drawn.

\(H_0 : \mu = \mu_0\)

Alternative hypothesis: there is a difference between the values of the means in the populations from which the samples were drawn (the two samples belong to two populations)

  • If the hypothesis test is about deciding, whether a population mean, μ, is different from the specified value μ0, the alternative hypothesis is expressed as:

\(H_A : \mu \neq \mu_0\)

Such a hypothesis test is called two-sided test.

  • The mean number of syllables will be DIFFERENT for children with atypical language development (\(\mu\)), when compared to the typical development children (\(\mu_0\)).

If the hypothesis test is about deciding, whether a population mean, μ, is higher than the specified value μ0, the alternative hypothesis is expressed as

\(H_A : \mu > \mu_0\)

  • The mean number of syllables will be HIGHER children with atypical language development (\(\mu\)), when compared to children with typical language development (\(\mu_0\)).

If the hypothesis test is about deciding, whether a population mean, μ, is lower than a specified value μ0, the alternative hypothesis is expressed as

  • The mean number of syllables will be LOWER for the children with atypical language development (\(\mu\)), when compared to the typical development children (\(\mu_0\)).

\(H_A : \mu < \mu_0\)

Important: when we perform hypothesis testing we want to reject the null hypothesis.

Calculate a test statistic to find the probability of the results observed, on the assumption that the null hypothesis is true. The test statistic is a number calculated from the data set, which is obtained by measurements and observations, or more general by sampling.

Steps to do hypothesis testing

  1. Formulate the null hypothesis and the alternative hypothesis.
  • There is no difference between the number of syllables’ means in the populations in which the two groups of scores were drawn.

\(H_0 : \mu = \mu_0\)

  • The mean number of syllables will be LOWER for the children with atypical language development than for the children with typical language development.

\(H_A : \mu < \mu_0\)

  1. Selecting your significance level

We have to assume that there will always be a chance that the differences that we are observing are due to chance (sampling differences) and not due to a true difference brought by the independent variable.

We set a probability for our observed results to occur under the null hypothesis.

Significance level of 5% means that there is a 5% of probability of our observed difference to be a result of chance (different sampling). Researchers in the social sciences are normally comfortable with a 5% probability of having found their observed results by chance.

\(p < 0.05\)

  1. Select a test statistic

A test statistic is a value describing the extent to which the research results differ from the null hypothesis. The test statistic is a hypothesis test that helps you determine whether to support or reject a null hypothesis in your study. You achieve this by using a test statistic to calculate the p-value of your results.

Two-tailed test

  • A researcher wants to test whether there is a difference in the average test scores of students who study in the morning versus those who study at night. We do not assume in advance which group will perform better..

\(H_0 : \mu_0 \neq \mu\)

  • We set the confidence level to 95%

  • \(\alpha\) (p-value) is the significant level 0.05 (5%) and cuts off the two tails of the distribution, because the test statistic could have either positive or negative values. The critical region cuts an area of \(\alpha\)/2

  • 19.6 and -1.96 are our critical values. If our test statistic is lower or higher than the critical values we REJECT the null hypothesis with 95% confidence. The difference between the two means is not likely to be by change at a significance level of 5%.

One-tailed test

The number of syllables produced by children enrolled with atypical language development will be lower when compared to the number of syllables produced by children with typical language development.

\(H_0 : \mu < \mu_0\)

  • We set the confidence level to 95% (learning in a classroom setting will result in higher results than learning in a classroom setting 95% of the time)

  • \(\alpha\) (p-value) is the significant level 0.05 (5%) and cuts off the two tails of the distribution, because the test statistic could have either positive or negative values. The critical region cuts an area of \(\alpha\)/2

  • The null hypothesis is rejected if the test statistic is too small.

A left-tailed test is used when the alternative hypothesis states that the true value of the parameter specified in the null hypothesis is less than the null hypothesis claims.

A right-tailed test is used when the alternative hypothesis states that the true value of the parameter specified in the null hypothesis is greater than the null hypothesis claims

From ScienceDirect
From ScienceDirect

Types of error in significance testing

When we reject or not reject the null hypothesis we do so using a significance level (5%), and this means that we still have some chance to reject the null hypothesis when the reality is that the null hypothesis is true, or to accept the null hypothesis when the reality is that the null hypothesis is false.

Type I error: The null hypothesis is rejected when it is actually true (false positive)

Type II error: The null rejected is not rejected when it is actually false (false negative)

Types of error
Types of error

Choosing a test

Parametric tests:A parametric test is a statistical test which makes certain assumptions about the distribution of the unknown parameter of interest and thus the test statistic is valid under these assumptions. Parametric tests have the benefit of being precise in their assumptions which leads to more precise inferences.

  • Data are normally distributed.

  • Some parametric tests, populations have equal variances.

Example: Difference between the mean scores of a group of students that learned statistics for 10 hours a week during 2 weeks, and a group of students that learned statistics for 2 hours a week during 20 weeks.

Non parametric tests: Nnon parametric tests are methods of statistical analysis that do not require a distribution to meet the required assumptions to be analyzed (especially if the data is not normally distributed). Due to this reason, they are sometimes referred to as distribution-free tests.

Factors to decide which test to use:

Non-parametric tests: for ranking, ordinal variables, and numeric variables that are not normally distributed.

Non-parametric tests are less powerful than parametric tests.

Example: Participants decide whether speech produced using a mask is intelligible or not (Likert scale from 1 to 7) and this is compared to speech produced without using a mask. Difference between the means of the intelligibiltiy judgement task.

Type of design used for the study

When you conduct a hypothesis test using two random samples, you must choose the type of test based on whether the samples are dependent or independent.

Correlated samples: for repeated measures designs

Example: one group of speakers exposed to speech produced with masks and speech not produced with masks.

Independent samples: two unrelated populations

Example: one group of speakers exposed to speech produced with masks and one group of speakers exposed to speech not produced with masks).

T-test

  • A t-test is an inferential statistic used to determine if there is a significant difference between the means of the two groups.
  • It is used when either or both of the samples are smaller than 30.

In this case, the ratio \(\frac{\bar{X}_1 - \bar{X}_2}{\text{standard error of difference between means}}\) is not normally distributed but follows the t distribution.

The t-distribution is similar in shape to the normal distribution, but has heavier tails, meaning that it gives more probability to extreme values than the normal distribution.

The t-distribution is defined by a single parameter, the degrees of freedom (df). The degrees of freedom are a measure of the sample size, and determine the shape of the t-distribution. As the degrees of freedom increase, the t-distribution approaches the normal distribution. When the sample size is large (typically, greater than 30), the t-distribution is very similar to the normal distribution.

The degrees of freedom in a t-test are calculated with the sample size and sample statistics. We will not worry about calculating degrees of freedom because R does it for us when we run a t-test.

T-test assumptions

The t-test makes two assumptions:

  • The distributions of the populations from which samples are drawn are approximately normal.

  • The distributions of the populations from which samples are drawn have equal variances.

Rationale of t-test

  1. We have two sample means. These differ to a greater or lesser extent.

  2. We have some idea of what sort of difference we believe exists between the means of the two populations from which we think these samples have come. Under the null hypothesis (that our experimental manipulation has had no effect on our subjects), we would expect the two population means to be identical (i.e., to show no difference).

  3. We compare the difference we actually have obtained, to the difference (no difference) that we would expect to obtain. If we have found a very big difference between our two sample means, there are two possibilities.

  4. Calculate t: we measure how many standard errors the observed difference is away from the expected difference under the null hypothesis. If the difference is large relative to the standard error, it suggests that the means of the two groups are significantly different.

t = \(\frac{\bar{x}_{1} - \bar{x}_{2}}{\sqrt{s^{2}_{p}(\frac{1}{n_{1}}+\frac{1}{n_{2}})}}\)

  1. Find the critical value. The critical value of t depends on the number of degrees of freedom. The total number of degrees of freedom is Sample Size A + Sample Size B - 2.

Example of t-test

Non-directional t-test (two-directional t-test)

We have two groups of bilingual speakers: low proficiency and medium proficiency. We want to calculate whether their proficiency scores are, indeed, different.

In non-directional t-test a large t-score, or t-value, indicates that the groups are different while a small t-score indicates that the groups are similar.

set.seed(123)
low_prof = rnorm(20, mean = 14, sd = 1)


med_prof = rnorm(20, mean = 18, sd = 1)
t.test(med_prof,low_prof,  conf.level = 0.95, alternative = 'two.sided')
## 
##  Welch Two Sample t-test
## 
## data:  med_prof and low_prof
## t = 13.316, df = 37.082, p-value = 1.058e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  3.227856 4.386382
## sample estimates:
## mean of x mean of y 
##  17.94874  14.14162

The t value is 13.32 and degrees of freedom 37. The critical value for the t distribution at 38 degrees of freedom two-directional and 0.05% of significance level is 2.021. A larger t value means that the group means are different. We can reject the null hypothesis because 13.32 is larger than 2.021.

You can find the t-table here.

Directional t-test

For a one-direction t-test, we can hypothesize that the students with medium-proficiency will have greater mean scores than the students with low-proficiency.

In a right-tail directional t-test a large t-score indicates that the sample mean 1 is greater than the sample mean 2.

t.test(med_prof,low_prof,  conf.level = 0.95, alternative = 'greater')
## 
##  Welch Two Sample t-test
## 
## data:  med_prof and low_prof
## t = 13.316, df = 37.082, p-value = 5.288e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  3.324792      Inf
## sample estimates:
## mean of x mean of y 
##  17.94874  14.14162

In this case, the critical value of t is 1.68.

In right-tail directional t-test a small t-score indicates that the sample mean 1 is greater than the sample mean 2.

For a one-direction t-test, we can hypothesize that the students with low-proficiency will have lower mean scores than the students with medium-proficiency.

t.test(low_prof, med_prof, conf.level = 0.95, alternative = 'less')
## 
##  Welch Two Sample t-test
## 
## data:  low_prof and med_prof
## t = -13.316, df = 37.082, p-value = 5.288e-16
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -3.324792
## sample estimates:
## mean of x mean of y 
##  14.14162  17.94874

Visualizing and testing normality

The null hypothesis is that the distribution IS normally distributed. The test computes the W statistic measuring whether the distribution of observed data points across quantiles is similar to that of normal distribution.

The W statistic comes with the probability (p-value). This is the probability that the W statistic takesthis value by chance/fluke/accident. If the p-value is below 0.05, we reject the null hypothesis at the 5% significance threshold. That is, we say that the distribution is not normal and the probability that we make an (alpha) error is below 5%. If the p-value is above 0.05, we fail to reject the null hypothesis,i.e. the distribution is normal.

plot(density(pizza_time))

qqnorm(pizza_time)
qqline(pizza_time)

shapiro.test(pizza_time) #p > 0.05 which means we accept the null hypothesis that the data is normally distributed
## 
##  Shapiro-Wilk normality test
## 
## data:  pizza_time
## W = 0.99388, p-value = 0.9349