Introduction to the course

What will we do in this course?

Introduction to statistics and R
Descriptive statistics
Sampling distributions, sampling statistics and population parameters
Hypothesis testing and parametric tests
Any suggestions?

Learning objectives of Lecture 1

Understand the definition of statistics and how we do statistics.
Understand the difference between R and statistics.
Hands-on R introduction
Understand the difference between statistics and parameters
Define descriptive statistics
Identify types of variables

What is statistics?

Statistics: the discipline of collecting, organizing, analyzing, and interpreting data. This course has a lecture and a lab component.

A statistical question is a question where you expect several (i.e., a variety) of answers and you are interested in the distribution and tendency of those answers.

Example: What is the weight of a cat (in kg)?

We sample 20 cats and get a variety of responses.

We expect a tendency

Which question is more suitable to be answered using statistical analysis?

How much does my cat weigh?
How much does Garfield the cat weigh?
How much does the average cat weigh?

How do we do statistics?

We can calculate the mean weight by hand:

Sum all the weights and divide them by the number of samples.

(4.845743 + 4.650540 + 4.558134 + 4.841231 +3.817112 +4.667056 +4.226919+ 4.523914 +4.559097 +
4.940135 + 3.992729 + 4.432350 + 3.733077 + 3.571335 + 4.068627 + 3.685094 + 4.557106 +3.697995 +
+3.531068 + 3.815028) / 20

## [1] 4.235715

But, fortunately, we have R to help us. R is a statistical software that allow us to perform statistical analyses using formulas. That means, we do not have to do everything by hand anymore!

mean(weight_cat)

## [1] 3.986994

Population and Samples

Population: The complete set of individuals.

Sample: A subset of the complete population.

The size of the sample will always be less than the total size of the population.

Example 1: Analysis of the grades in the final exam of a phonetics class.

Population: All the students in the class.

Sample: A set of 10 students.

Unbiased sample: we randomize from the population.
Example 2: Analysis of use of fillers in spontaneous speech in French

Population: All the speakers of French.

Sample: A set of French speakers (for unbiased sample you need randomization)

from voxco

Sampling

When we select a sample of the population, we want that sample to be unbiased. For example, if we are interesting in knowing what is the tendency in the income of families residing in Toronto and we allow for voluntary participation, probably only families with high income will answer the survey, and the sampling will, thus, be biased.

Random sampling: When we select samples of the population, we want to make sure that each unit has approximately an equal chance of being represented in the sample.

Maria is conducting an experiment on how L2 Catalan speakers acquire mid vowel contrasts. She wants to know whether L2 proficiency of Catalan has an effect on accuracy of mid vowel perception. She collects data from 5 L2 Catalan speakers with low proficiency and 5 L2 Catalan speakers with high L2 proficiency. What is the population and sample in this experiment?

A. The population is all L2 Catalan speakers, and the sample is the 10 participants that Maria studied.

B. The sample is all L2 Catalan speakers, and the population is the 5 speakers of the with low proficiency that Maria studied.

C. The sample is all L2 Catalan speakers, and the population is the 10 participants that Maria studied.

D. The population is all L2 Catalan speakers, and the sample is the 5 speakers of high proficiency that Maria studied.

Statistics and parameters

Now we have distinguished between populations and samples. It is also important to note that the number that we obtain from measuring the sample statistic will be used to estimate the population parameter.

Statistic: number that represents a property of the sample.

Parameter: property of the population (Greek letters are used for parameters).

Name and description	Statistics	Parameter
Mean	$\bar{x}$ (sum of the variable in the sample divided by the size of the sample) $\frac{\sum_{i\in S} x_i}{\|S\|}$	$\mu$ (sum of the variable in the population divided by the size of the population) $\frac{\sum_{i\in P} x_i}{\|P\|}$
Variance	$s^2$ (average square distance from the variable’s average from the mean divided by the size of the sample -1) $\frac{\sum_{i\in S} (x_i-\bar{x})^2}{\|S\|-1}$	$\sigma^2$ (average square distance of the variable from the mean divided by the size of the population) $\frac{\sum_{i\in P} (x_i-\mu)^2}{\|P\|}$
Standard deviation	$s$ (“the average distance from the sample mean”) $\sqrt{s^2}$	$\sigma$ (“the average distance from the population mean”) $\sqrt{\sigma^2}$

An important question to ask is: How accurately does a statistic estimate a parameter?

Example. Determine what the key terms refer to in the following study. We want to know the average (mean) amount of words per utterance produced in Spanish by toddlers enrolled in a Spanish immersion program (100 students enrolled). We randomly examined speech of 10 toddlers. Three of those toddlers produced 2, 3, and 4 words per utternace respectively.

Population: The population is all toddlers enrolled in the program.

Sample: The sample could be the random selection of 10 toddlers.

Parameter average (mean) amount of words per utterance produced by all the toddlers enrolled in the Spanish immersion program $\mu$.

Statistics The statistic is the average (mean) amount of words per utterance of toddlers in the sample $\bar{x}$.

The Descriptive and Inferential Functions of Statistics

Descriptive statistics: measurements to describe basic features of a data set by generating summaries about data samples (distribution, tendency, variance)

Inferential statistics: uses measurements from the sample to estimate population parameters.

We use the sample statistics to infer the population parameters.

from andymath

Example: Descriptive and Inferential statistics

In this example we have two groups of students that learned French using two different methods:

10 French in-class language learners
10 French language learners immersed in French in Montreal

Below the grades for each type of learning:

in_class  = c(13,18,19,15,16,18,19,15,16,14)
  
  
immersed = c(19,18,19,15,16,18,19,19,20,19)

Descriptive statistics

If we want to have summaries of the data:

Frequency distribution
- How many students gained 1,2,3,4,5….19,20 marks

table(in_class)

## in_class
## 13 14 15 16 18 19 
##  1  1  2  2  2  2

table(immersed)

## immersed
## 15 16 18 19 20 
##  1  1  2  5  1

Mean
- The most typical mark (average)

mean(in_class)

## [1] 16.3

mean(immersed)

## [1] 18.2

Standard deviation
- How much variation is there in each group (standard deviation)

sd(in_class)

## [1] 2.110819

sd(immersed)

## [1] 1.549193

Inferential statistics

Inferring relationships from descriptive measures.

How confident can we be of extrapolating the statistics of our samples to the larger population?

Use measures from our samples (statistics) to estimate parameters of the population.

Has the L2 immersion group done better than the L2 classroom learning group?

Hypothesis testing

Variables and their Classification

Independent and Dependent Variables

Independent variable: The condition that we vary.
Dependent variable: The condition that we measure.

Independent variable	Dependent variable
Type of teaching method (in-class vs. immersed)	Score on language test

Categorical or numeric variables

Categorical variable (nominal variable): Categorical variables represent groupings of some kind. They are sometimes recorded as numbers, but the numbers represent categories rather than actual amounts of things.
- Nominal variables: variables that have two or more categories, but which do not have an intrinsic order.
  - Grammatical category of a word (e.g., noun, adj, v, adv).
- Binary variables: variables which have only two categories or levels.
  - Knowledge of French (Yes/No)
- Ordinal variables: variables that have two or more categories just like nominal variables only the categories can also be ordered or ranked.
  - Knowledge of French (None/Some/A lot)
Numeric variables: A numeric variable (also called quantitative variable) is a quantifiable characteristic whose values are numbers (except numbers which are codes standing up for categories). Numeric variables may be either continuous or discrete.
- Discrete variable: Discrete variables are numeric variables that have a countable number of values between any two values. A discrete variable is always numeric. For example, the number of customer complaints or the number of flaws or defects.
  - Number of syllables that children produce in a word, number of words that students produce per sentence
- Continuous variable: Continuous variables are numeric variables that have an infinite number of values between any two values. A continuous variable can be numeric or date/time. For example, the length of a part or the date and time a payment is received.
- Syllable duration, time to utter a sentence

Statistics in R

R language and R Studio

R is a programming language for statistical computing, data analysis, and visualization, widely used in research and data science. RStudio is a user-friendly IDE that enhances productivity with tools for writing, running, and visualizing R code.

RStudio Interface

The RStudio interface consists of several windows. You can resize the windows by dragging the grey bars between them. Below is an overview of each section.

Bottom Left: Console Window (Command Window/Line)

This is where you type commands after the > prompt, and R executes them.
It’s the most important window because this is where R actually performs actions.

Top Left: Editor Window (Script Window)

Used to edit and save collections of commands (scripts).
To open this window (if not visible), go to File → New → R Script.
Commands typed here must be sent to the console window for execution.
To run code:
- Highlight a line and click Run or press CTRL+ENTER to send it to the console.
- If nothing is highlighted, R will execute the line where the cursor is.

Top Right: Workspace/History Window

Workspace Tab:
- Displays data and values currently in R’s memory.
- Allows you to view and edit these values by clicking on them.
History Tab:
- Shows the commands that have been typed previously.

Bottom Right: Files/Plots/Packages/Help/Viewer Window

Files Tab: Open files.
Plots Tab: View plots (including previous plots).
Packages Tab: Install and load packages.
Help Tab: Use the help function.
Viewer Tab: View additional outputs, like HTML files.

Working Directory

The working directory is the folder on your computer where R operates.
When opening or saving files, R uses the working directory as the default location.
To set the working directory:
- Go to Session → Set Working Directory → Choose Directory.

Scripts

Commands can be stored in files called scripts (with .R extension, e.g., foo.R).
To run part of the script:
- Select the lines and press CTRL+ENTER or click Run in the editor window.
- If no lines are selected, the line under the cursor will be executed.

2 + 2

## [1] 4

Packages

R can do many different types of statistical and data analyses. They are organized in so-called packages or libraries. With the standard installation, most common packages are installed.

There are many more packages available on the R website. If you want to install and use a package (for example, the package called ‘geometry’) you should first install the package by clicking ‘install packages’ in the packages window and typing geometry or by typing install.packages(“geometry”) in the command window.

#  install.packages("languageR")

After installing a package, you need to load it:

library(languageR)

## Warning: package 'languageR' was built under R version 4.3.3

R as a calculator

R can be used as a calculator. You can just type your equation in the command window after the >. Type 10 ^ 2 + 36.

Compute the difference between 10 and 5 and divide this by the difference 3 and 2.

You can also give numbers a name. By doing so, they become so-called variables which can be used later. For example, you can type in the command window x <- 4 .

x <- 4

You can see that x appeared in the workspace window in the top right corner, which means that R now remembers what x is.

Some people prefer to use <- instead of =. They do the same thing. <- consists of two characters, namely < and -, and represents an arrow pointing at the object receiving the value of the expression.

x <- 4

You can also ask R what x is. Just type x in the command window.

## [1] 4

You can also do calculations with x. Type x * 5 .

x * 5

## [1] 20

If you specify A again, it will forget what value it had before. You can also assign a new value to A using the old one. Type A <- A + 10 .

Now you know how to store numbers in variables. We can also create vectors using c() A vector is a sequence of elements. We will create a numeric vector.

vector <- c(1, 2, 3, 4)

We can ask R things about the vectors.

length(x) #calculate the length

## [1] 1

class(x) #class of vector

## [1] "numeric"

mynums <- 10:1 # create ranges
mynums #run the two lines

##  [1] 10  9  8  7  6  5  4  3  2  1

Functions

We can call functions on the vector. Functions have the form name(). Functions perform operations on the vectors.

Within the brackets you specify the arguments. Arguments give extra information to the function.

sum(mynums) # sum

## [1] 55

min(mynums) # smallest value (minimum)

## [1] 1

max(mynums) # largest value (maximum)

## [1] 10

range(mynums) # minimum and maximum together

## [1]  1 10

range(mynums) # minimum and maximum together

## [1]  1 10

diff(range(mynums)) # range: difference between min and max

## [1] 9

Use the help function to see which values are used as default in the function by typing ?function.

?mean

You can also store the output of the function in a variable. Type x <- rnorm(100).

x <- rnorm(100)

R can also make graphs. Type plot(x) for a very simple example.

plot(x)

Subsetting

Subsetting vectors with square brackets

Use square brackets to subset elements in the vector

mynums[1]    # retrieve value at first position

## [1] 10

mynums[2] # retrieve value at second position

## [1] 9

R also allows slicing, or taking more than one entry at a time.

mynums[1:4]  # retrieve first four values

## [1] 10  9  8  7

x <- c(NA, 2, 3, 1) #Sometimes data is not available. This is different from not possible or null. R uses the value NA to indicate this

x[is.na(x)]# retrieve NA values

## [1] NA

Subsetting vectors by names

 x <- 1:3 #create a sequence
 names(x) <- c("one", "two", "three") # set the names
 x["one"]

## one 
##   1

Data frames

Data frames are very important objects in data analysis. Data frames are 2-dimensional objects with columns and rows.

 library(languageR)

We’ll use the dataset beginningReaders from the package languageR.

BeginningReaders: Visual lexical decision latencies for beginning readers (8 year-old Dutch children).

Print the first 10 rows of the data frame beginningReaders

 head(beginningReaders)

##       Word Subject    LogRT Trial OrthLength LogFrequency LogFamilySize
## 1 avontuur     S28 7.410347   190          8     4.394449      1.609438
## 2 avontuur     S40 8.065894   298          8     4.394449      1.609438
## 3 avontuur     S37 6.739337   172          8     4.394449      1.609438
## 4 avontuur     S65 7.021976   295          8     4.394449      1.609438
## 5 avontuur     S54 7.167809    74          8     4.394449      1.609438
## 6 avontuur     S43 7.628518   446          8     4.394449      1.609438
##   ReadingScore ProportionOfErrors        PC1        PC2          PC3
## 1           39          0.0877193 -0.4267708  0.1789760  0.099767967
## 2           34          0.0877193  0.8830183 -0.5908354 -0.928297160
## 3           61          0.0877193 -1.0075239  0.3735815 -0.002442806
## 4           66          0.0877193 -0.1235864  0.1208182  0.764377023
## 5           41          0.0877193 -0.4886353 -0.9584474  0.193079488
## 6           23          0.0877193  1.8644799  0.7346544 -0.280724884
##           PC4
## 1  0.10309743
## 2 -1.65831762
## 3 -0.01558577
## 4  0.38848318
## 5 -1.28002946
## 6  0.14591764

How many columns does the data frame have?

 ncol(beginningReaders)

## [1] 13

How many rows does the data frame have?

 nrow(beginningReaders)

## [1] 7923

What are the names of the columns?

 names(beginningReaders)

##  [1] "Word"               "Subject"            "LogRT"             
##  [4] "Trial"              "OrthLength"         "LogFrequency"      
##  [7] "LogFamilySize"      "ReadingScore"       "ProportionOfErrors"
## [10] "PC1"                "PC2"                "PC3"               
## [13] "PC4"

Print a summary of the data frame

 summary(beginningReaders)

##        Word         Subject         LogRT           Trial      
##  mus     :  56   S46    : 180   Min.   :5.545   Min.   :  1.0  
##  poes    :  56   S52    : 177   1st Qu.:7.001   1st Qu.:126.0  
##  sok     :  56   S12    : 174   Median :7.329   Median :242.0  
##  oom     :  55   S63    : 174   Mean   :7.318   Mean   :247.8  
##  plein   :  55   S74    : 172   3rd Qu.:7.657   3rd Qu.:370.0  
##  sprookje:  55   S75    : 171   Max.   :8.294   Max.   :567.0  
##  (Other) :7590   (Other):6875                                  
##    OrthLength      LogFrequency   LogFamilySize    ReadingScore  
##  Min.   : 2.000   Min.   :2.079   Min.   :0.000   Min.   :10.00  
##  1st Qu.: 5.000   1st Qu.:3.584   1st Qu.:1.099   1st Qu.:29.00  
##  Median : 6.000   Median :4.143   Median :1.386   Median :46.00  
##  Mean   : 5.769   Mean   :4.218   Mean   :1.504   Mean   :47.61  
##  3rd Qu.: 7.000   3rd Qu.:4.860   3rd Qu.:1.792   3rd Qu.:65.00  
##  Max.   :11.000   Max.   :7.029   Max.   :3.807   Max.   :96.00  
##                                                                  
##  ProportionOfErrors      PC1                PC2                 PC3          
##  Min.   :0.01754    Min.   :-6.74895   Min.   :-3.571122   Min.   :-3.35566  
##  1st Qu.:0.13793    1st Qu.:-0.85291   1st Qu.:-0.461370   1st Qu.:-0.43199  
##  Median :0.19643    Median : 0.20943   Median :-0.009402   Median :-0.00131  
##  Mean   :0.21607    Mean   : 0.04015   Mean   : 0.000145   Mean   :-0.00295  
##  3rd Qu.:0.29825    3rd Qu.: 1.13255   3rd Qu.: 0.462891   3rd Qu.: 0.41571  
##  Max.   :0.49123    Max.   : 3.38085   Max.   : 4.172695   Max.   : 2.97171  
##                                                                              
##       PC4           
##  Min.   :-3.055159  
##  1st Qu.:-0.405996  
##  Median : 0.005431  
##  Mean   : 0.002121  
##  3rd Qu.: 0.412868  
##  Max.   : 2.926385  
##

To subset columns we will use the $

 head(beginningReaders$Word)

## [1] avontuur avontuur avontuur avontuur avontuur avontuur
## 184 Levels: avontuur baden balkon band barsten beek beker beton beven ... zwijgen

We can also use the square brackets to subset rows and columns

First number inside of the [] is the row, the second number is the column.

 beginningReaders[1,2] #first row second column

## [1] S28
## 59 Levels: S10 S12 S13 S14 S15 S16 S18 S2 S20 S21 S22 S26 S27 S28 S29 ... S8

 beginningReaders[2,2] #second row second column

## [1] S40
## 59 Levels: S10 S12 S13 S14 S15 S16 S18 S2 S20 S21 S22 S26 S27 S28 S29 ... S8

 beginningReaders[1,5] #first row fifth column

## [1] 8

Plotting

Base graphics are often constructed piecemeal, with each aspect of the plot handled separately through a particular function call. Usually you start with a plot function (such as plot, hist, or boxplot), then you use annotation functions (text, abline, points) to add to or modify your plot.

As you’ll see, most of the base plotting functions have many arguments, for example, setting the title, labels of axes, plot character, etc. Some of the parameters can be set when you call the function or they can be added later in a separate function call.

We see the dataset contains 13 columns of data.

 ncol(beginningReaders)

## [1] 13

First we’ll do a simple histogram of this LogRt column to show the distribution of the log-transformed reaction time (in ms).

Use the R command hist with the argument beginningReaders$LogRT.

Now, we can modify the x axis title to Log transformed reaction time

 hist(beginningReaders$LogRT, xlab = "Log transformed reaction time")

Next, we will do a density plot

 d = density(beginningReaders$LogRT) #returns density data
 plot(d) #plot results

Next we’ll do a scatter plot. We’ll want a scatter plot of LogRT as a function of OrthLength.

plot(LogRT~OrthLength, beginningReaders)

Note that boxplot, unlike hist, did NOT specify a title and axis labels for you automatically.

Let’s call boxplot again to specify labels. (Use the up arrow to recover the previous command and save yourself some typing.) We’ll add more arguments to the call to specify labels for the 2 axes. Set xlab equal to “Orthographic length” and ylab equal to “Log transformed Reaction Time”.

  plot(LogRT~OrthLength, beginningReaders, xlab="Orthographic length", ylab="Log transformed reaction time")

Name and description	Statistics	Parameter
Mean	\(\bar{x}\) (sum of the variable in the sample divided by the size of the sample) \(\frac{\sum_{i\in S} x_i}{\|S\|}\)	\(\mu\) (sum of the variable in the population divided by the size of the population) \(\frac{\sum_{i\in P} x_i}{\|P\|}\)
Variance	\(s^2\) (average square distance from the variable’s average from the mean divided by the size of the sample -1) \(\frac{\sum_{i\in S} (x_i-\bar{x})^2}{\|S\|-1}\)	\(\sigma^2\) (average square distance of the variable from the mean divided by the size of the population) \(\frac{\sum_{i\in P} (x_i-\mu)^2}{\|P\|}\)
Standard deviation	\(s\) (“the average distance from the sample mean”) \(\sqrt{s^2}\)	\(\sigma\) (“the average distance from the population mean”) \(\sqrt{\sigma^2}\)

Statistics for linguists