Learning objectives of Lecture 1
Statistics: the discipline of collecting, organizing, analyzing, and interpreting data. This course has a lecture and a lab component.
A statistical question is a question where you expect several (i.e., a variety) of answers and you are interested in the distribution and tendency of those answers.
Example: What is the weight of a cat (in kg)?
Which question is more suitable to be answered using statistical analysis?
How much does my cat weigh?
How much does Garfield the cat weigh?
How much does the average cat weigh?
We can calculate the mean weight by hand:
(4.845743 + 4.650540 + 4.558134 + 4.841231 +3.817112 +4.667056 +4.226919+ 4.523914 +4.559097 +
4.940135 + 3.992729 + 4.432350 + 3.733077 + 3.571335 + 4.068627 + 3.685094 + 4.557106 +3.697995 +
+3.531068 + 3.815028) / 20
## [1] 4.235715
But, fortunately, we have R to help us. R is a statistical software that allow us to perform statistical analyses using formulas. That means, we do not have to do everything by hand anymore!
## [1] 3.986994
Population: The complete set of individuals.
Sample: A subset of the complete population.
The size of the sample will always be less than the total size of the population.
Population: All the students in the class.
Sample: A set of 10 students.
Unbiased sample: we randomize from the population.
Example 2: Analysis of use of fillers in spontaneous speech in French
Population: All the speakers of French.
Sample: A set of French speakers (for unbiased sample you need randomization)
When we select a sample of the population, we want that sample to be unbiased. For example, if we are interesting in knowing what is the tendency in the income of families residing in Toronto and we allow for voluntary participation, probably only families with high income will answer the survey, and the sampling will, thus, be biased.
Random sampling: When we select samples of the population, we want to make sure that each unit has approximately an equal chance of being represented in the sample.
A. The population is all L2 Catalan speakers, and the sample is the 10 participants that Maria studied.
B. The sample is all L2 Catalan speakers, and the population is the 5 speakers of the with low proficiency that Maria studied.
C. The sample is all L2 Catalan speakers, and the population is the 10 participants that Maria studied.
D. The population is all L2 Catalan speakers, and the sample is the 5 speakers of high proficiency that Maria studied.
Now we have distinguished between populations and samples. It is also important to note that the number that we obtain from measuring the sample statistic will be used to estimate the population parameter.
Statistic: number that represents a property of the sample.
Parameter: property of the population (Greek letters are used for parameters).
Name and description | Statistics | Parameter |
---|---|---|
Mean | \(\bar{x}\) (sum of the variable in the sample divided by the size of the sample) \(\frac{\sum_{i\in S} x_i}{|S|}\) | \(\mu\) (sum of the variable in the population divided by the size of the population) \(\frac{\sum_{i\in P} x_i}{|P|}\) |
Variance | \(s^2\) (average square distance from the variable’s average from the mean divided by the size of the sample -1) \(\frac{\sum_{i\in S} (x_i-\bar{x})^2}{|S|-1}\) | \(\sigma^2\) (average square distance of the variable from the mean divided by the size of the population) \(\frac{\sum_{i\in P} (x_i-\mu)^2}{|P|}\) |
Standard deviation | \(s\) (“the average distance from the sample mean”) \(\sqrt{s^2}\) | \(\sigma\) (“the average distance from the population mean”) \(\sqrt{\sigma^2}\) |
An important question to ask is: How accurately does a statistic estimate a parameter?
Example. Determine what the key terms refer to in the following study. We want to know the average (mean) amount of words per utterance produced in Spanish by toddlers enrolled in a Spanish immersion program (100 students enrolled). We randomly examined speech of 10 toddlers. Three of those toddlers produced 2, 3, and 4 words per utternace respectively.
Population: The population is all toddlers enrolled in the program.
Sample: The sample could be the random selection of 10 toddlers.
Parameter average (mean) amount of words per utterance produced by all the toddlers enrolled in the Spanish immersion program \(\mu\).
Statistics The statistic is the average (mean) amount of words per utterance of toddlers in the sample \(\bar{x}\).
Descriptive statistics: measurements to describe basic features of a data set by generating summaries about data samples (distribution, tendency, variance)
Inferential statistics: uses measurements from the sample to estimate population parameters.
We use the sample statistics to infer the population parameters.
In this example we have two groups of students that learned French using two different methods:
10 French in-class language learners
10 French language learners immersed in French in Montreal
Below the grades for each type of learning:
If we want to have summaries of the data:
Frequency distribution
## in_class
## 13 14 15 16 18 19
## 1 1 2 2 2 2
## immersed
## 15 16 18 19 20
## 1 1 2 5 1
## [1] 16.3
## [1] 18.2
Standard deviation
## [1] 2.110819
## [1] 1.549193
Inferring relationships from descriptive measures.
How confident can we be of extrapolating the statistics of our samples to the larger population?
Has the L2 immersion group done better than the L2 classroom learning group?
Independent variable: The condition that we vary.
Dependent variable: The condition that we measure.
Independent variable | Dependent variable |
---|---|
Type of teaching method (in-class vs. immersed) | Score on language test |
Categorical variable (nominal variable): Categorical variables represent groupings of some kind. They are sometimes recorded as numbers, but the numbers represent categories rather than actual amounts of things.
Numeric variables: A numeric variable (also called quantitative variable) is a quantifiable characteristic whose values are numbers (except numbers which are codes standing up for categories). Numeric variables may be either continuous or discrete.
R is a programming language for statistical computing, data analysis, and visualization, widely used in research and data science. RStudio is a user-friendly IDE that enhances productivity with tools for writing, running, and visualizing R code.
The RStudio interface consists of several windows. You can resize the windows by dragging the grey bars between them. Below is an overview of each section.
>
prompt,
and R executes them.CTRL+ENTER
to send it to the console.R can do many different types of statistical and data analyses. They are organized in so-called packages or libraries. With the standard installation, most common packages are installed.
There are many more packages available on the R website. If you want to install and use a package (for example, the package called ‘geometry’) you should first install the package by clicking ‘install packages’ in the packages window and typing geometry or by typing install.packages(“geometry”) in the command window.
After installing a package, you need to load it:
## Warning: package 'languageR' was built under R version 4.3.3
R can be used as a calculator. You can just type your equation in the command window after the >. Type 10 ^ 2 + 36.
Compute the difference between 10 and 5 and divide this by the difference 3 and 2.
You can also give numbers a name. By doing so, they become so-called variables which can be used later. For example, you can type in the command window x <- 4 .
You can see that x appeared in the workspace window in the top right corner, which means that R now remembers what x is.
Some people prefer to use <- instead of =. They do the same thing. <- consists of two characters, namely < and -, and represents an arrow pointing at the object receiving the value of the expression.
You can also ask R what x is. Just type x in the command window.
## [1] 4
You can also do calculations with x. Type x * 5 .
## [1] 20
If you specify A again, it will forget what value it had before. You can also assign a new value to A using the old one. Type A <- A + 10 .
Now you know how to store numbers in variables. We can also create vectors using c() A vector is a sequence of elements. We will create a numeric vector.
We can ask R things about the vectors.
## [1] 1
## [1] "numeric"
## [1] 10 9 8 7 6 5 4 3 2 1
We can call functions on the vector. Functions have the form name(). Functions perform operations on the vectors.
Within the brackets you specify the arguments. Arguments give extra information to the function.
## [1] 55
## [1] 1
## [1] 10
## [1] 1 10
## [1] 1 10
## [1] 9
Use the help function to see which values are used as default in the function by typing ?function.
?mean
You can also store the output of the function in a variable. Type x <- rnorm(100).
R can also make graphs. Type plot(x) for a very simple example.
Subsetting vectors with square brackets
Use square brackets to subset elements in the vector
## [1] 10
## [1] 9
R also allows slicing, or taking more than one entry at a time.
## [1] 10 9 8 7
x <- c(NA, 2, 3, 1) #Sometimes data is not available. This is different from not possible or null. R uses the value NA to indicate this
## [1] NA
Subsetting vectors by names
## one
## 1
Data frames are very important objects in data analysis. Data frames are 2-dimensional objects with columns and rows.
We’ll use the dataset beginningReaders from the package languageR.
BeginningReaders: Visual lexical decision latencies for beginning readers (8 year-old Dutch children).
Print the first 10 rows of the data frame beginningReaders
## Word Subject LogRT Trial OrthLength LogFrequency LogFamilySize
## 1 avontuur S28 7.410347 190 8 4.394449 1.609438
## 2 avontuur S40 8.065894 298 8 4.394449 1.609438
## 3 avontuur S37 6.739337 172 8 4.394449 1.609438
## 4 avontuur S65 7.021976 295 8 4.394449 1.609438
## 5 avontuur S54 7.167809 74 8 4.394449 1.609438
## 6 avontuur S43 7.628518 446 8 4.394449 1.609438
## ReadingScore ProportionOfErrors PC1 PC2 PC3
## 1 39 0.0877193 -0.4267708 0.1789760 0.099767967
## 2 34 0.0877193 0.8830183 -0.5908354 -0.928297160
## 3 61 0.0877193 -1.0075239 0.3735815 -0.002442806
## 4 66 0.0877193 -0.1235864 0.1208182 0.764377023
## 5 41 0.0877193 -0.4886353 -0.9584474 0.193079488
## 6 23 0.0877193 1.8644799 0.7346544 -0.280724884
## PC4
## 1 0.10309743
## 2 -1.65831762
## 3 -0.01558577
## 4 0.38848318
## 5 -1.28002946
## 6 0.14591764
How many columns does the data frame have?
## [1] 13
How many rows does the data frame have?
## [1] 7923
What are the names of the columns?
## [1] "Word" "Subject" "LogRT"
## [4] "Trial" "OrthLength" "LogFrequency"
## [7] "LogFamilySize" "ReadingScore" "ProportionOfErrors"
## [10] "PC1" "PC2" "PC3"
## [13] "PC4"
Print a summary of the data frame
## Word Subject LogRT Trial
## mus : 56 S46 : 180 Min. :5.545 Min. : 1.0
## poes : 56 S52 : 177 1st Qu.:7.001 1st Qu.:126.0
## sok : 56 S12 : 174 Median :7.329 Median :242.0
## oom : 55 S63 : 174 Mean :7.318 Mean :247.8
## plein : 55 S74 : 172 3rd Qu.:7.657 3rd Qu.:370.0
## sprookje: 55 S75 : 171 Max. :8.294 Max. :567.0
## (Other) :7590 (Other):6875
## OrthLength LogFrequency LogFamilySize ReadingScore
## Min. : 2.000 Min. :2.079 Min. :0.000 Min. :10.00
## 1st Qu.: 5.000 1st Qu.:3.584 1st Qu.:1.099 1st Qu.:29.00
## Median : 6.000 Median :4.143 Median :1.386 Median :46.00
## Mean : 5.769 Mean :4.218 Mean :1.504 Mean :47.61
## 3rd Qu.: 7.000 3rd Qu.:4.860 3rd Qu.:1.792 3rd Qu.:65.00
## Max. :11.000 Max. :7.029 Max. :3.807 Max. :96.00
##
## ProportionOfErrors PC1 PC2 PC3
## Min. :0.01754 Min. :-6.74895 Min. :-3.571122 Min. :-3.35566
## 1st Qu.:0.13793 1st Qu.:-0.85291 1st Qu.:-0.461370 1st Qu.:-0.43199
## Median :0.19643 Median : 0.20943 Median :-0.009402 Median :-0.00131
## Mean :0.21607 Mean : 0.04015 Mean : 0.000145 Mean :-0.00295
## 3rd Qu.:0.29825 3rd Qu.: 1.13255 3rd Qu.: 0.462891 3rd Qu.: 0.41571
## Max. :0.49123 Max. : 3.38085 Max. : 4.172695 Max. : 2.97171
##
## PC4
## Min. :-3.055159
## 1st Qu.:-0.405996
## Median : 0.005431
## Mean : 0.002121
## 3rd Qu.: 0.412868
## Max. : 2.926385
##
To subset columns we will use the $
## [1] avontuur avontuur avontuur avontuur avontuur avontuur
## 184 Levels: avontuur baden balkon band barsten beek beker beton beven ... zwijgen
We can also use the square brackets to subset rows and columns
First number inside of the [] is the row, the second number is the column.
## [1] S28
## 59 Levels: S10 S12 S13 S14 S15 S16 S18 S2 S20 S21 S22 S26 S27 S28 S29 ... S8
## [1] S40
## 59 Levels: S10 S12 S13 S14 S15 S16 S18 S2 S20 S21 S22 S26 S27 S28 S29 ... S8
## [1] 8
Base graphics are often constructed piecemeal, with each aspect of the plot handled separately through a particular function call. Usually you start with a plot function (such as plot, hist, or boxplot), then you use annotation functions (text, abline, points) to add to or modify your plot.
As you’ll see, most of the base plotting functions have many arguments, for example, setting the title, labels of axes, plot character, etc. Some of the parameters can be set when you call the function or they can be added later in a separate function call.
We see the dataset contains 13 columns of data.
## [1] 13
First we’ll do a simple histogram of this LogRt column to show the distribution of the log-transformed reaction time (in ms).
Use the R command hist with the argument beginningReaders$LogRT.
Now, we can modify the x axis title to Log transformed reaction time
Next, we will do a density plot
Next we’ll do a scatter plot. We’ll want a scatter plot of LogRT as a function of OrthLength.
Note that boxplot, unlike hist, did NOT specify a title and axis labels for you automatically.
Let’s call boxplot again to specify labels. (Use the up arrow to recover the previous command and save yourself some typing.) We’ll add more arguments to the call to specify labels for the 2 axes. Set xlab equal to “Orthographic length” and ylab equal to “Log transformed Reaction Time”.
plot(LogRT~OrthLength, beginningReaders, xlab="Orthographic length", ylab="Log transformed reaction time")