Learning objectives

  • Measures of tendency and variation
  • Frequency, histograms, density plots
  • Properties of the normal distribution
  • Z-scores

Measures of tendency and variation

Central tendency

Central tendency measures are those that inquire avout the most typical value in the distribtuon.

The mean

The mean is the sum of all the values (x) divided by the number of observations (N).

\(\frac{\sum_{i\in S} x_i}{|S|}\)

Mean of students’ scores (N = 3)

\(\displaystyle \frac{12+19+20}{3 }\)

The median

The median is a measure of central tendency computed by sorting all values from least to greatest and then selecting the middle value.

Let’s arrange the scores in a ranked order.

  • 12,13,13,14,15,15,16,16,17

The median value is the value that has equal number of observations above and below it.

  • 12,13,13,14,15,15,16,16,17

If the number of observations is even, we take the mean of the two central values.

score = c(12,13,13,14,15,16,16,17)

median = mean(c(14,15))

Extreme values in our data set can have a significant influence on the mean. For instance, if there was a students in our dataset that got a 20, this would inflate the mean upwards. This could be misleading if none of the other students were anywhere near this score.

In skewed distributions the mean takes into account untypical values, which distorts the measure of central tendency.

The mode

The mode is the most frequent number in a data set and it is found by counting the number of times that each value appears in the dataset and selecting the most frequent value. Value with highest frequency.

gram_cat = c('adj','v','v','n','n','adj','v')

table(gram_cat)
## gram_cat
## adj   n   v 
##   2   2   3
#v is the mode

Which measure of central tendency do we use?

  • Level of measurement (nominal, ordinal, interval, ratio)

    • Mean: numeric/continuous variables, and variables that are highly skewed (e.g., syllable duration)
    • Median: ordinal variables (ranked variables, or skewed numeric variables) (e.g., proficiency levels: low - low-medium, medium -medium-high, high)
    • Mode: categorical nominal variables (e.g., grammatical category)

Measures of variability or disperion

Variance

The variance is the average of the squared differences from the mean. More simply, variance represents the total distance of the data from the mean.

\(\frac{\sum_{i\in S} (x_i-\bar{x})^2}{|S|-1}\)

  1. Finding deviation of each observation from the mean.

  2. Squaring each deviation.

  3. Summing the squares.

  4. Dividing by the sum of observations.

  5. The result is the variance (s)

var(c(8, 9, 10, 11, 12))
## [1] 2.5
var( c(-10, 0, 10 , 20, 30))
## [1] 250

The standard deviation

The standard deviation is the most widely used measure of variability for continuous variables. The values for variance and standard deviation are very closely related. The standard deviation can be calculated by taking the square root of the variance (\(\sqrt{s^2}\))

The square of the result is the standard deviation (\(s\) )

The standard deviation is very important when analyzing our data set. A small standard deviation indicates that the data points tend to be located near the mean value, while a large standard deviation indicates that the data points are spread further from the mean.

The standard deviation gives us an easier measure of dispersion because the units are in the same units than the values in the variables, where the units in the variance are squared.

sd(c(8, 9, 10, 11, 12))
## [1] 1.581139
sd( c(-10, 0, 10 , 20, 30))
## [1] 15.81139
sd(in_class)
## [1] 2.833314

Classification of raw data and distributions

We need to classify data before we can identify patterns.

##  [1] 11 12 13 11 13 15 18 17 16 15 11 14 15 17 16 15 14 18 19 19 20 11 12

One way to examine raw data is by exploring its frequency distribution.

Frequency

  • Frequency distribution: The frequency of a value refers to the number of times that value occurs in the sample. The distribution of a sample is the pattern of frequencies, meaning the set of all possible values and the frequencies associated with these values.

One way to visualize distributions is to create a table:

  • How many students gained 1,2,3,4,5….19,20 marks
table(in_class)
## in_class
## 11 12 13 14 15 16 17 18 19 20 
##  4  2  2  2  4  2  2  2  2  1
  • Class intervals: If a variable takes a large number of values, then it is easier to present and handle the data by grouping the values into class intervals. The frequency of a class interval is the number of observations that occur in a particular predefined interval.
## 
##   (0,4]   (4,8]  (8,12] (12,16] (16,20] 
##       0       0       6      10       7

Questions

  1. What is the frequency for the interval [13-16]?

  2. What is the frequency for the interval [5-8]?

Histograms

Frequency distributions can be visualized using histograms. Histograms are used to summarized data on an interval scale. A histogram divides the values in a data set into classes or groups. A histogram normally has bars of equal width.

  • Horizontal axis (x-axis): scores

  • Vertical axis (y-axis): count of scores. The width of the bin corresponds to the interval.

hist(in_class)

Aside from a histogram, frequency distributions can also be visualized with density curves.

A density curve is a graph that shows probability. The area under the curve is equal to 100 percent. As we usually use decimals in probabilities you can also say that the area is equal to 1 (because 100% as a decimal is 1). It visualizes the distribution of the random variable, and the peaks show where values are concentrated.

Types of distribution

Normal distribution: bell-shaped curve, values lie symmetrical around its highest point (mean) (more detailed definition of the normal distribution next week!)

Skewness

Skewness is a measurement of the distortion of symmetrical distribution or asymmetry in a data set.

Positive skewness: high frequencies correspond to low values of the variable.

Negative skewness: high frequencies correspond to high values of the variable.

Kurtosis

Kurtosis refers to the degree of peaking relative to the normal distribution.

Leptokurtic: A leptokurtic distribution is more peaked than the normal distribution.

Platykurtik: A platykurtic distribution has extremely dispersed points along the x-axis, resulting in a lower peak when compared to a normal distribution.

The normal distribution

Why is the normal distribution important?

  • It is possible to predict what proportion of the population has values of a normally distributed variable in a given range.

  • A lot of variables in the population are normally distributed (height, IQ, weight, logarithm of wages).

  • A lot of statistical tests use normal distributions.

Properties of a normal curve

  1. All normal curves are symmetric about the mean.
  • A normal distribution comes with a perfectly symmetrical shape. This means that the distribution curve can be divided in the middle to produce two equal halves. The symmetric shape occurs when one-half of the curve is the mirror image of the other half of the curve.
  1. The mean, mode and median are all equal.
  • The mean, median and mode of the normal distribution are the same.
  1. The normal curve is unimodal. There is only one maximal point.

  2. Empirical rule.

  • In normally distributed data, we know the cumulative frequency lying under the curve between the mean and specific number of standard deviations from the mean.
  1. 68.25% of all cases fall within +/- one standard deviation from the mean.

  2. 95% of all cases fall within +/- two standard deviations from the mean,

  3. while 99% of all cases fall within +/- three standard deviations from the mean.

Example

If the diameter of a basketball is normally distributed, with a mean of 9″, and a standard deviation of 0.5″, what is the probability that a randomly chosen basketball will have a diameter between 9.5″ and 10.5″?

Since the standard deviation is 0.5″ and the mean 9″, we are evaluating the probability that a randomly chosen ball will have a diameter between 1 and 3 standard deviations above the mean. The graphic below shows the portion of the normal distribution included between 1 and 3 SDs:

From granite.pressbooks.uk
From granite.pressbooks.uk

Comprehension questions:

  • What is the probability that a randomly chosen basketball will have a diameter between 10″ and 10.5″?

  • What is the probability that a randomly chosen basketball will have a diameter between 7.5″ and 8″?

  • What is the probability that a randomly chosen basketball will have a diameter larger than 8.2″?

Attention: If we have a ball that is 8.2″ and want to know the percentage of basket balls that are larger than our ball, we need to know how many standard deviations is 8.2″ from the mean.

Z-scores

A z-score measures the distance between a data point and the mean using standard deviations. The z-score tells us how many standard deviations the value is above (to the right of) or below (to the left of) the mean. A z-score is measured in units of the standard deviation. Z-scores normalize the scores of the distribution by creating a distribution with mean 0 and standard deviation 1.

\(z = \frac{x - \mu}{ \sigma}\)

  • Values of x that are larger than the mean have positive z-scores

  • Values of x that are smaller than the mean have negative z -scores

  • If x equals the mean, x has a z-score of 0

Example

We have a distribution of students’ scores with mean of 18.2 and standard deviation of 0.5.

You got 19 points and you want to know how much better you did than the rest of the students.s

If the value is 19, the mean is 18.2 and the standard deviation is 0.5:

z = (19-18.2)/0.5

z
## [1] 1.6

The value 19 is 1.6 standard deviations above the mean. Now we need to find out what proportion of the distribution corresponds to 1.6 standard deviations above the mean.

Once we have the z-score we can calculate the percentage of values that will fall above the score of 19.

Z-score table

Once we have normalized the values in our distribution, we can compare them to the z-score table. A z-table, also called the standard normal table, is a mathematical table that allows us to know the percentage of values below (to the left) a z-score in a standard normal distribution

After calculating the standardized score, we need to look up the area (same as probability) using the z-score table. First, we find the first two digits on the left side of the z-table. In this case it is 1.6. Then, we look up a remaining number across the table (on the top) which is 0 in our example. The corresponding area is 0.9452 which translates into 94.5%.

The score of 19 was better than 94.5% of the students. This is the same to say that the z-score value cuts off a tail of 5.5% and that 5.5% of the students did bettern than the student that got 19 points.

How many values lie below 18?

z = (18-18.2)/0.5

z
## [1] -0.4

If we go to the z-score table, 34.45% values lie below 18.

What score is such that only 1% of the students would fall below it? What z-value cuts off a tail containing 0.01 of the data?

-2.33 contains 0.99% of the data.

\(-2.33 = \frac{x - 1.82}{0.5}\)

x =  (-2.33 * 0.5)+18.2
x
## [1] 17.035

Testing for normality

  • Draw a histogram: is the data normally distributed?

  • Are the median and the mean close together?

  • A fixed proportion of the data will fall between particular values of the variable.

    • About 68% of scores will lie in the range mean +/- 1 sd
    • About 95% of scores will lie in the range mean +/- 2 sd

Thus,

  • About 68% of scores will lie in the range 17.2 - 19.2
  • About 95% of scores will lie in the range 16.2 - 20.2
length(x[(x> 17.2) & ( x<19.2)])
## [1] 69
length(x[(x> 16.2) & ( x<20.2)])
## [1] 96