Central tendency measures are those that inquire avout the most typical value in the distribtuon.
The mean is the sum of all the values (x) divided by the number of observations (N).
\(\frac{\sum_{i\in S} x_i}{|S|}\)
Mean of students’ scores (N = 3)
\(\displaystyle \frac{12+19+20}{3 }\)
The median is a measure of central tendency computed by sorting all values from least to greatest and then selecting the middle value.
Let’s arrange the scores in a ranked order.
The median value is the value that has equal number of observations above and below it.
If the number of observations is even, we take the mean of the two central values.
Extreme values in our data set can have a significant influence on the mean. For instance, if there was a students in our dataset that got a 20, this would inflate the mean upwards. This could be misleading if none of the other students were anywhere near this score.
In skewed distributions the mean takes into account untypical values,
which distorts the measure of central tendency.
The mode is the most frequent number in a data set and it is found by counting the number of times that each value appears in the dataset and selecting the most frequent value. Value with highest frequency.
## gram_cat
## adj n v
## 2 2 3
Level of measurement (nominal, ordinal, interval, ratio)
The variance is the average of the squared differences from the mean. More simply, variance represents the total distance of the data from the mean.
\(\frac{\sum_{i\in S} (x_i-\bar{x})^2}{|S|-1}\)
Finding deviation of each observation from the mean.
Squaring each deviation.
Summing the squares.
Dividing by the sum of observations.
The result is the variance (s)
## [1] 2.5
## [1] 250
The standard deviation is the most widely used measure of variability for continuous variables. The values for variance and standard deviation are very closely related. The standard deviation can be calculated by taking the square root of the variance (\(\sqrt{s^2}\))
The square of the result is the standard deviation (\(s\) )
The standard deviation is very important when analyzing our data set. A small standard deviation indicates that the data points tend to be located near the mean value, while a large standard deviation indicates that the data points are spread further from the mean.
The standard deviation gives us an easier measure of dispersion because the units are in the same units than the values in the variables, where the units in the variance are squared.
## [1] 1.581139
## [1] 15.81139
## [1] 2.833314
We need to classify data before we can identify patterns.
## [1] 11 12 13 11 13 15 18 17 16 15 11 14 15 17 16 15 14 18 19 19 20 11 12
One way to examine raw data is by exploring its frequency distribution.
One way to visualize distributions is to create a table:
## in_class
## 11 12 13 14 15 16 17 18 19 20
## 4 2 2 2 4 2 2 2 2 1
##
## (0,4] (4,8] (8,12] (12,16] (16,20]
## 0 0 6 10 7
Questions
What is the frequency for the interval [13-16]?
What is the frequency for the interval [5-8]?
Frequency distributions can be visualized using histograms. Histograms are used to summarized data on an interval scale. A histogram divides the values in a data set into classes or groups. A histogram normally has bars of equal width.
Horizontal axis (x-axis): scores
Vertical axis (y-axis): count of scores. The width of the bin corresponds to the interval.
Aside from a histogram, frequency distributions can also be visualized with density curves.
A density curve is a graph that shows probability. The area under the curve is equal to 100 percent. As we usually use decimals in probabilities you can also say that the area is equal to 1 (because 100% as a decimal is 1). It visualizes the distribution of the random variable, and the peaks show where values are concentrated.
Normal distribution: bell-shaped curve, values lie symmetrical around
its highest point (mean) (more detailed definition of the normal
distribution next week!)
Skewness is a measurement of the distortion of symmetrical distribution or asymmetry in a data set.
Positive skewness: high frequencies correspond to low values of the variable.
Negative skewness: high frequencies correspond to high values of the variable.
Kurtosis refers to the degree of peaking relative to the normal distribution.
Leptokurtic: A leptokurtic distribution is more peaked than the normal distribution.
Platykurtik: A platykurtic distribution has extremely dispersed points along the x-axis, resulting in a lower peak when compared to a normal distribution.
Why is the normal distribution important?
It is possible to predict what proportion of the population has values of a normally distributed variable in a given range.
A lot of variables in the population are normally distributed (height, IQ, weight, logarithm of wages).
A lot of statistical tests use normal distributions.
The normal curve is unimodal. There is only one maximal point.
Empirical rule.
68.25% of all cases fall within +/- one standard deviation from the mean.
95% of all cases fall within +/- two standard deviations from the mean,
while 99% of all cases fall within +/- three standard deviations from the mean.
If the diameter of a basketball is normally distributed, with a mean of 9″, and a standard deviation of 0.5″, what is the probability that a randomly chosen basketball will have a diameter between 9.5″ and 10.5″?
Since the standard deviation is 0.5″ and the mean 9″, we are evaluating the probability that a randomly chosen ball will have a diameter between 1 and 3 standard deviations above the mean. The graphic below shows the portion of the normal distribution included between 1 and 3 SDs:
What is the probability that a randomly chosen basketball will have a diameter between 10″ and 10.5″?
What is the probability that a randomly chosen basketball will have a diameter between 7.5″ and 8″?
What is the probability that a randomly chosen basketball will have a diameter larger than 8.2″?
Attention: If we have a ball that is 8.2″ and want to know the percentage of basket balls that are larger than our ball, we need to know how many standard deviations is 8.2″ from the mean.
A z-score measures the distance between a data point and the mean using standard deviations. The z-score tells us how many standard deviations the value is above (to the right of) or below (to the left of) the mean. A z-score is measured in units of the standard deviation. Z-scores normalize the scores of the distribution by creating a distribution with mean 0 and standard deviation 1.
\(z = \frac{x - \mu}{ \sigma}\)
Values of x that are larger than the mean have positive z-scores
Values of x that are smaller than the mean have negative z -scores
If x equals the mean, x has a z-score of 0
We have a distribution of students’ scores with mean of 18.2 and standard deviation of 0.5.
You got 19 points and you want to know how much better you did than the rest of the students.s
If the value is 19, the mean is 18.2 and the standard deviation is 0.5:
## [1] 1.6
The value 19 is 1.6 standard deviations above the mean. Now we need to find out what proportion of the distribution corresponds to 1.6 standard deviations above the mean.
Once we have the z-score we can calculate the percentage of values that will fall above the score of 19.
Once we have normalized the values in our distribution, we can compare them to the z-score table. A z-table, also called the standard normal table, is a mathematical table that allows us to know the percentage of values below (to the left) a z-score in a standard normal distribution
After calculating the standardized score, we need to look up the area (same as probability) using the z-score table. First, we find the first two digits on the left side of the z-table. In this case it is 1.6. Then, we look up a remaining number across the table (on the top) which is 0 in our example. The corresponding area is 0.9452 which translates into 94.5%.
The score of 19 was better than 94.5% of the students. This is the same to say that the z-score value cuts off a tail of 5.5% and that 5.5% of the students did bettern than the student that got 19 points.
How many values lie below 18?
## [1] -0.4
If we go to the z-score table, 34.45% values lie below 18.
What score is such that only 1% of the students would fall below it? What z-value cuts off a tail containing 0.01 of the data?
-2.33 contains 0.99% of the data.
\(-2.33 = \frac{x - 1.82}{0.5}\)
## [1] 17.035
Draw a histogram: is the data normally distributed?
Are the median and the mean close together?
A fixed proportion of the data will fall between particular values of the variable.
Thus,
## [1] 69
## [1] 96