Chapter 8 Statistics
8.3 Mean, Median, and Mode
Learning Objectives
By the end of this section, you will be able to:
- Calculate the mode of a dataset
- Calculate the median of a dataset
- Calculate the mean of a dataset
- Contrast measures of central tendency to identify the most representative average
- Solve application problems involving mean, median, and mode
What exactly do we mean when we describe something as “average? Is the height of an average person the height that more people share than any other? What if we line up every person in the world in order from shortest to tallest and find the person right in the middle: Is that person’s height the average? Or maybe it’s something more complicated.
Imagine a game where you and a friend are trying to guess the typical person’s height. Once the guesses are made, you bring in every person and measure their height. You and your friend figure out how far off each of your guesses were from the actual value, then square that number. The result is the number of points you earn for that person. After we check every height and award points accordingly, the person with the lower score wins (because a lower score means that person’s guess was, overall, closer to the actual values). Could we define the average height to be the number that you should guess to give you the smallest possible score?
Each of these three methods of determining the “average” is commonly used. They are all methods of measuring centrality (or central tendency). Centrality is just a word that describes the middle of a set of data. All give potentially different results, and all are useful for different reasons. In this section, we’ll explore each of these methods of finding the “average.”
The Mode
In our discussion of average heights, the first possible definition we offered was the height that more people share than any other. This is the mode, or the value that appears most often. If there are two modes, the data are bimodal.
Let’s look at some examples.
Example 1
In a previous example, we looked at a stem-and-leaf plot of the sale prices (in dollars) of a particular collectible trading card:
| 0 | 5 8 9 |
| 1 | 0 0 0 3 4 4 5 5 5 5 6 9 9 |
| 2 | 0 0 0 0 5 5 9 9 |
| 3 | 0 0 0 5 5 |
| 4 | 0 0 5 |
| 5 | |
| 6 | 0 |
What is the mode price?
The mode is the price that appears most often. Both 15 and 20 appear 4 times, more than any other values. So they are the modes (and we can conclude that this set of data is bimodal).
Exercise 1
In Example 6 of Section 8.2, we constructed a stem-and-leaf plot for the number of times in one minute that different crickets chirped:
| 8 | 2 2 4 5 9 9 9 |
| 9 | 1 1 2 3 7 9 |
| 10 | 0 1 2 3 3 4 4 4 5 6 6 7 9 9 |
| 11 | 5 6 |
| 12 | 0 |
What is the mode of the number of chirps in a minute?
Solution
There are two modes: 89 and 104, each of which appears 3 times.
When we have a complete list of the data or a stem-and-leaf plot, it’s pretty straightforward to find the mode; we just need to find the number that appears most often. If we’re given a frequency distribution instead, the technique is different (but just as straightforward): we’re looking for the number with the highest frequency.
Example 2
In a previous example, we created a frequency distribution of the number of siblings of conflict resolution class attendees.
| Number of Siblings | Frequency |
|---|---|
| 0 | 5 |
| 1 | 13 |
| 2 | 6 |
| 3 | 3 |
| 4 | 2 |
| 5 | 1 |
What is the mode of the number of siblings?
The mode is the value that appears the most often, which means it has the greatest frequency. Thirteen of the respondents have 1 sibling, more than any other number. So the mode is 1.
Exercise 2
In a previous exercise, you found a frequency distribution for the number of people who shared a residence with people in a sample.
| Number of People in the Residence | Frequency |
|---|---|
| 1 | 12 |
| 2 | 13 |
| 3 | 8 |
| 4 | 6 |
| 5 | 1 |
What’s the mode of the number of people in these residences?
Solution
2
What happens if there is no number in the data that appears more than once? In that case, by our definition, every data value is a mode. But according to some other definitions, the data would have no mode. In practice, though, it doesn’t really matter; if no data value appears more than once, then the mode is not helpful at all as a measure of centrality.
The Median
Let’s revisit our example of trying to identify the height of the “average” person. If we lined everyone up in order by height and found the person right in the middle, that person’s height is called the median, or the value that is greater than no more than half and less than no more than half of the values.
Let’s look at a really simple example. Consider the following list of numbers: 11, 12, 13, 13, 14. Is the first number on the list, 11, the median? There are no values less than 11 (that’s 0%), and there are four values greater than 11 (that’s 80%). Since more than 50% of the data are greater than 11, the definition is violated; it’s not the median. Here’s a chart with the rest of the data, with bolded numbers to show where the definition is violated:
| Data Value | Number of Values Below | Percentage of Values Below | Number of Values Above | Percentage of Values Above |
|---|---|---|---|---|
| 11 | 0 | 0% | 4 | 80% |
| 12 | 1 | 20% | 3 | 60% |
| 13 | 2 | 40% | 1 | 20% |
| 14 | 4 | 80% | 0 | 0% |
Only 13 has no violations, so it’s the median according to the definition. In practice, we find the median just like we described in the average height example: by lining up all the data values in order from smallest to largest and picking the value in the middle. For our easy example (with data values 11, 12, 13, 13, 14), that first 13 is right in the middle; there are two values to the left and two values to the right. If there’s not one value right in the middle, we pick the two closest, then choose the number exactly between them. For example, let’s say we have the data 41, 44, 46, 53. Since there are an even number of data values in our list, we can’t pick the one right in the middle. The two closest to the middle are 44 and 46, so we’ll choose the number halfway between those to be the median: 45. As this example shows, the median (unlike the mode) doesn’t have to be a number in our original set of data.
In the examples we’ve looked at so far, it’s been pretty easy to identify which number is right in the middle. If we had a very large dataset, though, it might be harder. Fortunately, we have some formulas to help us with that.
FORMULA
Suppose we have a set of data with [latex]n[/latex] values, ordered from smallest to largest. If [latex]n[/latex] is odd, then the median is the data value at position [latex]\frac{n+1}{2}[/latex]. If [latex]n[/latex] is even, then we find the values at positions [latex]\frac{n}{2}[/latex] and [latex]\frac{n}{2}+1[/latex]. If those values are named [latex]a[/latex] and [latex]b[/latex], then the median is defined to be [latex]\frac{a+b}{2}[/latex].
Let’s put those formulas to work in an example.
Example 3
In a previous example, we looked at a stem-and-leaf plot that contained 33 sale prices (in dollars) of a particular collectible trading card:
| 0 | 5 8 9 |
| 1 | 0 0 0 3 4 4 5 5 5 5 6 9 9 |
| 2 | 0 0 0 0 5 5 9 9 |
| 3 | 0 0 0 5 5 |
| 4 | 0 0 5 |
| 5 | |
| 6 | 0 |
What is the median price?
Step 1: Since 33 is odd, the median is the data value at position [latex]\frac{n+1}{2}[/latex], where [latex]n[/latex] is the number of values in the dataset. There are 33 total values, so our formula becomes [latex]\frac{33+1}{2}=17[/latex]. That means we want to look for the 17th number in the dataset.
Step 2: We’ll want to count from the lowest value to the 17th number. We can use our stem-and-leaf plot to do this.
[latex]\begin{array} {|c|c c c c c c c c c c c c c c|} \hline & \textbf{1} & \textbf{2} & \textbf{3}\\ \text{0} & \text{5} & \text{8} & \text{9} \\ \hline & \textbf{4} & \textbf{5} & \textbf{6} & \textbf{7} & \textbf{8} & \textbf{9} & \textbf{10} & \textbf{11} & \textbf{12} & \textbf{13} & \textbf{14} & \textbf{15} & \textbf{16} \\ \text{1} & \text{0} & \text{0} & \text{0} & \text{3} & \text{4} & \text{4} & \text{5} & \text{5} & \text{5} & \text{5} & \text{6} & \text{9} & \text{9} \\ \hline & \textbf{17} \\ \text{2} & \text{0} & \text{0} & \text{0} & \text{0} & \text{5} & \text{5} & \text{9} & \text{9} \\ \hline \text{3} & \text{0} & \text{0} & \text{0} & \text{5} & \text{5} \\ \hline \text{4} & \text{0} & \text{0} & \text{5} \\ \hline \text{5} \\ \hline \text{6} & \text{0} \\ \hline \end{array}[/latex]
The 17th number is 20, so the median is 20.
You can work this problem with the Desmos statistics calculator.
Consider the data given in this stem-and-leaf plot (there are 17 data values):
| 12 | 1 2 2 5 |
| 13 | 0 3 4 4 6 8 |
| 14 | 2 5 9 9 |
| 15 | 0 3 |
| 16 | |
| 17 | 0 |
What is the median of this data?
Solution
136
Now let’s tackle an example with an even number of values.
Example 4
We previously looked at the number of times different crickets (of differing species, genders, etc.) chirped in a one-minute span. Those data are again provided below:
| 89 | 97 | 82 | 102 | 84 | 99 |
| 115 | 105 | 89 | 109 | 107 | 89 |
| 101 | 109 | 116 | 103 | 100 | 91 |
| 93 | 103 | 120 | 91 | 85 | 104 |
| 104 | 82 | 106 | 92 | 104 | 106 |
Find the median.
Step 1: In order to find the median, we first need to sort the data so that they’re in order, smallest to largest:
| 82 | 82 | 84 | 85 | 89 | 89 |
| 89 | 91 | 91 | 92 | 93 | 97 |
| 99 | 100 | 101 | 102 | 103 | 103 |
| 104 | 104 | 104 | 105 | 106 | 106 |
| 107 | 109 | 109 | 115 | 116 | 120 |
Step 2: Next, we figure out how many data values we have. Counting them up, we see there are 30, which is even.
Step 3: Since we have an even number of data values, we need to find the values in positions [latex]\frac{30}{2}=15[/latex] and [latex]\frac{30}{2}+1=16[/latex]. These are 101 and 102.
Step 4: We use the formula to compute the median: [latex]\frac{101+102}{2}=101.5[/latex].
You can work this problem with the Desmos statistics calculator.
This table gives the records of the Major League Baseball teams at the end of the 2019 season:
| Team | Wins | Losses |
|---|---|---|
| HOU | 107 | 55 |
| LAD | 106 | 56 |
| NYY | 103 | 59 |
| MIN | 101 | 61 |
| ATL | 97 | 65 |
| OAK | 97 | 65 |
| TBR | 96 | 66 |
| CLE | 93 | 69 |
| WSN | 93 | 69 |
| STL | 91 | 71 |
| MIL | 89 | 73 |
| NYM | 86 | 76 |
| ARI | 85 | 77 |
| BOS | 84 | 78 |
| CHC | 84 | 78 |
| PHI | 81 | 81 |
| TEX | 78 | 84 |
| SFG | 77 | 85 |
| CIN | 75 | 87 |
| CHW | 72 | 89 |
| LAA | 72 | 90 |
| COL | 71 | 91 |
| SDP | 70 | 92 |
| PIT | 69 | 93 |
| SEA | 68 | 94 |
| TOR | 67 | 95 |
| KCR | 59 | 103 |
| MIA | 57 | 105 |
| BAL | 54 | 108 |
| DET | 47 | 114 |
What is the median number of wins?
Solution
82.5
In a previous example, we created a frequency distribution of the number of siblings of the people who attended a conflict resolution class. Let’s review those data again:
| Number of Siblings | Frequency |
|---|---|
| 0 | 5 |
| 1 | 13 |
| 2 | 6 |
| 3 | 3 |
| 4 | 2 |
| 5 | 1 |
What is the median of the number of siblings?
There are 30 data values total, so the median is between the 15th and 16th values in the ordered list. There are five 0s and thirteen 1s according to the frequency distribution, so items 1 through 5 are all 0s and items six through eighteen are all 1s. Since both items 15 and 16 are 1s, the median is 1.
Exercise 5
You previously found a frequency distribution for the number of people who shared a residence with people in a sample. Let’s review that data again:
| Number of People in the Residence | Frequency |
|---|---|
| 1 | 12 |
| 2 | 13 |
| 3 | 8 |
| 4 | 6 |
| 5 | 1 |
What’s the median of the number of people in these residences?
Solution
2
The Mean
Recall our example of ways we could identify the “average” height of an individual. The last method we discussed was also the most complicated. It involved a game where the player guesses a height, then figures out how far off that guess is from every single person’s height. Those differences get squared and added together to get a score. Our next measure of centrality gives the lowest possible score: no other guess would beat it in the game. Given a dataset containing n total values, the mean of the dataset is the sum of all the data values, divided by n.
This is a computation you have likely done before. In many places, including spreadsheet programs like Microsoft Excel and Google Sheets, this number is called the average. For statisticians, though, the word average has too many possible meanings, so they prefer the one we’ll use: mean.
Example 6
Compute the mean of the numbers 12, 15, 17, 18, 18, and 19.
The mean is the sum of the values divided by the number of values on the list. So we get [latex]\frac{12+15+17+18+18+19}{6} = \frac{99}{6} = 16.5[/latex].
You can work this problem with the Desmos statistics calculator.
Compute the mean of the numbers 5, 8, 11, 12, 12, 12, 15, 18, and 20. Round your answer to three decimal places.
Solution
12.556
Refer again to the frequency distribution of the number of siblings people who attended a conflict resolution class reported:
| Number of Siblings | Frequency |
|---|---|
| 0 | 5 |
| 1 | 13 |
| 2 | 6 |
| 3 | 3 |
| 4 | 2 |
| 5 | 1 |
What is the mean of the number of siblings?
Step 1: We compute the mean by adding up all the data values and then dividing by the number of data values on the list.
Step 2: Adding up the frequencies, we get [latex]5+13+6+3+2+1=30[/latex] data values in our list.
Step 3: Now to find the sum of all the data values, we could simply reconstruct the raw data and add up all the numbers there. But there’s an easier way: remember that repeatedly adding a number to itself is the definition of multiplication. So, for example, since there are six 2s in our data, the sum of all those 2s must be [latex]6 \times 2 = 12[/latex].
Step 4: Let’s add a column to our distribution for these products:
| Number of Siblings | Frequency | (Number of Siblings) [latex]\times[/latex] (Frequency) |
|---|---|---|
| 0 | 5 | 0 |
| 1 | 13 | 13 |
| 2 | 6 | 12 |
| 3 | 3 | 9 |
| 4 | 2 | 8 |
| 5 | 1 | 5 |
Step 5: So the sum of all our data values is [latex]0+13+12+9+8+5=47[/latex]. The mean is [latex]\frac{47}{30} \approx{1.567}[/latex].
You can work this problem with the Desmos statistics calculator.
Refer again to the frequency distribution of the number of people who shared a residence with people in a sample:
| Number of People in the Residence | Frequency |
|---|---|
| 1 | 12 |
| 2 | 13 |
| 3 | 8 |
What’s the mean of the number of people in these residences?
Solution
[latex]\frac{62}{33} = 1.879[/latex]
As the number of data values we are considering grows, the computation for the mean gets more and more complicated. That’s why people generally trust technology to perform that computation.
Which Is Better: Mode, Median, or Mean?
If the mode, median, and mean all purport to measure the same thing (centrality), why do we need all three? The answer is complicated, as each measure has its own strengths and weaknesses. The mode is simple to compute, but there may be more than one. Further, if no data value appears more than once, the mode is entirely unhelpful. As for the mean and median, the main difference between these two measures is how each is affected by extreme values.
Consider this example: the mean and median of 1, 2, 3, 4, 5 are both 3. But what if the dataset is instead 1, 2, 3, 4, 10? The median is still 3, but the mean is now 4. What this example shows is that the mean is sensitive to extreme values, while the median isn’t. This knowledge can help us decide which of the two is more relevant for a given dataset. If it is important that the really high or really low values are reflected in the measure of centrality, then the mean is the better option. If very high or low values are not important, however, then we should stick with the median.
The decision between mean and median only really matters if the data are skewed. If the data are symmetric, then the mean and median are going to be approximately equal, and the distinction between them is irrelevant. If the data are skewed, the mean gets pulled in the direction of the skew (i.e., if the data are right-skewed, then the mean will be bigger than the median; if the data are left-skewed, the opposite relation is true).
Example 8
For the following situations, decide which measure(s) of centrality would be best:
a) You found a used car that you like, and you want to know if the price is too high or too low. You find a list of sale prices for that make and model, and you see that the distribution of those prices is skewed to the right. Some of the prices at the high end are close to the original sale price of the car, so you guess that those cars might have really low mileage or have other enhancements added on that increased the value (but which don’t apply to the car you found).
In this situation, the high values are not comparable to the value of the car you found and we don’t want them to affect the results. Also, we’re unlikely to find many repeated values, so the mode is probably not useful. Median is best.
b) You are asked to analyze the responses to a survey. One of the questions asked, “How strongly do you agree with the statement ‘I believe my elected representatives have my best interests in mind when they vote’?” Responses are a number between 1 and 5, with 1 representing “strongly disagree” and 5 representing “strongly agree.”
Here we want to know what a typical result is. The mean doesn’t really make sense; it involves adding the numbers together, so it would treat two “strongly disagree” and two “strongly agree” responses (those add to [latex]1+1+5+5=12[/latex]) as exactly the same as four “neutral” responses ([latex]3+3+3+3=12[/latex]). But those are really different situations; the first shows a strong polarization in the responses, while the second represents strong indifference. The mode is probably the best choice (because the data are actually categorical), but the median would be good too.
c) You are asked to find the “average” household income for a zip code. Those values are skewed right.
The mode isn’t going to be useful in this situation because it’s unlikely you will find many households that have exactly the same income. The mean and median will be different because of the skew, so the choice comes down to the extreme values. Remember that the data are skewed right, so high values are prevalent. Because these high values are important for our analysis, we want them to be reflected in the results. Thus, the mean is best. That being said, the median is also useful; it allows us to say something like “50% of the households surveyed make more than” the median.
d) Think back to the situation at the beginning of this section: you want to find “average” height. The data you’ve collected seem to be distributed symmetrically.
Because we aren’t likely to find many people with exactly the same height, the mode won’t be useful. Since the data are symmetric, the mean and the median will be about the same. So it doesn’t really matter which of those two we choose.
Exercises
For the following situations, decide which measure(s) of centrality would be best:
a) The data come from a survey question where the responses range from 1 (“not interested”) to 10 (“very interested”).
b) The data are fuel efficiency measures (in miles per gallon) for various vehicles. The data are right-skewed.
c) The data are scores on an exam, and they are left-skewed.
Solution
a) Mode or median
b) Median or mean
c) Median or mean
Describes the middle of a set of data.
The value that appears most often.
Describes the data when there are two modes.
the value that is greater than no more than half and less than no more than half of the values.
the sum of all the data values, divided by the number of data values