How does an outlier affect the mean of a population?

1 Answer
Nov 16, 2017

Outliers tend to "pull" the mean towards them. See explanation for more details and examples.

Explanation:

Outliers, loosely speaking, are values which are so far "away" from the general area of the remaining values of a data set that they nearly appear to be suspect values.

More technically speaking, an outlier is generally any data value that lies more than 1.5 times the interquartile range (IQR) of a data set beyond the first or third quartiles. To know this, you generally have to calculate the lower quartile (Q1), median (Q2), upper quartile (Q3), and interquartile range (Q3-Q1), and then compare each data point to #Q1-1.5IQR# and #Q3+1.5IQR#. If the data exceeds those limits, it's usually marked as an outlier. Note that some people push this out farther and use 3 times the IQR. There's no hard and fast rule for this in my experience.

In any case, an outlier can dramatically affect the mean of a population as a measure of central tendency. Consider the following set of data (chosen to make calculations easy):

#S = {2, 4, 4, 6}#

For this set, we can calculate the mean #bar(x)# easily:

#bar(x) = (2+4+4+6)/4 = 16/4 = 4#

Now, let us replace the value 6 in this set with an exaggerated value that would definitely be considered an outlier of this overly small data set:

#S_1 = {2, 4, 4, 102}#

We can see now how this affects the new mean #bar(x)#:

#bar(x) = (2+4+4+102)/4 = 112/4 = 28#

By changing a single value, we "pulled" the mean strongly in the direction of the large outlier we just created.

We can see this effect again with an obviously contrived example:

#S = underbrace({100, 100, ..., 100, 100})_ "Repeated 20 times"#

Clearly if every number is 100 in the set, no matter how many numbers there are the mean #bar(x)# is 100. Now, though, let's replace three of the 100 values with 0s:

#S_1 = underbrace({0, 0, 0, 100, 100, ..., 100, 100})_ "17 entries of 100"#

In this instance, we can examine the new calculated mean #bar(x)#:

#bar(x) = overbrace(0 + 0 + 0 + 100 + 100 + ... + 100 + 100)^"17 entries of 100"/20=1700/20=85#

Although it doesn't seem too impressive, changing these three values to a small value outlier has had the effect of "pulling" the mean of #S_1# downwards by 15. If you were a student who took 20 tests, and scored 100 seventeen times and 0 three times, this has the effect of bringing your overall average down to an 85% - a B grade in many schools...even though the general body of your work has been outstanding!

It should be noted that in both of these examples I created, the mode (most common value) was unchanged as a result of swapping in an outlier, and the median (the "central" value) was unchanged as well. This demonstrates an "attractive" feature of the median as a measure of central tendency: it tends to be more insulated from wild swings that could be caused by outliers in the data.