Written by Peter Rosenmai on 25 Nov 2013. Last revised 13 Jan 2013.

One of the commonest ways of finding outliers in one-dimensional data is to mark as a potential outlier any point
that is more than two standard deviations, say, from the mean (I am referring to *sample* means and standard
deviations here and in what follows). But the presence of outliers is likely to have a strong effect on the mean
and the standard deviation, making this technique unreliable.

For an example of this well-known problem, try running the following R code:

require(BHH2)

x <- c(1, 2, 3, 3, 4, 4, 4, 5, 5.5, 6, 6, 6.5, 7, 7, 7.5, 8, 9, 12, 52, 90)

dotPlot(x)

mean.x <- mean(x)

sd.x <- sd(x)

lines(rep(mean.x , 2), c(0.2, 0.25)) # Vertical line at mean

lines(rep(mean.x + 2*sd.x, 2), c(0.2, 0.25)) # Vertical line at mean + 2 SD

text(mean.x , 0.3, expression(bar(x)))

text(mean.x + 2*sd.x, 0.3, expression(paste(bar(x), " + 2s")))

x <- c(1, 2, 3, 3, 4, 4, 4, 5, 5.5, 6, 6, 6.5, 7, 7, 7.5, 8, 9, 12, 52, 90)

dotPlot(x)

mean.x <- mean(x)

sd.x <- sd(x)

lines(rep(mean.x , 2), c(0.2, 0.25)) # Vertical line at mean

lines(rep(mean.x + 2*sd.x, 2), c(0.2, 0.25)) # Vertical line at mean + 2 SD

text(mean.x , 0.3, expression(bar(x)))

text(mean.x + 2*sd.x, 0.3, expression(paste(bar(x), " + 2s")))

Here's the resulting dotplot:

As you can see, the extreme value at x=90 has dragged x̄+2s, the outlier cutoff, above the point at x=52. Only the point at x=90 is therefore caught as an outlier, even though the point at x=52 is clearly also an outlier.

This occurs because the statistics of centre and distance—the mean and standard
deviation, respectively—that we're using to spot outliers… are themselves strongly
affected by outliers. The crux of the problem is that the standard deviation is based on
*squared* distances, so extreme points are much more influential than those close to the mean.

We're better off, therefore, using a measure of distance that's robust against outliers. A
good candidate for this job is the *median absolute deviation from median*, commonly shortened
to the *median absolute deviation* (MAD). It is the median of the set comprising the
absolute values of the differences between the median and each data point.

Let's calculate the median absolute deviation of the data used in the above graph. The first ingredient we'll need is the median:

> median(x)

[1] 6

[1] 6

Now get the absolute deviations from that median:

> abs(x-6)

[1] 5.0 4.0 3.0 3.0 2.0 2.0 2.0 1.0 0.5 0.0 0.0 0.5 1.0 1.0 1.5 2.0 3.0 6.0 46.0 84.0

[1] 5.0 4.0 3.0 3.0 2.0 2.0 2.0 1.0 0.5 0.0 0.0 0.5 1.0 1.0 1.5 2.0 3.0 6.0 46.0 84.0

Now for the median of those absolute deviations:

> median(abs(x-6))

[1] 2

[1] 2

So the MAD in this case is 2.

And here's the shortcut:

> mad(x, constant=1)

[1] 2

[1] 2

Now let's get the absolute deviation from the median of each point in the vector x, and let's put those deviations in terms of the MAD:

> abs(x - median(x)) / mad(x, constant=1)

[1] 2.50 2.00 1.50 1.50 1.00 1.00 1.00 0.50 0.25 0.00 0.00 0.25 0.50 0.50 0.75 1.00 1.50 3.00 23.00 42.00

[1] 2.50 2.00 1.50 1.50 1.00 1.00 1.00 0.50 0.25 0.00 0.00 0.25 0.50 0.50 0.75 1.00 1.50 3.00 23.00 42.00

Note that the two rightmost points in the above graph have MAD-denominated distances from centre of 23 and 42. These are more than seven times the maximum MAD-denominated distance of the remaining points. In other words, the MAD clearly differentiates the outliers from the remaining points.

Let's compare this to the absolute deviation from the mean in terms of the standard deviation:

> round(abs(x - mean(x)) / sd(x), 2)

[1] 0.52 0.48 0.43 0.43 0.38 0.38 0.38 0.34 0.31 0.29 0.29 0.27 0.24 0.24 0.22 0.19 0.15 0.01 1.88 3.67

[1] 0.52 0.48 0.43 0.43 0.38 0.38 0.38 0.34 0.31 0.29 0.29 0.27 0.24 0.24 0.22 0.19 0.15 0.01 1.88 3.67

This time the distances from centre of the rightmost points are 1.88 and 3.67. These are at least 3.6 times the maximum distance of the remaining points. In this instance—and as expected—the MADs-from-median approach more clearly differentiates the outliers from the remaining points than does the standard-deviations-from-mean approach.

Having calculated the MAD of a dataset, the next question is: How many MADs from the median must a point be before it becomes an outlier?

Before we consider this, let's tweak the above equation for the MAD very slightly by throwing into the
mix a *consistency constant*. This modification will ensure that for large samples the MAD provides
a good estimate of the standard deviation (more formally, the MAD becomes a
consistent estimator
of the population standard deviation). In R code, the reworked MAD definition is:

> MAD <- consistency.constant * median(abs(x - median(x)))

If you know that the underlying distribution is normal, the consistency constant should be set to 1.4826. That's the default for the mad() function, so running the following calculations on the dataset used above will, as you can see, produce identical results:

> 1.4826 * median(abs(x - median(x)))

[1] 2.9652

> mad(x)

[1] 2.9652

[1] 2.9652

> mad(x)

[1] 2.9652

This reworked form of the MAD allows us to set an outlier cutoff similar to that which we would use were we flagging outliers by considering standard deviations from the mean. The dotplot example shown above used a cutoff of 2, so let's run with that:

> x[which((abs(x - median(x)) / mad(x)) > 2)]

[1] 12 52 90

[1] 12 52 90

As you can see, this technique for finding outliers has flagged the three rightmost points in the dotplot. It's good that the points at x=52 and x=90 have been flagged. However, the point at x=12 doesn't seem to me to be an outlier—this isn't a perfect solution.

Now you may be wondering why I'm tweaking the MAD to turn it into an estimator for the population standard deviation—and then using this modified MAD to spot outliers. After all, I started this article with a demonstration of the problems that may arise when using the standard deviation to catch outliers.

But remember, I was referring there to the *sample* standard deviation. The population standard deviation is a perfectly good
measure of distance by which to catch outliers; the problem is that the sample standard deviation is a poor estimator of it in small
samples with extreme outliers. That's why I'm using the MAD to estimate the population standard deviation. I can then use an outlier
cutoff that is appropriate to the assumed underlying distribution. In the example we just considered, I assumed that the underlying distribution was
normal, so I calculated the MAD using a consistency constant of 1.4826. We may expect that approximately 95% of points taken from a
normal distribution are more than 2 standard deviations from the mean (and median), so 2 is a good outlier cutoff in this case.

Okay, so what happens if you can't assume that the underlying distribution is normal? Well, if
the distribution is symmetric (not necessarily about zero), setting the consistency constant
equal to the reciprocal of the 75th percentile of the *standard* distribution will achieve the nice
estimation property mentioned above. By "standard" I mean the distribution that results from shift-and-scale tranforming
the distribution in question to a distribution with a mean of zero and a standard deviation of 1.

There seems to be a little confusion about this last point in several of the more highly Google-ranked articles on this subject. Reading Wikipedia, one would think that the distribution in question must have zero mean, but that is neither a necessary nor sufficient condition for the above to hold. This paper fails to mention the requirement that the distribution be symmetric; the omission is repeated in this blog piece.

For example, suppose we assume that we're dealing with the uniform distribution given by f(x)=0.1 for 80≤x≤90, f(x)=0 elsewhere. We transform this to the standard uniform distribution (being that uniform distribution having a mean of 0 and a standard deviation of 1). It takes a moment's algebraic manipulation to see that the resulting distribution is given by g(x)=1/(2√3) for -√3≤x≤√3, g(x)=0 elsewhere. Since our original distribution is symmetric (about 85), its consistency constant is equal to the reciprocal of the 75th percentile of the standard uniform distribution: 2/√3.

Let's check our maths:

> x <- runif(100000, 80,90)

> mad(x, constant=2/sqrt(3))

[1] 2.895573

> sd(x)

[1] 2.89007

> mad(x, constant=2/sqrt(3))

[1] 2.895573

> sd(x)

[1] 2.89007

Good. The MAD (with consistency constant 2/√3) and the standard deviation of this random sample are agreeably close.

And what happens if the underlying distribution is unsymmetric? This is where the standard-deviations-from-mean and MADs-from-median strategies both fall flat. The problem is that these strategies apply the same cutoff—e.g. 2 sample standard deviations or 2 MADs—to both tails of a sample, even if one tail is far longer than the other. In such cases—as in the example illustrated by the dotplot shown above—you'll need to use a more sophisticated strategy for flagging outliers.

One rather obvious hack is to use what I term the *double MAD*. This is a pair of statistics: (1) the median absolute
deviation from the median of all points less than or equal to the median and (2) the median absolute
deviation from the median of all points greater than or equal to the median. Use the first of these
to calculate the distance from the median of all points less than or equal to the median; use
the second to calculate that distance for points that are greater than the median.

For example, consider this set—okay, multiset—of points: {1, 4, 4, 4, 5, 5, 5, 5, 7, 7, 8, 10, 16, 30}. As you can see, these data are very right-skewed. So let's flag outliers using a double MAD with, say, a cutoff of 3:

DoubleMAD <- function(x, zero.mad.action="warn"){

# The zero.mad.action determines the action in the event of an MAD of zero.

# Possible values: "stop", "warn", "na" and "warn and na".

x <- x[!is.na(x)]

m <- median(x)

abs.dev <- abs(x - m)

left.mad <- median(abs.dev[x<=m])

right.mad <- median(abs.dev[x>=m])

if (left.mad == 0 || right.mad == 0){

if (zero.mad.action == "stop") stop("MAD is 0")

if (zero.mad.action %in% c("warn", "warn and na")) warning("MAD is 0")

if (zero.mad.action %in% c( "na", "warn and na")){

if (left.mad == 0) left.mad <- NA

if (right.mad == 0) right.mad <- NA

}

}

return(c(left.mad, right.mad))

}

DoubleMADsFromMedian <- function(x, zero.mad.action="warn"){

# The zero.mad.action determines the action in the event of an MAD of zero.

# Possible values: "stop", "warn", "na" and "warn and na".

two.sided.mad <- DoubleMAD(x, zero.mad.action)

m <- median(x, na.rm=TRUE)

x.mad <- rep(two.sided.mad[1], length(x))

x.mad[x > m] <- two.sided.mad[2]

mad.distance <- abs(x - m) / x.mad

mad.distance[x==m] <- 0

return(mad.distance)

}

x <- c(1, 4, 4, 4, 5, 5, 5, 5, 7, 7, 8, 10, 16, 30)

print(x[DoubleMADsFromMedian(x) > 3])

# The zero.mad.action determines the action in the event of an MAD of zero.

# Possible values: "stop", "warn", "na" and "warn and na".

x <- x[!is.na(x)]

m <- median(x)

abs.dev <- abs(x - m)

left.mad <- median(abs.dev[x<=m])

right.mad <- median(abs.dev[x>=m])

if (left.mad == 0 || right.mad == 0){

if (zero.mad.action == "stop") stop("MAD is 0")

if (zero.mad.action %in% c("warn", "warn and na")) warning("MAD is 0")

if (zero.mad.action %in% c( "na", "warn and na")){

if (left.mad == 0) left.mad <- NA

if (right.mad == 0) right.mad <- NA

}

}

return(c(left.mad, right.mad))

}

DoubleMADsFromMedian <- function(x, zero.mad.action="warn"){

# The zero.mad.action determines the action in the event of an MAD of zero.

# Possible values: "stop", "warn", "na" and "warn and na".

two.sided.mad <- DoubleMAD(x, zero.mad.action)

m <- median(x, na.rm=TRUE)

x.mad <- rep(two.sided.mad[1], length(x))

x.mad[x > m] <- two.sided.mad[2]

mad.distance <- abs(x - m) / x.mad

mad.distance[x==m] <- 0

return(mad.distance)

}

x <- c(1, 4, 4, 4, 5, 5, 5, 5, 7, 7, 8, 10, 16, 30)

print(x[DoubleMADsFromMedian(x) > 3])

If you run the above code, you'll see that the flagged outliers are 1, 16 and 30.

By contrast, try flagging outliers using the ordinary MAD with an outlier cutoff of 3:

print(x[abs(x-median(x)) / mad(x, constant=1) > 3])

The flagged outliers are 10, 16 and 30. As you can see, the ordinary MAD only caught high outliers; the double MAD caught high and low outliers.

One more thing. If more than 50% of your data have identical values, your MAD will equal zero. All points in your dataset except those that equal the median will then be flagged as outliers, regardless of the level at which you've set your outlier cutoff. (By constrast, if you use the standard-deviations-from-mean approach to finding outliers, Chebyshev's inequality puts a hard limit on the percentage of points that may be flagged as outliers.) So at the very least check that you don't have too many identical data points before using the MAD to flag outliers.

My DoubleMADsFromMedian() function, above, deals with this problem via its zero.mad.action parameter:

- If zero.mad.action is "stop" and the left MAD or the right MAD is 0, execution halts.
- If zero.mad.action is "warn" and the left (right) MAD is 0, a warning is thrown. All points to the left (right) of the median will have a MAD-denominated distance from median of Inf.
- If zero.mad.action is "na" and the left (right) MAD is 0, all points to the left (right) of the median will have a MAD-denominated distance from median of NA.
- If zero.mad.action is "warn and na" and the left (right) MAD is 0, a warning is thrown. All points to the left (right) of the median will have a MAD-denominated distance from median of NA.

Leys, C., et al., Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median, Journal of Experimental Social Psychology, Volume 49, Issue 4, July 2013, pp. 764-766. *

Rousseeuw, P.J. and Croux C. (1993) Alternatives to the Median Absolute Deviation,
Journal of the American Statistical Association, December 1993, pp. 1273-1283.

* But watch out for the omission mentioned above.