## Using Mahalanobis Distance to Find Outliers

R's mahalanobis() function provides a simple means of detecting outliers in multidimensional data.

For example, suppose you have a dataframe of heights and weights:

hw <- data.frame(Height.cm=c(164, 167, 168, 169, 169, 170, 170, 170, 171, 172, 172, 173, 173, 175, 176, 178),
Weight.kg=c( 54,  57,  58,  60,  61,  60,  61,  62,  62,  64,  62,  62,  64,  56,  66,  70))

When plotting these data (generated for this example using an interactive plot), you could mark as outliers those points that are, for instance, more than two (sample) standard deviations from the mean height or mean weight:

is.height.outlier <- abs(scale(hw\$Height.cm)) > 2
is.weight.outlier <- abs(scale(hw\$Weight.kg)) > 2
pch               <- (is.height.outlier | is.weight.outlier) * 16
plot(hw, pch=pch)

The resulting graph:

Note that the point with height equal to 175 cm (in the bottom-right corner of the graph) has not been marked as an outlier, as it's less than 2 standard deviations from the mean height and mean weight. And yet that is the point that most clearly does not follow the linear relationship between height and weight that we see in this data. It is—arguably—the real outlier here.

By the way, the choice of scales for the above graph is somewhat misleading. A clearer picture of the effect of height on weight would have been obtained by at least letting the y scale start at zero. But I'm using this data merely to illustrate outlier detection; I hope you'll overlook this bad practice!

Now let's give mahalanobis() a spin:

n.outliers   <- 2 # Mark as outliers the 2 most extreme points
m.dist.order <- order(mahalanobis(hw, colMeans(hw), cov(hw)), decreasing=TRUE)
is.outlier   <- rep(FALSE, nrow(hw))
is.outlier[m.dist.order[1:n.outliers]] <- TRUE
pch <- is.outlier * 16
plot(hw, pch=pch)

The above code marks as outliers the two most extreme points according to their Mahalanobis distance (also known as the generalised squared distance). This is, very roughly speaking, the distance of each point (the rows of the dataframe) from the centre of the data that the dataframe comprises, normalised by the standard deviation of each of the variables (the columns of the dataframe) and adjusted for the covariances of those variables. It may be thought of as the multidimensional analogue of the t-statistic—which is defined as (x-x) / s, where x is the sample mean and s is the sample standard deviation. (For details, visit Wikipedia's page on Mahalanobis distance.) As you can see, this time the point in the bottom-right corner of the graph has been caught:

And this technique works in higher dimensions too. This code produces a 3-dimensional spinnable scatterplot:

library(rgl)
hwa <- data.frame(Height.cm=c(164, 167, 168, 168, 169, 169, 169, 170, 172, 173, 175, 176, 178),
Weight.kg=c( 55,  57,  58,  56,  57,  61,  61,  61,  64,  62,  56,  66,  70),
Age      =c( 13,  12,  14,  17,  15,  14,  16,  16,  13,  15,  16,  14,  16))
m.dist.order                  <- order(mahalanobis(hwa, colMeans(hwa), cov(hwa)), decreasing=TRUE)
is.outlier                    <- rep(FALSE, nrow(hwa))
is.outlier[m.dist.order[1:2]] <- TRUE # Mark as outliers the 2 most extreme points
col                           <- is.outlier + 1
plot3d(hwa\$Height.cm, hwa\$Weight.kg, hwa\$Age, type="s", col=col)

Here's a shot of the scatterplot, the red points being the outliers:

As you can see from the above code, the mahalanobis() function calculates the Mahalanobis distance of a dataframe using a supplied vector of means and a supplied covariance matrix. You'll typically want to use it as in the examples above, passing in a vector of means and a covariance matrix that have been calculated from the dataframe under consideration. For example:

m.dist <- mahalanobis(my.dataframe, colMeans(my.dataframe), cov(my.dataframe))

The resulting vector of distances can be used to weed out the most extreme rows of a dataframe. For example, you may want to remove the 5% of points that are the most extreme:

percentage.to.remove <- 5 # Remove 5% of points
number.to.remove     <- trunc(nrow(my.dataframe) * percentage.to.remove / 100)
m.dist               <- mahalanobis(my.dataframe, colMeans(my.dataframe), cov(my.dataframe))
m.dist.order         <- order(m.dist, decreasing=TRUE)
rows.to.keep.index   <- m.dist.order[(number.to.remove+1):nrow(my.dataframe)]
my.dataframe         <- my.dataframe[rows.to.keep.index,]

This is often useful when you want to quickly check whether an analysis you're running is overly affected by extreme points. First run the analysis on the full dataset, then remove the most extreme points using the above technique… and then run your analysis again. If there's a big difference in the results, you may want to consider using an analysis that is more robust against outliers.

### Caveats

Be wary of mahalanobis() when your data exhibit nonlinear relationships, as the Mahalanobis distance equation only accounts for linear relationships. For example, try running the following code:

my.dataframe <- data.frame(x=c(4,  8, 10, 16, 17, 22, 27, 33, 38, 40, 47, 48, 53, 55, 63, 71, 76, 85, 85, 92, 96),
y=c(6, 22, 32, 34, 42, 51, 59, 63, 64, 69, 70, 20, 70, 63, 63, 55, 46, 41, 33, 19,  6))
percentage.of.outliers        <- 10 # Mark 10% of points as outliers
number.of.outliers            <- trunc(nrow(my.dataframe) * percentage.of.outliers / 100)
m.dist                        <- mahalanobis(my.dataframe, colMeans(my.dataframe), cov(my.dataframe))
m.dist.order                  <- order(m.dist, decreasing=TRUE)
rows.not.outliers             <- m.dist.order[(number.of.outliers+1):nrow(my.dataframe)]
is.outlier                    <- rep(TRUE, nrow(my.dataframe))
is.outlier[rows.not.outliers] <- FALSE
pch                           <- is.outlier * 16
plot(x=my.dataframe\$x, y=my.dataframe\$y, pch=pch)

Here's the resulting graph:

Note that the most obvious outlier has not been detected because the relationship between the variables in the dataset under consideration is nonlinear.

Also be aware that using Mahalanobis distance to find extreme values in multidimensional data may raise issues similar to those encountered when using the t-statistic to find extreme values in one-dimensional data. In the one-dimensional case, extreme values can have such a strong effect on the mean and the standard deviation that the t-statistic no longer provides a reasonable measure of distance from centre. (Which is why it's often better to use measures of centre and distance—such as the median and the median absolute deviation—that are more resistant to outliers. See my piece on using the median absolute deviation to find outliers.) The same can happen in the multidimensional case, as Mahalanobis distance, like the t-statistic, is calculated from standard deviations and distances from the mean.