Written by Peter Rosenmai on 25 Nov 2013. Last revised 30 Nov 2013.

R's mahalanobis() function provides a simple means of detecting outliers in multidimensional data.

For example, suppose you have a dataframe of heights and weights:

hw <- data.frame(Height.cm=c(164, 167, 168, 169, 169, 170, 170, 170, 171, 172, 172, 173, 173, 175, 176, 178),

Weight.kg=c( 54, 57, 58, 60, 61, 60, 61, 62, 62, 64, 62, 62, 64, 56, 66, 70))

Weight.kg=c( 54, 57, 58, 60, 61, 60, 61, 62, 62, 64, 62, 62, 64, 56, 66, 70))

When plotting these data (generated for this example using an interactive plot), you could mark as outliers those points that are, for instance, more than two (sample) standard deviations from the mean height or mean weight:

is.height.outlier <- abs(scale(hw$Height.cm)) > 2

is.weight.outlier <- abs(scale(hw$Weight.kg)) > 2

pch <- (is.height.outlier | is.weight.outlier) * 16

plot(hw, pch=pch)

is.weight.outlier <- abs(scale(hw$Weight.kg)) > 2

pch <- (is.height.outlier | is.weight.outlier) * 16

plot(hw, pch=pch)

The resulting graph:

Note that the point with height equal to 175 cm (in the bottom-right corner of the graph) has not been marked as an outlier, as it's less than 2 standard deviations from the mean height and mean weight. And yet that is the point that most clearly does not follow the linear relationship between height and weight that we see in this data. It is—arguably—the real outlier here.

By the way, the choice of scales for the above graph is somewhat misleading. A clearer picture of the effect of height on weight would have been obtained by at least letting the y scale start at zero. But I'm using this data merely to illustrate outlier detection; I hope you'll overlook this bad practice!

Now let's give mahalanobis() a spin:

n.outliers <- 2 # Mark as outliers the 2 most extreme points

m.dist.order <- order(mahalanobis(hw, colMeans(hw), cov(hw)), decreasing=TRUE)

is.outlier <- rep(FALSE, nrow(hw))

is.outlier[m.dist.order[1:n.outliers]] <- TRUE

pch <- is.outlier * 16

plot(hw, pch=pch)

m.dist.order <- order(mahalanobis(hw, colMeans(hw), cov(hw)), decreasing=TRUE)

is.outlier <- rep(FALSE, nrow(hw))

is.outlier[m.dist.order[1:n.outliers]] <- TRUE

pch <- is.outlier * 16

plot(hw, pch=pch)

The above code marks as outliers the two most extreme points according to their Mahalanobis distance (also known as the generalised squared distance). This is, very roughly speaking, the distance of each point (the rows of the dataframe) from the centre of the data that the dataframe comprises, normalised by the standard deviation of each of the variables (the columns of the dataframe) and adjusted for the covariances of those variables. It may be thought of as the multidimensional analogue of the *t*-statistic—which is defined as (*x*-*x*) / *s*, where *x* is the sample mean and *s* is the sample standard deviation. (For details, visit Wikipedia's page on Mahalanobis distance.) As you can see, this time the point in the bottom-right corner of the graph has been caught:

And this technique works in higher dimensions too. This code produces a 3-dimensional spinnable scatterplot:

library(rgl)

hwa <- data.frame(Height.cm=c(164, 167, 168, 168, 169, 169, 169, 170, 172, 173, 175, 176, 178),

Weight.kg=c( 55, 57, 58, 56, 57, 61, 61, 61, 64, 62, 56, 66, 70),

Age =c( 13, 12, 14, 17, 15, 14, 16, 16, 13, 15, 16, 14, 16))

m.dist.order <- order(mahalanobis(hwa, colMeans(hwa), cov(hwa)), decreasing=TRUE)

is.outlier <- rep(FALSE, nrow(hwa))

is.outlier[m.dist.order[1:2]] <- TRUE # Mark as outliers the 2 most extreme points

col <- is.outlier + 1

plot3d(hwa$Height.cm, hwa$Weight.kg, hwa$Age, type="s", col=col)

hwa <- data.frame(Height.cm=c(164, 167, 168, 168, 169, 169, 169, 170, 172, 173, 175, 176, 178),

Weight.kg=c( 55, 57, 58, 56, 57, 61, 61, 61, 64, 62, 56, 66, 70),

Age =c( 13, 12, 14, 17, 15, 14, 16, 16, 13, 15, 16, 14, 16))

m.dist.order <- order(mahalanobis(hwa, colMeans(hwa), cov(hwa)), decreasing=TRUE)

is.outlier <- rep(FALSE, nrow(hwa))

is.outlier[m.dist.order[1:2]] <- TRUE # Mark as outliers the 2 most extreme points

col <- is.outlier + 1

plot3d(hwa$Height.cm, hwa$Weight.kg, hwa$Age, type="s", col=col)

Here's a shot of the scatterplot, the red points being the outliers:

As you can see from the above code, the mahalanobis() function calculates the Mahalanobis distance of a dataframe using a supplied vector of means and a supplied covariance matrix. You'll typically want to use it as in the examples above, passing in a vector of means and a covariance matrix that have been calculated from the dataframe under consideration. For example:

m.dist <- mahalanobis(my.dataframe, colMeans(my.dataframe), cov(my.dataframe))

The resulting vector of distances can be used to weed out the most extreme rows of a dataframe. For example, you may want to remove the 5% of points that are the most extreme:

percentage.to.remove <- 5 # Remove 5% of points

number.to.remove <- trunc(nrow(my.dataframe) * percentage.to.remove / 100)

m.dist <- mahalanobis(my.dataframe, colMeans(my.dataframe), cov(my.dataframe))

m.dist.order <- order(m.dist, decreasing=TRUE)

rows.to.keep.index <- m.dist.order[(number.to.remove+1):nrow(my.dataframe)]

my.dataframe <- my.dataframe[rows.to.keep.index,]

number.to.remove <- trunc(nrow(my.dataframe) * percentage.to.remove / 100)

m.dist <- mahalanobis(my.dataframe, colMeans(my.dataframe), cov(my.dataframe))

m.dist.order <- order(m.dist, decreasing=TRUE)

rows.to.keep.index <- m.dist.order[(number.to.remove+1):nrow(my.dataframe)]

my.dataframe <- my.dataframe[rows.to.keep.index,]

This is often useful when you want to quickly check whether an analysis you're running is overly affected by extreme points. First run the analysis on the full dataset, then remove the most extreme points using the above technique… and then run your analysis again. If there's a big difference in the results, you may want to consider using an analysis that is more robust against outliers.

Be wary of mahalanobis() when your data exhibit nonlinear relationships, as the Mahalanobis distance equation only accounts for *linear* relationships. For example, try running the following code:

my.dataframe <- data.frame(x=c(4, 8, 10, 16, 17, 22, 27, 33, 38, 40, 47, 48, 53, 55, 63, 71, 76, 85, 85, 92, 96),

y=c(6, 22, 32, 34, 42, 51, 59, 63, 64, 69, 70, 20, 70, 63, 63, 55, 46, 41, 33, 19, 6))

percentage.of.outliers <- 10 # Mark 10% of points as outliers

number.of.outliers <- trunc(nrow(my.dataframe) * percentage.of.outliers / 100)

m.dist <- mahalanobis(my.dataframe, colMeans(my.dataframe), cov(my.dataframe))

m.dist.order <- order(m.dist, decreasing=TRUE)

rows.not.outliers <- m.dist.order[(number.of.outliers+1):nrow(my.dataframe)]

is.outlier <- rep(TRUE, nrow(my.dataframe))

is.outlier[rows.not.outliers] <- FALSE

pch <- is.outlier * 16

plot(x=my.dataframe$x, y=my.dataframe$y, pch=pch)

y=c(6, 22, 32, 34, 42, 51, 59, 63, 64, 69, 70, 20, 70, 63, 63, 55, 46, 41, 33, 19, 6))

percentage.of.outliers <- 10 # Mark 10% of points as outliers

number.of.outliers <- trunc(nrow(my.dataframe) * percentage.of.outliers / 100)

m.dist <- mahalanobis(my.dataframe, colMeans(my.dataframe), cov(my.dataframe))

m.dist.order <- order(m.dist, decreasing=TRUE)

rows.not.outliers <- m.dist.order[(number.of.outliers+1):nrow(my.dataframe)]

is.outlier <- rep(TRUE, nrow(my.dataframe))

is.outlier[rows.not.outliers] <- FALSE

pch <- is.outlier * 16

plot(x=my.dataframe$x, y=my.dataframe$y, pch=pch)

Here's the resulting graph:

Note that the most obvious outlier has not been detected because the relationship between the variables in the dataset under consideration is nonlinear.

Also be aware that using Mahalanobis distance to find extreme values in multidimensional data may raise issues similar to those encountered when using the