Dragging Outliers Towards the Mean or Median
Written by Peter Rosenmai on 21 Jun 2014.
Here's an example of how to use R to smoothly drag towards the mean outliers that are more than a given number of standard deviations
from the mean—or median absolute deviations from the median, or whatever—so that the most extreme outliers are
dragged in the most. I don't really agree with mangling data in this way and I think the task is a trivially simple one, but I've often been asked how to do it… so here's how you might go about it.
First, for demonstation purposes, I create a dataset with some obvious outliers:
> x <- c(-43, -2, -1, 0, 0, 0.5, 2.5, 3, 3, 5, 7, 8.2, 15, 16, 70, 99)
I drag the outliers towards the mean using the standard deviation:
> cutoff <- 1.5
> severity <- 2.5
> x_adjusted <- mean(x) + drag_towards_zero(abs(scale(x)), cutoff, severity) * sd(x) * sign(scale(x))
> round(x_adjusted, 1)
[,1]
[1,] -41.6
[2,] -2.0
[3,] -1.0
[4,] 0.0
[5,] 0.0
[6,] 0.5
[7,] 2.5
[8,] 3.0
[9,] 3.0
[10,] 5.0
[11,] 7.0
[12,] 8.2
[13,] 15.0
[14,] 16.0
[15,] 66.9
[16,] 77.1
attr(,"scaled:center")
[1] 11.45
attr(,"scaled:scale")
[1] 31.73232
As you can see, the first and the last two values have been dragged towards the centre of the data. The other points were less than
1.5 standard deviations from the mean (note: cutoff = 1.5) so were unchanged. And this graph shows that the most extreme
values were dragged in the most:
> xylim <-
c(
min(x),
max(x))
>
plot(x=x, y=x_adjusted, xlim=xylim, ylim=xylim)
>
abline(a=0, b=1)
I can increase the extent to which points are dragged towards the centre by increasing the severity parameter passed to the drag_towards_zero function:
> cutoff <- 1.5
> severity <- 10
> x_adjusted <-
mean(x) +
drag_towards_zero(
abs(
scale(x)), cutoff, severity) *
sd(x) *
sign(
scale(x))
> xylim <-
c(
min(x),
max(x))
>
plot(x=x, y=x_adjusted, xlim=xylim, ylim=xylim)
>
abline(a=0, b=1)
>
lines(x=
c(-200, 200), y=
rep(
mean(x) + cutoff*
sd(x), 2), lty=2)
>
lines(x=
c(-200, 200), y=
rep(
mean(x) - cutoff*
sd(x), 2), lty=2)
Note that no matter how high you set the severity parameter, no outlier can be dragged into the non-outlier region [mean(x) - cutoff*sd(x), mean(x) + cutoff*sd(x)], shown on the graph with dashed lines. This is a good thing, as it ensures that the order of the data is preserved.
And if you only wish to drag in the high outliers, just do this:
> x_adjusted <- ifelse(x < mean(x), x, x_adjusted)
Okay, now let's use the median and the median absolute deviation from median (the MAD) as our statistics of centre and distance. I prefer these
to the mean and standard deviation when I'm dealing with outliers, as they're more robust against outliers. (But watch out for the MAD=0 problem.)
Here's the code:
> cutoff <- 1.5
> severity <- 2.5
> x_adjusted <-
median(x) +
drag_towards_zero(
abs((x -
median(x))/
mad(x)), cutoff, severity) *
mad(x) *
sign((x -
median(x))/
mad(x))
> xylim <- c(
min(x),
max(x))
>
plot(x=x, y=x_adjusted, xlim=xylim, ylim=xylim)
>
abline(a=0, b=1)
>
lines(x=
c(-200, 200), y=
rep(
median(x) + cutoff*
mad(x), 2), lty=2)
>
lines(x=
c(-200, 200), y=
rep(
median(x) - cutoff*
mad(x), 2), lty=2)
Looking at the above, I'd probably increase the cutoff to 2.5 so that the non-outlier region is
[median(x) - 2.5*mad(x), median(x) + 2.5*mad(x)].
And I'd reduce the severity parameter to 0.1, say:
> cutoff <- 2.5
> severity <- 0.1
> x_adjusted <-
median(x) +
drag_towards_zero(
abs((x -
median(x))/
mad(x)), cutoff, severity) *
mad(x) *
sign((x -
median(x))/
mad(x))
> xylim <-
c(
min(x),
max(x))
>
plot(x=x, y=x_adjusted, xlim=xylim, ylim=xylim)
>
abline(a=0, b=1)
>
lines(x=
c(-200, 200), y=
rep(
median(x) + cutoff*
mad(x), 2), lty=2)
>
lines(x=
c(-200, 200), y=
rep(
median(x) - cutoff*
mad(x), 2), lty=2)
Good. That's not quite so extreme a transformation.
Finally, here's the drag_towards_zero function:
drag_towards_zero <- function(x, cutoff, severity){
ifelse(x < cutoff, x, cutoff + log(severity*(ifelse(x < cutoff, NA, x) - cutoff) + 1)/severity)
}
And here's drag_towards_zero graphed for cutoff = 1.5 and severity = 0.5, 1 and 2:
> severity <-
c(0.5, 1.0, 2)
> x <-
seq(0, 5, 0.05)
>
plot( x=x, y=
drag_towards_zero(x, cutoff=1.5, severity[1]), type=
"l", col=
"green", ylab=
"y", xlim=
c(0, 5), ylim=
c(0, 5))
>
points(x=x, y=
drag_towards_zero(x, cutoff=1.5, severity[2]), type=
"l", col=
"red")
>
points(x=x, y=
drag_towards_zero(x, cutoff=1.5, severity[3]), type=
"l", col=
"black")
>
points(x=
c(-10, 10), y=
c(1.5, 1.5), type=
"l", lty=2)
>
legend(
"topleft", col=
c(
"green",
"red",
"black"),
paste(
"Severity =", severity), lty=1)
Note that no matter how high the severity parameter, the values that are greater than the cutoff cannot be dragged below it (the dashed line).