Exploring the World Bank's Gini Index Data with R

Let's have a look at the Gini index data available from the World Bank through R's WDI package.

For those who haven't met it before, the Gini index is an elegantly constructed measure of, typically, income inequality. A Gini index of 0 represents a perfectly equal economy; a Gini index of 100 represents a perfectly unequal economy. (To find out more about the Gini index, have a look at my Gini index calculator.)

Let's search for the Gini index within the World Bank's datasets:

require(WDI)
WDIsearch('gini')

If you run the above code, you'll see that SI.POV.GINI is the stat we need. Let's take a peek at the values it has taken in post-apartheid South Africa:

> df.wb <- WDI(indicator="SI.POV.GINI", country="ZA", start=1994, end=2013)
> df.wb[order(df.wb$year),]
   iso2c      country SI.POV.GINI year
20    ZA South Africa          NA 1994
19    ZA South Africa       56.59 1995
18    ZA South Africa          NA 1996
17    ZA South Africa          NA 1997
16    ZA South Africa          NA 1998
15    ZA South Africa          NA 1999
14    ZA South Africa       57.77 2000
13    ZA South Africa          NA 2001
12    ZA South Africa          NA 2002
11    ZA South Africa          NA 2003
10    ZA South Africa          NA 2004
9     ZA South Africa          NA 2005
8     ZA South Africa       67.40 2006
7     ZA South Africa          NA 2007
6     ZA South Africa          NA 2008
5     ZA South Africa       63.14 2009
4     ZA South Africa          NA 2010
3     ZA South Africa          NA 2011
2     ZA South Africa          NA 2012
1     ZA South Africa          NA 2013

Those are grim numbers. We have a figure for South Africa for 2009, so let's compare that against the Gini index for other countries for that same year:

> df.wb <- WDI(indicator="SI.POV.GINI", country="all", start=2009, end=2009)
> df.wb <- df.wb[!is.na(df.wb$"SI.POV.GINI"),]
> df.wb[order(df.wb$"SI.POV.GINI", decreasing=TRUE),][1:10,]
    iso2c      country SI.POV.GINI year
212    ZA South Africa       63.14 2009
121    HN     Honduras       56.95 2009
80     CO     Colombia       56.67 2009
65     BR       Brazil       54.69 2009
78     CL        Chile       52.06 2009
186    PA       Panama       52.03 2009
188    PY     Paraguay       51.04 2009
84     CR   Costa Rica       50.73 2009
95     EC      Ecuador       49.43 2009
189    PE         Peru       49.05 2009

So the Gini index for South Africa appears to be worse even than for South America's titans of income inequality. But how many countries are in our dataset?

> nrow(df.wb)
[1] 42

Only 42! The problem is that such statistics aren't collected for every country every year. So we need to do some interpolation and extrapolation to expand our dataset. I've written a simple function—LinearlyInterpolateFlatExtrapolateWBData()—to do that:

LinearlyInterpolateFlatExtrapolate <- function(v, max.extrapolate=NA){
   # Linearly interpolates and straight-line extrapolates
   #
   # Examples:
   # LinearlyInterpolateFlatExtrapolate(c(NA, NA,  1, NA, NA), NA) returns c( 1,  1,  1,  1,  1)
   # LinearlyInterpolateFlatExtrapolate(c(NA,  2, NA,  4, NA), NA) returns c( 2,  2,  3,  4,  4)
   # LinearlyInterpolateFlatExtrapolate(c(NA, NA, NA, NA, NA), NA) returns c(NA, NA, NA, NA, NA)
   # LinearlyInterpolateFlatExtrapolate(c(NA, NA,  1, NA, NA),  1) returns c(NA,  1,  1,  1, NA)

   n              <- length(v)
   indexes.non.na <- which(!is.na(v))
   n.not.na       <- length(indexes.non.na)

   if (n.not.na == 0) return(v)

   x <- 1:n
   v <- approx(x=x, y=v, xout=x, rule=2:2, method=ifelse(n.not.na == 1, "constant", "linear"))$y

   if (!is.na(max.extrapolate)){
      # Set to NA the data beyond the permitted extrapolation range
      non.na.range.min <- max(1, (min(indexes.non.na) - max.extrapolate))
      non.na.range.max <- min(n, (max(indexes.non.na) + max.extrapolate))
      v[setdiff(1:n, non.na.range.min:non.na.range.max)] <- NA
   }

   return(v)
}


LinearlyInterpolateFlatExtrapolateWBData <- function(country="all", indicator="NY.GNS.ICTR.GN.ZS", start=2000, end=NA, extra=FALSE, max.extrapolate=NA){
   # Linearly interpolates and straight-line extrapolates the World Bank data for the given
   # indicator -- and then returns the data for the given start and end years.


   require(WDI)

   # Get the data for all available years and order by country and year
   df.wb <- WDI(country, indicator, start=1000, end=3000, extra, cache=NULL)
   df.wb <- df.wb[order(df.wb$country, df.wb$year),]

   # Create a column that indicates whether the data were interpolated/extrapolated
   df.wb$source <- ifelse(is.na(df.wb[,3]), "Interpolated/Extrapolated", "Supplied")

   # Linearly interpolate and straight-line extrapolate
   all.countries <- unique(df.wb$country)
   for (country in all.countries){
      df.wb[df.wb$country==country, indicator] <- LinearlyInterpolateFlatExtrapolate(df.wb[df.wb$country==country, indicator], max.extrapolate)
   }

   # Chop off the data we don't need
   if (is.na(end)) end <- 3000
   df.wb <- df.wb[start <= df.wb$year & df.wb$year <= end,]

   return(df.wb)
}

Using the above code, let's look at 2009 again, flat-line extrapolating out at most five years:

> df.wb <- LinearlyInterpolateFlatExtrapolateWBData(indicator="SI.POV.GINI", start=2009, end=2009, max.extrapolate=5)
> df.wb <- df.wb[!is.na(df.wb$"SI.POV.GINI"),]
> df.wb[order(df.wb$"SI.POV.GINI", decreasing=TRUE),][1:10,]
      iso2c                  country SI.POV.GINI year                    source
10967    SC               Seychelles      65.770 2009 Interpolated/Extrapolated
4325     KM                  Comoros      64.300 2009 Interpolated/Extrapolated
9293     NA                  Namibia      63.900 2009 Interpolated/Extrapolated
11399    ZA             South Africa      63.140 2009                  Supplied
6485     HN                 Honduras      56.950 2009                  Supplied
13505    ZM                   Zambia      56.775 2009 Interpolated/Extrapolated
4271     CO                 Colombia      56.670 2009                  Supplied
4001     CF Central African Republic      56.300 2009 Interpolated/Extrapolated
3299     BO                  Bolivia      56.290 2009 Interpolated/Extrapolated
6215     GT                Guatemala      55.890 2009 Interpolated/Extrapolated

Ah. I was wondering where Namibia and Zambia had got to.

And how many countries do we now have in our dataset?

> nrow(df.wb)
[1] 108

Much better. So let's have a look at the Gini index over time for the BRICS economies:

require(ggplot2)
countries <- c("Brazil", "China", "India", "Russian Federation", "South Africa")
df.wb <- LinearlyInterpolateFlatExtrapolateWBData(country="all", indicator="SI.POV.GINI", start=1980, end=2013, max.extrapolate=0)
df.wb <- df.wb[df.wb$country %in% countries,]
ggplot(data=df.wb, aes(x=year, y=SI.POV.GINI, group=country, colour=country)) +
      theme_bw() +
      geom_line(size=2) +
      ggtitle("Gini Index for the BRICS Economies") +
      xlab("Year") +
      ylab("Gini Index") +
      labs(colour="")
The Gini index over time for the BRICS economies.

And let's map the Gini index for 2012, again extrapolating out at most five years:

MapWBData <- function(indicator, year, max.extrapolate=NA){
   require(rworldmap)
   df.wb <- LinearlyInterpolateFlatExtrapolateWBData(indicator=indicator, start=year, end=year, max.extrapolate=max.extrapolate)
   sPDF  <- joinCountryData2Map(df.wb, joinCode="ISO2", nameJoinColumn="iso2c")

   map.title <- paste(indicator, "in", year, "(Flat-line extrapolating")
   map.title <- paste(map.title, ifelse(is.na(max.extrapolate),
                                        "from most recent)",
                                        paste("at most", max.extrapolate, "years)")))

   mapCountryData(sPDF, nameColumnToPlot=indicator, colourPalette="heat", missingCountryCol="grey",
                  numCats=100, mapTitle=map.title)
}
MapWBData(indicator="SI.POV.GINI", year=2012, max.extrapolate=5)
A world map of the Gini index for 2012.

Thanks to Andy South's Beautiful world maps in R with rworldmap for introducing me to this dataset and the rworldmap package.