## Generating Random Survival Times From Any Hazard Function

Let's get 1,000 random survival times (for use, perhaps, in a simulation) from a constant hazard function (hazard = 0.001):

hazard_fn = function(t) rep(0.001, length(t))
survival_times = random_survival_times(hazard_fn, 1000)

And let's check that the Kaplan-Meier curve for these survival times appproximates, as expected, the curve P(t) = exp(-0.001t):

require("survival")
plot(survfit(Surv(survival_times, rep(1, length(survival_times)))~1), xlab="Time", ylab="Survival Probability",
main="Sampled and Expected Survival Curves for h(t) = 0.001")
neg_exp_fn = function(x){exp(-0.001 * x)}
curve(expr=neg_exp_fn, from=0, to=max(survival_times), add=TRUE, col="red", lwd=2)

## Atlanta Crime Map

This interactive map is based on crime data provided by the Atlanta Police Department for 2015 and 2016. That dataset is subject to considerable change over time, as crimes are often reported months after they actually occurred. I last downloaded and incorporated that data into this map on 14 April 2017.

Note that the date shown here for a crime is the date on which it occurred, not the date on which it was reported.

I built this map using Leaflet, Leaflet.Markercluster and jQuery Date Range Picker Plugin.

## Fitting an Exponential Curve to a Stepwise Survival Curve

Let's fit a function of the form f(t) = exp(λt) to a stepwise survival curve (e.g. a Kaplan Meier curve).

Here's the stepwise survival curve we'll be using in this demonstration:

# The data that describes the stepwise survival curve.
surv_time <- c(0.2,  0.5, 1.2,   2,   5,  6.5,    9,   15, 17.3,   35,   80,  200)
surv_prob <- c(  1, 0.94, 0.9, 0.8, 0.6, 0.57, 0.41,  0.2, 0.18, 0.07, 0.02, 0.01)

# Plot the curve.
plot_stepwise_surv_fn <- function(time, prob, col="black"){
time <- c(0, rep(time, each=2))
prob <- c(1, 1, rep(prob, each=2)[1:(2 * length(prob) - 1)])
plot(x=time, y=prob, ylim=c(0, 1), type="l", col=col, xlab="Time", ylab="Probability")
}
plot_stepwise_surv_fn(surv_time, surv_prob)

## Calculating Kaplan Meier Survival Curves and Their Confidence Intervals in SQL Server

I provide here a SQL Server script to calculate Kaplan Meier survival curves and their confidence intervals (plain, log and log-log) for time-to-event data.

### Example 1: Customer Attrition, Ungrouped, Without Censoring

Suppose a web-application company has seen its ten customers cancel their subscriptions after 0.5, 1, 10, 10, 11, 13.5, 14, 19, 19.5 and 30 months (from the start of their respective subscriptions). Based on this data, we want to estimate the probability of a new customer remaining a customer more than, say, 12 months—and we want a confidence interval around that estimate.

We create the input table:

create table #input
(
[Group] nvarchar(255),
[TimeToEvent] float,
[Event] int
);
insert into #input values('Web-App Ltd',  0.5, 1);
insert into #input values('Web-App Ltd',  1.0, 1);
insert into #input values('Web-App Ltd', 10.0, 1);
insert into #input values('Web-App Ltd', 10.0, 1);
insert into #input values('Web-App Ltd', 11.0, 1);
insert into #input values('Web-App Ltd', 13.5, 1);
insert into #input values('Web-App Ltd', 14.0, 1);
insert into #input values('Web-App Ltd', 19.0, 1);
insert into #input values('Web-App Ltd', 19.5, 1);
insert into #input values('Web-App Ltd', 30.0, 1);

## Calling SPSS Modeler from R

SPSS Modeler streams can be executed from R via input files and command-line calls. It's a hacky technique, but it works. And it's an easy way to make use of Modeler's excellent Expert Modeler functionality.

First, create an example of the data file that you want Modeler to read in. For example, if you want to run Modeler on a single time series, your data file will probably be a text file comprising a date column and a value column.

Then create and save a Modeler stream that reads in that file, fits the required model and produces an output file. This is fairly easy so I won't cover it here. But this is how it might look:

## Beta Distribution PDF Grapher

Here's a D3-rendered graph of the probability density function (PDF) of the beta distribution. Move the sliders to change the shape parameters or the scale of the y-axis.

## Installing Rpy2

Here's how I installed the rpy2 module for communicating between R and Python. I did this for R version 3.0.2, Python version 3.4.1, and rpy2 version 2.4.4 on a 64-bit machine running Windows 7.

### Setting System Variables

First, I got the full pathname of my R executible by right-clicking the R icon in my Start menu and selecting Properties. It was: C:\Program Files\R\R-3.0.2\bin\x64\Rgui.exe. And it only took a moment of poking around to find the full pathname of my Python executible: C:\Anaconda3\python.exe.

Time to look at my system variables. I right-clicked on Computer in my Start menu and selected Properties; I then clicked on Advanced System Settings in the window that appeared. In the Advanced tab of the System Properties window, I clicked the Environment Variables button. The lower half of the resulting Environment Variables window showed my system variables.

The first system variable I had to deal with was Path. It didn't include the directory in which my R executable sits, so I added it: C:\Program Files\R\R-3.0.2\bin\x64\.

I then added an R_HOME system variable and set it to the top level directory of my R installation: C:\Program Files\R\R-3.0.2\.

And I added an R_USER system variable and set it to the directory that the rpy2 module would install into: C:\Anaconda3\Lib\site-packages\rpy2\.

## Estimating the Distance Between GPS Points While Accounting for Circular Error Probable (CEP)

Given two GPS points recorded as being d metres apart with circular error probable (CEP) of c1 and c2 metres respectively, the true distance between the recorded points has the distribution

where:

• C is a non-central chi-square random variable having 2 degrees of freedom and non-centrality parameter d2 / σ2

(I give a proof of this easy result below.)

## Generating Random Numbers from Any Non-Negative Function

Here's some R code that generates random numbers from the probability distribution described by a given non-negative function. This can be useful when running simulations or generating datasets for testing purposes.

For example, suppose you want to generate random data from a distribution that looks something like this back-of-an-envelope sketch:

## Dragging Outliers Towards the Mean or Median

Here's an example of how to use R to smoothly drag towards the mean outliers that are more than a given number of standard deviations from the mean—or median absolute deviations from the median, or whatever—so that the most extreme outliers are dragged in the most. I don't really agree with mangling data in this way and I think the task is a trivially simple one, but I've often been asked how to do it… so here's how you might go about it.

First, for demonstation purposes, I create a dataset with some obvious outliers:

> x <- c(-43, -2, -1, 0, 0, 0.5, 2.5, 3, 3, 5, 7, 8.2, 15, 16, 70, 99)

I drag the outliers towards the mean using the standard deviation:

> cutoff     <- 1.5
> severity   <- 2.5
> x_adjusted <- mean(x) + drag_towards_zero(abs(scale(x)), cutoff, severity) * sd(x) * sign(scale(x))
[,1]
[1,] -41.6
[2,]  -2.0
[3,]  -1.0
[4,]  0.0
[5,]  0.0
[6,]  0.5
[7,]  2.5
[8,]  3.0
[9,]  3.0
[10,]  5.0
[11,]  7.0
[12,]  8.2
[13,]  15.0
[14,]  16.0
[15,]  66.9
[16,]  77.1
attr(,"scaled:center")
[1] 11.45
attr(,"scaled:scale")
[1] 31.73232

## Graphing Survival and Hazard Functions

Here's some R code to graph the basic survival-analysis functions—s(t), S(t), f(t), F(t), h(t) or H(t)—derived from any of their definitions.

For example:

# Define time points at which to plot
t <- seq(0, 4, 0.01)

# Define h(t), the hazard function
hazard_fn <- function(t){2*t}

# Graph S(t), the survival function derived from h(t)
y <- apply_survival_function(t, hazard_fn, supplied_fn_type="h", fn_type_to_apply="S")
plot(x=t, y=y, xlim=c(0, max(t)), ylim=c(0, max(y)), main="S(t)", ylab="Survival Probability", type="l")

## Calculating Influence Matrices

I present here R code to calculate the influence that a set of points have upon each other, where influence is a function of (1) the distance between the points and (2) the inherent influence of the points.

### Example: Light Bulbs Shining on Light Bulbs

Consider, for example, five light bulbs with brightness given by this vector:

> brightness
[1] 3 4 2 1 2

Now, suppose that the distance between the light bulbs (in metres) is given by this distance matrix:

> distance.matrix
[,1] [,2] [,3] [,4] [,5]
[1,]    0    6   11    4    5
[2,]    6    0   12    9   10
[3,]   11   12    0   14   15
[4,]    4    9   14    0    4
[5,]    5   10   15    4    0

This matrix tells us, for instance, that bulbs two and three are 12 metres apart. Note that the distance matrix is symmetrical about a zero diagonal. This corresponds with the ordinary notion of distance: Any point is a zero distance from itself, and the distance from point A to point B equals the distance from point B to point A. However, my code permits non-symmetric distances: If bulb two is "uphill" from bulb three, [2, 3] will be greater than [2, 3].

## Calculating a Distance Matrix for Geographic Points Using R

Here's an example of how to calculate a distance matrix for geographic points (expressed as decimal latitudes and longitudes) using R:

> df.cities <- data.frame(name = c("New York City", "Chicago", "Los Angeles", "Atlanta"),
+                         lat  = c(       40.75170,  41.87440,      34.05420,  33.75280),
+                         lon  = c(      -73.99420, -87.63940,    -118.24100, -84.39360))
> round(GeoDistanceInMetresMatrix(df.cities) / 1000)
New York City Chicago Los Angeles Atlanta
New York City             0    1148        3945    1204
Chicago                1148       0        2808     945
Los Angeles            3945    2808           0    3116
Atlanta                1204     945        3116       0

For example, the above distance matrix shows that the straight-line distance—accounting for curvature of the earth—between Los Angeles and NYC is 3,945 km.

## R's Crazy If-Else Parser Quirk

I present here what I consider to be a fiendishly weird quirk in R's code parser.

Okay, so what do you expect the following code to do?

if (TRUE){
if (TRUE) print(1)
else print(0)
}

Run it and you'll see that it prints the number 1, as you would expect.

Okay, now what happens when you remove the top-level if block?

## Debugging with the Superassignment Operator

I show here how the <<- assignment operator may be used to debug R functions by writing local variables into the global environment.

Consider the following code:

PlotExponentialSmoother <- function(x){
exp.smoother  <- HoltWinters(x, beta=FALSE, gamma=FALSE)
plot.subtitle <- paste("Sum of Squared Errors:", exp.smoother\$sse)
plot(exp.smoother, sub=plot.subtitle)
}

PlotExponentialSmoother(window(sunspots, 1950))

Running that code produces a graph of sunspot activity since 1950 and an exponential smoother of those data. But there's a problem: The graph subtitle doesn't come out properly.

## R's Flavours of Stacked Dot Plots

The humble stacked dot plot is, I think, often preferable to the histogram as a means of graphing distributions of small data sets. To gauge how closely a histogram approximates an underlying population distribution, one must take into account the number of points that the histogram is based on (the sample size). Many readers fail to do this—and all too often the sample size is not provided within the graph. However, a dot plot lets any reader make an immediate guess at how closely the graph follows the shape of the underlying distribution.

Several R functions implement stacked dot plots. But which one to use?

There's this one from the base graphics package:

stripchart(faithful\$waiting, method="stack", offset=0.5, pch=1)

## Creating Datasets Interactively

The following Create2DimData() R function allows two-dimensional datasets (e.g. of heights and weights) to be created by clicking with a mouse within a plot. This can be useful if you need to throw together a dataset for demonstration purposes.

For example, try calling Create2DimData() like this:

df.points <- Create2DimData(xlim=c(0,10), ylim=c(0,5))

## Using Mahalanobis Distance to Find Outliers

R's mahalanobis() function provides a simple means of detecting outliers in multidimensional data.

For example, suppose you have a dataframe of heights and weights:

hw <- data.frame(Height.cm=c(164, 167, 168, 169, 169, 170, 170, 170, 171, 172, 172, 173, 173, 175, 176, 178),
Weight.kg=c( 54,  57,  58,  60,  61,  60,  61,  62,  62,  64,  62,  62,  64,  56,  66,  70))

When plotting these data (generated for this example using an interactive plot), you could mark as outliers those points that are, for instance, more than two (sample) standard deviations from the mean height or mean weight:

is.height.outlier <- abs(scale(hw\$Height.cm)) > 2
is.weight.outlier <- abs(scale(hw\$Weight.kg)) > 2
pch               <- (is.height.outlier | is.weight.outlier) * 16
plot(hw, pch=pch)

## Using the Median Absolute Deviation to Find Outliers

One of the commonest ways of finding outliers in one-dimensional data is to mark as a potential outlier any point that is more than two standard deviations, say, from the mean (I am referring to sample means and standard deviations here and in what follows). But the presence of outliers is likely to have a strong effect on the mean and the standard deviation, making this technique unreliable.

For an example of this well-known problem, try running the following R code:

require(BHH2)
x <- c(1, 2, 3, 3, 4, 4, 4, 5, 5.5, 6, 6, 6.5, 7, 7, 7.5, 8, 9, 12, 52, 90)
dotPlot(x)
mean.x <- mean(x)
sd.x   <- sd(x)
lines(rep(mean.x         , 2), c(0.2, 0.25)) # Vertical line at mean
lines(rep(mean.x + 2*sd.x, 2), c(0.2, 0.25)) # Vertical line at mean + 2 SD
text(mean.x         , 0.3, expression(bar(x)))
text(mean.x + 2*sd.x, 0.3, expression(paste(bar(x), " + 2s")))

## Exploring the World Bank's Gini Index Data with R

Let's have a look at the Gini index data available from the World Bank through R's WDI package.

For those who haven't met it before, the Gini index is an elegantly constructed measure of, typically, income inequality. A Gini index of 0 represents a perfectly equal economy; a Gini index of 100 represents a perfectly unequal economy. (To find out more about the Gini index, have a look at my Gini index calculator.)

Let's search for the Gini index within the World Bank's datasets:

require(WDI)
WDIsearch('gini')

## R Code to Remove Duplicates from a SQL Server Database

It's easy to remove duplicate rows from an R dataframe using the unique() function:

> df.duplicates <- data.frame(a=c(1,2,3,3), b=c(1,2,3,3))
> df.duplicates
a b
1 1 1
2 2 2
3 3 3
4 3 3
> unique(df.duplicates)
a b
1 1 1
2 2 2
3 3 3

But this can be slow for large dataframes. Hideously slow, even.