Written by Peter Rosenmai on 14 Apr 2017.

Let's get 1,000 random survival times (for use, perhaps, in a simulation) from a constant hazard function (hazard = 0.001):

hazard_fn = function(t) rep(0.001, length(t))

survival_times = random_survival_times(hazard_fn, 1000)

survival_times = random_survival_times(hazard_fn, 1000)

And let's check that the Kaplan-Meier curve for these survival times appproximates, as expected, the curve P(t) = exp(-0.001t):

require("survival")

plot(survfit(Surv(survival_times, rep(1, length(survival_times)))~1), xlab="Time", ylab="Survival Probability",

main="Sampled and Expected Survival Curves for h(t) = 0.001")

neg_exp_fn = function(x){exp(-0.001 * x)}

curve(expr=neg_exp_fn, from=0, to=max(survival_times), add=TRUE, col="red", lwd=2)

Read Moreplot(survfit(Surv(survival_times, rep(1, length(survival_times)))~1), xlab="Time", ylab="Survival Probability",

main="Sampled and Expected Survival Curves for h(t) = 0.001")

neg_exp_fn = function(x){exp(-0.001 * x)}

curve(expr=neg_exp_fn, from=0, to=max(survival_times), add=TRUE, col="red", lwd=2)

Written by Peter Rosenmai on 27 Dec 2016. Last revised 14 Apr 2017.

This interactive map is based on crime data provided by the Atlanta Police Department for 2015 and 2016. That dataset is subject to considerable change over time, as crimes are often reported months after they actually occurred. I last downloaded and incorporated that data into this map on 14 April 2017.

Note that the date shown here for a crime is the date on which it occurred, not the date on which it was reported.

I built this map using Leaflet, Leaflet.Markercluster and jQuery Date Range Picker Plugin.

Read MoreWritten by Peter Rosenmai on 27 Aug 2016.

Let's fit a function of the form f(*t*) = exp(λ*t*) to a stepwise survival curve (e.g. a Kaplan Meier curve).

Here's the stepwise survival curve we'll be using in this demonstration:

# The data that describes the stepwise survival curve.

surv_time <- c(0.2, 0.5, 1.2, 2, 5, 6.5, 9, 15, 17.3, 35, 80, 200)

surv_prob <- c( 1, 0.94, 0.9, 0.8, 0.6, 0.57, 0.41, 0.2, 0.18, 0.07, 0.02, 0.01)

# Plot the curve.

plot_stepwise_surv_fn <- function(time, prob, col="black"){

time <- c(0, rep(time, each=2))

prob <- c(1, 1, rep(prob, each=2)[1:(2 * length(prob) - 1)])

plot(x=time, y=prob, ylim=c(0, 1), type="l", col=col, xlab="Time", ylab="Probability")

}

plot_stepwise_surv_fn(surv_time, surv_prob)

Read Moresurv_time <- c(0.2, 0.5, 1.2, 2, 5, 6.5, 9, 15, 17.3, 35, 80, 200)

surv_prob <- c( 1, 0.94, 0.9, 0.8, 0.6, 0.57, 0.41, 0.2, 0.18, 0.07, 0.02, 0.01)

# Plot the curve.

plot_stepwise_surv_fn <- function(time, prob, col="black"){

time <- c(0, rep(time, each=2))

prob <- c(1, 1, rep(prob, each=2)[1:(2 * length(prob) - 1)])

plot(x=time, y=prob, ylim=c(0, 1), type="l", col=col, xlab="Time", ylab="Probability")

}

plot_stepwise_surv_fn(surv_time, surv_prob)

Written by Peter Rosenmai on 1 Jan 2016.

I provide here a SQL Server script to calculate Kaplan Meier survival curves and their confidence intervals (plain, log and log-log) for time-to-event data.

Suppose a web-application company has seen its ten customers cancel their subscriptions after 0.5, 1, 10, 10, 11, 13.5, 14, 19, 19.5 and 30 months (from the start of their respective subscriptions). Based on this data, we want to estimate the probability of a new customer remaining a customer more than, say, 12 months—and we want a confidence interval around that estimate.

We create the input table:

create table #input

(

[Group] nvarchar(255),

[TimeToEvent] float,

[Event] int

);

insert into #input values('Web-App Ltd', 0.5, 1);

insert into #input values('Web-App Ltd', 1.0, 1);

insert into #input values('Web-App Ltd', 10.0, 1);

insert into #input values('Web-App Ltd', 10.0, 1);

insert into #input values('Web-App Ltd', 11.0, 1);

insert into #input values('Web-App Ltd', 13.5, 1);

insert into #input values('Web-App Ltd', 14.0, 1);

insert into #input values('Web-App Ltd', 19.0, 1);

insert into #input values('Web-App Ltd', 19.5, 1);

insert into #input values('Web-App Ltd', 30.0, 1);

Read More(

[Group] nvarchar(255),

[TimeToEvent] float,

[Event] int

);

insert into #input values('Web-App Ltd', 0.5, 1);

insert into #input values('Web-App Ltd', 1.0, 1);

insert into #input values('Web-App Ltd', 10.0, 1);

insert into #input values('Web-App Ltd', 10.0, 1);

insert into #input values('Web-App Ltd', 11.0, 1);

insert into #input values('Web-App Ltd', 13.5, 1);

insert into #input values('Web-App Ltd', 14.0, 1);

insert into #input values('Web-App Ltd', 19.0, 1);

insert into #input values('Web-App Ltd', 19.5, 1);

insert into #input values('Web-App Ltd', 30.0, 1);

Written by Peter Rosenmai on 12 Dec 2015.

SPSS Modeler streams can be executed from R via input files and command-line calls. It's a hacky technique, but it works. And it's an easy way to make use of Modeler's excellent Expert Modeler functionality.

First, create an example of the data file that you want Modeler to read in. For example, if you want to run Modeler on a single time series, your data file will probably be a text file comprising a date column and a value column.

Then create and save a Modeler stream that reads in that file, fits the required model and produces an output file. This is fairly easy so I won't cover it here. But this is how it might look:

Read More

Written by Peter Rosenmai on 13 Jan 2015.

Written by Peter Rosenmai on 1 Jan 2015.

Here's a D3-rendered graph of the probability density function (PDF) of the beta distribution. Move the sliders to change the shape parameters or the scale of the y-axis.

Written by Peter Rosenmai on 16 Nov 2014.

Here's how I installed the rpy2 module for communicating between R and Python. I did this for R version 3.0.2, Python version 3.4.1, and rpy2 version 2.4.4 on a 64-bit machine running Windows 7.

First, I got the full pathname of my R executible by right-clicking the *R* icon in my *Start* menu and selecting *Properties*. It was: C:\Program Files\R\R-3.0.2\bin\x64\Rgui.exe. And it only took a moment of poking around to find the full pathname of my Python executible: C:\Anaconda3\python.exe.

Time to look at my system variables. I right-clicked on *Computer* in my *Start* menu and selected *Properties*; I then clicked on *Advanced System Settings* in the window that appeared. In the *Advanced* tab of the *System Properties* window, I clicked the *Environment Variables* button. The lower half of the resulting *Environment Variables* window showed my system variables.

The first system variable I had to deal with was *Path*. It didn't include the directory in which my R executable sits, so I added it: C:\Program Files\R\R-3.0.2\bin\x64\.

I then added an *R_HOME* system variable and set it to the top level directory of my R installation: C:\Program Files\R\R-3.0.2\.

And I added an *R_USER* system variable and set it to the directory that the *rpy2* module would install into: C:\Anaconda3\Lib\site-packages\rpy2\.

Read More

Written by Peter Rosenmai on 27 Sep 2014. Last revised 12 Oct 2014.

Given two GPS points recorded as being *d* metres apart with circular error probable (CEP) of *c _{1}* and

where:

- C is a non-central chi-square random variable having 2 degrees of freedom and non-centrality parameter
*d*^{2}/ σ^{2}

(I give a proof of this easy result below.)

Read More

Written by Peter Rosenmai on 2 Aug 2014.

Here's some R code that generates random numbers from the probability distribution described by a given non-negative function. This can be useful when running simulations or generating datasets for testing purposes.

For example, suppose you want to generate random data from a distribution that looks something like this back-of-an-envelope sketch:

Read More

Written by Peter Rosenmai on 21 Jun 2014.

Here's an example of how to use R to smoothly drag towards the mean outliers that are more than a given number of standard deviations from the mean—or median absolute deviations from the median, or whatever—so that the most extreme outliers are dragged in the most. I don't really agree with mangling data in this way and I think the task is a trivially simple one, but I've often been asked how to do it… so here's how you might go about it.

First, for demonstation purposes, I create a dataset with some obvious outliers:

> x <- c(-43, -2, -1, 0, 0, 0.5, 2.5, 3, 3, 5, 7, 8.2, 15, 16, 70, 99)

I drag the outliers towards the mean using the standard deviation:

> cutoff <- 1.5

> severity <- 2.5

> x_adjusted <- mean(x) + drag_towards_zero(abs(scale(x)), cutoff, severity) * sd(x) * sign(scale(x))

> round(x_adjusted, 1)

[,1]

[1,] -41.6

[2,] -2.0

[3,] -1.0

[4,] 0.0

[5,] 0.0

[6,] 0.5

[7,] 2.5

[8,] 3.0

[9,] 3.0

[10,] 5.0

[11,] 7.0

[12,] 8.2

[13,] 15.0

[14,] 16.0

[15,] 66.9

[16,] 77.1

attr(,"scaled:center")

[1] 11.45

attr(,"scaled:scale")

[1] 31.73232

> severity <- 2.5

> x_adjusted <- mean(x) + drag_towards_zero(abs(scale(x)), cutoff, severity) * sd(x) * sign(scale(x))

> round(x_adjusted, 1)

[,1]

[1,] -41.6

[2,] -2.0

[3,] -1.0

[4,] 0.0

[5,] 0.0

[6,] 0.5

[7,] 2.5

[8,] 3.0

[9,] 3.0

[10,] 5.0

[11,] 7.0

[12,] 8.2

[13,] 15.0

[14,] 16.0

[15,] 66.9

[16,] 77.1

attr(,"scaled:center")

[1] 11.45

attr(,"scaled:scale")

[1] 31.73232

Read More

Written by Peter Rosenmai on 11 Apr 2014. Last revised 13 Jun 2015.

Here's some R code to graph the basic survival-analysis
functions—s(t), S(t), f(t), F(t), h(t) or H(t)—derived from *any* of their definitions.

For example:

# Define time points at which to plot

t <- seq(0, 4, 0.01)

# Define h(t), the hazard function

hazard_fn <- function(t){2*t}

# Graph S(t), the survival function derived from h(t)

y <- apply_survival_function(t, hazard_fn, supplied_fn_type="h", fn_type_to_apply="S")

plot(x=t, y=y, xlim=c(0, max(t)), ylim=c(0, max(y)), main="S(t)", ylab="Survival Probability", type="l")

t <- seq(0, 4, 0.01)

# Define h(t), the hazard function

hazard_fn <- function(t){2*t}

# Graph S(t), the survival function derived from h(t)

y <- apply_survival_function(t, hazard_fn, supplied_fn_type="h", fn_type_to_apply="S")

plot(x=t, y=y, xlim=c(0, max(t)), ylim=c(0, max(y)), main="S(t)", ylab="Survival Probability", type="l")

Read More

Written by Peter Rosenmai on 22 Feb 2014.

I present here R code to calculate the influence that a set of points have upon each other, where influence is a function of (1) the distance between the points and (2) the inherent influence of the points.

Consider, for example, five light bulbs with brightness given by this vector:

> brightness

[1] 3 4 2 1 2

[1] 3 4 2 1 2

Now, suppose that the distance between the light bulbs (in metres) is given by this distance matrix:

> distance.matrix

[,1] [,2] [,3] [,4] [,5]

[1,] 0 6 11 4 5

[2,] 6 0 12 9 10

[3,] 11 12 0 14 15

[4,] 4 9 14 0 4

[5,] 5 10 15 4 0

[,1] [,2] [,3] [,4] [,5]

[1,] 0 6 11 4 5

[2,] 6 0 12 9 10

[3,] 11 12 0 14 15

[4,] 4 9 14 0 4

[5,] 5 10 15 4 0

This matrix tells us, for instance, that bulbs two and three are 12 metres apart. Note that the distance matrix is symmetrical about a zero diagonal. This corresponds with the ordinary notion of distance: Any point is a zero distance from itself, and the distance from point A to point B equals the distance from point B to point A. However, my code permits non-symmetric distances: If bulb two is "uphill" from bulb three, [2, 3] will be greater than [2, 3].

Read More

Written by Peter Rosenmai on 30 Jan 2014.

Here's an example of how to calculate a distance matrix for geographic points (expressed as decimal latitudes and longitudes) using R:

> df.cities <- data.frame(name = c("New York City", "Chicago", "Los Angeles", "Atlanta"),

+ lat = c( 40.75170, 41.87440, 34.05420, 33.75280),

+ lon = c( -73.99420, -87.63940, -118.24100, -84.39360))

> round(GeoDistanceInMetresMatrix(df.cities) / 1000)

New York City Chicago Los Angeles Atlanta

New York City 0 1148 3945 1204

Chicago 1148 0 2808 945

Los Angeles 3945 2808 0 3116

Atlanta 1204 945 3116 0

+ lat = c( 40.75170, 41.87440, 34.05420, 33.75280),

+ lon = c( -73.99420, -87.63940, -118.24100, -84.39360))

> round(GeoDistanceInMetresMatrix(df.cities) / 1000)

New York City Chicago Los Angeles Atlanta

New York City 0 1148 3945 1204

Chicago 1148 0 2808 945

Los Angeles 3945 2808 0 3116

Atlanta 1204 945 3116 0

For example, the above distance matrix shows that the straight-line distance—accounting for curvature of the earth—between Los Angeles and NYC is 3,945 km.

Read More

Written by Peter Rosenmai on 17 Jan 2014.

I present here what I consider to be a fiendishly weird quirk in R's code parser.

Okay, so what do you expect the following code to do?

if (TRUE){

if (TRUE) print(1)

else print(0)

}

if (TRUE) print(1)

else print(0)

}

Run it and you'll see that it prints the number 1, as you would expect.

Okay, now what happens when you remove the top-level if block?

Read More

Written by Peter Rosenmai on 31 Dec 2013.

I show here how the <<- assignment operator may be used to debug R functions by writing local variables into the global environment.

Consider the following code:

PlotExponentialSmoother <- function(x){

exp.smoother <- HoltWinters(x, beta=FALSE, gamma=FALSE)

plot.subtitle <- paste("Sum of Squared Errors:", exp.smoother$sse)

plot(exp.smoother, sub=plot.subtitle)

}

PlotExponentialSmoother(window(sunspots, 1950))

exp.smoother <- HoltWinters(x, beta=FALSE, gamma=FALSE)

plot.subtitle <- paste("Sum of Squared Errors:", exp.smoother$sse)

plot(exp.smoother, sub=plot.subtitle)

}

PlotExponentialSmoother(window(sunspots, 1950))

Running that code produces a graph of sunspot activity since 1950 and an exponential smoother of those data. But there's a problem: The graph subtitle doesn't come out properly.

Read More

Written by Peter Rosenmai on 25 Nov 2013. Last revised 13 Jan 2014.

The humble stacked dot plot is, I think, often preferable to the histogram as a means of graphing distributions of small data sets. To gauge how closely a histogram approximates an underlying population distribution, one must take into account the number of points that the histogram is based on (the sample size). Many readers fail to do this—and all too often the sample size is not provided within the graph. However, a dot plot lets any reader make an immediate guess at how closely the graph follows the shape of the underlying distribution.

Several R functions implement stacked dot plots. But which one to use?

There's this one from the base graphics package:

stripchart(faithful$waiting, method="stack", offset=0.5, pch=1)

Read More

Written by Peter Rosenmai on 25 Nov 2013. Last revised 1 Jan 2014.

The following Create2DimData() R function allows two-dimensional datasets (e.g. of heights and weights) to be created by clicking with a mouse within a plot. This can be useful if you need to throw together a dataset for demonstration purposes.

For example, try calling Create2DimData() like this:

df.points <- Create2DimData(xlim=c(0,10), ylim=c(0,5))

Read More

Written by Peter Rosenmai on 25 Nov 2013. Last revised 30 Nov 2013.

R's mahalanobis() function provides a simple means of detecting outliers in multidimensional data.

For example, suppose you have a dataframe of heights and weights:

hw <- data.frame(Height.cm=c(164, 167, 168, 169, 169, 170, 170, 170, 171, 172, 172, 173, 173, 175, 176, 178),

Weight.kg=c( 54, 57, 58, 60, 61, 60, 61, 62, 62, 64, 62, 62, 64, 56, 66, 70))

Weight.kg=c( 54, 57, 58, 60, 61, 60, 61, 62, 62, 64, 62, 62, 64, 56, 66, 70))

When plotting these data (generated for this example using an interactive plot), you could mark as outliers those points that are, for instance, more than two (sample) standard deviations from the mean height or mean weight:

is.height.outlier <- abs(scale(hw$Height.cm)) > 2

is.weight.outlier <- abs(scale(hw$Weight.kg)) > 2

pch <- (is.height.outlier | is.weight.outlier) * 16

plot(hw, pch=pch)

is.weight.outlier <- abs(scale(hw$Weight.kg)) > 2

pch <- (is.height.outlier | is.weight.outlier) * 16

plot(hw, pch=pch)

Read More

Written by Peter Rosenmai on 25 Nov 2013. Last revised 13 Jan 2013.

One of the commonest ways of finding outliers in one-dimensional data is to mark as a potential outlier any point
that is more than two standard deviations, say, from the mean (I am referring to *sample* means and standard
deviations here and in what follows). But the presence of outliers is likely to have a strong effect on the mean
and the standard deviation, making this technique unreliable.

For an example of this well-known problem, try running the following R code:

require(BHH2)

x <- c(1, 2, 3, 3, 4, 4, 4, 5, 5.5, 6, 6, 6.5, 7, 7, 7.5, 8, 9, 12, 52, 90)

dotPlot(x)

mean.x <- mean(x)

sd.x <- sd(x)

lines(rep(mean.x , 2), c(0.2, 0.25)) # Vertical line at mean

lines(rep(mean.x + 2*sd.x, 2), c(0.2, 0.25)) # Vertical line at mean + 2 SD

text(mean.x , 0.3, expression(bar(x)))

text(mean.x + 2*sd.x, 0.3, expression(paste(bar(x), " + 2s")))

x <- c(1, 2, 3, 3, 4, 4, 4, 5, 5.5, 6, 6, 6.5, 7, 7, 7.5, 8, 9, 12, 52, 90)

dotPlot(x)

mean.x <- mean(x)

sd.x <- sd(x)

lines(rep(mean.x , 2), c(0.2, 0.25)) # Vertical line at mean

lines(rep(mean.x + 2*sd.x, 2), c(0.2, 0.25)) # Vertical line at mean + 2 SD

text(mean.x , 0.3, expression(bar(x)))

text(mean.x + 2*sd.x, 0.3, expression(paste(bar(x), " + 2s")))

Read More

Written by Peter Rosenmai on 17 Dec 2013. Last revised 18 Dec 2013.

Let's have a look at the Gini index data available from the World Bank through R's WDI package.

For those who haven't met it before, the Gini index is an elegantly constructed measure of, typically, income inequality. A Gini index of 0 represents a perfectly equal economy; a Gini index of 100 represents a perfectly unequal economy. (To find out more about the Gini index, have a look at my Gini index calculator.)

Let's search for the Gini index within the World Bank's datasets:

require(WDI)

WDIsearch('gini')

WDIsearch('gini')

Read More

Written by Peter Rosenmai on 8 Dec 2013.

It's easy to remove duplicate rows from an R dataframe using the unique() function:

> df.duplicates <- data.frame(a=c(1,2,3,3), b=c(1,2,3,3))

> df.duplicates

a b

1 1 1

2 2 2

3 3 3

4 3 3

> unique(df.duplicates)

a b

1 1 1

2 2 2

3 3 3

> df.duplicates

a b

1 1 1

2 2 2

3 3 3

4 3 3

> unique(df.duplicates)

a b

1 1 1

2 2 2

3 3 3

But this can be slow for large dataframes. Hideously slow, even.

Read More