Written by Peter Rosenmai on 27 Aug 2016.
Let's fit a function of the form f(t) = exp(λt) to a stepwise survival curve (e.g. a Kaplan Meier curve).
Here's the stepwise survival curve we'll be using in this demonstration:
Written by Peter Rosenmai on 1 Jan 2016.
I provide here a SQL Server script to calculate Kaplan Meier survival curves and their confidence intervals (plain, log and log-log) for time-to-event data.
Suppose a web-application company has seen its ten customers cancel their subscriptions after 0.5, 1, 10, 10, 11, 13.5, 14, 19, 19.5 and 30 months (from the start of their respective subscriptions). Based on this data, we want to estimate the probability of a new customer remaining a customer more than, say, 12 months—and we want a confidence interval around that estimate.
We create the input table:
Written by Peter Rosenmai on 12 Dec 2015.
SPSS Modeler streams can be executed from R via input files and command-line calls. It's a hacky technique, but it works. And it's an easy way to make use of Modeler's excellent Expert Modeler functionality.
First, create an example of the data file that you want Modeler to read in. For example, if you want to run Modeler on a single time series, your data file will probably be a text file comprising a date column and a value column.
Then create and save a Modeler stream that reads in that file, fits the required model and produces an output file. This is fairly easy so I won't cover it here. But this is how it might look:
Written by Peter Rosenmai on 13 Jan 2015.
Written by Peter Rosenmai on 1 Jan 2015.
Here's a D3-rendered graph of the probability density function (PDF) of the beta distribution. Move the sliders to change the shape parameters or the scale of the y-axis.
Written by Peter Rosenmai on 16 Nov 2014.
Here's how I installed the rpy2 module for communicating between R and Python. I did this for R version 3.0.2, Python version 3.4.1, and rpy2 version 2.4.4 on a 64-bit machine running Windows 7.
First, I got the full pathname of my R executible by right-clicking the R icon in my Start menu and selecting Properties. It was: C:\Program Files\R\R-3.0.2\bin\x64\Rgui.exe. And it only took a moment of poking around to find the full pathname of my Python executible: C:\Anaconda3\python.exe.
Time to look at my system variables. I right-clicked on Computer in my Start menu and selected Properties; I then clicked on Advanced System Settings in the window that appeared. In the Advanced tab of the System Properties window, I clicked the Environment Variables button. The lower half of the resulting Environment Variables window showed my system variables.
The first system variable I had to deal with was Path. It didn't include the directory in which my R executable sits, so I added it: C:\Program Files\R\R-3.0.2\bin\x64\.
I then added an R_HOME system variable and set it to the top level directory of my R installation: C:\Program Files\R\R-3.0.2\.
And I added an R_USER system variable and set it to the directory that the rpy2 module would install into: C:\Anaconda3\Lib\site-packages\rpy2\.
Written by Peter Rosenmai on 27 Sep 2014. Last revised 12 Oct 2014.
Given two GPS points recorded as being d metres apart with circular error probable (CEP) of c1 and c2 metres respectively, the true distance between the recorded points has the distribution
(I give a proof of this easy result below.)
Written by Peter Rosenmai on 2 Aug 2014.
Here's some R code that generates random numbers from the probability distribution described by a given non-negative function. This can be useful when running simulations or generating datasets for testing purposes.
For example, suppose you want to generate random data from a distribution that looks something like this back-of-an-envelope sketch:
Written by Peter Rosenmai on 21 Jun 2014.
Here's an example of how to use R to smoothly drag towards the mean outliers that are more than a given number of standard deviations from the mean—or median absolute deviations from the median, or whatever—so that the most extreme outliers are dragged in the most. I don't really agree with mangling data in this way and I think the task is a trivially simple one, but I've often been asked how to do it… so here's how you might go about it.
First, for demonstation purposes, I create a dataset with some obvious outliers:
I drag the outliers towards the mean using the standard deviation:
Written by Peter Rosenmai on 11 Apr 2014. Last revised 13 Jun 2015.
Here's some R code to graph the basic survival-analysis functions—s(t), S(t), f(t), F(t), h(t) or H(t)—derived from any of their definitions.
Written by Peter Rosenmai on 22 Feb 2014.
I present here R code to calculate the influence that a set of points have upon each other, where influence is a function of (1) the distance between the points and (2) the inherent influence of the points.
Consider, for example, five light bulbs with brightness given by this vector:
Now, suppose that the distance between the light bulbs (in metres) is given by this distance matrix:
This matrix tells us, for instance, that bulbs two and three are 12 metres apart. Note that the distance matrix is symmetrical about a zero diagonal. This corresponds with the ordinary notion of distance: Any point is a zero distance from itself, and the distance from point A to point B equals the distance from point B to point A. However, my code permits non-symmetric distances: If bulb two is "uphill" from bulb three, [2, 3] will be greater than [2, 3].
Written by Peter Rosenmai on 30 Jan 2014.
Here's an example of how to calculate a distance matrix for geographic points (expressed as decimal latitudes and longitudes) using R:
For example, the above distance matrix shows that the straight-line distance—accounting for curvature of the earth—between Los Angeles and NYC is 3,945 km.
Written by Peter Rosenmai on 17 Jan 2014.
I present here what I consider to be a fiendishly weird quirk in R's code parser.
Okay, so what do you expect the following code to do?
Run it and you'll see that it prints the number 1, as you would expect.
Okay, now what happens when you remove the top-level if block?
Written by Peter Rosenmai on 31 Dec 2013.
I show here how the <<- assignment operator may be used to debug R functions by writing local variables into the global environment.
Consider the following code:
Running that code produces a graph of sunspot activity since 1950 and an exponential smoother of those data. But there's a problem: The graph subtitle doesn't come out properly.
Written by Peter Rosenmai on 25 Nov 2013. Last revised 13 Jan 2014.
The humble stacked dot plot is, I think, often preferable to the histogram as a means of graphing distributions of small data sets. To gauge how closely a histogram approximates an underlying population distribution, one must take into account the number of points that the histogram is based on (the sample size). Many readers fail to do this—and all too often the sample size is not provided within the graph. However, a dot plot lets any reader make an immediate guess at how closely the graph follows the shape of the underlying distribution.
Several R functions implement stacked dot plots. But which one to use?
There's this one from the base graphics package:
Written by Peter Rosenmai on 25 Nov 2013. Last revised 1 Jan 2014.
The following Create2DimData() R function allows two-dimensional datasets (e.g. of heights and weights) to be created by clicking with a mouse within a plot. This can be useful if you need to throw together a dataset for demonstration purposes.
For example, try calling Create2DimData() like this:
Written by Peter Rosenmai on 25 Nov 2013. Last revised 30 Nov 2013.
R's mahalanobis() function provides a simple means of detecting outliers in multidimensional data.
For example, suppose you have a dataframe of heights and weights:
When plotting these data (generated for this example using an interactive plot), you could mark as outliers those points that are, for instance, more than two (sample) standard deviations from the mean height or mean weight:
Written by Peter Rosenmai on 25 Nov 2013. Last revised 13 Jan 2013.
One of the commonest ways of finding outliers in one-dimensional data is to mark as a potential outlier any point that is more than two standard deviations, say, from the mean (I am referring to sample means and standard deviations here and in what follows). But the presence of outliers is likely to have a strong effect on the mean and the standard deviation, making this technique unreliable.
For an example of this well-known problem, try running the following R code:
Written by Peter Rosenmai on 17 Dec 2013. Last revised 18 Dec 2013.
Let's have a look at the Gini index data available from the World Bank through R's WDI package.
For those who haven't met it before, the Gini index is an elegantly constructed measure of, typically, income inequality. A Gini index of 0 represents a perfectly equal economy; a Gini index of 100 represents a perfectly unequal economy. (To find out more about the Gini index, have a look at my Gini index calculator.)
Let's search for the Gini index within the World Bank's datasets:
Written by Peter Rosenmai on 8 Dec 2013.
It's easy to remove duplicate rows from an R dataframe using the unique() function:
But this can be slow for large dataframes. Hideously slow, even.