Sunday, September 20, 2009

Computing a confidence interval for ρ

Curiously, neither R nor SPSS seem to offer a simple way to compute a confidence interval for Pearson's correlation coefficient based on r and the sample size. R base package includes the cor.test function which does provide a confidence interval based on Fisher z transformation but it takes the full data set as input. Even then, the confidence interval depends only on the sample correlation and on the sample size so the extra information is not really needed, except to compute the sample correlation coefficient in the first place. The confidence interval can therefore just as well be computed from published correlation coefficients, without going back to the original data set.

The formula is relatively simple and can be found in any statistics textbook but tracking it down and computing it by hand every time can be somewhat cumbersome. Here is a short R function to do it easily.

r.cint <- function(r,n,level=.95) {
 z <- 0.5*log((1+r)/(1-r))
 zse <- 1/sqrt(n-3)
 zmin <- z - zse * qnorm((1-level)/2,lower.tail=FALSE)
 zmax <- z + zse * qnorm((1-level)/2,lower.tail=FALSE)
 return(c((exp(2*zmin)-1)/(exp(2*zmin)+1),(exp(2*zmax)-1)/(exp(2*zmax)+1)))
}

The result can also be used as an hypothesis test, by checking if the confidence interval includes 0 or any other constant. The conclusion is very similar but not identical to the tests reported by SPSS CORRELATIONS procedure or R cor.test, because these p values are based on another test statistic (and on the t distribution).

What's the point? As is plain to see from the formulas, the standard error of the z-transformed correlation depends only on the sample size (that's the point of the transformation), which means that you don't need any other information than the correlation coefficient and the sample size to perform a test.

Correlation are often reported without any discussion of sampling variability but with a very small sample size, the point estimate is going to be very imprecise and even an impressive r can hide a modest correlation. Similarly, a moderate observed correlation could reflect anything from a small correlation in the other direction to a strong correlation in the same direction. If nothing else, the confidence interval makes this imprecision visible and helps to interpret results based on experiments with a very small number of participants.

Bootstrapping techniques can also be used to construct a confidence interval for a correlation coefficient but they require access to the original data set and cannot be computed based only on typical research reports.

PS: This post was updated in 2013 to fix a layout problem and add some clarifications

4 comments:

  1. I just copied your function to my txt file named "handy_functions.R" :)

    ReplyDelete
  2. I wrote a Stata program with exactly the same intent. The accompanying piece for the Stata Journal is accessible to all and may be useful as a tutorial review even for people who don't use Stata. http://www.stata-journal.com/sjpdf.html?articlenum=pr0041

    ReplyDelete
  3. This website seems to advocate a test statistic based on the t-distribution: http://vassarstats.net/textbook/ch4apx.html - which one is correct here. Is the one above valid as n->Inf ?

    ReplyDelete