Fundamentals of Statistics contains material of various lectures and courses of H. Lohninger on statistics, data analysis and chemometrics......click here for more.


Distribution of the Correlation Coefficient

This simulation allows to calculate the distribution of the correlation coefficient in relation to the sample size. Click the image to experiment with the simulation.
Let us assume that we are monitoring a process which is described by two variables (for example, the temperature and the flow rate in a chemical reactor). Let us further assume that the two variables are uncorrelated, which implies that we will detect a correlation coefficient of approximately zero if we draw many samples of the two parameters.

As the measurement of many samples is time-consuming, we measure only a small number of values (i.e. five) and calculate the correlation coefficient of these five pairs of values. The measurement of this small amount of values will be repeated several times. We will see, that the actual correlation coefficient as determined from this small samples will considerably deviate from the expected value of zero. If we repeat this experiement often enough we can plot the frequencies of occurence in a histogram showing the distribution of the correlation coefficient (see simulation at the right).

The correlation coefficient r is a random variable, thus having a distribution function which depends on the population value of the correlation coefficient ρ and the number of samples n.

From the images above one can conclude that for a small number of observations it is quite likely that the correlation coefficient is high. A high correlation coefficient does not necessarily represent a high correlation between two variables. The cause for the high correlation may equally be a small sample size. Especially with four sample values, any correlation coefficient is equally likely to occur.

As a consequence of this effect, one has to test for the significance of a correlation coefficient. There's a rule of thumb which gives a guideline: if we have 10 pairs of observations the correlation coefficient has to exceed 0.8 to be significant, for 20 pairs this limit is around 0.5.