Back to the Table of Contents
Applied Statistics - Lesson 13
More Correlation Coefficients
Lesson Overview
We introduced in lesson 5 the
Pearson product moment correlation coefficient and the
Spearman rho correlation coefficient. There are more.
Remember that the Pearson product moment correlation
coefficient required quantitative (interval or ratio)
data for both x and y whereas the Spearman
rho correlation coefficient applied to ranked (ordinal)
data for both x and y. You should review
levels of measurement in
lesson 1 before we continue. It is often
the case that the data variables are not at the same level
of measurement, or that the data might instead of being
quantitative be catagorical (nominal or ordinal).
In addition to correlation coefficients based on the
product moment and thus related to the Pearson product moment
correlation coefficient, there are coefficients which are
instead measures of association which are also
in common use.
For the purposes of correlation coefficients we can generally
lump the interval and ratio scales together as just quantitative.
In addition, the regression of x on y is closely
related to the regression of y on x, and the
same coefficient applies. We list below in a table the
common choices which we will then discuss in turn.
Variable Y\X | Quantitiative X | Ordinal X | Nominal X
|
---|
Quantitative Y | Pearson r | Biserial rb | Point Biserial rpb
|
---|
Ordinal Y | Biserial rb | Spearman rho/Tetrachoric rtet | Rank Biserial rrb
|
---|
Nominal Y | Point Biserial rpb | Rank Bisereal rrb | Phi, L, C, Lambda
|
---|
Before we go on we need to clarify different types of nominal data.
Specifically, nominal data with two possible outcomes are call
dichotomous.
The point-biserial correlation coefficient, referred to as
rpb, is a special case of Pearson
in which one variable is quantitative and the other
variable is dichotomous and nominal. The calculations simplify
since typically the values 1 (presence) and 0 (absence)
are used for the dichotomous variable. This simplification
is sometimes expressed as follows:
rpb = (Y1 - Y0)
sqrt(pq) / Y,
where Y0 and Y1
are the Y score means for data pairs with
an x score of 0 and 1, respectively,
q = 1 - p and p are the proportions
of data pairs with x scores of 0 and 1, respectively,
and Y
is the population standard deviation for the y data.
An example usage might be to determine if one gender accomplished
some task significantly better than the other gender.
If both variables instead are nominal and dichotomous,
the Pearson simplifies even further.
First, perhaps, we need to introduce contingency tables.
A contingency table is a two dimensional table containing
frequencies by catagory. For this situation it will
be two by two since each variable can only take on
two values, but each dimension will exceed two when
the associated variable is not dichotomous.
In addition, column and row headings and totals are
frequently appended so that the contingency table ends up being
n + 2 by m + 2, where n and m
are the number of values each variable can take on.
The label and total row and column typically are
outside the gridded portion of the table, however.
As an example, consider the following data organized by
gender and employee classification (faculty/staff).
(htm doesn't provide the facility to grid only
the table's interior).
Class.\Gender | Female (0) | Male (1) | Totals
|
---|
Staff | 10 | 5 | 15
|
---|
Faculty | 5 | 10 | 15
|
---|
Totals: | 15 | 15 | 30
|
---|
Contingency tables are often coded as below to
simplify calculation of the Phi coefficient.
Y\X | 0 | 1 | Totals
|
---|
1 | A | B | A + B
|
---|
0 | C | D | C + D
|
---|
Totals: | A + C | B + D | N
|
---|
With this coding:
phi = (BC - AD)/sqrt((A+B)(C+D)(A+C)(B+D)).
For this example we obtain:
phi = (25-100)/sqrt(15151515) = -75/225 = -0.33,
indicating a slight correlation. Please note that this is
the Pearson correlation coefficient, just calculated in a
simplified manner. However, the extreme values of |r| = 1
can only be realized when the two row totals are equal and
the two column totals are equal. There are thus ways
of computing the maximal values, if desired.
As product moment correlation coefficients,
the point biserial, phi, and Spearman rho are
all special cases of the Pearson.
However, there are correlation coefficients which are not.
Many of these are more properly called
measures of association, although
they are usually termed coefficients as well.
Three of these are similar to Phi
in that they are for nominal against nominal data,
but these do not require the data to be dichotomous.
One is called Pearson's contingency coefficient
and is termed C whereas the second is
called Cramer's V coefficient.
Both utilize the chi-square statistic
so will be deferred into the next lesson.
However, the Goodman and Kruskal lambda coefficient
does not, but is another commonly used association measure.
There are two flavors, one called symmetric when the researcher
does not specify which variable is the dependent variable and
one called asymmetric which is used when such a designation
is made. We leave the details to any good statistics book.
Another measure of association, the biserial correlation coefficient,
termed rb, is similar to the point biserial, but
pits quantitative data against ordinal data, but ordinal data
with an underlying continuity but measured discretely as two values
(dichotomous). An example might be test performance vs
anxiety, where anxiety is designated as either high or low.
Presumably, anxiety can take on any value inbetween, perhaps beyond,
but it may be difficult to measure. We further assume that
anxiety is normally distributed.
The formula is very similar to the point-biserial but yet different:
rb = (Y1 - Y0)
(pq/Y) / Y,
where Y0 and Y1
are the Y score means for data pairs with
an x score of 0 and 1, respectively,
q = 1 - p and p are the proportions
of data pairs with x scores of 0 and 1, respectively,
and Y
is the population standard deviation for the y data,
and Y is the height of the standardized normal distribution
at the point z, where P(z'<z)=q and
P(z'>z)=p.
Since the factor involving p, q, and the height
is always greater than 1, the biserial is always greater than
the point-biserial.
The tetrachoric correlation coefficient, rtet,
is used when both variables are dichotomous, like the phi, but
we need also to be able to assume both variables really are
continuous and normally distributed. Thus it is applied to
ordinal vs. ordinal data which has this characteristic.
Ranks are discrete so in this manner it differs from the Spearman.
The formula involves a trigonometric function called cosine.
The cosine function, in its simpliest form, is the ratio of
two side lengths in a right triangle, specifically, the
side adjacent to the reference angle divided by the
length of the hypotenuse. The formula is:
rtet = cos (180/(1 + sqrt(BC/AD)).
The rank-biserial correlation coefficient, rrb,
is used for dichotomous nominal data vs rankings (ordinal).
The formula is usually expressed as
rrb = 2 (Y1 - Y0)/n,
where n is the number of data pairs, and
Y0 and Y1,
again, are the Y score means for data pairs with
an x score of 0 and 1, respectively.
These Y scores are ranks. This formula assumes
no tied ranks are present.
This may be the same as a Somer's D statistic for
which an online calculator is available.
It is often useful to measure a relationship irrespective of
if it is linear or not. The eta correlation ratio
or eta coefficient gives us that ability. This statistic
is interpretted similar to the Pearson, but can never be negative.
It utilizes equal width intervals and always exceeds |r|.
However, even though r is the same whether we
regress y on x or x on y,
two possible values for eta can be obtained.
Again, the calculation goes beyond what can be presented
here at the moment.