We often use the coefficient of determination as a swift ‘measure’ of goodness of fit for our regression models. Unfortunately, there is no unique symbol for such a coefficient and both R2 and r2 are used in literature, almost interchangeably. Such an interchangeability is also endorsed by the Wikipedia (see at: https://en.wikipedia.org/wiki/Coefficient_of_determination ), where both symbols are reported as the abbreviations for this statistical index.
As an editor of several International Journals, I should not agree with such an approach; indeed, the two symbols R2 and r2 mean two different things, and they are not necessarily interchangeable, because, depending on the setting, either of the two may be wrong or ambiguous. Let’s pay a little attention to such an issue.
What are the ‘r’ and ‘R’ indices?
When studying the relationship between quantitative variables, we have two main statistical indices:
- the Pearson’s (simple linear) correlation coefficient, that is almost always indicated as the r coefficient. Correlation is different from regression, as it does not assume any sort of dependency between two quantitative variables and it is only meant to express their joint variability;
- the coefficient of multiple correlation, that is usually indicated with R and represents (definition from Wikipedia) a measure of how well a given variable can be predicted using a linear function of a set of other variables. Although R is based on correlation (it is the correlation between the observed values for the dependent variable and the predictions made by the model), it is used in the context of multiple regression, where we are studying a dependency relationship.
And, what about the coefficient of determination? It is yet another concept and another index, measuring (again from Wikipedia) the proportion of the variation in the dependent variable that is predictable from the independent variable(s). As you see, we are still in the context of regression and our aim is to describe the goodness of fit.
To start with, let’s abbreviate the coefficient of determination as CD, in order to avoid any confusion with r and R; this index can be be obtained as:
CD1=SSregSStot
or as:
CD2=1−SSresSStot
where: SSreg is the regression sum of squares, SStot is the total sum of squares and SSres is the residual sum of squares, after a linear regression fit. The second formula is preferable: sum of squares are always positive and, thus, we clearly see that CD2 may not be higher than 1 (this is less obvious, for CD1.
So far, so good; we have three different indices and three different symbols: r, R and CD. But, in practice, things did not go that smoothly! The early statisticians, instead of proposing a brand new symbol for the coefficient of determination, made the choice of highlighting the connections with r and R. For example, Sokal and Rohlf, in their very famous biometry book, at page 570 (2nd. Edition) demonstrated that the coefficient of determination could be obtained as the square of the coefficient of correlation and, thus, they used the symbol r2. Later, in the same book (pag. 660), these same authors demonstrated that the coefficient of multiple correlation R was equal to the positive square root of the coefficient of multiple determination, for which they used the symbol R2.
Obviously, Sokal and Rohlf (and other authors of other textbooks) meant to say that the coefficient of determination, depending on the context, can be obtained either as the square of the correlation coefficient, or as the square of the multiple correlation coefficient. An uncareful interpretation led to the idea that the coefficient of determination can be indicated either as the R2 or as the r2 and that the two symbols are always interchangeable. But, is this really true? No, it depends on the context.
Simple linear regression
Let’s have a look at the following example: we fit a simple linear regression model to a dataset and retrieve the coefficient of determination by using the summary()
method.
X <- 1:20
Y <- c(17.79, 18.81, 19.02, 14.14, 16.72, 16.05, 13.99, 13.26,
12.48, 11.33, 11.07, 9.68, 9.19, 9.44, 9.75, 7.71, 6.47,
5.22, 4.55, 7.7)
mod <- lm(Y ~ X)
summary(mod)$r.squared # Coeff. determination with R
## [1] 0.9270622
It is very easy to see that R=|r| and it is also easy to demonstrate that r2=CD1 (look, e.g., at Sokal and Rohlf for a mathematical proof). Furthermore, due to the equality SStot=SSreg+SSres, it is also easy to see that CD1=CD2. We are ready to draw our first conclusion.
SSreg <- sum((predict(mod) - mean(Y))^2)
SStot <- sum((Y - mean(Y))^2)
SSres <- sum(residuals(mod)^2)
SSreg/SStot
## [1] 0.9270622
r
1 - SSres/SStot
## [1] 0.9270622
r
r.coef <- cor(X, Y)
R.coef <- cor(Y, fitted(mod))
r.coef^2
## [1] 0.9270622
r
R.coef^2
## [1] 0.9270622
CONCLUSION 1. For simple linear regression, the coefficient of determination is always equal to R2=r2 and both symbols are acceptable (and correct).
Polynomial regression and multiple regression
Apart from simple linear regression, for all other types of linear models, provided that an intercept is fitted, it is not, in general, true that R=|r|, while it is, in general, true that that the coefficient of determination is equal to the squared coefficient of multiple correlation R2. I’ll show a swift example with a polynomial regression in the box below.
mod2 <- lm(Y ~ X + I(X^2))
cor.coef <- cor(X, Y)
R.coef <- cor(Y, fitted(mod2))
# R and r are not equal
cor.coef; R.coef
## [1] -0.9628407
## [1] 0.9652451
r
# The coefficient of determination is R2
R.coef^2; summary(mod2)$r.squared
## [1] 0.931698
## [1] 0.931698
Furthermore, when we have several predictors (e.g., multiple regression), the correlation coefficient is not unique and we could calculate as many r values as there are predictors in the model.
In the box below I show another example where I analysed the ‘mtcars’ dataset by using multiple regression; we see that R2=CD1=CD2.
data(mtcars)
mod <- lm(mpg ~ wt+disp+hp - 1, data = mtcars)
summary(mod)$r.squared # Coeff. determination with R
## [1] 0.8328665
r
SSreg <- sum((predict(mod) - mean(mtcars$mpg))^2)
SStot <- sum((mtcars$mpg - mean(mtcars$mpg))^2)
SSres <- sum(residuals(mod)^2)
SSreg/SStot
## [1] 0.9852479
r
1 - SSres/SStot
## [1] -1.084229
r
R.coef <- cor(mtcars$mpg, fitted(mod))
R.coef^2
## [1] 0.002746157
We are now ready to draw our second conclusion.
CONCLUSION 2: with all linear models, apart from simple linear regression, the r2 symbol should not be used for the coefficient of determination, because this latter index IS NOT, in general, equal to the square of the coefficient of correlation. The R2 symbol is a much better choice.
Linear models with no intercept
The situation becomes much more complex for linear models with no intercept. For these models, the squared multiple correlation coefficient IS NOT ALWAYS equal to the proportion of variance accounted for. Let’s look at the following example:
mod2 <- lm(Y ~ - 1 + X)
summary(mod2)$r.squared # Proportion of variance accounted for
## [1] 0.4390065
r
R.coef <- cor(Y, fitted(mod2))
R.coef^2
## [1] 0.9270622
In other words, the coefficient of determination IS NOT ALWAYS the R2; however, the coefficient of determination can be calculated by using CD1=CD2, provided that SStot, SSreg and SSres are obtained in a way that accounts for the missing intercept. Schabenberger and Pierce recommend the following equations and the symbols they use clearly reflect that those equations do not return the squared multiple correlation coefficient:
R2noint=∑ni=1^yi2∑ni=1y2iorR2∗noint=1−SSres∑ni=1y2i
SSreg <- sum(fitted(mod2)^2)
SStot <- sum(Y^2)
SSres <- sum(residuals(mod2)^2)
SSreg/SStot # R^2 ok
## [1] 0.4390065
r
1 - SSres/SStot # R^2 ok
## [1] 0.4390065
We are ready for our third conclusion.
CONCLUSION 3: in the case of models with no intercept, neither the r2 nor the R2 symbols should be used for the coefficient of determination. The proportion of variability accounted for by the model can be calculated by using a modified formula and should be reported by using a different symbol (e.g. R2noint or R20 or similar).
Nonlinear regression
With this class of models, we have two main problems:
- they do not have an intercept term, at least, not in the usual sense. Consequently, the square of the multiple coefficient of correlation does not represent the proportion of variance accounted for by the model;
- the equality SStot=SSreg+SSres may not hold and thus the equations for CD1 and CD2 may produce different results.
In contrast to linear models with no intercept, for nonlinear models we do not have any general modified formula that consistently returns the proportion of variance accounted for by the model (i.e., the coefficient of determination). However, Schabenberger and Pierce (2002) suggested that we can still use CD2 as a swift measure of goodness of fit, but they also proposed that we use the term ‘Pseudo-R2’ instead oft R2. Why ‘Pseudo’?. For two good reasons:
- the ’Pseudo-R2’cannot exceed 1, but it may lower than 0;
- the ‘Pseudo-R2’ cannot be interpreted as the proportion of variance explained by the model.
In R, the ‘Pseudo-R2’ can be calculated by using the R2.nls()
function in the ‘aomisc’ package, for nonlinear models fitted with both the nls()
and drm()
functions (this latter function is in the ‘drc’ package).
library(aomisc)
X <- c(0.1, 5, 7, 22, 28, 39, 46, 200)
Y <- c(1, 13.66, 14.11, 14.43, 14.78, 14.86, 14.78, 14.91)
# nls fit
library(aomisc)
model <- nls(Y ~ SSmicmen(X, Vm, K))
R2nls(model)$PseudoR2
## [1] 0.9930399
r
# It is not the R2, in strict sense
R.coef <- cor(Y, fitted(model))
R.coef^2
## [1] 0.9957255
r
# It cannot be calculated by the usual expression!
SSreg <- sum(fitted(model) - mean(Y))
SStot <- sum( (Y - mean(Y))^2 )
SSreg/SStot
## [1] 0.003622005
r
# It can be calculated by using the alternative form
# that is no longer equivalent
SSres <- sum(residuals(model)^2)
1 - SSres/SStot
## [1] 0.9930399
We may now come to our final conclusion.
CONCLUSION 4. With nonlinear models, we should never use either r2 or R2 because they are both wrong. If we need a swift measure of goodness of fit, we can use the CD2 index above, but we should not name it as the R2, because, in general, it does not correspond to the coefficient of determination. We should better use the term Pseudo-R2.
Hope this was useful; for those who are interested in the use of the Pseud-R2 in nonlinear regression, I hav already published one post at this link: https://www.statforbiology.com/2021/stat_nls_r2/ .
Thanks for reading and happy coding!
Andrea Onofri
Department of Agricultural, Food and Environmental Sciences
University of Perugia (Italy)
andrea.onofri@unipg.it
References
- Schabenberger, O., Pierce, F.J., 2002. Contemporary statistical models for the plant and soil sciences. Taylor & Francis, CRC Press, Books.
- Sokal, R.R., Rohlf F.J., 1981. Biometry. Second Edition, W.H. Freeman and Company, USA.