What does the R-Squared value of a regression refer to?

1 Answer
Oct 14, 2015

The R-squared value of a linear regression is the percentage of variation in your response variable (y) explained by your model.


Let us take a dataset with an explanatory variable X and a response variable Y. Now, let us fit a linear regression, of the form :

#Y = aX+b#

We obtain what we see in figure 1.

enter image source here

We obtain a R-squared value of 0.667, which means that 66.7% of the variation in Y is explained by X. The more the R-squared is important, the more your model fits your data and the closer are the observed to the fitted values.

As a consequence, a relatively high R-squared is interesting when you want to predict data. What is a high R-squared ? It depends on the fields. For example, in ecology, it is rare to have R-squared above 50%.

However, whenever you fit a model, be cautious to respect the assumptions of this model. In fact, you cannot trust the r-squared if the assumptions are not respected.
Here is an example of a linear regression conducted on different datasets. All the regressions give the same R-squared (0.667) :

enter image source here

This means that you have to question yourself about your data before doing a linear regression. Is it the best regression to adopt ? If the answer is yes, as in figure 1, you can raisonnably interpret the R-squared. However, if it is not case (figure 2 for example), your R-squared will be unreliable.

Reference :
F.J Anscombe, 1973. Graphs in Statistical Analysis, The American Statistician , vol. 27, No. 1, p.17-21.