What is the residual sum of squares?

1 Answer
Dec 5, 2016

It's the remaining variance unaccounted for by explainable sources of variation in data.

Explanation:

All data sets have what's known as a "total sum of squares" (or perhaps a "corrected total sum of squares"), which is usually denoted something like #SS_"Total"# or #SS_T#. This is the grand sum of all the squared data values (minus a "mean"-based correction factor, if you're using the corrected #SS_T#). #SS_T# quantifies the total amount of variance for any given data set.

Using some formulas, #SS_T# can be split into other sums of squares—the sources that attempt to explain where all that variance in #SS_T# comes from. These sources may be:

  • regression (line slopes, like how a server's tips increase with the price of a meal), denoted #SS_R#;
  • main effects (category averages, like how women tip more than men, female servers get more tips than male servers, etc.), denoted #SS_A#, #SS_B#, etc;
  • interaction effects between two explanatory variables (like how men tip more than women if their server is female), denoted #SS_(AB)#;
  • lack of fit (repeated observations when all explanatory variables are the same, like if a customer dines at a restaurant twice with the same server), denoted #SS_"LOF"#;
  • and many others.

Most of the time, these explainable sources do not account for all of the total variance in the data. We certainly hope they come close, but there is almost always a little bit of variance left over that has no explainable source.

This leftover bit is called the residual sum of squares or the sum of squares due to error and is usually denoted by #SS_"Error"# or #SS_E#. It's the remaining variance in the data that can't be attributed to any of the other sources in our model.

We usually write an equation like this:

#SS_T=SS_"Source 1"+SS_"Source 2"+...+SS_E#

It's that last term, the #SS_E#, that contains all the variance in the data that has no explainable source. It's the sum of all the squared distances between each observed data point and the point the model predicts at the corresponding explanatory values. These distances are also called the residuals, hence the term "residual sum of squares". In this way, #SS_E# is the best value to help us estimate #sigma^2#, the variance of the residuals.

Note: #SS_E# on its own does not estimate #sigma^2#; we must first divide #SS_E# by its degrees of freedom, #df_E#, to get our "mean squared error":

#MS_E=(SS_E)/(df_E)#

Unfortunately, explaining degrees of freedom would make this answer a lot longer, so I have left it out for the sake of keeping this response (relatively) short.