Analytics, Best Practices, Bigdata, Exploratory Data Analysis, Machine Learning

Residual Plots for Regression Analysis…

As we discussed in my last article to show you parameters to understand the accuracy and prediction of a regression model but I guess before going into that we first need to understand the importance of residual plot. Without understanding residual plots the discussion on regression would be incomplete.

Using residual analysis we can verify that our model is linear or nonlinear. Residual plots reveal unwanted residual patterns that indicate biased results. You just need to muster it by visualization. In residual analysis we check that the variables are randomly scattered around zero for the entire range of fitted values.

It is crucial to check the residual plots. If your plots display unwanted patterns, you can’t trust the regression coefficients and other numeric results. I was working with one of the ML studio and was surprised to see that there is no any module available to draw residual plots. Here I’ll show you, how simple it is to utilize R to create visualization on data to draw residual plot and analyse it. Note that, Residual is completely different from disturbance in data and I am covering only residual here.

So here is the how do you determine whether the residuals are random in regression analysis. It’s pretty simple, see figure below as they are randomly scattered around zero for the entire range of fitted values. When the residuals center on zero, they indicate that the model’s predictions are correct on average rather than systematically too high or low. 

Below let me generate the residual plot for my model. Here dt2 is my dataframe and I want to analyse it to see the GROWTHRATE have anything to do with the POPULATION of that area.

fit <- lm(GROWTHRATE ~ TOTPOPULAT , data = dt2)


So here I think the we can predict the errors in residual plot as near 15 it is perfectly scattered around and near 25 also it show randomness in data. So this plot concludes that the there is, definetly would be a regression between GROWTHRATE and TOTPOPULATION.

Now lets draw another plot between VictimsAge and GROWTHRATE+TOTPOPULATION

fit <- lm(victim_ages ~ GROWTHRATE + TOTPOPULAT , data = dt2)


Here we can see the pattern like between 26 and 27 are more a biased dots and data doesn’t seem scattered perfectly at zero. 

Now what would be the possible problem if such case arises in residual plot.

1. It may not be a regression problem. 

2. There must be a nonlinear model to such problem.

3. Some of the explanatory information have been leaked or are not present.

4. Some missing variables.

5. You may co-related a variable that should not be considered. 

6. It is time to improve your model because there is something more that your independent variables can explain.

To get numerical summary of your plot you can use R code below:-

> summary(lm(GROWTHRATE ~ TOTPOPULAT, data = dt2))

lm(formula = GROWTHRATE ~ TOTPOPULAT, data = dt2)

   Min     1Q Median     3Q    Max 
-426.7 -233.6 -134.7   49.3 5091.3 

                Estimate   Std. Error t value  Pr(>|t|)    
(Intercept) 329.05699290  73.95042442   4.450 0.0000134 ***
TOTPOPULAT    0.00002989   0.00002350   1.272     0.205    
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 579.6 on 230 degrees of freedom
  (5 observations deleted due to missingness)
Multiple R-squared:  0.006987,	Adjusted R-squared:  0.00267 
F-statistic: 1.618 on 1 and 230 DF,  p-value: 0.2046

Hopefully, you see that checking your residuals plots is a crucial but simple thing to do. You need random residuals.

Next as promised lets go to our OLS model and understand what the underlying parameters are saying about the model and predictions.

Happy Machine Learning!!

Leave a Reply

Your email address will not be published. Required fields are marked *