Regression Basics – MSR, MSE, Cook’s Distance

How do calculate the MSR, MSE, and Cook’s Distance directly from the output dataset with model prediction? SAS will produce a general summary of ANOVA results as below, but how are these statistics calculated.

Y= Intercept + beta1x1 +beta2x2 +beta3x3; n = 49, observation 20 to 47 are omited in display.

The excel spreadsheet shows the model prediction as yhat_old, residual values as resid_old, and the cook’s distance value as cook_old. The 19th observation has a high value (1.29) using the 4/n threshold (4/49=0.08).

  • Mean Square due to Regression (MSR) =24,948,880,090
    • Column model^2 are the squares of the difference between yhat_old and mean of Y which is 186,509, as shown in the Excel;
    • Take sum of the above, which equals to 74,846,640,270, as shown in the Excel.
    • Divided by degree of freedom, which is 3 (3 parameters as x1, x2, x3 ).
    • MSR means variances that are explained by the model.
  • Mean Square Error (MSE) =139,372,106
    • Column error^2 are the squares the difference between yhat_old and Y;
    • Take sum of the above, which equals to 6,271,744,784, as shown in the Excel.
    • Divided by degree of freedom, which is total number of observation minus number of parameters, which should include intercept as a parameter (intercept, x1, x2, x3).
    • MSE means variances that are unexplained by the model.
  • Root MSE or s: The estimated standard deviation of the random error. s=sqrt(139372106)=11805.6
  • F value: MSR/MSE =179.01
  • Cook’s Distance (how the cook’s D for observation 19 is calculated)
    • yhat_new is the model prediction excluding observation 19 in the model.
    • Column diff^2 are the squares of the difference between yhat_old and yhat_new;
    • Take sum of the above , which equals to 723,926,182;
    • For the denominator, multiply MSE (139,732, 106) by the number of parameters (4), which includes the intercept as parameter (intercept, x1, x2, x3).
    • Take the division of the results from the above 2 items equals to 1.2985.
    • Cook’s Distance measures the individual observation’s influence on the model but removing the observation from the model and measure the squared difference in prediction as a portion of the MSE per parameter. The smaller the Cook’s Distance (close to 0) means that removing the observation from the model have no impact on the model.

The Excel calculation matches the SAS output.

PowerBI and R Integration – Connect SAS dataset

Our database infrastructure is in SAS. Although many data requests can be handled directly through Base SAS, but SAS doesn’t have good visualization capability.

Directly connect PowerBI with SAS raw datasets or analytical result in SAS datasets from PowerBI through R integration can efficiently demonstrate the insights of data.

Here is the R script to connect the SAS dataset to PowerBI.