Regression Basics – MSR, MSE, Cook’s Distance

How do calculate the MSR, MSE, and Cook’s Distance directly from the output dataset with model prediction? SAS will produce a general summary of ANOVA results as below, but how are these statistics calculated.

Y= Intercept + beta1x1 +beta2x2 +beta3x3; n = 49, observation 20 to 47 are omited in display.

The excel spreadsheet shows the model prediction as yhat_old, residual values as resid_old, and the cook’s distance value as cook_old. The 19th observation has a high value (1.29) using the 4/n threshold (4/49=0.08).

  • Mean Square due to Regression (MSR) =24,948,880,090
    • Column model^2 are the squares of the difference between yhat_old and mean of Y which is 186,509, as shown in the Excel;
    • Take sum of the above, which equals to 74,846,640,270, as shown in the Excel.
    • Divided by degree of freedom, which is 3 (3 parameters as x1, x2, x3 ).
    • MSR means variances that are explained by the model.
  • Mean Square Error (MSE) =139,372,106
    • Column error^2 are the squares the difference between yhat_old and Y;
    • Take sum of the above, which equals to 6,271,744,784, as shown in the Excel.
    • Divided by degree of freedom, which is total number of observation minus number of parameters, which should include intercept as a parameter (intercept, x1, x2, x3).
    • MSE means variances that are unexplained by the model.
  • Root MSE or s: The estimated standard deviation of the random error. s=sqrt(139372106)=11805.6
  • F value: MSR/MSE =179.01
  • Cook’s Distance (how the cook’s D for observation 19 is calculated)
    • yhat_new is the model prediction excluding observation 19 in the model.
    • Column diff^2 are the squares of the difference between yhat_old and yhat_new;
    • Take sum of the above , which equals to 723,926,182;
    • For the denominator, multiply MSE (139,732, 106) by the number of parameters (4), which includes the intercept as parameter (intercept, x1, x2, x3).
    • Take the division of the results from the above 2 items equals to 1.2985.
    • Cook’s Distance measures the individual observation’s influence on the model but removing the observation from the model and measure the squared difference in prediction as a portion of the MSE per parameter. The smaller the Cook’s Distance (close to 0) means that removing the observation from the model have no impact on the model.

The Excel calculation matches the SAS output.

PowerBI and R Integration – Connect SAS dataset

Our database infrastructure is in SAS. Although many data requests can be handled directly through Base SAS, but SAS doesn’t have good visualization capability.

Directly connect PowerBI with SAS raw datasets or analytical result in SAS datasets from PowerBI through R integration can efficiently demonstrate the insights of data.

Here is the R script to connect the SAS dataset to PowerBI.

SAS: Proc SQL Group By generate duplicate results and fix

I need to get a list of sum of award amount by year by student id and by award. The following codes generated duplicated results and a warning message.

Problem:

proc sql;
create table want as
select year, id, award, funding_type, type, sum(amount) as sum, 1 as count
from have
where (funding_type eq "A" and type = "Internal") or (funding_type eq "B")
group by year, funding_type, sisid, award;quit

“The query requires remerging summary statistics back with the original data. “

Fix: “In SAS SQL, in a query with a group by clause that includes extraneous columns on the select statement (i.e. columns not part of the group by and not derived from an aggregating function), SAS “remerges” the summary statistics back to the original data (with a note to that effect).”
Reference:
https://stackoverflow.com/questions/25538392/sas-proc-sql-returning-duplicate-values-of-group-by-order-by-variables

Because “type” variable is in the select statement but not in the group by statement, SAS needs to remerging the summary statistics back with the original data to get information on the “type” variable. In the following codes, I added “type” to the group by statement and the warning message was gone.

proc sql;
create table want as
select year, id, award, funding_type, type, sum(amount) as sum, 1 as count
from funding
where (funding_type eq "A" and type = "Internal") or (funding_type eq "B")
group by year, funding_type, type, id, award;
quit;

Proc SQL: Update with other table

Update statement in Data step:

Syntax:

Data master;
update master  transaction;
by key1 key2;
run;

Limitation:

  • The master data set should be sorted by the by group with no duplicated grouping id.
  • The transaction data set should be sorted by the by group with no duplicated grouping id.
  • Transaction data set should have just the column that contains the updated information.

Update using Proc SQL Update statement:

Syntax:

proc sql;
update table1 as t1
set major=(select major from table2 as t2 where t1.id = t2.id and t1.year=t2.year)
where exists (select 1 from table2 t2 where t1.id = t2.id and t1.year=t2.year);
quit;

Limitation:

  • Can’t use join statement. Can only use nested Select statement.
  • Code is a little complicated to comprehend.

SAS: Identify Lag and Lead Status by Row

For character variable.

data want;
set have;
by id;
set  have (firstobs =2 keep = index0 rename= (index0= index1))
     have (obs=1 drop = _all_ );
indexlag1 = ifc( first.id, '', lag1(index0));
index1 = ifc( last.id, '', index1);
run;

For numeric variable.

data want;
set have;
by id;
set  have (firstobs =2 keep = index0 rename= (index0= index1))
     have (obs=1 drop = _all_ );
indexlag1 = ifn( first.id, (.), lag1(index0));
index1 = ifn( last.id, (.) , index1);
run;

SAS: Identify Row Level Changes and Tagging

Sort the dataset first and identify the change using lag(var).
The example below shows if the student cumulative credits drops from previous term, the count of degree will increase by 1.

/* identify the row observation that the change took place */
data data2;
set data1;
by sisid term; 
if first.sisid then deg =1;
else if sisid = lag1(sisid) and cumm_cr < lag1(cumm_cr) then deg = 2;
run;
/* tagging the observations associated with the change */
data data3 ;
set data2;
by sisid;
retain temp;
if first.sisid then do; temp = 0; end;
if deg ne . then temp = deg;
else if deg eq . then deg= temp;
run;

Updated to

data data2;
set data1;
by sisid term; 
retain deg;
if first.sisid then deg =1;
else if sisid = lag1(sisid) and cumm_cr < lag1(cumm_cr) then deg = deg +1;
run;

SAS: Reading Census Data

Census data is exceptionally large. The 2016 census profile for Ontario is 4.5G and has more than 46,694,909 lines of records. To extract the data efficiently, StatsCan provides a csv file that identifies the starting row number for each geography. Using this file, you can compile the parameter list for the geographical area of interest at the selected geographic level, eg. province level, census division level, and census subdivision level.
Census file can be downloaded at link
Step 1: Compile parameter lists

%put &name.;
Canada Ontario Durham York Toronto Peel Halton
 %put &start.;
2 2249 7287023 9513800 12198965 20521853 25289987
 %put &end.;
2248 4495 7289269 9516046 12201211 20524099 25292233

Step 2: Extract and Compile data

%macro ext (namelst=, startlst=, endlst=);
proc datasets library=work noprint;
delete census;
quit;
%let i=1;
%do %while (%scan(&namelst., &i, ' ') ne );
%let parm1=%scan(&namelst., &i, ' ');
%let parm2=%scan(&startlst., &i, ' ');
%let parm3=%scan(&endlst., &i, ' ');
data census_&parm1.;
infile 'X:\Work\Stats Can\98-401-X2016044_ONTARIO_eng_CSV\98-401-X2016044_ONTARIO_English_CSV_data.csv'
delimiter = ',' firstobs=&parm2. obs=&parm3. TRUNCOVER  DSD LRECL=32767 ;
INFORMAT 
year 8.
geo_code  $13.
geo_level 8.
geo_name $80.
gnr 8.1
gnr_lf 8.1
quality_flag $5. 
alt_geo_code 8.
Item $50.
itemID 8.
Notes 8.
Total 8.
Male $8.
Female $8.
;
FORMAT 
year 8.
geo_code  $13.
geo_level 8.
geo_name $50.
gnr 8.1
gnr_lf 8.1
quality_flag $5. 
alt_geo_code 8.
Item $80.
itemID 8.
Notes 8.
Total 8.
Male $8.
Female $8.
;
input
year 
geo_code $
geo_level 
geo_name $
gnr
gnr_lf
quality_flag 
alt_geo_code 
Item $
itemID 
Notes 
Total
Male $
Female $
;
run;
proc append base = census data = census_&parm1.;
run;
%let i = %eval (&i +1);
%end;
%mend ext;
%ext (namelst=&name., startlst=&start., endlst=&end. );

The resulting dataset is only 4.1mb, which you can manipulate efficiently.

SAS: Convert format catalogs from 32bit to 64bit

Reference: http://support.sas.com/kb/44/047.html

Error message:
ERROR: File FORMATS.CATALOG was created for a different operating system.
Step 1: In windows 32-bit SAS, create a transport file (.cpt) with PROC CPORT and file option.

libname my32 'X:\Work\SAS\formats'; /* path where commonfmt.sas7bcat exists */
filename cat1 'X:\Work\SAS\formats\commonfmt.cpt';  /* transport file you are creating */

proc cport lib=my32 file=cat1 memtype=catalog;
   select commonfmt;
run;

The .cpt file will contain the format information of commonfmt.sas7bcat catalog file.

Step 2: In windows 64-bit SAS, unload the transport file (.cpt) using PROC CIMPORT and infile option.

libname my64 'X:\Work\SAS\format64';  /* path to store the new Formats.sas7bcat file */
filename trans1 'X:\Work\SAS\formats\commonfmt.cpt';  /* same as in Step 1 above */

proc cimport infile=trans1 lib=my64;
run;

Step 3: Check the formats in windows 64-bit SAS.

libname sasfmt  'X:\Work\SAS\format64';

PROC CATALOG CATALOG = SASFMT.COMMONFMT;
CONTENTS;
QUIT;

SAS 9.4 ODS RTF file corrupt

Recently, I updated from SAS 9.3 to 9.4. The code was working fine on 9.3. It contains a macro running do loops, producing graphs for each loop, and outputting all the graphs into a rtf file. When I ran the code in 9.4, I can’t open the file in Word. Google search recommends update the sasv9.cfg file to increase the -memsize and -memmaxsz to 6G (https://communities.sas.com/t5/ODS-and-Base-Reporting/SAS-ODS-RTF-file-corrupt/td-p/282779). Another alternative I find is to just change ODS RTF to ODS PDF without changing any code. When running the ODS PDF statements, I got the following warning, but the PDF file was generated without a problem.

Warning: Unsupported device "ACTIVEX" for LISTING destination.  Using device "ACTXIMG".

To get rid of the warning, in SAS EG, navigating to Tools/Options…/Results/Graph. In the “Graph Format” dropdown list, changing “ActiveX” to “ActiveX image”.