SQL Join: Compile Statistics Summary from Multiple Statistics Tables

In the previous blog, I presented how to compile statistic results by groups and by variable into a long table for each type of statistics analysis.   All tables share three common field: group, test and var.

Here is the list of the table with the specific statistic that want to extract and compile into a summary table.

  • stat_tstat:
    • class = ‘1’: n, mean
    • class = ‘0’: n, mean
    • class = ‘Diff (1-2)’: mean,
  • stat_ttest:
    • variances = ‘Unequal’: tValue, Probt

PPOC SQL is the most convenient procedure to pick up these values under specific condition and combine them into one table.

proc sql;
create table stat1 as
select a.n as studyn, a.mean as studymean, b.n as controln, b.mean as controlmean, c.mean as diff, a.group, a.var, d.tValue, d.probt
from stat_tstat a
left join stat_tstat b
on a.group =b.group and a.var=b.var and b.class = '1'
left join stat_tstat c
on a.group =c.group and a.var=c.var and c.class = 'Diff (1-2)'
left join stat_ttest d
on a.group = d.group and a.var=d.var and d.variances = 'Unequal'
where a.class = '0';
quit;

SAS: Stacking Statistical Output using Macros

The previous post goes through basic SAS statistics procedures, like Proc Means, Proc Ttest, and Proc Freq.  In this post, the focus is to establish a good naming convention for the statistical outputs. For each loop of PROC TTEST or PROC FREQ, the statistics are saved in separate tables by pairing group and type of statistics.  When stacking the test results, it is important to identify the type of statistics, the variable that the statistics is based on and the pairing group.  In the following macro, &in. is the group identifier, &var. is the variable identified, and &stat is the statistics procedure identifier.

%macro stat (in =, var=, stat=);
data &in.&stat.&var.;
set &in.&stat.&var.;
length group $2 test $20 var $20;
group = substr("&in.", 1, 2);
test = "&stat.";
var = "&var.";
rename &var. = response;
run;
proc append base=stat_&stat. data = &in.&stat._&var.;
run;
%mend stat;

proc datasets library= work noprint;
delete stat_prop;
run;
%stat ( in =g1, var= isretainYr1,stat=prop);
%stat ( in =g1, var= isretainYr2,stat=prop);
%stat ( in =g1, var= isretainYr3,stat=prop);
%stat ( in =g1, var= isretainYr4,stat=prop);
%stat ( in =g2, var= isretainYr1,stat=prop);
%stat ( in =g2, var= isretainYr2,stat=prop);
%stat ( in =g2, var= isretainYr3,stat=prop);
%stat ( in =g2, var= isretainYr4,stat=prop);

SAS STAT: getting output statistics (1)

SAS STAT package has many procedures that can be used to conduct specific statistical analysis with structured data sets.  I found the following use of precedures quite common and there are usually a number of statistics needed to be output for further compilation.  The caveat is to sort the data set in the desired order, so the null hypothesis is in the upper left cell (1,1) in the two-way frequency table, which normally showes the results of study group with a respone equal to 1. This way the output statistics can be interpreted more straight forward.

A. How to get output from statistics procedures?

There are different ways for outputing statistics, eg. ODS OUTPUT, OUTPUT statement or OUT=;

a. use PROC MEANS for general statistics like n, mean, min, max std;

  • use maxdec= option to adjust the decimal places needed.
  • output out=filename: N, MIN, MAX, MEAN, STD, SUMWGT

b. use PROC TTEST for two-sample independent T-test; use CLASS Variable to identify and differentiate the groups for study cases and control cases;

  • ods output
    • ttests=filename: tValue, DF, (pick Satterthwaite method which assume unequal variance);
    • statistics=filename: n, mean, LowerCLMean, UpperCLMean, StdDev, StdErr for ‘0’,’1′, Diff (1-2);
    • equality=filename: fValue, probf (equality of variances)

c. use PROC FREQ for two-sample independent T-test

  • tables statement, out option
    • out= filename: default only provide the TABLES variable, frequency count, percent of total frequency.  To include percent of row frequency  and percent of column frequency, you need to add OUTPCT in TABLES statement.
  • output statement out option
    • Output  relrisk  out = filename;
  • ods output
    • ChiSq=filename: Chi-Square, Prob
    • PdiffTest=filename: Proportion Difference Test, Wald is the default method, Proportion Difference, Z Value, One-sided Prob and two-sided prob.
    • RelativeRisks=filename: relative risk estimate, case-control (odds ratio), column 1 risk, column 2 risk, 95% L/U Confidence Limit

Overall, the ODS OUTPUT is the most versatile and powerful method to obtain statisics results from these procedure.  Check “ODS Table Names” under each procedure in the SAS STAT User Guide.  The SAS procedure assigns a name to each table that it creates.  Use ODS Table Name= filename to assign your own table name to be saved in the work library.  In order to have the output ods table, you also need to check if you have included the corresponding option(s) in the specific statement for the program generate the statistics.

Sample code for Two-sample Proportion Test (include all 3 methods of output):

%macro prop ( in =, var=, out=, weight=);
proc sort data=∈
by descending study descending &var.;
run;
ods graphics on;
ods output ChiSq = &out.chi_&var. PdiffTest=&out.pdiff_&var. RelativeRisks=&out.rr_&var.;
proc freq data=&in. order =data ;
format study grpfmt. &var. rspFmt.;
weight &weight.;
tables study*&var. / chisq measures riskdiff(equal) outpct out=&out._&var.
plots= (freqplot(twoway=groupvertical scale =percent));
output relrisk out=&out._or_&var.;
title "Proportion Test:Case - Control study of variable &var. for &out.";
run;
ods output close;
ods graphics off;
%mend prop;
%prop ( in = g1pair, var= isRetainYr1, out =g1prop, weight = weight1);

*”The Satterthwaite approximation of the standard errors differs from the Pooled method in that it does not assume that the variances of the two samples are equal. This means that if the variances are equal, the Satterthwaite approximation should give us exactly the same answer as the Pooled method.” (Reference: https://wolfweb.unr.edu/~ldyer/classes/396/PSE.pdf)

SAS STAT: Weight Statement

Weight statement can be used to exclude observation from the statistical procedure. If Weight statement is not specified, each obs has a default weight of 1.

PROC MEANS

0: Counts the obs in number of observations
<0: convert to 0 and treat like 0
missing: exclude

PROC FREQ

0: ignore obs unless specify the ZEROS option, which include obswith 0 weights.
missing: exclude

PROC TTEST

0: exclude
<0: exlude
missing: exclude

Basically, assigning a missing value to the weight field to exclude the observation and assigning 1 to the weight field to include the observation in the analysis.

SAS: Sampling using PROC SURVEYSELECT

STEP1: SEPARATE CASES AND CONTROL DATASETS WITH ALL THE VARIABLES OF INTEREST

STEP2: ESTABLISH INDEX:

%let index1 = year||entry_lvl||regist||faculty||isRetainYr0;
%let index2 = year||entry_lvl||regist||faculty||isRetainYr1;
%let index3 = year||entry_lvl||regist||faculty||isRetainYr2;
%let index4 = year||entry_lvl||regist||faculty||isRetainYr3;
  • Index are developed based on the characters you want to control. In the example, we want to control the year, the study level, the registration status (full-time or part-time) and the faculty when the student originally entering into the program;  the retention status of the student.

STEP3: GET FREQUENCY DISTRIBUTION FOR CASES AND CONTROL DATASETS BY INDICES

%macro match (sur1=, sur2=);
data cases&sur1.(keep =id index&sur2. );
set cases (keep= id &match.);
index&sur2. = compress(&&index&sur2) ;
run;
proc freq data = cases&sur1. noprint;
tables index&sur2./list missing out =casescnt&sur1. (keep = index&sur2. count rename=(count=casescnt));
run;
%mend match;
%match (sur1=y1, sur2=1);
%match (sur1=y2, sur2=2);
%match (sur1=y3, sur2=3);
%match (sur1=y4, sur2=4);
%macro match (sur1=, sur2=);
data pool&sur1.(keep =id index&sur2. );
set pool (keep= id &match.);
index&sur2. = compress(&&index&sur2) ;
run;
proc freq data = pool&sur1. noprint;
tables index&sur2./list missing out =poolcnt&sur1. (keep = index&sur2. count rename=(count=poolcnt));
run;
%mend match;
%match (sur1=y1, sur2=1);
%match (sur1=y2, sur2=2);
%match (sur1=y3, sur2=3);
%match (sur1=y4, sur2=4);

STEP4: MERGE CASES AND CONTROL BY INDEX AND CHECK THE CASES THAT HAVE NO POSSIBLE MATCHING CONROL CASES BY INDEX. IF THERE ARE MANY STUDY CASES NOT ABLE TO FIND MATCHING CONTROL CASES (eg. size of control is smaller than the sized of study cases), THEN WE NEED TO CHANGE THE INDEX.

%macro match (sur1=, sur2=);
data count&sur1.;
merge casescnt&sur1. (in =a ) poolcnt&sur1. (in =b);
by index&sur2.;
if casescnt >0;
if a and not b then poolcnt = 0;
nsize = min(casescnt, poolcnt);
run;
data count&sur1. notindex&sur1.;
set count&sur1.;
if nsize = 0 then output notindex&sur1.;
else output count&sur1.;
run;
%mend match;
%match (sur1=y1, sur2=1);
%match (sur1=y2, sur2=2);
%match (sur1=y3, sur2=3);
%match (sur1=y4, sur2=4);

STEP5: LIMIT THE CONTROL CASES WHERE THE INDICES EQUAL TO THOSE OF THE STUDY CASES. SAMPLING BASED ON THE SIZE OF THE STUDY CASES BY EACH STRATA.

%macro match (sur1= , sur2= );
proc sql;
create table pool&sur2. as
select *
from pool&sur1.
where index&sur2. in (select index&sur2. from count&sur1.);
proc sort data = pool&sur2.;
by index&sur2.;
run;
proc surveyselect data = pool&sur2.
sampsize = count&sur1.
method = srs
seed = 300001
out=selected&sur1.;
strata index&sur2.;
run;
%mend match;
%match (sur1=y1, sur2=1);
%match (sur1=y2, sur2=2);
%match (sur1=y3, sur2=3);
%match (sur1=y4, sur2=4);

*change the seed number to get different random samples.  Use the same seed number to achieve reproducible results.

 

 

SAS: Update values with the last record and keep other attributes of the first record by group

The list contains course grades for students and the year students took the course.  It is possible that students can retake the course to improve their gpa.  We want a list that the last records of the students are showing all the grades that they took in the past and if a course was taken multiple times then the most recenct grade of that course.  We also want the last records of the student to show the initial year that the students were taking these course.

The following marco used a temp variable to hold the grade of the course and retain the value for the same student and if the course is retaken, the temp variable will be assigned with new grade, and if the course grade is empty, then the value retained in the  temp variable will be assigned to the course grade.

%macro lst (in =, out=, crs=);
data &out.;
drop temp ;
set &in.;
by id;
length temp $2 firstcrsyr $4 ;
retain temp firstcrsyr ;
if first.id then do; temp = '' ; firstcrsyr = yr; end;
/* Assign TEMP when courseX is non-missing /
if &crs. ne '' then temp=&crs.;
/ When X is missing, assign the retained value of TEMP into courseX */
else if &crs. eq '' then &crs.=temp;
run;
%mend lst;
%lst (in = list1, out = list2, crs= course1);
%lst (in = list2, out = list2, crs= course2);
%lst (in = list2, out = list2, crs= course3);

SAS: Cumulative Count by Study and Control Group

This example is used to set up pairing id (PAIRID) for study group and control group. Variable INDEX is the matching character. Variable STUDY identify the group membership (0 = control group; 1= study group).  Variable ID is the unique identifier. In the SAMPLE dataset, number of membership are equal for both the control group and study group at each strata.

data pair;
set sample;
by study index;
count =1;
if first.study then pairid = 0;
pairid +count;
drop count;
run;

*Not using retain statement.

SAS: Proc SQL Where … In to subset

SQL where … in can be used to subsetting dataset which meet contain condition and the condition is in a separate dataset.

Table ONE is a large dataset and contains variable ID and other related information. Table TWO contains unique ID that of interest.  Table THREE is the subset of table ONE which match the ID value with table TWO.

proc sql;
create table THREE as
select *
from ONE
where ID in (select ID from TWO);

 

Check SAS license and components installed

proc setinit; run;

Proc setinit provides the components list with the expiration dates.

—Base SAS Software
—SAS/STAT
—SAS/GRAPH
—SAS/ETS
—SAS/FSP
—SAS/OR
—SAS/AF
—SAS/IML
—SAS/QC
—SAS/SHARE
—SAS/LAB
—SAS/ASSIST
—SAS/CONNECT
—SAS/INSIGHT
—SAS/EIS
—SAS/SHARE*NET
—MDDB Server common products
—SAS Integration Technologies
—SAS/Secure Windows
—SAS Enterprise Guide
—SAS Bridge for ESRI
—OR OPT
—OR PRS
—OR IVS
—OR LSO
—SAS/ACCESS Interface to DB2
—SAS/ACCESS Interface to Oracle
—SAS/ACCESS Interface to Sybase
—SAS/ACCESS Interface to PC Files
—SAS/ACCESS Interface to ODBC
—SAS/ACCESS Interface to OLE DB
—SAS/ACCESS Interface to Teradata
—SAS/ACCESS Interface to MySQL
—SAS/IML Studio
—SAS Workspace Server for Local Access
—SAS/ACCESS Interface to Netezza
—SAS/ACCESS Interface to Aster nCluster
—SAS/ACCESS Interface to Greenplum
—SAS/ACCESS Interface to Sybase IQ
—DataFlux Trans DB Driver
—SAS Framework Data Server
—Reserved for Dataflux

proc product_status; run; 

Get the version of the product.