December 2017 - BI Practice

EXCEL: Most Clever Lookup Formula with Multiple Conditions

I came across an excel lookup function through the following link. https://www.extendoffice.com/documents/excel/2440-excel-vlookup-multiple-criteria.html

Then I found further explanation in the following forum.

https://www.excelforum.com/excel-formulas-and-functions/1043140-explain-lookup-2-1-a2-a10-d2-b2-b10.html

The structure for the formula is:

=LOOKUP(2, 1/(COLUMN1=CONDITION1)/(COLUMN2=CONDITION2)/(COLUMN=CONDITIONn), (RESULTCOLUMN))

This formula can be used to look up both numeric value and character value that match multiple conditions. COLUMN1=CONDITION1 will produce an array of 0s and 1s; 1 divided by the array will produce an array of %Div/0s and 1s; then the array will be further divided by an array of 0s and 1s (results of column2=condition2) and so on. The lookup value is set at 2 because 2 is greater than 1, and if lookup_value (2) is not found within the lookup_vector, LOOKUP returns the last value in that array which is less than 2. The formula then returns the value in the result column using the corresponding position found in the lookup columns.

SQL Join: Compile Statistics Summary from Multiple Statistics Tables

In the previous blog, I presented how to compile statistic results by groups and by variable into a long table for each type of statistics analysis. All tables share three common field: group, test and var.

Here is the list of the table with the specific statistic that want to extract and compile into a summary table.

stat_tstat:
- class = ‘1’: n, mean
- class = ‘0’: n, mean
- class = ‘Diff (1-2)’: mean,
stat_ttest:
- variances = ‘Unequal’: tValue, Probt

PPOC SQL is the most convenient procedure to pick up these values under specific condition and combine them into one table.

proc sql;
create table stat1 as
select a.n as studyn, a.mean as studymean, b.n as controln, b.mean as controlmean, c.mean as diff, a.group, a.var, d.tValue, d.probt
from stat_tstat a
left join stat_tstat b
on a.group =b.group and a.var=b.var and b.class = '1'
left join stat_tstat c
on a.group =c.group and a.var=c.var and c.class = 'Diff (1-2)'
left join stat_ttest d
on a.group = d.group and a.var=d.var and d.variances = 'Unequal'
where a.class = '0';
quit;

SAS: Stacking Statistical Output using Macros

The previous post goes through basic SAS statistics procedures, like Proc Means, Proc Ttest, and Proc Freq. In this post, the focus is to establish a good naming convention for the statistical outputs. For each loop of PROC TTEST or PROC FREQ, the statistics are saved in separate tables by pairing group and type of statistics. When stacking the test results, it is important to identify the type of statistics, the variable that the statistics is based on and the pairing group. In the following macro, &in. is the group identifier, &var. is the variable identified, and &stat is the statistics procedure identifier.

%macro stat (in =, var=, stat=);
data &in.&stat.&var.;
set &in.&stat.&var.;
length group $2 test $20 var $20;
group = substr("&in.", 1, 2);
test = "&stat.";
var = "&var.";
rename &var. = response;
run;
proc append base=stat_&stat. data = &in.&stat._&var.;
run;
%mend stat;

proc datasets library= work noprint;
delete stat_prop;
run;
%stat ( in =g1, var= isretainYr1,stat=prop);
%stat ( in =g1, var= isretainYr2,stat=prop);
%stat ( in =g1, var= isretainYr3,stat=prop);
%stat ( in =g1, var= isretainYr4,stat=prop);
%stat ( in =g2, var= isretainYr1,stat=prop);
%stat ( in =g2, var= isretainYr2,stat=prop);
%stat ( in =g2, var= isretainYr3,stat=prop);
%stat ( in =g2, var= isretainYr4,stat=prop);

SAS STAT: getting output statistics (1)

SAS STAT package has many procedures that can be used to conduct specific statistical analysis with structured data sets. I found the following use of precedures quite common and there are usually a number of statistics needed to be output for further compilation. The caveat is to sort the data set in the desired order, so the null hypothesis is in the upper left cell (1,1) in the two-way frequency table, which normally showes the results of study group with a respone equal to 1. This way the output statistics can be interpreted more straight forward.

A. How to get output from statistics procedures?

There are different ways for outputing statistics, eg. ODS OUTPUT, OUTPUT statement or OUT=;

a. use PROC MEANS for general statistics like n, mean, min, max std;

use maxdec= option to adjust the decimal places needed.
output out=filename: N, MIN, MAX, MEAN, STD, SUMWGT

b. use PROC TTEST for two-sample independent T-test; use CLASS Variable to identify and differentiate the groups for study cases and control cases;

ods output
- ttests=filename: tValue, DF, (pick Satterthwaite method which assume unequal variance);
- statistics=filename: n, mean, LowerCLMean, UpperCLMean, StdDev, StdErr for ‘0’,’1′, Diff (1-2);
- equality=filename: fValue, probf (equality of variances)

c. use PROC FREQ for two-sample independent T-test

tables statement, out option
- out= filename: default only provide the TABLES variable, frequency count, percent of total frequency. To include percent of row frequency and percent of column frequency, you need to add OUTPCT in TABLES statement.
output statement out option
- Output relrisk out = filename;
ods output
- ChiSq=filename: Chi-Square, Prob
- PdiffTest=filename: Proportion Difference Test, Wald is the default method, Proportion Difference, Z Value, One-sided Prob and two-sided prob.
- RelativeRisks=filename: relative risk estimate, case-control (odds ratio), column 1 risk, column 2 risk, 95% L/U Confidence Limit

Overall, the ODS OUTPUT is the most versatile and powerful method to obtain statisics results from these procedure. Check “ODS Table Names” under each procedure in the SAS STAT User Guide. The SAS procedure assigns a name to each table that it creates. Use ODS Table Name= filename to assign your own table name to be saved in the work library. In order to have the output ods table, you also need to check if you have included the corresponding option(s) in the specific statement for the program generate the statistics.

Sample code for Two-sample Proportion Test (include all 3 methods of output):

%macro prop ( in =, var=, out=, weight=);
proc sort data=&in;
by descending study descending &var.;
run;
ods graphics on;
ods output ChiSq = &out.chi_&var. PdiffTest=&out.pdiff_&var. RelativeRisks=&out.rr_&var.;
proc freq data=&in. order =data ;
format study grpfmt. &var. rspFmt.;
weight &weight.;
tables study*&var. / chisq measures riskdiff(equal) outpct out=&out._&var.
plots= (freqplot(twoway=groupvertical scale =percent));
output relrisk out=&out._or_&var.;
title "Proportion Test:Case - Control study of variable &var. for &out.";
run;
ods output close;
ods graphics off;
%mend prop;
%prop ( in = g1pair, var= isRetainYr1, out =g1prop, weight = weight1);

*”The Satterthwaite approximation of the standard errors differs from the Pooled method in that it does not assume that the variances of the two samples are equal. This means that if the variances are equal, the Satterthwaite approximation should give us exactly the same answer as the Pooled method.” (Reference: https://wolfweb.unr.edu/~ldyer/classes/396/PSE.pdf)

SAS STAT: Weight Statement

Weight statement can be used to exclude observation from the statistical procedure. If Weight statement is not specified, each obs has a default weight of 1.

PROC MEANS

0: Counts the obs in number of observations
<0: convert to 0 and treat like 0
missing: exclude

PROC FREQ

0: ignore obs unless specify the ZEROS option, which include obswith 0 weights.
missing: exclude

PROC TTEST

0: exclude
<0: exlude
missing: exclude

Basically, assigning a missing value to the weight field to exclude the observation and assigning 1 to the weight field to include the observation in the analysis.

SAS: Sampling using PROC SURVEYSELECT

STEP1: SEPARATE CASES AND CONTROL DATASETS WITH ALL THE VARIABLES OF INTEREST

STEP2: ESTABLISH INDEX:

%let index1 = year||entry_lvl||regist||faculty||isRetainYr0;
%let index2 = year||entry_lvl||regist||faculty||isRetainYr1;
%let index3 = year||entry_lvl||regist||faculty||isRetainYr2;
%let index4 = year||entry_lvl||regist||faculty||isRetainYr3;

Index are developed based on the characters you want to control. In the example, we want to control the year, the study level, the registration status (full-time or part-time) and the faculty when the student originally entering into the program; the retention status of the student.

STEP3: GET FREQUENCY DISTRIBUTION FOR CASES AND CONTROL DATASETS BY INDICES

%macro match (sur1=, sur2=);
data cases&sur1.(keep =id index&sur2. );
set cases (keep= id &match.);
index&sur2. = compress(&&index&sur2) ;
run;
proc freq data = cases&sur1. noprint;
tables index&sur2./list missing out =casescnt&sur1. (keep = index&sur2. count rename=(count=casescnt));
run;
%mend match;
%match (sur1=y1, sur2=1);
%match (sur1=y2, sur2=2);
%match (sur1=y3, sur2=3);
%match (sur1=y4, sur2=4);
%macro match (sur1=, sur2=);
data pool&sur1.(keep =id index&sur2. );
set pool (keep= id &match.);
index&sur2. = compress(&&index&sur2) ;
run;
proc freq data = pool&sur1. noprint;
tables index&sur2./list missing out =poolcnt&sur1. (keep = index&sur2. count rename=(count=poolcnt));
run;
%mend match;
%match (sur1=y1, sur2=1);
%match (sur1=y2, sur2=2);
%match (sur1=y3, sur2=3);
%match (sur1=y4, sur2=4);

STEP4: MERGE CASES AND CONTROL BY INDEX AND CHECK THE CASES THAT HAVE NO POSSIBLE MATCHING CONROL CASES BY INDEX. IF THERE ARE MANY STUDY CASES NOT ABLE TO FIND MATCHING CONTROL CASES (eg. size of control is smaller than the sized of study cases), THEN WE NEED TO CHANGE THE INDEX.

%macro match (sur1=, sur2=);
data count&sur1.;
merge casescnt&sur1. (in =a ) poolcnt&sur1. (in =b);
by index&sur2.;
if casescnt >0;
if a and not b then poolcnt = 0;
nsize = min(casescnt, poolcnt);
run;
data count&sur1. notindex&sur1.;
set count&sur1.;
if nsize = 0 then output notindex&sur1.;
else output count&sur1.;
run;
%mend match;
%match (sur1=y1, sur2=1);
%match (sur1=y2, sur2=2);
%match (sur1=y3, sur2=3);
%match (sur1=y4, sur2=4);

STEP5: LIMIT THE CONTROL CASES WHERE THE INDICES EQUAL TO THOSE OF THE STUDY CASES. SAMPLING BASED ON THE SIZE OF THE STUDY CASES BY EACH STRATA.

%macro match (sur1= , sur2= );
proc sql;
create table pool&sur2. as
select *
from pool&sur1.
where index&sur2. in (select index&sur2. from count&sur1.);
proc sort data = pool&sur2.;
by index&sur2.;
run;
proc surveyselect data = pool&sur2.
sampsize = count&sur1.
method = srs
seed = 300001
out=selected&sur1.;
strata index&sur2.;
run;
%mend match;
%match (sur1=y1, sur2=1);
%match (sur1=y2, sur2=2);
%match (sur1=y3, sur2=3);
%match (sur1=y4, sur2=4);

*change the seed number to get different random samples. Use the same seed number to achieve reproducible results.