SAS: Sampling using PROC SURVEYSELECT

STEP1: SEPARATE CASES AND CONTROL DATASETS WITH ALL THE VARIABLES OF INTEREST

STEP2: ESTABLISH INDEX:

%let index1 = year||entry_lvl||regist||faculty||isRetainYr0;
%let index2 = year||entry_lvl||regist||faculty||isRetainYr1;
%let index3 = year||entry_lvl||regist||faculty||isRetainYr2;
%let index4 = year||entry_lvl||regist||faculty||isRetainYr3;
  • Index are developed based on the characters you want to control. In the example, we want to control the year, the study level, the registration status (full-time or part-time) and the faculty when the student originally entering into the program;  the retention status of the student.

STEP3: GET FREQUENCY DISTRIBUTION FOR CASES AND CONTROL DATASETS BY INDICES

%macro match (sur1=, sur2=);
data cases&sur1.(keep =id index&sur2. );
set cases (keep= id &match.);
index&sur2. = compress(&&index&sur2) ;
run;
proc freq data = cases&sur1. noprint;
tables index&sur2./list missing out =casescnt&sur1. (keep = index&sur2. count rename=(count=casescnt));
run;
%mend match;
%match (sur1=y1, sur2=1);
%match (sur1=y2, sur2=2);
%match (sur1=y3, sur2=3);
%match (sur1=y4, sur2=4);
%macro match (sur1=, sur2=);
data pool&sur1.(keep =id index&sur2. );
set pool (keep= id &match.);
index&sur2. = compress(&&index&sur2) ;
run;
proc freq data = pool&sur1. noprint;
tables index&sur2./list missing out =poolcnt&sur1. (keep = index&sur2. count rename=(count=poolcnt));
run;
%mend match;
%match (sur1=y1, sur2=1);
%match (sur1=y2, sur2=2);
%match (sur1=y3, sur2=3);
%match (sur1=y4, sur2=4);

STEP4: MERGE CASES AND CONTROL BY INDEX AND CHECK THE CASES THAT HAVE NO POSSIBLE MATCHING CONROL CASES BY INDEX. IF THERE ARE MANY STUDY CASES NOT ABLE TO FIND MATCHING CONTROL CASES (eg. size of control is smaller than the sized of study cases), THEN WE NEED TO CHANGE THE INDEX.

%macro match (sur1=, sur2=);
data count&sur1.;
merge casescnt&sur1. (in =a ) poolcnt&sur1. (in =b);
by index&sur2.;
if casescnt >0;
if a and not b then poolcnt = 0;
nsize = min(casescnt, poolcnt);
run;
data count&sur1. notindex&sur1.;
set count&sur1.;
if nsize = 0 then output notindex&sur1.;
else output count&sur1.;
run;
%mend match;
%match (sur1=y1, sur2=1);
%match (sur1=y2, sur2=2);
%match (sur1=y3, sur2=3);
%match (sur1=y4, sur2=4);

STEP5: LIMIT THE CONTROL CASES WHERE THE INDICES EQUAL TO THOSE OF THE STUDY CASES. SAMPLING BASED ON THE SIZE OF THE STUDY CASES BY EACH STRATA.

%macro match (sur1= , sur2= );
proc sql;
create table pool&sur2. as
select *
from pool&sur1.
where index&sur2. in (select index&sur2. from count&sur1.);
proc sort data = pool&sur2.;
by index&sur2.;
run;
proc surveyselect data = pool&sur2.
sampsize = count&sur1.
method = srs
seed = 300001
out=selected&sur1.;
strata index&sur2.;
run;
%mend match;
%match (sur1=y1, sur2=1);
%match (sur1=y2, sur2=2);
%match (sur1=y3, sur2=3);
%match (sur1=y4, sur2=4);

*change the seed number to get different random samples.  Use the same seed number to achieve reproducible results.

 

 

SAS: Cumulative Count by Study and Control Group

This example is used to set up pairing id (PAIRID) for study group and control group. Variable INDEX is the matching character. Variable STUDY identify the group membership (0 = control group; 1= study group).  Variable ID is the unique identifier. In the SAMPLE dataset, number of membership are equal for both the control group and study group at each strata.

data pair;
set sample;
by study index;
count =1;
if first.study then pairid = 0;
pairid +count;
drop count;
run;

*Not using retain statement.