SAS: Sampling using PROC SURVEYSELECT

STEP1: SEPARATE CASES AND CONTROL DATASETS WITH ALL THE VARIABLES OF INTEREST

STEP2: ESTABLISH INDEX:

%let index1 = year||entry_lvl||regist||faculty||isRetainYr0;
%let index2 = year||entry_lvl||regist||faculty||isRetainYr1;
%let index3 = year||entry_lvl||regist||faculty||isRetainYr2;
%let index4 = year||entry_lvl||regist||faculty||isRetainYr3;
  • Index are developed based on the characters you want to control. In the example, we want to control the year, the study level, the registration status (full-time or part-time) and the faculty when the student originally entering into the program;  the retention status of the student.

STEP3: GET FREQUENCY DISTRIBUTION FOR CASES AND CONTROL DATASETS BY INDICES

%macro match (sur1=, sur2=);
data cases&sur1.(keep =id index&sur2. );
set cases (keep= id &match.);
index&sur2. = compress(&&index&sur2) ;
run;
proc freq data = cases&sur1. noprint;
tables index&sur2./list missing out =casescnt&sur1. (keep = index&sur2. count rename=(count=casescnt));
run;
%mend match;
%match (sur1=y1, sur2=1);
%match (sur1=y2, sur2=2);
%match (sur1=y3, sur2=3);
%match (sur1=y4, sur2=4);
%macro match (sur1=, sur2=);
data pool&sur1.(keep =id index&sur2. );
set pool (keep= id &match.);
index&sur2. = compress(&&index&sur2) ;
run;
proc freq data = pool&sur1. noprint;
tables index&sur2./list missing out =poolcnt&sur1. (keep = index&sur2. count rename=(count=poolcnt));
run;
%mend match;
%match (sur1=y1, sur2=1);
%match (sur1=y2, sur2=2);
%match (sur1=y3, sur2=3);
%match (sur1=y4, sur2=4);

STEP4: MERGE CASES AND CONTROL BY INDEX AND CHECK THE CASES THAT HAVE NO POSSIBLE MATCHING CONROL CASES BY INDEX. IF THERE ARE MANY STUDY CASES NOT ABLE TO FIND MATCHING CONTROL CASES (eg. size of control is smaller than the sized of study cases), THEN WE NEED TO CHANGE THE INDEX.

%macro match (sur1=, sur2=);
data count&sur1.;
merge casescnt&sur1. (in =a ) poolcnt&sur1. (in =b);
by index&sur2.;
if casescnt >0;
if a and not b then poolcnt = 0;
nsize = min(casescnt, poolcnt);
run;
data count&sur1. notindex&sur1.;
set count&sur1.;
if nsize = 0 then output notindex&sur1.;
else output count&sur1.;
run;
%mend match;
%match (sur1=y1, sur2=1);
%match (sur1=y2, sur2=2);
%match (sur1=y3, sur2=3);
%match (sur1=y4, sur2=4);

STEP5: LIMIT THE CONTROL CASES WHERE THE INDICES EQUAL TO THOSE OF THE STUDY CASES. SAMPLING BASED ON THE SIZE OF THE STUDY CASES BY EACH STRATA.

%macro match (sur1= , sur2= );
proc sql;
create table pool&sur2. as
select *
from pool&sur1.
where index&sur2. in (select index&sur2. from count&sur1.);
proc sort data = pool&sur2.;
by index&sur2.;
run;
proc surveyselect data = pool&sur2.
sampsize = count&sur1.
method = srs
seed = 300001
out=selected&sur1.;
strata index&sur2.;
run;
%mend match;
%match (sur1=y1, sur2=1);
%match (sur1=y2, sur2=2);
%match (sur1=y3, sur2=3);
%match (sur1=y4, sur2=4);

*change the seed number to get different random samples.  Use the same seed number to achieve reproducible results.

 

 

SAS: Update values with the last record and keep other attributes of the first record by group

The list contains course grades for students and the year students took the course.  It is possible that students can retake the course to improve their gpa.  We want a list that the last records of the students are showing all the grades that they took in the past and if a course was taken multiple times then the most recenct grade of that course.  We also want the last records of the student to show the initial year that the students were taking these course.

The following marco used a temp variable to hold the grade of the course and retain the value for the same student and if the course is retaken, the temp variable will be assigned with new grade, and if the course grade is empty, then the value retained in the  temp variable will be assigned to the course grade.

%macro lst (in =, out=, crs=);
data &out.;
drop temp ;
set &in.;
by id;
length temp $2 firstcrsyr $4 ;
retain temp firstcrsyr ;
if first.id then do; temp = '' ; firstcrsyr = yr; end;
/* Assign TEMP when courseX is non-missing /
if &crs. ne '' then temp=&crs.;
/ When X is missing, assign the retained value of TEMP into courseX */
else if &crs. eq '' then &crs.=temp;
run;
%mend lst;
%lst (in = list1, out = list2, crs= course1);
%lst (in = list2, out = list2, crs= course2);
%lst (in = list2, out = list2, crs= course3);

SAS: Cumulative Count by Study and Control Group

This example is used to set up pairing id (PAIRID) for study group and control group. Variable INDEX is the matching character. Variable STUDY identify the group membership (0 = control group; 1= study group).  Variable ID is the unique identifier. In the SAMPLE dataset, number of membership are equal for both the control group and study group at each strata.

data pair;
set sample;
by study index;
count =1;
if first.study then pairid = 0;
pairid +count;
drop count;
run;

*Not using retain statement.

SAS: Proc SQL Where … In to subset

SQL where … in can be used to subsetting dataset which meet contain condition and the condition is in a separate dataset.

Table ONE is a large dataset and contains variable ID and other related information. Table TWO contains unique ID that of interest.  Table THREE is the subset of table ONE which match the ID value with table TWO.

proc sql;
create table THREE as
select *
from ONE
where ID in (select ID from TWO);

 

Data: Curl Installation and batch download PDF files from a website

cURL Installation

  1. download carcet.pem from  https://curl.haxx.se/docs/caextract.html
  2. download curl from https://curl.haxx.se/download.html (I use a windows system, so I download the binary file for windows 64 bit)
  3. extract curl to c:\curl and put carcet.pem under c:\curl\bin folder
  4. add environment variable curl with the path of curl.exe

Curl is useful for downloading files from the website.  The basic command is the following:

curl -O url

This command needs be run under the c:\curl\bin folder. The files are downloaded to C:\Users\username\

Purpose: download SAS 9.3 user guide pdf files from https://support.sas.com/documentation/onlinedoc/stat/930/

  1. In window cmd, under c:\curl\bin, run curl -o index https://support.sas.com/documentation/onlinedoc/stat/930/
    • generate the index file which contains the wrapped source html code of the webpage
  2.  open git bash, run following
      1. cd c
      2. cd curl
      3. cd bin
      4. grep -i pdf  index > list
  3. list contains the href=”*.pdf”. Use Excel text to column to get only the name of the pdf files.
  4. open list in notepad++ and at the bottom of the window, it shows “Windows (CR LF)”,  right click and select “Unix (LF)”.  This will solve the error “curl: (3) Illegal characters found in URL”
  5. Start a new bash file in notepad++ with the following code
    1. echo “Start!”
      url=https://support.sas.com/documentation/onlinedoc/stat/930/
      while read query
      do
      curl -O “$url${query}”
      echo $url${query}
      done < list
    2. save as echo
  6. in git bash, navigate to where the echo file is and run following
    1. bash echo

Check SAS license and components installed

proc setinit; run;

Proc setinit provides the components list with the expiration dates.

—Base SAS Software
—SAS/STAT
—SAS/GRAPH
—SAS/ETS
—SAS/FSP
—SAS/OR
—SAS/AF
—SAS/IML
—SAS/QC
—SAS/SHARE
—SAS/LAB
—SAS/ASSIST
—SAS/CONNECT
—SAS/INSIGHT
—SAS/EIS
—SAS/SHARE*NET
—MDDB Server common products
—SAS Integration Technologies
—SAS/Secure Windows
—SAS Enterprise Guide
—SAS Bridge for ESRI
—OR OPT
—OR PRS
—OR IVS
—OR LSO
—SAS/ACCESS Interface to DB2
—SAS/ACCESS Interface to Oracle
—SAS/ACCESS Interface to Sybase
—SAS/ACCESS Interface to PC Files
—SAS/ACCESS Interface to ODBC
—SAS/ACCESS Interface to OLE DB
—SAS/ACCESS Interface to Teradata
—SAS/ACCESS Interface to MySQL
—SAS/IML Studio
—SAS Workspace Server for Local Access
—SAS/ACCESS Interface to Netezza
—SAS/ACCESS Interface to Aster nCluster
—SAS/ACCESS Interface to Greenplum
—SAS/ACCESS Interface to Sybase IQ
—DataFlux Trans DB Driver
—SAS Framework Data Server
—Reserved for Dataflux

proc product_status; run; 

Get the version of the product.

SAS Proc Upload and Download

SAS server processes dataset at a much faster rate than the local desktop.  To send the datasets from the WORK library of the desktop to the WORK library of the server,  and get the processed datasets from WORK library of the server to the WORK library of the desktop, we need to use PROC UPLOAD and PROC DOWNLOAD.

proc update data =localdata out =serverdata; run;
…code…
proc download data=serverdata out =localdata; run;

SAS Array example

In the data set, the student courses have 20 fields and the student grades have 20 fields and the courses and grades are one to one match.  The following codes can be used to get the course name from course field and grade from the grade field.

array crs {20} crs1-crs20;
array course {20} course1-course20 ;
array grade {20} grade1-grade20;
array gr {20} gr1-gr20 ;
do i = 1 to 20;
if crs{i} ne ” then course{i} = substr(crs{i},9,10);
if grade{i} ne ” then gr{i} =scan(grade{i},5,’\’);
end;
drop crs1-crs20 grade1-grade20;

Second example produces the retention results for each student based on whether students come back next year. cumm_crYr variables the accumulative credits at the end of year X of the program; ProgramYr variables are the name of the program the students registed at the end of year X.  If the student didn’t register for the year, there will be no value in cummcrYr or programYr. Therefore, if the student has accumulative credits calculated for the next year then the student is retained one year later and the isRetainYr1 will equal to 1.  The array evaluates whether the value for cumm_Yr2/programYr2 is null, if not, then is RetainYr2 is 1, etc.

data two;
set one;
length isRetainYr1-isRetainYr4 8;
array cumm_crYr{5} cumm_crYr1-cumm_crYr5;
array isRetainYr{4} isRetainYr1-isRetainYr4;
array programYr{5} programYr1-programYr5;
do i = 1 to 4;
j=i+1;
if pgmentyr*1 +i -1 <= 2016 then do;
if cumm_crYr{j} ne . or programYr{j} ne ” then isRetainYr{i} = 1;
else isRetainYr{i} = 0; end;
end;
run;

SAS Regression Basic 1

Step 1: Organize dataset – from wide to long. Categorical variables, possible predictor/explanatory/independent variables and the outcome/response/dependent variable all in columns.

Step 2: Proc Means – Find descriptive statistics for all the numeric variables by categorical variables.

%macro mean (data=, byvar=, var=);
proc sort data = &data.;
by &byvar.;
run;
proc means data = &data.;
by &byvar.;
var &var.;
run;
%mend mean;

Step 3: Proc Sgpanel or Proc sgplot – Check the scatter plot of outcome variable and predictor variable by categorical variables.

%macro plotby (data=, title=, by= , x=, y= , lbl= );
proc sgpanel data =&data.;
title &title.;
panelby &by. / columns =2 rows =4;
scatter x = &x. y=&y. /datalabel=&lbl.;
run;
%mend plotby;
  • title macro variable needs to be in “”;
  • by mcro variable is categorical;
  • Use COLUMNS and ROWS options in the PANELBY statement to define the grid layout of the plot.  It can be determined by the number of category and the preferred size of the plot;
  • Use uniscale in the PANELBY statement if the shared column or row axis needs to be identical. The default is ALL. UNISCALE= COLUMN | ROW | ALL
  • To give label for each datapoint in the plot, use DATALABEL option in the SCATTER statement;
  • Use ROWAXIS and COLAXIS to change the default axis value;