SAS: Proc Freq for all Character Variables

Before processing the large data set, it is a good practice to check the levels of the categorical varaible and make sure records are assigned to the right category and missing categories are treated.

Use _character_ in the freqency procedure to apply the procedure to all character variables.  Be careful not to include any character variables that are numeric in nature, eg id variable in the data set.  Using “nlevels” option to provide a summary table of number of the levels, missing levels, non-missing levels for each categorical variable.  Then check the frequency table for the specific categorical variable for the values of the levels.

proc freq data= table1 nlevels;
tables _character_ / nocum nopercent out=freqcnt;
run;
Number of Variable Levels
Variable Label Levels Missing Levels Nonmissing Levels
gender Gender Code 4 0 4
Var2 label2 12 1 11
Gender Code
gender Frequency
F 495320
M 402399
U 3475
{NUL} 1

SAS STAT: getting output statistics (1)

SAS STAT package has many procedures that can be used to conduct specific statistical analysis with structured data sets.  I found the following use of precedures quite common and there are usually a number of statistics needed to be output for further compilation.  The caveat is to sort the data set in the desired order, so the null hypothesis is in the upper left cell (1,1) in the two-way frequency table, which normally showes the results of study group with a respone equal to 1. This way the output statistics can be interpreted more straight forward.

A. How to get output from statistics procedures?

There are different ways for outputing statistics, eg. ODS OUTPUT, OUTPUT statement or OUT=;

a. use PROC MEANS for general statistics like n, mean, min, max std;

  • use maxdec= option to adjust the decimal places needed.
  • output out=filename: N, MIN, MAX, MEAN, STD, SUMWGT

b. use PROC TTEST for two-sample independent T-test; use CLASS Variable to identify and differentiate the groups for study cases and control cases;

  • ods output
    • ttests=filename: tValue, DF, (pick Satterthwaite method which assume unequal variance);
    • statistics=filename: n, mean, LowerCLMean, UpperCLMean, StdDev, StdErr for ‘0’,’1′, Diff (1-2);
    • equality=filename: fValue, probf (equality of variances)

c. use PROC FREQ for two-sample independent T-test

  • tables statement, out option
    • out= filename: default only provide the TABLES variable, frequency count, percent of total frequency.  To include percent of row frequency  and percent of column frequency, you need to add OUTPCT in TABLES statement.
  • output statement out option
    • Output  relrisk  out = filename;
  • ods output
    • ChiSq=filename: Chi-Square, Prob
    • PdiffTest=filename: Proportion Difference Test, Wald is the default method, Proportion Difference, Z Value, One-sided Prob and two-sided prob.
    • RelativeRisks=filename: relative risk estimate, case-control (odds ratio), column 1 risk, column 2 risk, 95% L/U Confidence Limit

Overall, the ODS OUTPUT is the most versatile and powerful method to obtain statisics results from these procedure.  Check “ODS Table Names” under each procedure in the SAS STAT User Guide.  The SAS procedure assigns a name to each table that it creates.  Use ODS Table Name= filename to assign your own table name to be saved in the work library.  In order to have the output ods table, you also need to check if you have included the corresponding option(s) in the specific statement for the program generate the statistics.

Sample code for Two-sample Proportion Test (include all 3 methods of output):

%macro prop ( in =, var=, out=, weight=);
proc sort data=∈
by descending study descending &var.;
run;
ods graphics on;
ods output ChiSq = &out.chi_&var. PdiffTest=&out.pdiff_&var. RelativeRisks=&out.rr_&var.;
proc freq data=&in. order =data ;
format study grpfmt. &var. rspFmt.;
weight &weight.;
tables study*&var. / chisq measures riskdiff(equal) outpct out=&out._&var.
plots= (freqplot(twoway=groupvertical scale =percent));
output relrisk out=&out._or_&var.;
title "Proportion Test:Case - Control study of variable &var. for &out.";
run;
ods output close;
ods graphics off;
%mend prop;
%prop ( in = g1pair, var= isRetainYr1, out =g1prop, weight = weight1);

*”The Satterthwaite approximation of the standard errors differs from the Pooled method in that it does not assume that the variances of the two samples are equal. This means that if the variances are equal, the Satterthwaite approximation should give us exactly the same answer as the Pooled method.” (Reference: https://wolfweb.unr.edu/~ldyer/classes/396/PSE.pdf)

Survey basic frequency macro

This coding can be used to run

Case 1: Count frequency of distribution for one question and exclude a category from the counting results (eg. exclude records from the question with 88 as the answer in the frequency distribution).

%macro count  (field=);

proc freq data=indata (where = (&field ne 88));
tables var1*(&field) ;
run;
%mend count;

%count (field=q1);

Case 2: Count frequency of distribution for one question conditional on the result of a different question. In the example, the where statement will delete records with q1 equals to 88 and then delete records with q10 not equal to 2 or 3 or 4.  For the rest of the records, proc freq will run the distribution of q1.

%macro count (field1=, field2=);

proc freq data=indata (where = (&field1 ne 88 and &field2 in (2,3,4)));
tables var1*(&field1) ;
run;
%mend count;

%count (field1=q1, field2=q10);

SAS: Identfy and Remove Duplicate Records

Step 1: Check duplicate records by one field or a combination of fields.

proc freq data=temp noprint;
table var1 /out =dupl (keep =var1 count where = (count >1));
run;

proc freq data=temp noprint;
table var1*var2 /out =dupl (keep =var1 var2 count where = (count >1));
run;

Examine the output to see what circumstances can cause duplicate records.

Step 2: Remove duplicate records.

proc sort data=temp NODUPKEY;
by var1 ;
run;

proc sort data=temp NODUPKEY;
by var1 var2 ;
run;

Alternative method to output unique records and duplicated records in two separate datasets.

proc sort data =temp out=temp1;
by var1 var2;
run;
data unique dup;
set temp1;
by var1 var2;
if first.var2 and last.var2 then output unique;
else output dup;
run;

 

SAS: data extraction and merge routine (1)

/*select data from raw data set*/

data x;

set y;

where var1 = ‘xxx’ and var2 = ‘yyy’;

run;

/*check frequency on variables of interest*/

proc freq;

tables var1 var2;

run;

/*merge to get additional variables needed for analysis*/

proc sort data = x;

by var3;

proc sort data =z;

by var3;

data xz;

merge x (in =a) z;

by var3;

if a;

run;