SAS: Tagset.Excelxp title and footnote setup

A. Multiple Titles/Footnotes

title
title2
title3
footnote
footnote2
footnote3

B. Alignment: default is center.

justify=

C. Font

f=

D.Bold/Italic

bold italic

E. Font size

h=

F. Example

title1 f='Calibri' h=14pt 'Report Title';
title2 "Report Title1";
title3 "Report Title2";
footnote "Prepared by First Last on %sysfunc(date(),worddate.)";
footnote2 justify=left h=8pt bold "Note:";
footnote3 justify=left h=8pt italic "footnote3";

SAS: Date Functions and Date Value

A. Extract Month, Year information from DateTime.

  • Convert DateTime to Date using Datepart() first.
  • Use Year() and Month() function to get the Month and Year information
date=datepart(datetime);
year=year(date);
month=month(date);

B. Date Value for Date Comparison

  • ’17OCT1991’D is 11612, and is SAS date Oct. 17, 1991
  • use single quotation and D at the end
  • the day, month, or year in the date string can be replaced by macro variable to become dynamic. eg. ’01JUL&yr.” can be used for comparison of July 1 of different year.

C. Age calculation

  • Reference: https://communities.sas.com/t5/SAS-Enterprise-Guide/How-to-calculate-age-from-Date-of-birth-and-date-of-last-visit/td-p/572236
  • Reference: https://blogs.sas.com/content/iml/2017/05/15/intck-intnx-intervals-sas.html
  • intck function results whole year and yrdif function results decimal year.
  • In intck function, using the ‘continuous’ or ‘c’ option to return the number of full years.
    age = intck('Year', birth, "01JUL&yr."D, 'C');
    

    SAS: Standard Errors

    A. Different Standard Errors

    • Standard Error of the Regression Model: (denote by s) usually referred to as the standard error of the regression or “standard error of the estimate”.
    • STDI: Standard Error of the individual predicted value (y_hat)
    • STDP: Standard Error of the mean predicated value
    • STDR:Standard Error of the residual
    • STUDENT: Studentized residuals, the residual divided by its standard error

    B. Estimate function in the GLM procedure

    SAS: Normality Test

    A. Reference

    B. Data

    data BlockDesign;
    input block a b y @@;
    datalines;
    1 1 1 56 1 1 2 41
    1 2 1 50 1 2 2 36
    1 3 1 39 1 3 2 35
    2 1 1 30 2 1 2 25
    2 2 1 36 2 2 2 28
    2 3 1 33 2 3 2 30
    3 1 1 32 3 1 2 24
    3 2 1 31 3 2 2 27
    3 3 1 15 3 3 2 19
    4 1 1 30 4 1 2 25
    4 2 1 35 4 2 2 30
    4 3 1 17 4 3 2 18
    ;
    run;
    

    C. Normality Check

    • histogram
    • normal
    • plot
    • qqplot
    %macro uni (data, var);
    Title "Check Nomality and Distribution - &var. in  &data.";
    proc univariate data = &data. ;
    var &var. ;
    histogram / cfill = gray;
    run;
    proc univariate data = &data. normal plot;
    var &var.;
    qqplot &var. / normal (mu=est sigma =est);
    run;
    %mend uni;
    %uni(blockdesign, y);

    D. Output

    The p-value for Shapiro-Wilk test is 0.27, therefore we can not reject the null hypothesis that y is normally distributed.
    Q-Q Plot

    SAS: Proc Reg – Collinearity Diagnostics

    A. Reference

    B. Purpose

    • Examine whether predictors are highly collinear which can casuse problems in estimating the regression coefficients.
    • As the degree of multicollinearity increases, the coefficient estimates become unstable and the standard errors for the coefficients can be wildly inflated.

    C. SAS code

    proc reg data = cars;
    model msrp = enginesize cylinders horsepower  /  tol vif collinoint;
    run;
    quit;

    D. Notes

    • proc reg can not deal with categorical variable directly, therefore you need to create dummy variable yourself for the categorical variable.
    • tol: tolerance, the percent of variance in the predictor that cannot be accounted for by other predictors. Regress the predictor variable on the rest of the predictor variable and compute the R square. 1 minus the R square equals tolerance for the predictor.
    • vif: variance inflation factor. It is the inverse function of tolerance. Measures how much the variance of the estimated regression coefficient is “inflated” by the existence of correlation among the predictor variables in the model. A vif of 1 means no inflation at all. Exceeding 4 warrants further investigation Greater than 10 vif means serious multicollinearity and requires correction.
    • collinoint: produce intercept adjusted collinearity diagnostic. This table decomposes the correlation matrix in to linear combination of variables. The variance of each of these linear combinations is called an eigenvalue. Collinearity is assumed by finding 2 or more variables that have large proportions of variance (.50 or more) that correspond to large condition indices. A large condition index, 10 or more is and indication of instability.

    E. SAS Output

    F. Interpretation

    • Engine Size and cylinders have greater than 5 VIF.
    • The higher condition index is 5.41 with 83.7% and 90.1% of variances from for Engine Size and Cylinders. Since 5.4 is less than 10, therefore there is no multicollinearity.
    • Total eigenvalue accumulates to 3 because there are 3 predictors.

    SAS: Proc Corr – Different Correlations between variables

    A. Pearson Correlation: Between continuous variable and continuous variable.

    %macro corr (in, year);
    Title "Check Correlation - &in. group";
    proc corr data= &in._&year. 
              outp=corr_&in._&year.
              plot=(matrix  scatterplot);
          var exper salary ;
    run;
    %mend corr;
    • Variables Information
    • Simple Statistics (N, Mean, Std Dev, Sum, Min, Max, Label
    • Pearson Correlations (Correlation Coefficients, Prob)
    • Scatter Plots
    • Scatter Plots Matrix

    B. Polyserial Correlation: Between continuous variable and categorical variable.

    %macro corr (in, year);
    ods output polyserialCorr=corr_polys_&in._&year. ;
    Title "Check Polyserial Correlation - &in. group";
    proc corr data= &in._&year.  pearson polyserial;
    	with gender minority rank;
          var  exper salary;
    run;
    %mend corr;
    • Variables Information
    • Simple Statistics
    • Person Correlations
    • PolyserialCorr (Wald Test, LR Test)

    C. Polychoric Correlation: Between categorical variable and categorical variable.

    %macro corr (in, year);
    ods output measures=plcorr_&in._&year. (where=(statistic="Polychoric Correlation"));
    Title "Check Polychoric Correlation - &in. group";
    proc freq data= &in._&year.;
    tables gender*rank minority*rank / plcorr;
    run;
    %mend corr;
    • Cross-Tabular Freq Table
    • Measures of Association (Polychoric Correlation)

    D. Partial Correlation: Pearson correlation between variables while controlling other variables. Following example checks the correlation between salary and gender, controlling for rank and experience.

    %macro partial (in, year);
    Title "Check Partial Correlation for basesal and gender - &in. group";
    proc corr data=&in._&year.  plots=scatter(alpha=.20 .30);
       var salary gender ; /* treat gender as continuous */
       partial rank exper;
     run;
    %mend partial;

    SAS: Create All Possible Combination of Categorical Variables and Generate Summary Statistics for Each Combination

    A. Scenario

    The data set contains personal response on attribute 1 to attribute 4. The response can be either ‘Y’ or ‘N’. Some of the records are blank. The possible combinations for attribute 1 to 4 are 2x2x2x2, which is 16 combinations.

    I would like to know how many people belongs any one of the 16 combination of the attributes. The distribution of people among the combination of attributes indicates the prominent attributes of the population (where the distribution is dense) and the outlier attributes (where the distribution is sparse).

    B. Code

    • create table with all combination
    data combination;
    format attr1 $1. attr2 $1. attr3 $1. attr4 $1.; *Important to add format statement. The variables need to be exactly the same format as the data table that you want to run summary statistics on;
    do attr1= 'Y','N';
    do attr2 = 'Y','N';
    do attr3 = 'Y','N';
    do attr4 = 'Y','N';
    output;
    end;
    end;
    end;
    end;
    run;
    • run sum use Proc tabulate with classdata option
    proc tabulate data=data out = sum missing classdata = combination; *classdata option use the combination table created in the last step;
    class attr1 attr2 attr3 attr4;
    var count;
    table attr1*attr2*attr3*attr4, count*(sum="");
    run;

    C. Output

    table a classtable named combination
    table sum using the classtable combination

    D. Notes

    • make sure that the variables used in the class statement of the Proc tabulate procedure are included in the classdata table;
    • The output table sum automatically add the combination of null to all attr1 to attr4.
    • make sure that the variables used in classdata table have exactly the same name and format as the variables in the table that you want to run statistics on. In the above example, the attr1-4 in classdata table combination need to match the variables attr1-4 in the data table. Use Proc contents procedure on both data sets to make sure these variables are matching. Merely check the format in column attributes is not enough. If you omit the format statement when creating the classdata table combination, you will get the following error.

    Statistics Basics: Odds Ratio and Relative Risk Calculation in Excel

    I have done a post of meta-analysis using the Proc Freq to produce cross tabulation of 2 by 2 table, the odds ratio, and the relative risk (column 1, column 2). In this post, I will show how the stats for odds ratio and relative risks are calculated using Excel.

    A. SAS output

    • Frequency table
    • Odds table

    B. Excel

    • Frequency and Odds Ratio
    • Relative Risk Column 1
    • Relative Risk Column 2

    C. Interpretation

    Drug: peole who feel better/people who did not improve = 58/30 = 1.93333
    Placebo: people wo feel better/people who did not improve = 24/3 = 8
    Odds ratio: of 0.241667 means patients who take the drug are 0.24 times likely to improve.
    The probability of the improvement with the drug is 0.74 time the probability of the improvement with the placebo.
    The probability of the no improvement in the symptoms with the drug is 3.7 time the probability of no improvement with the placebo.

    SAS: Meta-Analysis CMH Example for Categorical Variable

    A. Reference

    B. Meta-Analysis

    A meta-analysis is a statistical analysis that combines the results of multiple scientific studies. Meta-analysis can be performed when there are multiple scientific studies addressing the same question, with each individual study reporting measurements that are expected to have some degree of error. The aim then is to use approaches from statistics to derive a pooled estimate closest to the unknown common truth based on how this error is perceived.

    Wikipedia
    • In meta-analysis, studies become observations.
    • Research collect data for meta-analysis by systematic review of the literature in the field, and compile data directly from the summary statistics in the publication.

    C. Problem with simply lumping the data from different studies together

    • Not consider treatment-by-study interaction
    • Assume response rates are the same in all studies.

    D. SAS Solution (follow Hamer and Simpson’s paper, but corrected the output from the paper)

    • Create data set with the results of 2 studies. B: Remitted; N:Not remitted; P: Placebo; D: Drug.
    • I have used B (Better) to indicate Remitted cases because Proc Freq test is based on column 1 and row 1 of the 2 by 2 table, so if we code R for Remitted cases then the remitted case will be in column 2 because the table is by alphabetical order and R is after N.
    • The Hamer and Simpson paper actually tested the null hypothesis for the non-effective cases rather than the effective cases.
    data chm;
    input study $ response $ trt $ cellfreq @@;
    datalines;
    study1	B	P	24	study1	N	P	3
    study1	B	D	58	study1	N	D	30
    study2	B	P	16	study2	N	P	57
    study2	B	D	2	study2	N	D	10
    ;
    run;
    • Run Cochran-Mantel-Haenszel Statistics using Proc Freq procedure with cmh option.
    proc freq data=chm;
    tables study*trt*response /cmh;
    weight cellfreq;
    run;

    E. SAS Output

    • SAS chm table
    • Frequency table
    • Cochrane-Mantel-Haenszel test

    F. Notes

    • The Mantel-Haenszel estimator of the common odds ratio assumed the estimation to be homogeneous among both studies.
    • The Mentel-Haenszel statistics tests the null hypothesis that the response rate is the same for the two treatments, after adjusting for possible differences in study response rates.
    • For Proc Freq testing options, make sure the group that you want to tested are in row 1 and column 1. It is also important to crosstab treatment as row and response as column, so the interpretation of the relative risk for the risk of improvement make sense. In Hamper and Simpon’s paper the crosstab has been transposed, therefore the relative risk output doesn’t make sense.

    G. Interpretation

    • The CMH test statistics is 4.65 with a p-value of 0.03, therefore, we can reject the null hypothesis that there is no association between treatment and response. P-value lower than 0.05 indicates that the association between treatment and response remains strong after adjusting for study.
    • Relative Risk (Column 1) equals to 0.74 which means the probability of the improvement with the drug is 0.74 time the probability of the improvement with the placebo.
    • Relative Risk (Column 2) equals to 1.51 which means the probability of no improvement in the symptoms with the drug is 1.51 times the probability of no improvement with the placebo.
    • The Breslow-Day test has a large p-value of 0.295 which indicates there is no significant difference in the odds ratios among the studies.

    * I will show the odds ratio and relative risk calculation in Excel in another post.