SAS: Normality Test

A. Reference

B. Data

data BlockDesign;
input block a b y @@;
datalines;
1 1 1 56 1 1 2 41
1 2 1 50 1 2 2 36
1 3 1 39 1 3 2 35
2 1 1 30 2 1 2 25
2 2 1 36 2 2 2 28
2 3 1 33 2 3 2 30
3 1 1 32 3 1 2 24
3 2 1 31 3 2 2 27
3 3 1 15 3 3 2 19
4 1 1 30 4 1 2 25
4 2 1 35 4 2 2 30
4 3 1 17 4 3 2 18
;
run;

C. Normality Check

  • histogram
  • normal
  • plot
  • qqplot
%macro uni (data, var);
Title "Check Nomality and Distribution - &var. in  &data.";
proc univariate data = &data. ;
var &var. ;
histogram / cfill = gray;
run;
proc univariate data = &data. normal plot;
var &var.;
qqplot &var. / normal (mu=est sigma =est);
run;
%mend uni;
%uni(blockdesign, y);

D. Output

The p-value for Shapiro-Wilk test is 0.27, therefore we can not reject the null hypothesis that y is normally distributed.
Q-Q Plot

SAS: Proc Reg – Collinearity Diagnostics

A. Reference

B. Purpose

  • Examine whether predictors are highly collinear which can casuse problems in estimating the regression coefficients.
  • As the degree of multicollinearity increases, the coefficient estimates become unstable and the standard errors for the coefficients can be wildly inflated.

C. SAS code

proc reg data = cars;
model msrp = enginesize cylinders horsepower  /  tol vif collinoint;
run;
quit;

D. Notes

  • proc reg can not deal with categorical variable directly, therefore you need to create dummy variable yourself for the categorical variable.
  • tol: tolerance, the percent of variance in the predictor that cannot be accounted for by other predictors. Regress the predictor variable on the rest of the predictor variable and compute the R square. 1 minus the R square equals tolerance for the predictor.
  • vif: variance inflation factor. It is the inverse function of tolerance. Measures how much the variance of the estimated regression coefficient is “inflated” by the existence of correlation among the predictor variables in the model. A vif of 1 means no inflation at all. Exceeding 4 warrants further investigation Greater than 10 vif means serious multicollinearity and requires correction.
  • collinoint: produce intercept adjusted collinearity diagnostic. This table decomposes the correlation matrix in to linear combination of variables. The variance of each of these linear combinations is called an eigenvalue. Collinearity is assumed by finding 2 or more variables that have large proportions of variance (.50 or more) that correspond to large condition indices. A large condition index, 10 or more is and indication of instability.

E. SAS Output

F. Interpretation

  • Engine Size and cylinders have greater than 5 VIF.
  • The higher condition index is 5.41 with 83.7% and 90.1% of variances from for Engine Size and Cylinders. Since 5.4 is less than 10, therefore there is no multicollinearity.
  • Total eigenvalue accumulates to 3 because there are 3 predictors.

SAS: Proc Corr – Different Correlations between variables

A. Pearson Correlation: Between continuous variable and continuous variable.

%macro corr (in, year);
Title "Check Correlation - &in. group";
proc corr data= &in._&year. 
          outp=corr_&in._&year.
          plot=(matrix  scatterplot);
      var exper salary ;
run;
%mend corr;
  • Variables Information
  • Simple Statistics (N, Mean, Std Dev, Sum, Min, Max, Label
  • Pearson Correlations (Correlation Coefficients, Prob)
  • Scatter Plots
  • Scatter Plots Matrix

B. Polyserial Correlation: Between continuous variable and categorical variable.

%macro corr (in, year);
ods output polyserialCorr=corr_polys_&in._&year. ;
Title "Check Polyserial Correlation - &in. group";
proc corr data= &in._&year.  pearson polyserial;
	with gender minority rank;
      var  exper salary;
run;
%mend corr;
  • Variables Information
  • Simple Statistics
  • Person Correlations
  • PolyserialCorr (Wald Test, LR Test)

C. Polychoric Correlation: Between categorical variable and categorical variable.

%macro corr (in, year);
ods output measures=plcorr_&in._&year. (where=(statistic="Polychoric Correlation"));
Title "Check Polychoric Correlation - &in. group";
proc freq data= &in._&year.;
tables gender*rank minority*rank / plcorr;
run;
%mend corr;
  • Cross-Tabular Freq Table
  • Measures of Association (Polychoric Correlation)

D. Partial Correlation: Pearson correlation between variables while controlling other variables. Following example checks the correlation between salary and gender, controlling for rank and experience.

%macro partial (in, year);
Title "Check Partial Correlation for basesal and gender - &in. group";
proc corr data=&in._&year.  plots=scatter(alpha=.20 .30);
   var salary gender ; /* treat gender as continuous */
   partial rank exper;
 run;
%mend partial;

Data: Git Bash Installation for Windows

A. Reference: Software Carpentry Unix Shell Tutorial

B. Steps

  • Download git (git-2.23.0-64-bit.exe) for windows from gitforwindows.org
  • Install git for windows to c:\program files\git
  • Select Components -> Next
    • Additional icons: On the Desktop
    • Windows Explorer integration: Git Bash Here; Git GUI Here
    • Git LFS (Large File support)
    • Associate .git* configuration files with the default text editor
    • Associate .sh files to be run with Bash
  • Choosing the default editor used by Git: Use Notepad++ as Git’s default editor -> Next
  • Adjusting your PATH environment: Use Git and optional Unix tools from the Command Prompt -> Next
  • Choosing the SSH executable: Use OpenSSH -> Next
  • Choosing HTTPs transport backend: Use the OpenSSL library -> Next
  • Configuring the line ending conversions -> Checkout Windows-style, commit Unix-style line ending -> Next
  • Configuring the terminal emulator to use with Git Bash: Use Windows’ default console window -> Next
  • Configuring extra options: Enable file system caching; Enable Git Credential Manager; -> Next -> Install
  • Set home environment: cmd -> cd users\username\documents ->set HOME “%USERPROFILE%” -> exit

C. Note

  • Windows operating system do not automatically have a Unix Shell program installed. The post shows how to install an emulator included in Git for Windows, which provides the access to both Bash shell commands as well as Git.

D. Help

Data: Export Email to Excel

Reference: https://support.office.com/en-us/article/back-up-your-email-e5845b0b-1aeb-424f-924c-aa1c33b18833

Steps:

  • open Outlook
  • Move the emails that you want to export to a folder
  • Go to the folder
  • File -> Open & Export -> Import and Export
  • Choose an action to perform: ->Export to a file -> Next
  • Create a file of type: -> Comma Separated Values
  • Select folder to export from: (The folder that you were in are highlighted) -> Next
  • Save exported file as: -> Browse -> select folder and create file name -> Next
  • The following actions will be performed: Export “Email messages” from folder: xxxxxxxx -> Map Custom Fields -> Finish
    • Subject
    • Body
    • From: (Name)
    • From: (Address)
    • From: (Type)
    • To: (Name)
    • To: (Address)
    • To: (Type)
    • BCC: (Name)
    • BCC: (Address)
    • BillingInformation
    • Categories
    • Importance
    • Mileage
    • Sensitivity

Tableau: Load Census Shape Data

A. Download data from web

  • 2016 Census Boundary File
  • select ArcGIS (.shp)
  • select census subdivisions and continue to download
  • unzip file to local directory

B. Tableau

  • Data-> New Data Source ->Spatial File-> Open the .shp file
  • Open a new sheet, drag Geometry from Measures to the canvas
  • drag Ccsname over the Colour Property in Marks card
Coloured by census consolidated subdivision name

C. Attribute

  • FID: specific to ArcGIS
  • Shape: specific to ArcGIS
  • CCSUID: Uniquely identifies a census consolidated subdivision (2-digit province/territory + 2-digit census division code +3-digit census consolidated subdivision code)
  • CSDUID: Uniquely identifies a census subdivision (2-digit province/territory + 2-digit census division code +3-digit census subdivision code)
  • CCSNAME: Census consolidated subdivision name.
  • CSDNAME: Census subdivision name.
  • PRUID: Uniquely identifies a province or territory.
  • PRNAME: Province or territory name.
  • CDUID: Uniquely identifies a census division.
  • CDNAME: Census division name.
  • CDTYPE: Census division type.
  • CSDTYPE: Census subdivision type.
  • ERUID: Uniquely identifies an economic region ( 2-digit province/territory + 2-digit economic region code
  • ERNAME: Economic region name.
  • SACCODE: 3-digit statistical area classification code
  • SACTYPE: The statistical area classification groups census subdivisions according to whether they are a component of a census metropolitan area, a census agglomeration, a census metropolitan influenced zone or the territories.
  • CMAUID: Uniquely identifies a census metropolitan area/census agglomeration.
  • CMAPUID: Uniquely identifies the provincial/territorial part of a census metropolitan area/census agglomeration.
  • CMANAME: Census metropolitan area/census agglomeration name.
  • CMATYPE: Identify whether the unit is a census metropolitan area, a tracted census agglomeration or a non-tracted census agglomeration.

SAS: Proc SQL Format

  • get variable with the required format from other table with sql join and format in one statement.
  • the example shows the new table contains the new character variable newvar1 with $1. format and the new date variable newvar2 with ddmmyy10. format
 proc sql;
create table new as
select a.* , b.var1 as newvar1 format=$1. , datepart(b.var2) as newvar2 format=ddmmyy10. 
from old1 as a 
left join lib.old2 as b
on a.var = b.var;
quit;

SAS: Create All Possible Combination of Categorical Variables and Generate Summary Statistics for Each Combination

A. Scenario

The data set contains personal response on attribute 1 to attribute 4. The response can be either ‘Y’ or ‘N’. Some of the records are blank. The possible combinations for attribute 1 to 4 are 2x2x2x2, which is 16 combinations.

I would like to know how many people belongs any one of the 16 combination of the attributes. The distribution of people among the combination of attributes indicates the prominent attributes of the population (where the distribution is dense) and the outlier attributes (where the distribution is sparse).

B. Code

  • create table with all combination
data combination;
format attr1 $1. attr2 $1. attr3 $1. attr4 $1.; *Important to add format statement. The variables need to be exactly the same format as the data table that you want to run summary statistics on;
do attr1= 'Y','N';
do attr2 = 'Y','N';
do attr3 = 'Y','N';
do attr4 = 'Y','N';
output;
end;
end;
end;
end;
run;
  • run sum use Proc tabulate with classdata option
proc tabulate data=data out = sum missing classdata = combination; *classdata option use the combination table created in the last step;
class attr1 attr2 attr3 attr4;
var count;
table attr1*attr2*attr3*attr4, count*(sum="");
run;

C. Output

table a classtable named combination
table sum using the classtable combination

D. Notes

  • make sure that the variables used in the class statement of the Proc tabulate procedure are included in the classdata table;
  • The output table sum automatically add the combination of null to all attr1 to attr4.
  • make sure that the variables used in classdata table have exactly the same name and format as the variables in the table that you want to run statistics on. In the above example, the attr1-4 in classdata table combination need to match the variables attr1-4 in the data table. Use Proc contents procedure on both data sets to make sure these variables are matching. Merely check the format in column attributes is not enough. If you omit the format statement when creating the classdata table combination, you will get the following error.

Statistics: Tools for Systematic Review and Meta-Analysis

A. Resource

B. Guidelines

  • Cooper & Hedges, 1994
  • Hedges & Olkin, 1985
  • Lipsey & Wilson, 2001
  • Borenstein, Hedges, Higgins, & Rothstein, 2008: Comprehensive Meta-Analysis Version 2.2.048

C. Review Process

  • Identification of studies
    • Name of the reviewer
    • Date of the review
    • Article: Author, date of publication, title, journal, issue number, pages, and credentials
  • General Information
    • Focus of study
    • Country of study
    • Variables being measured
    • Age range of participants
    • Location of the study
  • Study Research Questions
    • hypothesis
    • theoretical/empirical basis
  • Methods designs
    • Independent variables
    • Outcome variables
    • Measurement tools
  • Methods groups
    • Nonrandomized with treatment and control groups/repeated measures design
    • Number of groups
  • Methods sampling strategy
    • Explicitly stated/Implicit/not stated/unclear
    • sampling frame (telephone directory, electoral register, postcode, school listing)random selection/systematically/convenience
  • Sample information
    • number of participants in the study
    • if more than one group, the number of participants in each group
    • sex
    • socioeconomic status ethnicity
    • special educational need
    • region
    • control for bias from confounding variables and groups
    • baseline value for longitudinal study
  • Recruitment and consent
    • Method: letters of invitation, telephone, face-to-face
    • incentives
    • consent sought
  • Data collection
    • Methods: experimental, curriculum-based assessment, focus group, group interview, one-to-one interview, observation, self-completion questionnaire, self-completion report or diary, exams, clinical test, practical test, psychological test, school records, secondary data etc.
    • who collected the data
    • reliability
    • validity
  • Data analysis
    • statistical methods: descriptive, correlation, group differences (t test, ANOVA), growth curve analysis/multilevel modeling(HLM), structural equation modeling(SEM), path analysis, regression
  • Results and conclusion
    • Group means, SD, N, estimated effect size, appropriate SD, F, t test, significance, inverse variance weight

D. Statistics

  • Cohen’s kappa
  • Cohen’s d
  • effect size
  • aggregate/weighted mean effect size
  • 95% confidence interval: upper and lower
  • homogeneity of variance (Q statistic): Test if the mean effect size of the studies are significantly heterogeneous (p<.05), which means that there is more variability in the effect sizes than would be expected from sampling error and that the effect sized did not estimate common population mean (Lipsey & Wilson, 2001)
  • df: degrees of freedom
  • I square (%): the percentage of variability of the effect size that is attributable to true heterogeneity, that is, over and above the sampling error.
  • Outlier detection
  • mixed-effects model (consider studies as random effects): moderator analysis for heterogeneity (allow for population parameters to vary across studies, reducing the probability of committing a Type I error)
  • Proc GLM/ANOVA (consider studies as fixed effects): moderator analysis for heterogeneity
    • Region
    • Socioeconomic status
    • Geographical location
    • Education level
    • Setting
    • Language
    • sampling method
  • Statistical difference in the mean effect size of methodological feature of the study
    • confidence in effect size derivation (medium, high)
    • reliability (not reported, reported)
    • validity (not reported vs. reported
  • classic fail-safe N/Orwin’s fail-safe N: The number of missing null studies needed to bring the current mean effect size of the meta-analysis to .04. Threshhold is 5k+10, k is number of studies for the meta-analysis. If the N is greater than the 5k+10 limit then it is unlikely that publication bias poses a significant threat to the validity of findings of the meta-analysis.
    • Used to assess publication bias. eg. control for bias in studies (tightly controlled, loosely controlled, not controlled)

E. Purpose/Research Questions

  • Whether the treatment is associated with single effect or multiple effects?
  • Understand the variability of studies on the association of treatment with single or multiple effects, and explain the variable effects potentially through the study features (moderators). How do the effects of the treatment vary different study features?

F. Reference

Statistics Basics: Odds Ratio and Relative Risk Calculation in Excel

I have done a post of meta-analysis using the Proc Freq to produce cross tabulation of 2 by 2 table, the odds ratio, and the relative risk (column 1, column 2). In this post, I will show how the stats for odds ratio and relative risks are calculated using Excel.

A. SAS output

  • Frequency table
  • Odds table

B. Excel

  • Frequency and Odds Ratio
  • Relative Risk Column 1
  • Relative Risk Column 2

C. Interpretation

Drug: peole who feel better/people who did not improve = 58/30 = 1.93333
Placebo: people wo feel better/people who did not improve = 24/3 = 8
Odds ratio: of 0.241667 means patients who take the drug are 0.24 times likely to improve.
The probability of the improvement with the drug is 0.74 time the probability of the improvement with the placebo.
The probability of the no improvement in the symptoms with the drug is 3.7 time the probability of no improvement with the placebo.