Statistics Archives - BI Practice

Statistics: Meta-analysis. 1) Install dmetar package

A. R code

install.packages("tidyverse") 
install.packages("meta") 
install.packages("metafor")
devtools::install_github("MathiasHarrer/dmetar") 
# if not working, clone the package from github and unzip and install from local
devtools::install("C:/dmetar-master/dmetar-master")

B. Error

Error: (converted from warning) cannot remove prior installation of package ‘digest’

C. Workaround

get library location: Sys.getenv(“R_LIBS_USER”)
close R program completely
Go to R library: C:\Program Files\R\R-3.6.1\library
Delete “digest” folder manually
Rerun R with above code and the error message will not appear again and the dmetar package will be successfully intstalled.

Statistics: Matrix Calculation in Excel

A. Matrix Functions in Excel

MMULT(): returns the matrix product of 2 arrays. The result is an array with the same number of rows as array1 and the same number of columns as array2. Matrix A multiply Matrix B doesn’t equal to Matrix B multiply Matrix A. The order matters. The multiplication can be done as long as the number of columns of the first matrix equals to the number of rows in the second matrix. To make matrix smaller is to multiply a wider matrix to a long matrix, which returns the matrix with the shorter side of both matrices; and to make a matrix big is to multiple a longer matrix to a wider matrix, which returns the matrix with the longer side of both matrices.
MUNIT(): returns the identity matrix for the specified dimension.
MINVERSE(): returns the inverse matrix of a given matrix. The product of a matrix and its inverse matrix is the identity matrix. The inverse of a matrix will exist only if the determinant is not zero. The inverse of 2×2 [a, b; c, d] matrix can be calculated as:
- Calculate the determinant;
- take the inverse of the determinant;
- multiple the inverse of the determinant to the [d, -b; -c, a] matrix.
MDETERM(): returns the determinant of the square matrix. If matrix determinant is 0, then matrix is sigular. For a 2×2 matrix, the determinant is equal to (ad-bc).

B. Excel Examples

Define matrix

Transpose Matrix (X’)

Matrix multiplication

Inverse Matrix (RULE: multiplication of a matrix and its inverse matrix equals to the identity matrix)

Combination of Matrices Operations

C. Understand the matrix operation in solving regression equations

The matrix multiplication of X’X is to take the sum of product for each combination of all the Xs.
The inverse of X’X is converting the X’X into weights for Xs.
The matrix multiplication of X’Y is to take the sum of product for each combination of X and Y. Visually, it is like expand the space of X matrix to Y scale.
The (X’X)^(-1)X’Y is the estimation of the intercept and coefficient(s).

D. Reference

Khan Academy: Matrix multiplication

Statistics: Calculate Effect Size for Meta-Analysis

A. Web-based Effect Size Calculator

B. Examples

T: Treatment Group
C: Control Group
T(n): n for Treatment Group
C(n): n for Control Group
p: p-value
std: standard deviation

Data	ES
1. Mean SES / Group/ n / std 127.8 / T / 25 / 10.4 132.3 / C / 30 / 132.3	Standardized Mean Difference (d) Means and Standard Deviation d= -0.4466
2. t/ T(n) / C(n) 1.68 / 10 / 12	Standardized Mean Difference (d) T-Test, Unequal Sample Size d= 0.7193
3. T(n) / C(n) / p 10 / 12 / .037	Standardized Mean Difference (d) T-Test P-Value, Unequal Sample Size d= 0.9569
4. r = .27 for binary variable and continuous variable	2*0.27/ sqrt(1-0.27^2) ES=0.560829 formula from D.W. Wilson’s slides
5. Group / Mean test scroe / n 1 / 55.38 / 13 2 / 59.40 / 18 3 / 75.14 / 37 4 / 88 / 22 F(3,86) = 7.05, for meta-analysis only interested group 1 and 2, std not reported.	Standardized Mean Difference (d) F-Test, 3 or more Groups d= -0.1658
6. 2 x 2 table n / group / % not improved / % improved 42 / T / 32% / 68% 29 / C / 37% / 63%	Standardized Mean Difference (d) Frequency Distribution (Proportions) d= 0.1057
7. frequency table for T and C group, 60 cases for each group (no report of the means and stds) Degrees of Condition / T(n) / C(n) 0 / 15 / 20 1 / 15 / 20 2 / 15 / 10 3 / 15 / 10	Standardized Mean Difference (d) Frequency Distribution d = 0.305
8. regression analysis with nonequivalent comparison group design covariates: employment status, marital status, age etc. treatment: intervention / probation only unstandardized regression coefficient: -.523 std for DV, severity of physical abuse: s=9.23 sample size (intervention / probation only) : n1= 125 / n2=254	Standardized Mean Difference (d) Unstandardized Regression coefficient Covariates adjusted ES(d)= -0.0568 =(125-1)9.23^2+ (254-1)9.23^2 =32117.72 = 124+254-2 = 377 =32117.72 / 377 = 85.1929 =sqrt(85.1929)= 9.23 = S_pooled ES=-.523 /9.23 = -.056

C. Web-based Calculator Output

Example 3: T-Test P-Value, Unequal Sample Size

Example 6, Frequency distribution (Proportions)

Example 8, Unstandardized Regression Coefficient

D. Citation

Wilson, D. B. (date of version). Meta-analysis macros for SAS, SPSS, and Stata. Retrieved November 3, 2019, from http://mason.gmu.edu/~dwilsonb/ma.html

SAS: Proc Reg – Collinearity Diagnostics

A. Reference

B. Purpose

Examine whether predictors are highly collinear which can casuse problems in estimating the regression coefficients.
As the degree of multicollinearity increases, the coefficient estimates become unstable and the standard errors for the coefficients can be wildly inflated.

C. SAS code

proc reg data = cars;
model msrp = enginesize cylinders horsepower  /  tol vif collinoint;
run;
quit;

D. Notes

proc reg can not deal with categorical variable directly, therefore you need to create dummy variable yourself for the categorical variable.
tol: tolerance, the percent of variance in the predictor that cannot be accounted for by other predictors. Regress the predictor variable on the rest of the predictor variable and compute the R square. 1 minus the R square equals tolerance for the predictor.
vif: variance inflation factor. It is the inverse function of tolerance. Measures how much the variance of the estimated regression coefficient is “inflated” by the existence of correlation among the predictor variables in the model. A vif of 1 means no inflation at all. Exceeding 4 warrants further investigation Greater than 10 vif means serious multicollinearity and requires correction.
collinoint: produce intercept adjusted collinearity diagnostic. This table decomposes the correlation matrix in to linear combination of variables. The variance of each of these linear combinations is called an eigenvalue. Collinearity is assumed by finding 2 or more variables that have large proportions of variance (.50 or more) that correspond to large condition indices. A large condition index, 10 or more is and indication of instability.

E. SAS Output

F. Interpretation

Engine Size and cylinders have greater than 5 VIF.
The higher condition index is 5.41 with 83.7% and 90.1% of variances from for Engine Size and Cylinders. Since 5.4 is less than 10, therefore there is no multicollinearity.
Total eigenvalue accumulates to 3 because there are 3 predictors.

Statistics: Tools for Systematic Review and Meta-Analysis

A. Resource

Cochrane
Covidence
PRISMA: Transparent reporting of systematic reviews and meta-analysis
- PRISMA-P: Guidelines for developing review protocols
- PRISMA-IPD: Guidelines for individual patient data
- PRISMA-NMA: Guidelines for Network Meta-Analysis

B. Guidelines

Cooper & Hedges, 1994
Hedges & Olkin, 1985
Lipsey & Wilson, 2001
Borenstein, Hedges, Higgins, & Rothstein, 2008: Comprehensive Meta-Analysis Version 2.2.048

C. Review Process

Identification of studies
- Name of the reviewer
- Date of the review
- Article: Author, date of publication, title, journal, issue number, pages, and credentials
General Information
- Focus of study
- Country of study
- Variables being measured
- Age range of participants
- Location of the study
Study Research Questions
- hypothesis
- theoretical/empirical basis

Methods designs
- Independent variables
- Outcome variables
- Measurement tools
Methods groups
- Nonrandomized with treatment and control groups/repeated measures design
- Number of groups
Methods sampling strategy
- Explicitly stated/Implicit/not stated/unclear
- sampling frame (telephone directory, electoral register, postcode, school listing)random selection/systematically/convenience
Sample information
- number of participants in the study
- if more than one group, the number of participants in each group
- sex
- socioeconomic status ethnicity
- special educational need
- region
- control for bias from confounding variables and groups
- baseline value for longitudinal study

Recruitment and consent
- Method: letters of invitation, telephone, face-to-face
- incentives
- consent sought

Data collection
- Methods: experimental, curriculum-based assessment, focus group, group interview, one-to-one interview, observation, self-completion questionnaire, self-completion report or diary, exams, clinical test, practical test, psychological test, school records, secondary data etc.
- who collected the data
- reliability
- validity

Data analysis
- statistical methods: descriptive, correlation, group differences (t test, ANOVA), growth curve analysis/multilevel modeling(HLM), structural equation modeling(SEM), path analysis, regression

Results and conclusion
- Group means, SD, N, estimated effect size, appropriate SD, F, t test, significance, inverse variance weight

D. Statistics

Cohen’s kappa
Cohen’s d
effect size
aggregate/weighted mean effect size
95% confidence interval: upper and lower
homogeneity of variance (Q statistic): Test if the mean effect size of the studies are significantly heterogeneous (p<.05), which means that there is more variability in the effect sizes than would be expected from sampling error and that the effect sized did not estimate common population mean (Lipsey & Wilson, 2001)
df: degrees of freedom
I square (%): the percentage of variability of the effect size that is attributable to true heterogeneity, that is, over and above the sampling error.
Outlier detection
mixed-effects model (consider studies as random effects): moderator analysis for heterogeneity (allow for population parameters to vary across studies, reducing the probability of committing a Type I error)
Proc GLM/ANOVA (consider studies as fixed effects): moderator analysis for heterogeneity
- Region
- Socioeconomic status
- Geographical location
- Education level
- Setting
- Language
- sampling method
Statistical difference in the mean effect size of methodological feature of the study
- confidence in effect size derivation (medium, high)
- reliability (not reported, reported)
- validity (not reported vs. reported
classic fail-safe N/Orwin’s fail-safe N: The number of missing null studies needed to bring the current mean effect size of the meta-analysis to .04. Threshhold is 5k+10, k is number of studies for the meta-analysis. If the N is greater than the 5k+10 limit then it is unlikely that publication bias poses a significant threat to the validity of findings of the meta-analysis.
- Used to assess publication bias. eg. control for bias in studies (tightly controlled, loosely controlled, not controlled)

E. Purpose/Research Questions

Whether the treatment is associated with single effect or multiple effects?
Understand the variability of studies on the association of treatment with single or multiple effects, and explain the variable effects potentially through the study features (moderators). How do the effects of the treatment vary different study features?

F. Reference

Odesope et al, 2010: A Systematic Review and Meta-Analysis of the Cognitive Correlates of Bilingualism
PRISMA Checklist
PRISMA Flow Diagram

Regression Basics – MSR, MSE, Cook’s Distance

How do calculate the MSR, MSE, and Cook’s Distance directly from the output dataset with model prediction? SAS will produce a general summary of ANOVA results as below, but how are these statistics calculated.

Y= Intercept + beta1x1 +beta2x2 +beta3x3; n = 49, observation 20 to 47 are omited in display.

The excel spreadsheet shows the model prediction as yhat_old, residual values as resid_old, and the cook’s distance value as cook_old. The 19th observation has a high value (1.29) using the 4/n threshold (4/49=0.08).

Mean Square due to Regression (MSR) =24,948,880,090
- Column model^2 are the squares of the difference between yhat_old and mean of Y which is 186,509, as shown in the Excel;
- Take sum of the above, which equals to 74,846,640,270, as shown in the Excel.
- Divided by degree of freedom, which is 3 (3 parameters as x1, x2, x3 ).
- MSR means variances that are explained by the model.
Mean Square Error (MSE) =139,372,106
- Column error^2 are the squares the difference between yhat_old and Y;
- Take sum of the above, which equals to 6,271,744,784, as shown in the Excel.
- Divided by degree of freedom, which is total number of observation minus number of parameters, which should include intercept as a parameter (intercept, x1, x2, x3).
- MSE means variances that are unexplained by the model.
Root MSE or s: The estimated standard deviation of the random error. s=sqrt(139372106)=11805.6
F value: MSR/MSE =179.01
Cook’s Distance (how the cook’s D for observation 19 is calculated)
- yhat_new is the model prediction excluding observation 19 in the model.
- Column diff^2 are the squares of the difference between yhat_old and yhat_new;
- Take sum of the above , which equals to 723,926,182;
- For the denominator, multiply MSE (139,732, 106) by the number of parameters (4), which includes the intercept as parameter (intercept, x1, x2, x3).
- Take the division of the results from the above 2 items equals to 1.2985.
- Cook’s Distance measures the individual observation’s influence on the model but removing the observation from the model and measure the squared difference in prediction as a portion of the MSE per parameter. The smaller the Cook’s Distance (close to 0) means that removing the observation from the model have no impact on the model.

The Excel calculation matches the SAS output.

Eigenvalue and Eigenvector

Google Definition: 1) each of a set of values of parameter of which a differential equation has a nonzero solution (an eigenfuction) under given conditions. 2) any number such that a given matrix minus that number times the identity matrix has a zero determinant.

reference: https://www.khanacademy.org/math/linear-algebra/alternate-bases/eigen-everything/v/linear-algebra-introduction-to-eigenvalues-and-eigenvectors

Kaiser-Guttman Criterion: ‘Eigenvalues greater than one’ (Guttan, 1954; Kaiser, 1960, 1970) is commonly used to determine number of factors to retain. The thinking behind the criterion is “that a factor must account for at least as much variance as an individual variable” per Nunnally and Bernstein (1994)