Kaggle Competition: Men’s and Women’s Division I Basketball

Host: The NCAA and Google Cloud

  • The National Collegiate Athletic Association (NCAA®) is a non-profit organization which regulates most of the United State’s college athletic programs, now supporting nearly half a million student athletes. They officially run the US Men’s and Women’s Division I college basketball tournament, March Madness®.
  • Google Cloud offers a suite of cloud computing services that runs on the same infrastructure that Google uses internally for its end-user products, like Google Search and YouTube. Google Cloud is now the official cloud of the NCAA®.

Kaggle Page Links:

Competition Detail:

  • Kagglers who participate in this competition will use their computing power along with the NCAA®’s data to join the millions of fans who attempt to predict the outcomes of this year’s tournament.
  • The competition is split into 2 stages. In the first stage, Kagglers will rely on results of past tournaments to build and test models. In the second stage, competitors will forecast outcomes of all possible match-ups in the 2018 tournament. You don’t need to participate in the first stage to enter the second – the real competition is forecasting the 2018 results.

SAS: Concatenate Values in Multiple Rows by Group

For example, data table contains student grades  for the term. Each grade is one record and the term, program, student id information are the same  for the student.  The following data step can be used to order the data table and then concatenate the grade records into something like ‘A/A/B/B+/A’ for each student.

proc sort data= one;
by groupvar1 groupvar2 groupvar3;
run;
data two;
set one;
by groupvar1 groupvar2 groupvar3;
firstgroup = first.groupvar2;
lastgroup = last.groupvar2;
retain catvar;
if firstgroup =1 then catvar=trim(targetvar);
else catvar=catx('/',catvar, targetvar);
if lastgroup ne 1 then delete;
keep groupvar1 groupvar2 catvar;
run;

SAS: Proc Tabulate and Percentage Statistics

The following proc tabulate procedure will generate at a report on sum of undergraduate teaching and graduate teaching and FTE for each Faculty by departments (row header) and by year (column header).

proc tabulate data=data;
class year faculty dept;
var undergrad_load graduate_load fte;
table faculty, (dept all), year*(undergrad_load="UG" graduate_load="GR" fte)*(sum="");
run;

Data table contains grade, program, and incoming year of students. “Total” variable equals to 1 if the student has a grade; “ge75” is 1 if the grade is great than and equal to 75; and “ge80” is 1 if the grade is great than and equal to 80.

The following proc tabulate procedure gives a summary table of total number of counts of students with grade, students with grade great than and equal to 75 and 80.

proc tabulate missing f=6. noseps data=data;
class program progname category year level2;
var total ge75 ge80;
tables category,program*progname*level2,year*(total ge75 ge80) / rts=40 indent=3;
run;

PERCENTAGE STATISTICS

  • reppctn: report percentage (all rows and all columns add up to 100)
  • colpctn: column percentage (every column adds up to 100)
  • rowpctn: row percentage (every row adds up to 100)
  • A*pctsum: construct a custom denomination, both A and B are analysis variables, and B as in the is the denominator for calculating the percentage.

The following tabulate procedure calculates the percentage of greater and equal to 75 and 80. The outomatic SAS naming with ‘1111111’ of the percentage output reflects the number of class variables in the procedure.

proc tabulate missing f=6. noseps data=data out=dataout (drop=_type_ _table_ _page_ rename=(ge75_pctsum_1111111=ge75 ge80_pctsum_1111111=ge80));
class level1 faculty program progname category level2 year;
var total ge75 ge80;
tables category*level1*faculty*program*progname*level2*year, ge75*pctsum='ge75%'*f=8.2 ge80*pctsum='ge80%'*f=8.2 / rts=40 indent=3;
run;

Examples: with multiple class variables in row expression

  • Reppctn/Reppctsum
proc tabulate data = table1 ;
class Agegroup gender acadyear;
var heads ;
table agegroup="Age Group" all gender="Gender" all ,
(acadyear="" all)*(heads="Heads"*SUM="" heads="% Total"*reppctsum="" );
run;
  • Colpctn/Colpctsum — best for % sum for multiple class variables
proc tabulate data = table1 ;
class Agegroup gender acadyear;
var heads ;
table agegroup="Age Group" all gender="Gender" all ,
(acadyear="" all)*(heads="Heads"*SUM="" heads="% Total"*colpctsum="" );
run;
  • Rowpctn/Rowpctsum
proc tabulate data = table1 ;
class Agegroup gender acadyear;
var heads ;
table agegroup="Age Group" all gender="Gender" all ,
(acadyear="" all)*(heads="Heads"*SUM="" heads="% Total"*rowpctsum="" );
run;
  • pctsum – not working for the All column
proc tabulate data = table1 ;
class Agegroup gender acadyear;
var heads ;
table agegroup="Age Group" all gender="Gender" all ,
(acadyear="" all="All Year" )*(heads="Heads"*SUM="" heads="% Total"*pctsum < agegroup all gender all > ="" );
run;

Show % sign in the table

Reference: http://support.sas.com/kb/38/001.htmlhttp://support.sas.com/kb/38/001.html

proc format;                           
   picture fmtpct (round) low-high='009.99%';   
run;
/* use: f=fmtpct. */

SAS Proc Sgplot: Assign colors by group in Statistics Plot

Reference: https://blogs.sas.com/content/iml/2012/10/17/specify-the-colors-of-groups-in-sas-statistical-graphics.html; https://blogs.sas.com/content/graphicallyspeaking/2012/02/27/roses-are-red-violets-are-blue/

 

Original code use gplot

%macro plots;
%do i = 0 %to 1;
%if &i=0 %then %do;
title height=12pt "Regular Salary Group";
%end;
%else %do;
title height=12pt "High Salary Group";
%end;
%do j = 2 %to 4;
%if &j=2 %then %do;
title2 height=14pt "&curr_fiscal_year  Assistant Professors";
%end;
%else %if &j=3 %then %do;
title2 height=14pt "&curr_fiscal_year  Associate Professors";
%end;
%else %do;
title2 height=14pt "&curr_fiscal_year Full Professors";
%end;
proc gplot data=anno_&i(where=(rank_cd=&j)) anno=anno_&i(where=(rank_cd=&j));
plot y*x / haxis=axis1 vaxis=axis2 noframe;
symbol1 v=dot h=.6 w=.6 color='Black';
format basesal comma7.;
run;
quit;
%end;
%end;
%mend plots;
%plots;

New code with the following improvements:

  1. produce regression line with color and transparency attributes
  2. produce 95% confidence limit with color and transparency attributes
  3. produce scatter plot by gender group and use grouporder attribute to make sure fixed color is assigned to Male and Female.
  4. output the plots to pdf

 

ods pdf file="X:regression_with_gender_label.pdf";
goptions reset = global;
%macro plots;
%do i = 0 %to 1;
%if &i=0 %then %do;
title height=12pt "Regular Salary Group";
%end;
%else %do;
title height=12pt "High Salary Group";
%end;
%do j = 2 %to 4;
%if &j=2 %then %do;
title2 height=14pt "&curr_fiscal_year Assistant Professors";
%end;
%else %if &j=3 %then %do;
title2 height=14pt "&curr_fiscal_year Associate Professors";
%end;
%else %do;
title2 height=14pt "&curr_fiscal_yearFull Professors ";
%end;
proc sgplot data=log_glm (where=(rank_cd=&j. and area = "&i.")) ;
scatter x=exper y=log_sal / group=gender grouporder=ascending name='plot' markerattrs=(symbol=circlefilled);
series x=exper y=yhat / name='predict' legendlabel='ln(Predicted Log Sal)' lineattrs=(color=blue ) transparency = 0.5 ;
series x=exper y=ucl / name='upper' legendlabel='Upper Confidence Level' lineattrs=(color = lightblue) transparency = 0.5;
series x=exper y=lcl / name='lower' legendlabel='Lower Confidence Level' lineattrs=(color = lightblue) transparency = 0.5;
run;
quit;
%end;
%end;
%mend plots;
%plots
Ods pdf close;

Problem: myattrmap not reconginzed by the scatter statement though ‘attrid= myid’ doesn’t generate error.

data myattrmap;
retain id value linecolor fillcolor;
length linecolor $ 9 fillcolor $ 9;
input ID $ value $ linecolor $ fillcolor $;
datalines;
myid F blue blue
myid M red red;
run;

EXCEL: Most Clever Lookup Formula with Multiple Conditions

I came across an excel lookup function through the following link.  https://www.extendoffice.com/documents/excel/2440-excel-vlookup-multiple-criteria.html

Then I found further explanation in the following forum.

https://www.excelforum.com/excel-formulas-and-functions/1043140-explain-lookup-2-1-a2-a10-d2-b2-b10.html

The structure for the formula is:

=LOOKUP(2, 1/(COLUMN1=CONDITION1)/(COLUMN2=CONDITION2)/(COLUMN=CONDITIONn), (RESULTCOLUMN))

This formula can be used to look up both numeric value and character value that match multiple conditions.  COLUMN1=CONDITION1 will produce an array of 0s and 1s; 1 divided by the array will produce an array of %Div/0s and 1s; then the array will be further divided by an array of 0s and 1s (results of column2=condition2) and so on.  The lookup value is set at 2 because 2 is greater than 1, and if lookup_value (2) is not found within the lookup_vector, LOOKUP returns the last value in that array which is less than 2.  The formula then returns the value in the result column using the corresponding position found in the lookup columns.

 

SQL Join: Compile Statistics Summary from Multiple Statistics Tables

In the previous blog, I presented how to compile statistic results by groups and by variable into a long table for each type of statistics analysis.   All tables share three common field: group, test and var.

Here is the list of the table with the specific statistic that want to extract and compile into a summary table.

  • stat_tstat:
    • class = ‘1’: n, mean
    • class = ‘0’: n, mean
    • class = ‘Diff (1-2)’: mean,
  • stat_ttest:
    • variances = ‘Unequal’: tValue, Probt

PPOC SQL is the most convenient procedure to pick up these values under specific condition and combine them into one table.

proc sql;
create table stat1 as
select a.n as studyn, a.mean as studymean, b.n as controln, b.mean as controlmean, c.mean as diff, a.group, a.var, d.tValue, d.probt
from stat_tstat a
left join stat_tstat b
on a.group =b.group and a.var=b.var and b.class = '1'
left join stat_tstat c
on a.group =c.group and a.var=c.var and c.class = 'Diff (1-2)'
left join stat_ttest d
on a.group = d.group and a.var=d.var and d.variances = 'Unequal'
where a.class = '0';
quit;

SAS: Stacking Statistical Output using Macros

The previous post goes through basic SAS statistics procedures, like Proc Means, Proc Ttest, and Proc Freq.  In this post, the focus is to establish a good naming convention for the statistical outputs. For each loop of PROC TTEST or PROC FREQ, the statistics are saved in separate tables by pairing group and type of statistics.  When stacking the test results, it is important to identify the type of statistics, the variable that the statistics is based on and the pairing group.  In the following macro, &in. is the group identifier, &var. is the variable identified, and &stat is the statistics procedure identifier.

%macro stat (in =, var=, stat=);
data &in.&stat.&var.;
set &in.&stat.&var.;
length group $2 test $20 var $20;
group = substr("&in.", 1, 2);
test = "&stat.";
var = "&var.";
rename &var. = response;
run;
proc append base=stat_&stat. data = &in.&stat._&var.;
run;
%mend stat;

proc datasets library= work noprint;
delete stat_prop;
run;
%stat ( in =g1, var= isretainYr1,stat=prop);
%stat ( in =g1, var= isretainYr2,stat=prop);
%stat ( in =g1, var= isretainYr3,stat=prop);
%stat ( in =g1, var= isretainYr4,stat=prop);
%stat ( in =g2, var= isretainYr1,stat=prop);
%stat ( in =g2, var= isretainYr2,stat=prop);
%stat ( in =g2, var= isretainYr3,stat=prop);
%stat ( in =g2, var= isretainYr4,stat=prop);

SAS STAT: getting output statistics (1)

SAS STAT package has many procedures that can be used to conduct specific statistical analysis with structured data sets.  I found the following use of precedures quite common and there are usually a number of statistics needed to be output for further compilation.  The caveat is to sort the data set in the desired order, so the null hypothesis is in the upper left cell (1,1) in the two-way frequency table, which normally showes the results of study group with a respone equal to 1. This way the output statistics can be interpreted more straight forward.

A. How to get output from statistics procedures?

There are different ways for outputing statistics, eg. ODS OUTPUT, OUTPUT statement or OUT=;

a. use PROC MEANS for general statistics like n, mean, min, max std;

  • use maxdec= option to adjust the decimal places needed.
  • output out=filename: N, MIN, MAX, MEAN, STD, SUMWGT

b. use PROC TTEST for two-sample independent T-test; use CLASS Variable to identify and differentiate the groups for study cases and control cases;

  • ods output
    • ttests=filename: tValue, DF, (pick Satterthwaite method which assume unequal variance);
    • statistics=filename: n, mean, LowerCLMean, UpperCLMean, StdDev, StdErr for ‘0’,’1′, Diff (1-2);
    • equality=filename: fValue, probf (equality of variances)

c. use PROC FREQ for two-sample independent T-test

  • tables statement, out option
    • out= filename: default only provide the TABLES variable, frequency count, percent of total frequency.  To include percent of row frequency  and percent of column frequency, you need to add OUTPCT in TABLES statement.
  • output statement out option
    • Output  relrisk  out = filename;
  • ods output
    • ChiSq=filename: Chi-Square, Prob
    • PdiffTest=filename: Proportion Difference Test, Wald is the default method, Proportion Difference, Z Value, One-sided Prob and two-sided prob.
    • RelativeRisks=filename: relative risk estimate, case-control (odds ratio), column 1 risk, column 2 risk, 95% L/U Confidence Limit

Overall, the ODS OUTPUT is the most versatile and powerful method to obtain statisics results from these procedure.  Check “ODS Table Names” under each procedure in the SAS STAT User Guide.  The SAS procedure assigns a name to each table that it creates.  Use ODS Table Name= filename to assign your own table name to be saved in the work library.  In order to have the output ods table, you also need to check if you have included the corresponding option(s) in the specific statement for the program generate the statistics.

Sample code for Two-sample Proportion Test (include all 3 methods of output):

%macro prop ( in =, var=, out=, weight=);
proc sort data=&in;
by descending study descending &var.;
run;
ods graphics on;
ods output ChiSq = &out.chi_&var. PdiffTest=&out.pdiff_&var. RelativeRisks=&out.rr_&var.;
proc freq data=&in. order =data ;
format study grpfmt. &var. rspFmt.;
weight &weight.;
tables study*&var. / chisq measures riskdiff(equal) outpct out=&out._&var.
plots= (freqplot(twoway=groupvertical scale =percent));
output relrisk out=&out._or_&var.;
title "Proportion Test:Case - Control study of variable &var. for &out.";
run;
ods output close;
ods graphics off;
%mend prop;
%prop ( in = g1pair, var= isRetainYr1, out =g1prop, weight = weight1);

*”The Satterthwaite approximation of the standard errors differs from the Pooled method in that it does not assume that the variances of the two samples are equal. This means that if the variances are equal, the Satterthwaite approximation should give us exactly the same answer as the Pooled method.” (Reference: https://wolfweb.unr.edu/~ldyer/classes/396/PSE.pdf)

SAS STAT: Weight Statement

Weight statement can be used to exclude observation from the statistical procedure. If Weight statement is not specified, each obs has a default weight of 1.

PROC MEANS

0: Counts the obs in number of observations
<0: convert to 0 and treat like 0
missing: exclude

PROC FREQ

0: ignore obs unless specify the ZEROS option, which include obswith 0 weights.
missing: exclude

PROC TTEST

0: exclude
<0: exlude
missing: exclude

Basically, assigning a missing value to the weight field to exclude the observation and assigning 1 to the weight field to include the observation in the analysis.