Srikanth Boyina: August 2012

Thursday, August 30, 2012

Compute biserial, point biserial, and rank biserial correlations

Source -

http://support.sas.com/kb/24/991.html

%macro biserial(version, data= ,contin= ,binary= ,out=);

%if &version ne %then %put BISERIAL macro Version 2.2;

options nonotes;

* exclude observations with missing variables *;

data &out;

set &data;

where &contin>.;

if &binary>.;

run;

* compute the ranks for the continuous variable *;

proc rank data=&out out=&out ;

var &contin;

ranks r_contin;

run;

* compute proportion of binary, std of contin, and n *;

proc means data=&out noprint;

var &binary &contin;

output out=_temp_(keep=p stdy n) mean=p std=stdx stdy n=n;

run;

* sort by the binary variable *;

proc sort data=&out;

by descending &binary;

run;

* compute mean of contin and rank of contin var *;

proc means data=&out noprint;

by notsorted &binary;

var &contin r_contin;

output out=&out mean=my r_contin;

run;

* restructure the means computed in the step above *;

proc transpose data=&out out=&out(rename=(col1=my1 col2=my0));

var r_contin my;

run;

* combine the data needed to compute biserial correlation *;

data &out;

set &out(drop= _name_ _label_);

retain r1 r0 ;

if _n_=1 then do;

r1=my1;

r0=my0;

end;

else do;

set _temp_;

output;

end;

run;

* compute point biserial correlation *;

proc corr data=&data noprint outp=_temp_;

var &binary &contin;

run;

* extract the point biserial correlation from the matrix *;

data _temp_(keep=pntbisrl);

set _temp_(rename=(&contin=pntbisrl));

if _TYPE_='CORR' and &binary<>1 then output;

run;

options notes;

* compute biserial and rank biserial *;

data &out;

merge _temp_ &out;

if pntbisrl=1 then delete;

h=probit(1-p);

u=exp(-h*h/2)/sqrt(2*arcos(-1));

biserial=p*(1-p)*(my1-my0)/stdy/u;

rnkbisrl=2*(r1-r0)/n;

keep biserial pntbisrl rnkbisrl;

label biserial='Biserial Corr'

pntbisrl='Point Biserial Corr'

rnkbisrl='Rank Biserial Corr';

run;

%mend;

data k;

length x1 $ 1;

input x1 length;

event=(x1='y');

cards;

y 14.8

n 13.8

y 12.4

y 10.1

y 7.1

y 6.1

n 5.8

y 4.6

n 4.3

n 3.5

n 3.3

y 3.2

y 3.0

n 2.8

n 2.8

n 2.5

y 2.4

y 2.3

y 2.1

n 1.7

n 1.7

n 1.5

n 1.3

n 1.3

n 1.2

n 1.2

n 1.1

y 0.8

n 0.7

n 0.6

n 0.5

n 0.2

n 0.2

y 0.1

;

/* Define the BISERIAL macro */

%inc "";

%biserial(data=k, contin=LENGTH, binary=EVENT, out=out1);

*********************
data= SAS data set to be analyzed.
binary = Name of dichotomous variable which must be numeric with values 0 and 1.
contin= Name of continuous variable. Ranks of this variable will be computed to produce the rank biserial corr.
out= Output data set name.
*****************

proc print data=out1 label noobs;

title 'Point Biserial, Biserial and Rank Biserial Correlations';

run;

Wednesday, August 29, 2012

Logistic Regression - Assumptions

Residuals follow a binomial rather than a normal distribution. Normality of variables is not a stringent requirement.

A “nonlinear” (specifically an S-shaped, or sigmoidal, curve) relationship between IVs and the DV; however, this represents a linear relationship between the logit (natural log of the odds of the dependent occurring or not) and the set of IVs. See Addendum 2 for an illustration that compares probabilities, odds, and the logit.

Uses a maximum-likelihood rather than least-squares statistical model. In least squares, we select
regression coefficients that result in the smallest sum of squared differences between the observed
and the predicted values of the DV. In maximum-likelihood, the coefficients that make our observed results “most likely” are selected.

Does not assume homoscedasticity.

Assumes that there is little or no multicollinearity

Predicts the odds of an event occurring (see Addendum 1), which is based on the probability of

that event occurring. Precisely, the odds of an event occurring is: Odds=P/(1-P)
={prob. of event occurring}/[1-prob. of event occurring]
--------------------------------------------

Terminology for use with Logistic Regression -

Probability = P = probability of an event occurring (range of 0 - 1)

Odds = P/(1-P) = ratio of the probability of an event occurring to the probability of the event not occurring (range of 0 – positive infinity)

Odds ratio = Odds1/ Odds2= ratio of two odds

Logit = ln(Odds) = predicted logged odds (range of neg. infinity - pos. infinity)

******************

References:

Hosmer, D. W., & Lemeshow, S. (2000). Applied logistic regression (2nd ed.). New York: John Wiley &

Sons, Inc.

Menard, S. (1995). Applied logistic regression analysis. Thousand Oaks, CA: Sage Publications.

Pampel, F. C. (2000). Logistic regression: A primer. Thousand Oaks, CA: Sage Publications.

Quantiles into new dataset

data tt;

do i=1 to 100;

x1=rannor(0);output;

x2=rannor(0);output;

end;

run;

ODS TRACE on / listing;

proc univariate data=tt;

var x1 x2;

run;

ODS TRACE off;

proc univariate data=tt;

var x1 x2;

ODS OUTPUT Quantiles=qntls;

run;

http://rfd.uoregon.edu/files/rfd/StatisticalResources/110_ods.txt

C - Stat Calculation Code for Validation and OOT

/* STEP - 1 Modeling for Development Dataset */

proc logistic data=DEV outmodel=logist_param descending;

model depvar=&var_r. /

outroc=roc

stb

scale=none

lackfit

selection=stepwise sle=.05 sls=.001;

output out=est_dev P=score;

run;

/* KS */

proc npar1way edf data=devscr;

class depvar;

var score;

run;

/* STEP - 2 Modeling for Validation Dataset */

proc logistic descending inmodel =logist_param;

score data = VAL out = est_val;

run;

/* KS */

proc npar1way data = est_val edf;

class depvar;

var P_1;

run;

/* STEP - 3 Modeling for OOT Validation Dataset */

proc logistic descending inmodel =logist_param;

score data = OOT out = est_val_OOT;

run;

/* KS */

ODS GRAPHICS on;

proc npar1way data = est_val_OOT edf;

class depvar;

var P_1;

run;

/* STEP - 4 C Statistic for Validation and Out of Time Sample DataSet */

%macro C_stat(input,var1,var2,output);

data d nd (rename=(&var1.=&var1._2 &var2.=&var2._2));

set &input.;

if &var1. then output d; else output nd;

run;

proc sql;

create table pairs as select a.&var1.,a.&var2.,b.&var1._2,b.&var2._2

from d as a, nd as b;

quit;

data class;

set pairs;

if &var2. gt &var2._2 then concordant =1; else concordant =0;

if &var2. = &var2._2 then tied =1; else tied =0;

if &var2. lt &var2._2 then discordant =1; else discordant =0;

proc summary data = class;

output out = sum

sum(concordant)=concordant

sum(tied)=tied;

run;

data &output.;set sum;

C = (concordant+tied)/_freq_;

run;

%mend;

%C_stat(est_val,depvar,p_1,C_for_val_data);

%C_stat(est_val_OOT,depvar,p_1,C_for_oot_data);

C - Stat Calculation Code

data test(drop=i);

do i=1 to 200;

y=round(ranuni(9),1);

x=ranuni(3);output;

end;

run;

proc logistic data=test outmodel=logist_param descending;

model y=x;

output out=est_dev P=score;

run;

%macro C_stat(input,var1,var2,output);

data d nd (rename=(&var1.=&var1._2 &var2.=&var2._2));

set &input.;

if &var1. then output d; else output nd;

run;

proc sql;

create table pairs as select a.&var1.,a.&var2.,b.&var1._2,b.&var2._2

from d as a, nd as b;

quit;

data class;

set pairs;

if &var2. gt &var2._2 then concordant =1; else concordant =0;

if &var2. = &var2._2 then tied =1; else tied =0;

if &var2. lt &var2._2 then discordant =1; else discordant =0;

proc summary data = class;

output out = sum

sum(concordant)=concordant

sum(tied)=tied;

run;

data &output.;set sum;

C = (concordant+tied)/_freq_;

run;

%mend;

%C_stat(est_dev,y,score,out_c);

Friday, August 17, 2012

Logistic regression Analysis - Dev Val OOT

/*Logistic regression Analysis for development sample*/

proc logistic descending data=reg_test outest=alt.modest

outmodel=dev_model_estimate;

model =

/ selection= stepwise slstay=0.0001 /*include=13*/;

output out=calculated_scores p=prob xbeta=logit;

run;

proc npar1way data=calculated_scores;

class bad_new1;

var prob;

run;

/*KS calculation for validation sample*/

proc logistic inmodel=dev_model_estimate;

score data=reg_test_val out=calculated_scores_val;

run;

proc npar1way data= calculated_scores_val;

class bad_new1;

var P_1;

run;

/*KS calculation for OOT-validation sample*/

proc logistic inmodel=dev_model_estimate;

score data=reg_test_OOTval out=calculated_scores_OOTval;

run;

proc npar1way data= calculated_scores_OOTval;

class bad_new1;

var P_1;

run;

KS - npar1way - edf - example

data dat;

do i = 1 to 10000;

X1 = 10 + 3*rannor(0);

X2 = 20 + 2*ranuni(0);

X3 = 15 - exp(2*rannor(0));

X4 = 35 + 2*ranexp(0);

y=round(ranuni(0));

output; end; drop i;

run;

proc logistic data=dat outest=logist descending;

model y=x1 x2 x3 x4/

outroc=roc

stb

scale=none;

output out=valid P=score;

run;

proc npar1way edf data=valid;

class y;

var score;

run;

Proc output section into dataset - ods trace on

Outlier Treatment - 2 std

/**Outlier Treatment **/

data winsor_test;

do i = 1 to 98;

x = round(rannor(124));

y = round(rannor(24));

z = round(rannor(14));

output ;

end;

x = 989;

y = .;

z = 1234;

output;

drop i;

run;

options mlogic mprint;

%macro get_stat(datain);

proc means data = &datain. noprint ;

output out = cal_means(where=(_STAT_ in ('MEAN','STD')));

run;

proc transpose data = cal_means(drop = _type_ _freq_ _stat_) out = cal2;

run;

data _null_;

set cal2;

call symput(compress("Mean"

_n_),col1);

call symput(compress("Std"

_n_),col2);

call symput(compress("Vname"

_n_),_name_);

call symput(compress("Num"),_N_);

run;

data &datain._temp;

set &datain.;

%do i=1 %to &Num.;

if (&&Vname&i.. > &&Mean&i.. + ( 2* &&Std&i..)) OR

(&&Vname&i.. < &&Mean&i.. - ( 2* &&Std&i..)) then &&Vname&i.. = . ;

%end;

run;

proc means data = &datain._temp noprint;

output out = new_means (where=(_STAT_ in ('MEAN','STD')));

run;

proc transpose data = new_means(drop = _type_ _freq_ _stat_) out = tran_new;

run;

data _null_;

set tran_new;

call symput(compress("Mean"

_n_),col1);

call symput(compress("Std"

_n_),col2);

call symput(compress("Vname"

_n_),_name_);

call symput(compress("Num"),_N_);

run;

data &datain._final;

set &datain.;

%do i=1 %to &Num.;

New_&&Vname&i.. = &&Vname&i..;

if not missing(&&Vname&i..) then

DO;

if &&Vname&i.. > &&Mean&i.. + ( 2* &&Std&i..) then New_&&Vname&i.. = &&Mean&i.. + ( 2* &&Std&i..);

if &&Vname&i.. < &&Mean&i.. - ( 2* &&Std&i..) then New_&&Vname&i.. = &&Mean&i.. - ( 2* &&Std&i..);

END;

%end;

run;

%mend;

%get_stat(winsor_test);

retain char

data test;

input x $ page_char $ ;

datalines;

xyz page

1 xcv

2 xfv

3 ggg

pqr page

4 dfg

5 ddd

6 ggg

;run;

data test1;

set test;

if page_char="page" then

page=_N_;else page=0;

run;

data test2;

set test1;

RETAIN Page_v1 0;

Page_v1 = SUM(Page_v1, page);

run;

proc sort data=test2;by page_v1;run;

data test3(keep=New page_char x);

retain NEW;

set test2;

by page_v1;

if first.page_v1 = 1 then NEW = x;

else NEW = trim(NEW);

put _all_;

run;

Plots

options mprint symbolgen;

%macro univariate_plot(indata,invar,target,numgrps);

data srv_work.temp4plot (keep= &invar. &target.) Zeros4plot (keep= &invar. &target.) miss4plot (keep= &invar. &target.);

set srv_work.&indata.;

if &invar. = 0 then output zeros4plot;

if &invar. = . then output miss4plot;

else output srv_work.temp4plot;

run;

proc rank data=srv_work.temp4plot groups=&numgrps. out= srv_work.temp4plot;

Var &invar.;

ranks R_&numgrps.;

run;

data srv_work.fin4plot;

set srv_work.temp4plot;

run;

proc sql;

create table srv_work.t_collapsed as

select R_&numgrps.,

count(*) as N,

sum(&target.) as sum_&target.,

avg(&invar.) as M_&invar.,

avg(&target.) as &target.,

min(&invar.) as &invar.

from srv_work.fin4plot

group by r_&numgrps.;

quit;

SYMBOL1 COLOR=blue VALUE=Triangle HEIGHT=2 width=2 INTERPOL=NONE;

SYMBOL2 INTERPOL=NONE COLOR=red HEIGHT=2 VALUE=diamond WIDTH=2;

proc gplot data=srv_work.t_collapsed;

legend1 value=( "default_risk");

legend2 value=( "Number of observations");

plot &target.*&invar. / legend=legend1;

plot2 N*&invar./overlay legend=legend2;

run;

quit;

%mend;

/*Graph – Independent Variable Vs Dependent Variable */

/* Converting Total Number of observations into 40 Groups */

%univariate_plot(with_var,dlq_at_lm,default,40);

Winsorize macro

/*****************************************

* Source From - http://www.wrds.us/index.php/repository/view/1

******Trim or winsorize macro

*byvar = none for no byvar;

*type = delete/winsor (delete will trim, winsor will winsorize;

*dsetin = dataset to winsorize/trim;

*dsetout = dataset to output with winsorized/trimmed values;

*byvar = subsetting variables to winsorize/trim on;

****************************************/

data test;

do i=1 to 100;

x=round(abs(rannor(0)));

y=round(ranuni(0));output;

end;

x=900;y=1;output;

/*x=800;y=0;output;*/

drop i;

run;

%winsor(dsetin=Test,dsetout=TestOut,byvar=none,vars=x,type=winsor,pctl=1 99);

options mlogic mprint;

%macro winsor(dsetin=, dsetout=, byvar=none, vars=, type=winsor, pctl=);

%if &dsetout = %then %let dsetout = &dsetin;

%let varL=;

%let varH=;

%let xn=1;

%do %until ( %scan(&vars,&xn)= );

%let token = %scan(&vars,&xn);

%let varL = &varL &token.L;

%let varH = &varH &token.H;

%let xn=%EVAL(&xn + 1);

%end;

%let xn=%eval(&xn-1);

/* Assignin input data set to Xtemp */

data xtemp; set &dsetin; run;

/* if no byvar variable then it will assign to 1 */

%if &byvar = none %then %do;

data xtemp;

set xtemp;

xbyvar = 1;

run;

%let byvar = xbyvar;

%end;

proc sort data = xtemp;

by &byvar;

run;

proc univariate data = xtemp noprint;

by &byvar;

var &vars;

output out = xtemp_pctl PCTLPTS = &pctl PCTLPRE = &vars PCTLNAME = L H;

run;

data &dsetout;

merge xtemp xtemp_pctl;

by &byvar;

array trimvars{&xn} &vars;

array trimvarl{&xn} &varL;

array trimvarh{&xn} &varH;

do xi = 1 to dim(trimvars);

%if &type = winsor %then %do;

if not missing(trimvars{xi}) then do;

if (trimvars{xi} < trimvarl{xi}) then trimvars{xi} = trimvarl{xi};

if (trimvars{xi} > trimvarh{xi}) then trimvars{xi} = trimvarh{xi};

end;

%end;

%else %do;

if not missing(trimvars{xi}) then do;

if (trimvars{xi} < trimvarl{xi}) then delete;

if (trimvars{xi} > trimvarh{xi}) then delete;

end;

%end;

end;

drop &varL &varH xbyvar xi;

run;

%mend winsor;

Information Value

/*Information Value*/

%Macro IV_Numeric(Indep_Var);

proc rank data = test out = Ranks_Var ties = mean groups = 24;

var &Indep_Var.;

ranks X_rank_&Indep_Var.;

run;

%IV_Char(Ranks_Var,X_rank_&Indep_Var.);

%Mend IV_Numeric;

%Macro IV_Char(Data_Set,Cat_Var);

proc freq data =&Data_Set.(where=(y=1)) noprint;

table &Cat_Var.*y/out=freq_out_y1 (rename=(percent=percent1 count=count1)) ;

run;

proc freq data =&Data_Set.(where=(y=0)) noprint;

table &Cat_Var.*y/out=freq_out_y0 (rename=(percent=percent0 count=count0));

run;

data Freq_Out(drop=y &Cat_Var.);

merge freq_out_y0 freq_out_y1;

by &Cat_Var.;

woe=log(percent0/percent1);

iv=(percent0-percent1)*woe;

run;

proc means data= Freq_Out SUM; run;

data Freq_Out;

length Variable_Name $32.;

set Freq_Out;

Variable_Name="&Cat_Var.";

run;

proc append base = Final data = Freq_Out force;run;

%Mend IV_Char;

%Macro All_Independent_Vars();

/* Numeric Variables */

proc sql noprint;

select count(*) into :nobs from data_chars (where=( name <> "y" and type=1));

quit;run;

data _null_;

set data_chars (where=( name <> "y" and type=1));

call symput(compress("Var"

_n_),name);

run;

%do i=1 %to &nobs.;

%IV_Numeric(&&Var&i..);

%end;

/* Char Variables */

proc sql noprint;

select count(*) into :nobs from data_chars (where=( name <> "y" and type=2));

quit;run;

data _null_;

set data_chars (where=( name <> "y" and type=2));

call symput(compress("Var"

_n_),name);

run;

%do i=1 %to &nobs.;

%IV_Char(test,&&Var&i..);

%end;

proc means data=final SUM;

class Variable_Name;

var iv;output out=final_iv(drop=_type_) sum= ;

run;

%Mend All_Independent_Vars;

proc contents data=test out=data_chars(keep=name type) noprint;

run;

%All_Independent_Vars;

Different types of correlation

http://www.fgse.nova.edu/edl/secure/stats/lesson6.htm

Variable X Variable Y Type of Correlation

Interval Interval Pearson product moment (r)

Ordinal Ordinal Spearman rank coefficient (rho or p)

Nominal Ordinal Rank biserial coefficient

Nominal Interval Point biserial

Nominal Nominal Phi coefficient

Srikanth Boyina