Tuesday, May 5, 2015

logistic regression interpretation in sas

http://www.ats.ucla.edu/stat/sas/output/sas_logit_output.htm

http://pages.uoregon.edu/aarong/teaching/G4075_Outline/node16.html

Tuesday, July 1, 2014

Combinations


/*****************************************************************************/

/***************************** Overlap Code **********************************/

/*****************************************************************************/

 

libname sri "E:\Srikanth";

 

option nomprint nomlogic;

options compress=yes;

%let data_set_name=sri.data_set;

%let var=T1 T2 T3 T4 T5 T6 T7 T8;

%let r_value=5;

 

*This macro, called combo(r), when execute will create a combination of n taken at a time r.;

%macro combo(r);

       data combo;

            keep v1-v&r.;

            array word $&maxl.  w1-w&n. (&things.);

            array rr (*) r1-r&r.;

            array v $&maxl.  v1-v&r.;

 

            %do i=1 %to &r.;

                %if &i.=1 %then %do;

               do r&i.=1 to &n.-(&r.-&i.);

               %end;

             %else %do;

               do r&i.=r%eval(&i.-1)+1 to &n.-(&r.-&i.);

               %end;

             %end;

               do k=1 to &r.;

               v(k)=word (rr(k));

               end;

               output;

           %do i=1 %to &r.;

             end;

             %end;

            run;

%mend combo;

 

 

 

%macro mrun(sri);

      %let i=1;

      %let thing=;

 

      %do %while (%scan(&var.,&i.) ne );

        %let p&i.="%scan(&var.,&i.)";

       %if &i.=1 %then %let thing=&&p&i..;

        %else %let thing=&thing.,&&p&i..;

        %let i=%eval(&i.+1);

      %end;

              %let things=&thing.;

      %let n=%eval(&i.-1);

 

* Calculation the max length of independent variables;

             %do m=1 %to &n.;

                 %let _length&m. = %length(%scan(&var.,&m.));

             %end;

 

             %do m=1 %to &n.;

                %if &m.=1 %then %let copy=&&_length&m.;

%else

                %let copy=&copy., &&_length&m.;

             %end;

 

             %let maxl = %sysfunc(max(&copy.));

*end of calculation;

 

*executing the macro combo(r);

                   %combo(&sri.);

*end of execution;

 

 

%mend mrun;

 

%mrun(&r_value.);

 

 

%put &r_value.;

 

 

%macro vars_creation;

 

data _null_;

set combo;

call symput("tot_com",compress(_N_));

run;

 

%put &tot_com.;

 

data _null_;

set combo;

%do i=1 %to &r_value.;

call symput(compress("v"||&i.||"_"||_n_),v&i.);

%end;

run;

 

data test_data_final;

keep p_1-P_&tot_com.;

set &data_set_name.;

    %do i=1 %to &tot_com.;

 

/*if (&&v1_&i. and &&v2_&i. and &&v3_&i. ) then P_&i.=1;else P_&i.=0;*/

 

if (

            %do j=1 %to %eval(&r_value.-1);

                &&&&v&j._&i. and

                %end;  &&&&v&r_value._&i.

        )then P_&i.=1;else P_&i.=0;

 

    %end;

;run;

 

proc means data=test_data_final noprint;

output out=result(drop=_Freq_ _type_) sum=;

run;

 

proc transpose data=result out=trans_result;run;

 

data final_result;

merge trans_result combo;

run;

 

%mend;

 

%vars_creation;

 

Tuesday, March 5, 2013

Variable Selection Methods -

Variable Selection Methods -


1) Decision Tree for Variable Selection:

2) Factor Analysis Variable Reduction Technique:

3) Principal Component Analysis for Variable Reduction:

4) Random Feature Selection

5) Variable Clustering (Proc Varclus)

6) Regression – Stepwise Selection

7) Regression – Forward Selection

8) Regression – Backward Elimination

9) Information Value Analysis

10) Variable Correlation and Association Check

11) Partial Least Square Regression



1. Decision Tree for Variable Selection:

Decision trees are produced by algorithms that identify various ways of splitting a data

set into branch-like segments. Decision trees attempt to find a strong relationship between input values and target values in a group of observations that form a data set. When a set of input values is identified as having a strong relationship to a target value, then all of these values are grouped in a bin that becomes a branch on the decision tree. These groupings are determined by the observed form of the relationship between the bin values and the target.



Decision Tree can be used as a variable selection method. Inputs that appear in splitting rules of trees can be selected for building predictive models such as logistic regression. At every step the variable having the strongest relationship with the dependent variable is used to split the data into segments. We can consider this process to select the variables.

Decision stumps are basically decision trees with a single layer. As opposed to a tree which has multiple layers, a stump basically stops after the first split. Decision stumps are usually used in population segmentation for large data. Occasionally, they are also used to help make simple yes/no decision model for smaller data with little data.



Method A : Degree of Impurity

Given a data table that contains attributes and class of the attributes, we can measure homogeneity (or heterogeneity) of the table based on the classes. We say a table is pure or homogenous if it contains only a single class. If a data table contains several classes, then we say that the table is impure or heterogeneous. There are several indices to measure degree of impurity quantitatively. Most well known indices to measure degree of impurity are entropy, gini index, and classification error. The formulas are given below







All above formulas contain values of probability of a class j.

The degree of impurity calculated by the above mentioned methods can then be compared with each other. The measure to compare the difference of impurity degrees is called information gain . We would like to know what our gain is if we split the data table based on some attribute values.

Information gain is computed as impurity degrees of the parent table and weighted summation of impurity degrees of the subset table. The weight is based on the number of records for each attribute values. Suppose we will use entropy as measurement of impurity degree, then we have:

Information gain (i) = Entropy of parent table D – Sum (n k /n * Entropy of each value k of subset table Si )

For each attribute in our data, we try to compute the information gain. Once we get the information gain for all attributes, then we find the optimum attribute that produce the maximum information gain. This iteration is continued to select the optimum set of variables.

Method B : Proc SPLIT



This approach uses PROC SPLIT, a procedure that is only available in the Enterprise Miner module. The primary function of the SPLIT procedure is to develop decision trees that classify records based on the values of a target variable in relation to a set of predictive variables. The target variable may be nominal, binary or interval.



In the process of building a tree PROC SPLIT computes variable importance scores for each predictive variable. The approach we are proposing leverages variable importance scores to identify variables with significant impact on the target variable. A higher score means higher relative importance. One can rank order the variables using the scores from PROC SPLIT and select the top N number of variables for further processing.

A more rigorous way of leveraging PROC SPLIT would be running a series of iterations. At each iteration we set aside variables with positive scores, then we narrow down to a list of variables that take zero importance scores in all iterations. Then we exclude these variables from the model and keep the remaining variables that we have previously set aside.

Example :

Consider a case with



i) Eight interval predictors: A, B, C, D, E, F, G and H



ii) Seven ordinal predictors: I, J, K, L, M, N and O



iii) A binary target variable T



SAS Code

PROC DMDB BATCH

DATA = INPUT_DATA

OUT = DMTRAIN

DMDBCAT = CATTRAIN;

VAR A B C D E F G H;

CLASS I J K L M N O T;

TARGET T;

RUN;

PROC SPLIT

DATA = DMTRAIN

DMDBCAT = CATTRAIN

OUTIMPORTANCE = OUTPUT_DATA

CRITERION = GINI

ASSESS = LIFT

LIFTDEPTH = 0.1

MAXBRANCH = 4

MAXDEPTH = 10

SUBTREE = ASSESSMENT

LEAFSIZE = 500;

INPUT A B C D E F G H /

LEVEL = INTERVAL;

INPUT I J K L M N O /

LEVEL = ORDINAL;

TARGET T /

LEVEL = BINARY

ORDER = DESCENDING;

RUN;



Advantage:

1. Trees are appealing because they accept several types of variables: nominal, ordinal, and interval.

2. Decision tree also has an added advantage of handling of missing values. While regression models cannot use observations with missing value, decision trees treat them as an extra value and model around it.

3. Ease of interpretation

Disadvantage :

1. When the data contain no simple relationship between the inputs and the target, a simple tree is too simplistic. Even when a simple description is accurate, the description may not be the only accurate one.





2. Factor Analysis Variable Reduction Technique:



Exploratory Factor Analysis (EFA) is a technique which allows us to reduce a large number of correlated variables to a smaller number of ‘super variables’. It does this by attempting to account for the pattern of correlations between the variables in terms of a much smaller number of latent variables or factors. A latent variable is one that cannot be measured directly, but is assumed to be related to a number of measurable, observable, manifest variables. These factors can be either orthogonal (independent and uncorrelated) or Oblique (they are correlated and share some variance between them). EFA is used when we want to understand the relationships between a set of variables and to summaries them, rather than whether one variable has a significant effect on another.



Basic Issues:



Factor Analysis is not just a single technique, but part of a whole family of techniques which includes Principal Components Analysis (PCA), Exploratory Factor Analysis (EFA) and Confirmatory Factor Analysis (CFA), a more advanced technique used for testing hypotheses.



Difference between PCA and EFA:



There are some important conceptual differences between principal component analysis and factor analysis that should be understood at the outset.

• Perhaps the most important deals with the assumption of an underlying causal structure: factor analysis assumes that the covariation in the observed variables is due to the presence of one or more latent variables (factors) that exert causal influence on these observed variables. In contrast, principal component analysis makes no assumption about an underlying causal model. Principal component analysis is simply a variable reduction procedure that (typically) results in a relatively small number of components that account for most of the variance in a set of observed variables.

• PCA results in principal components that account for a maximal amount of variance for observed variables; FA account for common variance in the data (measured by the communalities of each variable).

• The component scores in PCA represent a linear combination of the observed variables weighted by eigenvectors; the observed variables in FA are linear combinations of the underlying and unique factors

• In PCA, the components yielded are not interpretable, i.e. they do not represent underlying ‘constructs’; in FA, the underlying constructs can be labeled and readily interpreted, given an accurate model specification.

• Factor analysis should be used when theoretical ideas about relationships between variables exist, whereas PCA should be used if the goal is to explore patterns in their data.

When should EFA be used?



We should only use EFA when there are good theoretical reasons to suspect that some set of variables will be represented by a smaller set of latent variables. There should be a good number of substantial correlations in the correlation matrix; otherwise EFA may not succeed in finding an acceptable solution. An acceptable solution appear when each variable has a high loading (>0.3) on a single factor and low (or zero) loadings on all the others (A factor loading is the degree to which every variable correlates with a factor. If a factor loading is high, say, above 0.3) or very high, say, above 0.6, then the relevant variable helps to describe that factor quite well. Factor loadings below 0.3 may be ignored). This ideal is known as the simple structure, and there are several methods of rotating the initial solution so that a simple structure is obtained.



Steps involved in the factor analysis



1. Prepare the correlation matrix of the variables to test whether the EFA is possible to run

2. If the matrix is factorable, determine the number of factors (by observing the value of eigenvalue associated with a factor and the outcome of scree plot) and extract them from the correlation matrix.

3. If necessary, rotate the factors so that the variables that have higher loading on a factor will result in lower and lower loading on the other extracted factors. Ideally, if a variable has high loading on a factor it should have zero loading on the other factors. By the way of rotation, the pattern of each factor becomes clear.

4. Interpret the result and compute a factor score for each observation to study on each factor.



Drawbacks:

• Factor analysis can be only as good as the data allows, i.e., if the variables are correlated with each other then only EFA is applicable. If the variables are not correlated, using EFA is useless because the number of factors here will be same as the number of variables.

• Interpreting factor analysis is based on using a "heuristic", which is a solution that is "convenient even if not absolutely true".More than one interpretation can be made of the same data factored the same way



3. Principal Component Analysis for Variable Reduction:



Principal component analysis (PCA) is a variable reduction procedure. It is appropriate when the number of observed variables is large and one wishes to develop a smaller number of artificial variables (called principal components) that will account for most of the variance in the observed variables. The principal components may then be used as predictor or criterion variables in subsequent analyses.



PCA is useful when some of the variables are correlated with one another, possibly because they are measuring the same character. Because of this redundancy, one believes that it should be possible to reduce the observed variables into a smaller number of principal components (artificial variables) that will account for most of the variance in the observed variables. Technically, a principal component can be defined as a linear combination of optimally-weighted observed variables. Therefore, PCA is a dimension reduction method that creates variables called principal components.



For a given set of p numeric variables in a dataset, one can compute p principal components. Each principal component is a linear combination of original variables with coefficients equal to the eigenvectors of the correlation or covariance matrix. The eigenvectors are customarily taken with unit length. The principal components are sorted by the descending order of the eigenvalues, which are equal to the variance of the components.



Characteristics of principal components: The first component extracted in a principal

component analysis accounts for a maximal amount of total variance in the observed variables. Under typical conditions, this means that the first component will be correlated with at least some of the observed variables. It may be correlated with many.

The second component extracted will have two important characteristics. First, this component will account for a maximal amount of variance in the data set that was not accounted for by the first component. Again under typical conditions, this means that the second component will be correlated with some of the observed variables that did not display strong correlations with component 1.The second characteristic of the second component is that it will be uncorrelated with the first component. Literally, if anyone wants to compute the correlation between components 1 and 2, that correlation would be zero. The remaining components that are extracted in the analysis display the same two characteristics, each component accounts for a maximal amount of variance in the observed variables that was not accounted for by the preceding components, and is uncorrelated with all of the preceding components. A principal component analysis proceeds in this fashion, with each new component accounting for progressively smaller and smaller amounts of variance. Finally, the first few components are retained (basically for which the eigenvalues are greater than 1) and interpreted that consumes almost 70% to 80% variation of the total variance. When the analysis is complete, the resulting components will display varying degrees of correlation with the observed variables, but are completely uncorrelated with one another.



Drawbacks:

• The primary disadvantage of principal components is that the interpretation can be more difficult.

• The principal components are heavily affected by the scaling of the variables.

• If the first few principal components do not account for most of the variation, there is little advantage to using them.





4. Random Feature Selection

When feature selection is performed on a very large set of variables, there is always a risk of selection bias. In other words, the modeler‟s subjectivity may result in a biased subset of features. To avoid selection bias and also to bring in model stability, we propose a random feature selection process from bootstrapped samples of variables and data records.



The process is implemented as outlined below:



 Specify a modeling data set



 Draw 100 random samples of a fixed proportion x% (e.g. 70%) of the modeling data set where records are drawn with replacement



 Specify a list of n variables – this is the initial set of variables from which we try to select features



 Draw 100 random samples of a fixed proportion y% (e.g. 80%) from the specified list of n variables , with replacement



 Randomly create pairs of data samples and variable samples



 Develop 100 statistical models for each pair (one model on a given sample data set and a given sample variable-list). Keep only statistically significant variables as we would do in a typical modeling exercise



 For each model, some variables will remain in the model equation and others will not



 For each variable, compute the percentage occurrence in 100 models



 We will have a list of variables where each variable has an associated percentage of occurrence



 Variables that have higher percentage of occurrence are considered more important and more relevant. Select variables with percentage occurrence greater than a predefined cut off point (e.g., 80%)



The choice of percentage cut off points and the number of bootstrap samples are subjective. The numbers presented above are illustrative and may be different for each particular scenario.



5. Variable Clustering (Proc VARCLUS)

In high dimensional data sets, identifying irrelevant inputs is more difficult than identifying redundant inputs. A good strategy is to first reduce redundancy and then tackle irrelevancy in a lower dimension space. PROC VARCLUS is closely related to principal component analysis and can be used as an alternative method for eliminating redundant dimensions. This type of variable clustering will find groups of variables that are as correlated as possible among themselves and as uncorrelated as possible with variables in other clusters. If the second Eigen value for the cluster is greater than a specified threshold, the cluster is split into two different dimensions.

The analyst can then begin selecting variables from each cluster - if the cluster contains variables which do not make any sense in the final model, the cluster can be ignored. A variable selected from each cluster should

have a high correlation with its own cluster and a low correlation with the other clusters. The 1-R**2 ratio can be used to select these types of variables.



The formula for this ratio is:

1-R**2 ratio = (1-R2 own cluster/ 1-R2 next closest)

Small values of this ratio indicate that the variable has a strong correlation with its own cluster and a weak correlation with the other clusters. If a cluster has several variables, two or more variables can be selected from the cluster.



By default, PROC VARCLUS begins with all variables in a single cluster, It then repeats the following steps:

1. A cluster is chosen for splitting.

2. The chosen cluster is split into two clusters by finding the first two principal components, performing an orthoblique rotation, and assigning each variable to the rotated component with which it has the higher squared correlation.

3. Variables are iteratively reassigned to clusters to maximize the variance accounted for by the cluster components.

Drawbacks:

Clustering routines do not offer any insight into the relationship of the individual variables and our dependent variable.



6. Regression – Forward Selection

Forward selection stars with no variables in the model, testing the addition of each variable using a chosen model comparison criterion, adding the variable (if any) that improves the model the most, and repeating this process until none improves the model.

(In a forward selection analysis we start out with no predictors in the model. Each of the available predictors is evaluated with respect to how much R2 would be increased by adding it to the model. The one which will most increase R2 will be added if it meets the statistical criterion for entry. With SAS the statistical criterion is the significance level for the increase in the R2 produced by addition of the predictor. If no predictor meets that criterion, the analysis stops. If a predictor is added, then the second step involves re-evaluating all of the available predictors which have not yet been entered into the model. If any satisfy the criterion for entry, the one which most increases R2 is added. This procedure is repeated until there remain no more predictors that are eligible for entry.

The first model asks for a forward selection analysis. The SLENTRY= value specifies the significance level for entry into the model. The defaults are 0.50 for forward selection and 0.15 for fully stepwise selection.)



7. Regression – Backward Elimination



Backward elimination starts with all candidate variables, testing the deletion of each variable using a chosen model comparison criterion, deleting the variable (if any) that improves the model the most by being deleted, and repeating this process until no further improvement is possible.



Backward elimination stops when all the variables remaining in the model produce F statistics with p-values less than the cutoff. The default cutoff, which we use below, is 0.10.



What method should you use: forward or backward? If you have a very large set of potential independent variables from which you wish to extract a few--i.e., if you're on a fishing expedition--you should generally go forward. If, on the other hand, if you have a modest-sized set of potential variables from which you wish to eliminate a few--i.e., if you're fine-tuning some prior selection of variables--you should generally go backward.





8. Regression – Stepwise Selection

Stepwise regression is a semi-automated process of building a model by successively adding or removing variables based solely on the t-statistics of their estimated coefficients.

In stepwise selection, variables are added as in forward selection, but after a variable is added, all the variables in the model are candidates for removal. There are two cutoffs to be specified, SLENTRY and SLSTAY.

( Enter and remove predictors, in a stepwise manner, until there is no justifiable reason to enter or remove more.)



 Start with no predictors in the “stepwise model.”

 At each step, enter or remove a predictor based on partial F-tests (that is, the t-tests).

 Stop when no more predictors can be justifiably entered or removed from the stepwise model.

 Specify an Alpha-to-Enter (αE = 0.15) significance level.

 Specify an Alpha-to-Remove (αR = 0.15) significance level.



Step #1

• Fit each of the one-predictor models, that is, regress y on x1, regress y on x2, … regress y on xp-1.

• The first predictor put in the stepwise model is the predictor that has the smallest t-test P-value (below αE = 0.15).

• If no P-value < 0.15, stop.

Step #2

• Suppose x1 was the “best” one predictor.

• Fit each of the two-predictor models with x1 in the model, that is, regress y on (x1, x2), regress y on (x1, x3), …, and y on (x1, xp-1).

• The second predictor put in stepwise model is the predictor that has the smallest t-test P-value (below αE = 0.15).

• If no P-value < 0.15, stop.

o Suppose x2 was the “best” second predictor.

o Step back and check P-value for β1 = 0. If the P-value for β1 = 0 has become not significant (above αR = 0.15), remove x1 from the stepwise model.

Step #3

• Suppose both x1 and x2 made it into the two-predictor stepwise model.

• Fit each of the three-predictor models with x1 and x2 in the model, that is, regress y on (x1, x2, x3), regress y on (x1, x2, x4), …, and regress y on (x1, x2, xp-1).

o The third predictor put in stepwise model is the predictor that has the smallest t-test P-value (below αE = 0.15).

o If no P-value < 0.15, stop.

o Step back and check P-values for β1 = 0 and β2 = 0. If either P-value has become not significant (above αR = 0.15), remove the predictor from the stepwise model.



Stopping the procedure

The procedure is stopped when adding an additional predictor does not yield a t-test P-value below αE = 0.15.



Drawbacks of stepwise regression:



• Major drawback of stepwise/forward/backward procedures is that they can lead to variables in your final model that are too correlated with one another.

• The procedure yields a single final model, although in practice there are often several equally good models.

• It doesn’t take into account a researcher’s (business) knowledge about the predictors.

• The final model is not guaranteed to be optimal in any specified sense.



9. Information Value Analysis

Information values are commonly used in data mining and marketing analytics. Information values provide a way of quantifying the amount of information about the outcome that one gains from a predictor. Larger information values indicate that a predictor is more informative about the outcome. One rule of thumb is that information values less than 0.02 indicate that a variable is not predictive; 0.02 to 0.1 indicate weak predictive power; 0.1 to 0.3 indicate medium predictive power; and 0.3+ indicates strong predictive power.

Information Value of x for measuring y is a number that attempts to quantify the predictive power of x in capturing y. Let's assume the target variable

y which we are interested in being able to measure, is a 0-1 variable (or an indicator). Let's also further assume that it is the number of accounts who will go bad in the immediate future. Let's now divide our population in 10 equal parts after sorting the entire pool by x, and create the deciles. Now we are all set to define Information Value -



IV=∑(badi−goodi)ln(badi/goodi)



Here, i runs from 1 to 10 deciles in which we have divided the data,

badi is the proportion of bad accounts captured in ith decile out of all bad accounts in the population,

goodi similarly is proportion of good (i.e. not bad) accounts in ith decile.



Note that the variable whose effectiveness you want to measure is getting used since it is the variable by which the entire data is sorted and divided into deciles.



How does it work?

We can check that if x has no information on y at all, then the IV turns will trun out to be zero. That's because when you sort by x and create deciles, the deciles are as good as random with respect to y. Hence, each decile should capture 10% of total bads and 10% of total goods. So badi−goodi=0 and ln(badi/goodi)=0. So the IV turns out to be zero.



On the other hand if after sorting by x some decile has higher or lower concentration of bad's than good's, then that would mean that that particular decile is different from the overall population, and x lets us create it. The decile will contribute a positive value to the summation which defines IV in the equation above. So it is clear that for a good x variable, there will be more of such deciles where the proportion of goods and bads differ - and by a larger margin as your x is more effective in capturing y - hence IV indeed gives a measure of predictive power of x.



Drawbacks:

However, there is something artificial in the definition of IV above - it is the functional form. Indeed there can be many different ways to create the functional form that is being summed up.

To give some examples - ∑(badi−goodi)* (badi−goodi), ∑
badi−goodi
(ni/n), etc. all should be equally good candidates.



The last one in particular is interesting - because it has the proportion in the equation, making it a consistent measure. That is, if you decide to divide the data into 20 parts or 30 parts and so on, you will go closer and closer to a limit. Incidentally, the limit to which it converges is essentially gini/2 under some assumptions. For IV on the other hand, this leads to a problem - you cannot divide the data indefinitely - as you may hit segments which have no good (or no bad) accounts at all - in which case taking the ln will bomb.



Also for the same reason, it will be inaccurate and unfair to compare two variables when one of them has ties - which makes it impossible to unambiguously divide the population into 10 equal parts.









10. Variable Correlation and Association Check



Perhaps one of the most straightforward ways of reducing the number of predictors in a regression problem is to make up some rules-based screening criteria using bi-variate and pair wise correlations. In regression analysis, you need to be careful not to include two or more predictor variables that are highly correlated with one another if you’re interested in determining the true contribution of each predictor to your dependent variable. If predictors have too high a correlation with one another, slight changes in the data may result in significant changes in the coefficients – even resulting in incorrect signs. In the statistical literature, this is known as multicollinearity. A rule of thumb states that multicollinearity is likely to be a problem if the simple correlation between two variables is larger than the correlation of either or both variables with the dependent variable. In order to use this method for variable reduction, calculate the correlation of each predictor variable against the other (pair wise correlations) as well as the correlation of each with the dependent variable (bi-variate correlations).



Drawbacks:

Although this procedure may be easy to implement, it only offers a very simplistic view of variable selection. It does not incorporate tests for statistical significance and looks at only pairs of variables one at a time. At some point, a more multivariate approach needs to be considered.





11. Partial Least Square Regression



This procedure, partial least squares (PLS), essentially picks up where principal components analysis leaves off by simultaneously accounting for variations in the dependent variable while trying to extract those factors representing the maximum unique correlation in the predictive attributes. This is often referred to as a supervised procedure for reducing the dimensionality of the data because of the necessary linkage to the dependent variable.



Unlike classical PCA, PLS provides a linkage to the dependent variable, making it potentially useful in developing risk or marketing models. If it is acceptable in the forecasting application to use the transformed factors rather than the original variables, a decision is needed to determine how many factors to keep.



If the decision is made not to use PLS factors directly in a regression, but to use the predictor variables in their original form, then utilizing the VIP (Variable Importance for Projection) criteria may offer a promising way for recommendations to be made as to variable selection. Predictors with small PLS regression coefficients (in absolute value) make a small contribution to the response prediction. Whereas, these coefficients represent the importance each predictor has in the prediction of just the dependent variable, the VIP represents the value of each predictor in fitting the PLS model for both predictors and response. If a predictor has a relatively small coefficient (in absolute value) and a small value of VIP, then it is a prime candidate for deletion. Wold (1994) considered a value less than 0.8 to be “small” for the VIP. Therefore, the analyst could select a subset of the original variables for the regression based upon the VIP criteria.



Syntax:



PROC PLS < options > ;

BY variables;

CLASS variables;

MODEL dependent-variables = effects < / options > ;

OUTPUT OUT= SAS-data-set < options > ;



Sample Code:

Proc pls data=sample;

Model ls ha dt = v1-v27;

Run;

A PLS model is not complete until you choose the number of factors. You can choose the number of factors by using cross validation, in which the data set is divided into two or more groups. You fit the model to all groups except one, then you check the capability of the model to predict responses for the group omitted. Repeating this for each group, you then can measure the overall capability of a given form of the model. The Predicted Residual Sum of Squares (PRESS) statistic is based on the residuals generated by this process.

Steps in PD Credit Scorecard Model Development

Steps in PD Credit Scorecard Model Development


Step 1: Understanding the business problem

Step 2: Defining the dependent variable and understanding the relevant independent variables

Step 3: Pulling the data (dependent and independent variables) from databases

Step 4: Data cleaning and segmentation

Step 5: Sampling and model methodology selection

Step 6: Data preparation

                 A) Determining exclusion criterion for observation periods

                 B) Determining exclusion criterion for performance periods

                 C) Outlier treatment

                 D) Missing value analysis

                 E) Univariate data analysis

                 F) Bi-variate data analysis

                G) Binning and Transformation of variables

Step 7: Model building

               A) Variable selection or reduction

               B) Multicollinearity check

               C) Parameter estimation

               D) Score generation

Step 8: Model validation

              A) Assessing model discrimination/separation power

              B) Assessing model calibration power or goodness-of-fit

Step 9: Model implementation

Step 10: Periodical model monitoring and model recalibration (if required)



Friday, February 1, 2013

PROC LOGISTIC - param=ref

Example -1


 

PROC LOGISTIC DATA=sales descending;

CLASS gender (param=ref ref='M');

MODEL purchase = gender;

RUN;



descending; The descending option will model the probability that a customer places an order of $100 or more (response 1). Otherwise, by default, the response 0 would be modeled.



param=ref ref='M' The param option specifies the parametrization of the model that will be used, which in this example is reference cell coding, i.e. the females will be compared to the males (reference group because of ref='Male').

If there are more than two categories in our independent variable, for interpretation we have to use param option.





Example -2



http://www.ats.ucla.edu/stat/sas/faq/proc_logistic_coding.htm



proc logistic data = mydir.hsb2m descending;

class ses (ref='3') / param = ref ;

model hiread = write ses ;

run ;



Looking at the output (below), the coding system shown in the "Class Level Information" section of the output is for two dummy variables, one for category 1 versus 3, and one for category 2 versus 3. Note two other things in the output below. First, that the coefficients in this model are consistent with the odds ratios. That is, exp(-0.9204) = 0.398 and exp(-0.3839) = 0.681. The second thing to notice is that the odds ratios from this model are the same as the odds ratios above. This is expected, since, SAS always uses dummy coding to compute odds ratios, all that has changed is how the categorical variable ses is being parameterized in the part of parameter estimates.



   Class Level Information

                      

Class     Value     Variables Design

SES        1          1      0

          2          0      1

          3          0      0



              Analysis of Maximum Likelihood Estimates

                                 Standard          Wald

Parameter      DF    Estimate       Error    Chi-Square    Pr > ChiSq

Intercept       1     -7.6872      1.3697       31.4984        <.0001

WRITE           1      0.1438      0.0236       37.0981        <.0001

SES       1     1     -0.9204      0.4897        3.5328        0.0602

SES       2     1     -0.3839      0.3975        0.9330        0.3341

              Odds Ratio Estimates

                   Point          95% Wald

Effect          Estimate      Confidence Limits

WRITE         1.155       1.102       1.209

SES   1 vs 3       0.398       0.153       1.040

SES   2 vs 3       0.681       0.313       1.485

PROC LOGISTIC why aren't the coefficients consistent with the odds ratios?




http://www.ats.ucla.edu/stat/sas/faq/proc_logistic_coding.htm

In PROC LOGISTIC why aren't the coefficients consistent with the odds ratios?

We will start with a logistic regression model predicting the binary outcome variable hiread with the variables write and ses. The variable write is continuous, and the variable ses is categorical with three categories (1 = low, 2 = middle, 3 = high). In the code below, the class statement is used to specify that ses is a categorical variable and should be treated as such.



proc logistic data = mydir.hsb2m descending;

class ses;

model hiread = write ses ;

run ;

The "Class Level Information" section of the SAS output shows the coding used by SAS in estimating the model. This coding scheme is what is known as effect coding. (For more information see our FAQ page What is effect coding?)



Class Level Information



Design

Class Value Variables



SES 1 1 0

2 0 1

3 -1 -1





Analysis of Maximum Likelihood Estimates



Standard Wald

Parameter DF Estimate Error Chi-Square Pr > ChiSq



Intercept 1 -8.1220 1.3216 37.7697 <.0001

WRITE 1 0.1438 0.0236 37.0981 <.0001

SES 1 1 -0.4856 0.2823 2.9594 0.0854

SES 2 1 0.0508 0.2290 0.0493 0.8243



Further down in the output, we find the table containing the rest to the estimates of the coefficients. For the variable ses there are two coefficients one for each of the effect-coded variables in the model (SES 1 and SES 2). The coefficients are -0.4856 and 0.0508. If we exponentiate these coefficients we get exp(-0.4856) = .61533 and exp(0.0508) = 1.0521, for SES 1 and SES 2 respectively, but the odds ratios in listed in the table with the heading "Odds Ratio Estimates" are 0.398 and 0.681. Why aren't the odds ratios consistent with the coefficients? The answer is that SAS uses effect coding for the coefficients, but uses dummy variable coding when calculating the odds ratios. Because they are not making the same comparisons, it is possible for the coefficients in the table of estimates to be non-significant while the confidence interval around the odds ratios does not include one (or vice versa). (For more information see our FAQ What is dummy coding?)



Odds Ratio Estimates



Point 95% Wald

Effect Estimate Confidence Limits



WRITE 1.155 1.102 1.209

SES 1 vs 3 0.398 0.153 1.040

SES 2 vs 3 0.681 0.313 1.485



If we run the same analysis, but use dummy variable coding for both the parameter estimates and the odds ratios, we can get coefficients that will be consistent with the odds ratios. There are several methods that can be used to estimate a model using dummy coding for nominal level variables. In the first example below we add (ref='3') / param = ref to the class statement. This instructs SAS that for the variable ses the desired reference category is 3 (we could also use category 1 or 2 as the reference group), and then tells SAS that we want to use reference coding scheme in parameter estimates.



proc logistic data = mydir.hsb2m descending;

class ses (ref='3') / param = ref ;

model hiread = write ses ;

run ;

Looking at the output (below), the coding system shown in the "Class Level Information" section of the output is for two dummy variables, one for category 1 versus 3, and one for category 2 versus 3. Note two other things in the output below. First, that the coefficients in this model are consistent with the odds ratios. That is, exp(-0.9204) = 0.398 and exp(-0.3839) = 0.681. The second thing to notice is that the odds ratios from this model are the same as the odds ratios above. This is expected, since, SAS always uses dummy coding to compute odds ratios, all that has changed is how the categorical variable ses is being parameterized in the part of parameter estimates.



Class Level Information



Design

Class Value Variables



SES 1 1 0

2 0 1

3 0 0



Analysis of Maximum Likelihood Estimates



Standard Wald

Parameter DF Estimate Error Chi-Square Pr > ChiSq



Intercept 1 -7.6872 1.3697 31.4984 <.0001

WRITE 1 0.1438 0.0236 37.0981 <.0001

SES 1 1 -0.9204 0.4897 3.5328 0.0602

SES 2 1 -0.3839 0.3975 0.9330 0.3341



Odds Ratio Estimates



Point 95% Wald

Effect Estimate Confidence Limits



WRITE 1.155 1.102 1.209

SES 1 vs 3 0.398 0.153 1.040

SES 2 vs 3 0.681 0.313 1.485



Another way to use dummy coding is to create the dummy variables manually, and use them in the model statement, bypassing the class statement entirely. The code below does this. First we create two dummy variables, ses_d1 and ses_d2, which code for category 1 versus 3, and category 2 versus 3 respectively. Then we include ses_d1 and ses_d2 in the model statement. There is no need for the class statement here. The output generated by this code will not include the "Class Level Information" since the class statement was not used, however, the output will be otherwise identical to the last model.



data mydir.hsb2m;

set 'D:\data\hsb2';

if ses = 1 then ses_d1 = 1;

if ses = 2 then ses_d1 = 0;

if ses = 3 then ses_d1 = 0;



if ses = 1 then ses_d2 = 0;

if ses = 2 then ses_d2 = 1;

if ses = 3 then ses_d2 = 0;

run;



proc logistic data = mydir.hsb2m descending;

model hiread = write ses_d1 ses_d2 ;

run ;

As a final exercise, we can run the model using effect coding and check to see that the coefficients from this model match the coefficients from the first model. This will confirm that SAS is in fact using effect coding in the first model. The first step is to create the variables for the effect coding, below we have called them ses_e1 and ses_e2, for the coefficients for the differences between category 1 and the grand mean (when all other covariates equal zero), and category 2 and the grand mean, respectively. Then we run the model with ses_e1 and ses_d2 in the model statement, and the class statement is being omitted entirely (since we have done the work normally done by the class state).



data mydir.hsb2m;

set 'D:\data\hsb2';

if ses = 1 then ses_e1 = 1;

if ses = 2 then ses_e1 = 0;

if ses = 3 then ses_e1 = -1;



if ses = 1 then ses_e2 = 0;

if ses = 2 then ses_e2 = 1;

if ses = 3 then ses_e2 = -1;

run;



proc logistic data = mydir.hsb2m descending;

model hiread = write ses_e1 ses_e2;

run ;

Comparing the table of coefficients below to the coefficients in the very first table of estimates, we see that the coefficients are in fact the same. This confirms that the model in the first table was estimated using effect coding, by default. Note that the odds ratios below do not match the odds ratios in the first model, because when we use the class statement, SAS uses dummy coding to generate the odds ratios, while in this case, the odds ratios are computed directly from the estimated coefficients.



Analysis of Maximum Likelihood Estimates



Standard Wald

Parameter DF Estimate Error Chi-Square Pr > ChiSq



Intercept 1 -8.1220 1.3216 37.7697 <.0001

WRITE 1 0.1438 0.0236 37.0981 <.0001

ses_e1 1 -0.4856 0.2823 2.9594 0.0854

ses_e2 1 0.0508 0.2290 0.0493 0.8243



Odds Ratio Estimates



Point 95% Wald

Effect Estimate Confidence Limits



WRITE 1.155 1.102 1.209

ses_e1 0.615 0.354 1.070

ses_e2 1.052 0.672 1.648