Improving the art, craft and science of economic credit risk scorecards using random forests: why credit scorers and economists should use random forests.
Sharma, Dhruv
INTRODUCTION
The aim of this paper is to outline an approach to improving credit
risk scorecards using Random Forests. We start with the benefits of
random forests compared to logistic regression, the tool used most often
for credit scoring systems. We then compare performance of random
forests and logistic regression out of the box on a credit card dataset,
a home equity loan dataset and a proprietary data set. We outline an
approach to improving logistic regression using the random forest. We
conclude by demonstrating how power random forests can be used to
develop a model using 8 variables which is almost as good as the FICO[R]
score. Thus highlighting the fact that data sets with complex
interaction terms and contents can benefit from random forest models in
2 ways: 1) clear insight into the most predictive and valuable variables
2) generating robust models which maximize predictive interactions and
relationships in the data not detectable by traditional regression
techniques.
For the purpose of this study, model performance will be compared
using Receiver Operating curves which plot the proportion of bad loans
detected vs. incorrectly classified good loans for each model cut off.
Numerically this will be represented by the area under the curve of the
ROC plot. All performance discussed will be out of sample performance of
a 30% hold out sample while the models generated are built on 70% of the
dataset. All investigations into data are conducted using R and Rattle
tool.
TRADITIONAL CREDIT SCORING PITFALLS
The biggest problem with traditional credit scoring based on
logistic regression techniques is that as a scientist or economist one
cannot interpret the importance of underlying variables to the
probability of a borrower experiencing financial difficulty.
The p values of the regression are not reliable as regression
assumes no multicollinearity. As such variables which might make sense
from a theoretical point of view, such as cash flow surrogates, and may
have strong predictive power would not appear to be statistically
significant based on p value statistics. This is a problem because
credit data is notoriously correlated and biased. It is well known that
'biased estimation in data ... [which] has been shown to predict
and extrapolate better when predictor variables are highly correlated
...' as this is common to credit scoring (Overstreet, 1992) .
Although modelers have used skill and judgment to work past this
short coming there is no way in traditional scorecards to assess the
predictive value variables in a robust and reliable manner. Thus there
might be many opportunities of variables and variable interactions which
might be lost given the use of the current tool.
Also from a human factors and organizational point of view people
are biased to test theories they have and not try things that might not
make sense. Our ability to develop causal models is biased and arbitrary
despite the meanings we attach to things after the fact.
The history of credit scoring literature is rife with contradictory
studies from the Durand's first study in the 1930s on whether
income is predictive. Yet mortgage risk models have shown the debt ratio
(monthly expenses/income) to be predictive as well as month's
reserves (liquid assets/monthly payment). The successes of credit
scoring in the mortgage industry show that financial worth and ability
to pay variables can be used effectively in models along with loan to
value (loan amount/property value) to assess risk. If we step back we
can see that interaction variables of affordability and credit risk have
proven to be valuable predictive tools. This is also consistent with the
judgment theory of credit of: credit (willingness to pay), capacity
(ability to pay) and collateral, and character.
The next leap in improvement to credit scoring is to find ways to
test interaction terms in a meaningful and principled way. It stands to
reason econometrically that if any variable should have impact on human
behavior in spending, consumption, and financial distress it should be
ability to pay. The measures of this are income, current debt usage, and
reserves and assets one has saved to absorb shocks or life events.
Is there a statistically reliable way to test out the importance of
variables, relative to their predictive power?
Importance of Random Forests to Credit Risk and Economics in
general
To date the majority of credit scorecards used in industry are
linear models despite the known issues of the flat maximum and
multicollinearity (Wainer, 1978; Overstreet etal 1997;). Random Forests
are a powerful tool for economic science as they are able to
successfully deal with correlated variables with complex interactions
(Breiman, 2001).
A simple example of the power of Random Forests was shown by
Breiman in the binary prediction case of hepatitis mortality in which
Stanford medical school had identified variables 6, 12, 14 and 19 as
most predictive of risk using logistic regression. Subsequently using
the bootstrap technique Efron showed that none of these variables were
significant in the random resampling trials he ran. The Random Forest
variable importance measure, created by Breiman, showed variables 7 and
11 to be critical and improved the logit regression results simplifying
the model and by reducing error from 17% to 12% (Breiman, 2002).
As Random Forests are non parametric the linear restrictions of the
flat maximum do not come into play as such. That said predictive models
tend to perform well with regards to pareto optimal trade offs in true
positive and false positive rates which look like an asymptote like the
flat maximum effect. The complex interactions of economic variables such
as macroeconomic forces and affordability are too complex to be studied
for simple linear regression anymore. Random Forests serve as good
estimate for asymptote of possible predictive power in this regards and
help us get past the psychological limit we may believe to exist for
predictive power as Roger Banister was able to do with preconceived
limit on minimum time for completing the mile run. The way Random
Forests work by building large quantities of weak classifiers with
random selection of variables grown with out of sample testing is
analogous to the way humans make decisions in a market place (See
Gigerenzer's work on "Fast and Frugal trees" on human
judgment models). Humans each look at the data available to them and
make quick inferences and take actions based on these data. Random
Forests then take votes from these large quantities of predictors and
use decisions of all the predictors to make the final decision. The fact
that diverse models built on different variables and samples of data
when combined outperform other simple linear models is profound.
That said the critical aspects of Random Forests of interest to
economic scientists are the features Breiman intended such as:
* Random Forests never overfit the data as they are built with out
of sample testing for each submodel
* Variable importance ( a measure based on the importance in
accuracy each variable provides to the overall model based on
permutation tests of removing variables)
* Being able to see the effects of variables on predictions (2002).
* Handling thousands of variables efficiently by sampling
variables.
Random Forests help us see the true impact of complex interrelated
variables. As Breiman mentioned in his Wald lecture, complex phenomenon
cannot be modeled well with goodness of fit models with simplifications.
A more scientific approach is to build as complex a model to fit the
phenomenon being studied and then to have tools like variable importance
to understand the relationship inside the phenomenon being studied
(Breiman, 2002). This is an important point as economics is based more
and more complex realities.
Comparison of Random Forests to Logistic Regression
We now examine random forest performance out of the box on 3 data
sets. The first dataset is a private label credit card data set from the
2010 KDD contest in Pacific Asia, the second data set is the widely used
home equity loans, and the third data is a proprietary dataset.
1. Random Forest vs. Logistic Regression on Credit Card Data Set
Credit Card Dataset
The credit card data set has 50,000 loans of which 13000 are bad
(serious delinquency). Using this data set a random forest model and
logistic regression scorecard were compared out of the box. The source
for the data is http://sede.neurotech.com.br/PAKDD2010/ Pacific-Asian
Knowledge Discovery and Data Mining conference.
Models
Random Forest Variable Importance: The variable importance plot for
random forests showed the following variables to be predictive in rank
order.
According to the random forest plot the majority of predictions of
borrower delinquency on the card can be predicted by age, monthly
income, phone, payment day, type of occupation, marital status, number
of dependents, area code of profession, and type of residence. In
addition additional variables can add to predictive power in some
fashion through some interaction effects.
Logistic regression model
Insights
Note how the regression makes the personal income appear
statistically insignificant although we know from the random forest that
it has a great deal of predictive power.
[GRAPHIC OMITTED]
The AUC (area under the curve) for the random forest model was .629
while for the regression model was .60. Thus random forests had a 5%
improvement in performance over the logistic regression. By adding
interaction terms suggested by variables in the random forest the
logistic regression performance can be enhanced to match or slightly
exceed random forest performance.
2. Random Forest vs. Logistic Regression on Home Equity Data Set
Home Equity Dataset
The home equity data set has approximately 5,960 loans of which
1189 are bad (serious delinquency). Using this data set a random forest
model and logistic regression scorecard were compared out of the box.
The source for the data is the popular SAS data set:
www.sasenterpriseminer.com/data/HMEQ.xls
Models
Random Forest Variable Importance: The variable importance plot for
random forests showed the following variables to be predictive in rank
order. The debt ratio, age of credit history, value of the home, and
delinquency history had the most predictive power according to the
random forest.
[GRAPHIC OMITTED]
Insights
The regression shows Debt ratio and other variables suggested by
random forests to be statistically significant.
[GRAPHIC OMITTED]
The random forest however greatly outperforms the logistic
regression scorecard on the home equity data set. Thus showing that
logistic regression is not exploiting the maximum predictive value of
the variables.
The AUC of the random forest was .92 while for the logistic
regression was .78. Thus out of the box random forests had an 18%
advantage in performance over the logistic regression. A recent study of
tuning logistic regressions with neural network transformations had a
performance of logistic regression to have an AUC of .86 (Wallinga,
2009). Thus Wallinga's approach of general additive neural network
logistic regression though a powerful well thought enhancement improved
performance by 28% but still did not match the out of performance of
random forests.
3. Random Forest vs. Logistic Regression on Proprietary Data set
Proprietary Dataset
The proprietary data set comprises of credit data from 2008 and the
bad loans are those defined as loans which go 90 days past due or worse
within 2 years on any account tradeline or loan. The data has 293,421
loan applicants and 19,449 bad loans.
Models
Random Forest Variable Importance: The variable importance plot for
random forests showed the following variables to be predictive in rank
order.
The revolving line of credit utilization, debt ratio, income, age
of applicant, number of 30 days delinquencies in 2 years, number of
tradelines active/open (had activity within 6 months), number of 90 day
delinquency tradelines in 2 years, number of 60 day tradelines in 2
years, and number of mortgage tradelines have the most predictive power
in predicting serious delinquency for a borrower for up 2 years. The
attributes excluded duplicate or invalid status tradelines.
[GRAPHIC OMITTED]
Insights
Regression does not show revolving utilization to be statistically
significant while random forests correctly identify it as a very
predictive variable and obtain maximal predictive value from the data.
Performance
Using these 8 variables the AUC of the random forest exceed that of
logistic regression by a large margin. Random forest has an area under
the curve of 0.8522 while logistic regression has an AUC of 0.6964.
[GRAPHIC OMITTED]
In addition results of the performance were also computed for a
popular credit score known as FICO[R]. Performance of the credit score
was superior to both regression and random forest as it had an AUC of
.865.
[GRAPHIC OMITTED]
IMPLICATIONS
The fact that random forests with 8 variables can produce a model
which is competitive with FICO[R] out of the box is remarkable. Logistic
regression does not achieve that level of performance out of the box.
This example clearly shows random forest's superiority in
scientifically rank ordering predictive variables and optimally
extracting predictive value from data with multi-collinearity and
interactions. The advantage of random forests depends on strength of
relationships between variables. In data sets with little interaction
effects random forests may not outperform. On large credit data sets,
behavioral models, application scoring random forests can improve
existing credit models by 5-10% by tuning regression. Once tuned
logistic regression can outperform random forests with judgment and
careful testing of logistic regression. The example of building a random
forest that is almost as predictive as a FICO[R] score, with an AUC of
.85 vs. .865, but with 8 variables dramatically shows the power of
random forests for scientists and credit risk modelers to maximize
predictive value of data using random forests.
All 8 variables conform to theoretical soundness as they relate to
borrower cash flow surrogates. Econometrically credit scoring variables
can be segmented into: cash flow variables, stability variables, and
payment history variables (Overstreet, 1992). Removing the revolving
utilization and delinquency behavior variables greatly reduced the
random forest performance to be more in line with logistic regression.
Implying that the most predictive value is in the interaction of the
utilization and delinquency behavior attributes with the other
variables. Random forests will outperform when there are complex
relationships and interactions between the variables a typical
regression might miss.
Explaining the Advantage of Random Forests over Logistic Regression
An explanation of how such a simple data set can be competitive
with the FICO[R] is the fact the credit models are thought to suffer
from the flat maximum effect which implies that models with smaller data
can perform close to larger more sophisticated linear models like
logistic regression because these regressors are insensitive to large
variations in the size of regression weights. Random forest advantage
also seems to correlate with variables with interaction effects and
multi-collinearity as the technique is able to determine complex
relationships in the data using a bootstrap of variables and samples to
build ensembles of models.
The power of random forests has profound implications for taking
credit risk scorecards to the next level by optimizing credit score
performance and leading to better and more robust scientific inferences
about factors and how they impact phenomenon ranging from financial risk
to consumer behavior modeling to medical science and perhaps even
mimicking know humans think or behave in swarm intelligence.
Optimizing Credit Scorecards Using Random Forests: An approach
Updated Credit Card Random Forest Variable Importance with
interaction terms
Main stream credit scorers can benefit from random forest models as
well. One approach to optimizing existing models is to test interaction
terms with variables identified to be most predictive by random forests.
For example using the credit card data set discussed initially one can
improve the AUC of the logistic regression to match random forests by
adding interaction terms to the credit card data set to achieve an AUC
of .626. Thus logistic regression can be tuned to match performance of
random forests out of the box and yield almost the same performance as
the random forest model (and on some data sets after tuning logistic
regression performs better than random forest).
Overall process for Optimizing Existing Credit Scorecard
* SOAR (Specify data, observe data, analyze, and recommend) (Brown,
2005)
* Run Random Forest
* Take top predictive fields and create interactions terms with
regression one at a time and retain statistically significant
interactions
* Rerun regression and compare until regression outperforms or
closely matches random forest out of sample performance
* Run conditional inference trees to identify interactions and
re-run random forest and logit models until maximal performance is
achieved.
* Convert fields to factors for logit as binned data improves logit
in general
* Multiply the score from Random forest and logistic, sum, take
max, and compare area under curve. As predicted Hand's
Superscorecard literature multiplying the 2 scores resulted in improved
performance as well (Hand etal, 2002).
The method of using random forests, affordability and logistic
regression in combination with conditional inference trees iteratively
to improve logistic regression to match and outperform random forests is
dubbed the Sharma method. For the most comprehensive review of credit
scoring literature and this approach see (Sharma, Overstreet &
Beling, 2009). Also the methods are detailed in the Guide to Credit
Scoring in R as well (Sharma, 2009). The pioneering work behind this was
Overstreet etal in 1992 which was the first theory based free cash flow
model for credit scoring and Breiman's work on random forests which
allowed the importance of affordability data to be more clearly seen.
Prior to this most logistic regression scorecards showed income and cash
flow data to be marginally predictive as the p values were too high and
erroneous due to multicollinearity. For details on checkered history of
credit scoring see Sharma, Overstreet and Beling, 2009.
In terms of implementation R was used along with Rattle data mining
software. Rattle greatly facilitated the speed and ease of running the
algorithms and credit scoring once the interaction terms were added by
hand code and run through rattle (See Graham for Rattle, 2008).
Extensions
In large data sets I have been able to improve logistic regressions
to match the performance of random forests using trial and error,
judgment and using random forest variable importance as a base to add
interaction terms. This approach is painful, and time consuming. A more
viable approach will be to use random forest performance as a benchmark
to automatically optimize logistic regression using out of sample error
by testing out interactions among most predictive variables and formulas
using a genetic algorithm approach.
Credit scoring is a search for meaningful interaction terms and all
financial ratios are interaction terms. Hand has shown multiplying
scores always produce a better or equivalent score, and this itself is
again an example of interaction term of multiplying variables (Hand,
2002). By viewing financial ratios as interactions one can widen the
lens and search for optimal interactions to obtain optimal predictive
power from the affordability data. Traditional regression, with
it's failure to handle multi-collinearity, has made searching for
fruitful interaction terms in credit data problematic. Also attempting
too many interactions can overfit logits. Thus, a careful knowledge
based approach is needed which random forest variable importance
measures provide. For an in depth discussion of this, as well as the
most comprehensive literature review of credit scoring, and the overall
approach see Sharma, Overstreet and Beling, 2009.
CONCLUSIONS
The best of both worlds can be achieved by finding ways to
optimally enhance logistic regression using insights from random forest
variable importance which are more reliable gauges for variable
importance and relationship given the multi-collinearity in all credit
models and data. To date, the random forests I have tuned logistic
regressions scorecards judgmentally using random forest variable
importance to outline interactions terms to be added to the model but
the home equity dataset shows that this might not be enough as more
transformations and binning of variables might be needed to optimally
squeeze performance into logistic regression to explore interaction
terms and transformations via stochastic search optimization using
genetic algorithms within a bounded variable space using random forest
performance as a stopping criterion. This would best be accomplished via
an automated algorithm which iterates through variable interaction and
combination mining using a sample set of meaningful variables identified
by random forest as being predictive which regression p values might
miss. A common example of this oversight by traditional scorecards since
the time of Durand in the 1930s is that of income and affordability data
which standard regressions have shown to not be predictive while flying
in the face of common sense. The most successful predictive variables
using the mortgage industry are all interaction terms (loan to value,
month's reserves, and debt ratio; for example of mortgage scoring
see Avery et al 1996). The history of credit scoring shows finding
optimal interaction terms is crucial to optimal predictive accuracy and
random forests play a vital role in being able to test out meaningful
variables which traditional scoring technologies such as regression
failed to identify using p value tests of significance.
Human Values perspective
Credit scoring should be integrated with normative models to ensure
borrower wellbeing instead of maximizing profit as evidenced by the
recent global recession in the 21st century. Credit score models no
matter how sophisticated built to predict two years of data fail to
assess the long term impact of borrower wellbeing and that is a
challenge worth studying; such knowledge will surely lead to sustainable
credit markets which do not threaten democracy and have a robust
micro-foundation for macro-markets in credit. In the aggregate picture
proprietary models to predict behavior are all more suboptimal than a
white box credit policy which ensures borrower financial wellbeing by
ensuring constraints on borrower reserves, consumption, and expenses to
income over time. Competition in credit modeling will not lead to better
consumer welfare as credit is a commodity and financial institutions
should not compete on credit policy for sustainable advantage but
instead should compete on convenience, safer products, and customization
to fit borrower life stages.
Let's hope in the future we won't need proprietary models
and can live in an enlightened world where borrowers can choose safe
products and know the implications of their behavior on their ability to
obtain more credit in a open white box world where behavior is then
regulated by a desire to conform to standards which will make the
borrowers more fiscally responsible. Credit data should be democratized
and not for profit entities as it is a social good.
APPENDIX OF DATA DESCRIPTIONS AND OPEN DATA SETS
Credit Card Dataset Original Variable Descriptions
Var_Title Var_Description Field_Content
ID_CLIENT Sequential number for 1-50000,
the applicant (to be 50001-70000, 70001-
used as a key) 90000
CLERK_TYPE Not informed C
PAYMENT_DAY Day of the month for 1 ,5,10,15,20,25
bill payment, chosen
by the applicant
APPLICATION_ Indicates if the Web, Carga
SUBMISSION_TYPE application was
submitted via the
internet or in
person/posted
QUANT_ADDITIONAL_ Quantity of 1 ,2,NULL
CARDS additional cards
asked for in the same
application form
POSTAL_ADDRESS_TYPE Indicates if the 1.2
address for posting
is the home address
or other Encoding not
informed.
SEX M = Male, F = Female
MARITAL_STATUS Encodinq not informed 1,2,3,4,5,6,7
QUANT_DEPENDANTS 0, 1, 2, ...
EDUCATION_LEVEL Edducational level in 1,2,3,4,5
qradual order not
informed
STATE_OF_BIRTH Brazilian states,
XX, missing
CITY_OF_BIRTH
NACIONALITY Country of birth 0, 1, 2
Encoding not informed
but Brazil is likely
to be equal 1
RESIDENCIAL_STATE State of residence
RESIDENCIAL_CITY City of residence
RESIDENCIAL_BOROUGH Borouqh of residence
FLAG_RESIDENCIAL_ Indicates if the Y, N
PHONE applicant possesses a
home phone
RESIDENCIAL_PHONE_ Three-digit pseudo-
AREA_CODE code
RESIDENCE_TYPE Encoding not 1,2,3,4,5,NULL
informed. In general,
there are the types:
owned, mortqaqe.
rented, parents,
family etc.
MONTHS_IN_RESIDENCE Time in the current 1,2, ... , NULL
residence in months
FLAG_MOBILE_PHONE Indicates if the Y,N
applicant possesses a
mobile phone
FLAG_EMAIL Indicates if the 0.1
applicant possesses
an e-mail address
PERSONAL_MONTHLY_ Applicant's personal
INCOME regular monthly
income in Brazilian
currency (R$)
OTHER_INCOMES Applicant's other
incomes monthly
averaged in Brazilian
currency (R$)
FLAG_VISA Flaq indicatinq if 0.1
the applicant is a
VISA credit card
holder
FLAG_MASTERCARD Flag indicating if 0.1
the applicant is a
MASTERCARD credit
card holder
FLAG_DINERS Flaq indicatinq if 0 1
the applicant is a
SINERS credit card
holder
FLAG_AMERICAN_ Flag indicating if 0.1
EXPRESS the applicant is an
AMERICAN EXPRESS
credit card holder
FLAG_OTHER_CARDS Despite being label 0, 1, NULL
"FLAG", this field
presents three values
not explained
QUANT_BANKING_ 0, 1, 2
ACCOUNTS
QUANT_SPECIAL_ 0, 1, 2
BANKING_ACCOUNTS
PERSONAL_ASSETS_ Total value of the
VALUE personal possessions
such as houses, cars
etc in Brazilian
currency (R$)
QUANT_CARS Quantity of cars the
applicant possesses
COMPANY If the applicant has Y,N
supplied the name of
the company where
he/she formally works
PROFESSIONAL_STATE State where the
applicant works
PROFESSIONAL_CITY City where the
applicant works
PROFESSIONAL_ Borough where the
BOROUGH applicant works
FLAG_PROFESSIONAL_ Indicates if the Y,N
PHONE professional phone
number was supplied
PROFESSIONAL_PHONE_ Three-digit
AREA_CODE pseudo-code
MONTHS_IN_THE_JOB Time in the current
residence in months
PROFESSION_CODE Applicant's 1.2,3, ...
profession code.
Encodinq not informed
OCCUPATION_TYPE Encodinq not informed 1 .2,3,4,5,NULL
MATE_PROFESSION_ Mate's profession 1 ,2,3, ...
CODE code. Encoding not
informed
EDUCATION_LEVEL Mate's educational 1 .2,3,4,5
level in qradual
order not informed
FLAG_HOME_ADDRESS_ Flag indicating 0.1
DOCUMENT documental
confirmation of home
address
FLAG_RG Flaq indicatinq 0.1
documental
confirmation of
citizen card number
FLAG_CPF Flaq indicatinq 0.1
documental
confirmation of tax
payer status
FLAG_INCOME_PROOF Flaq indicating 0.1
documental
confirmation of
income
PRODUCT Type of credit 1, 2, 7
product applied
Encodinq not informed
FLAG_ACSP_RECORD Flag indicating if Y, N
the applicant has any
previous credit
delinquency
AGE Applicant's aqe at
the moment of
submission
RESIDENCIAL_ZIP_3 Three most
significant diqits of
the actual home zip
code
PROFESSIONAL_ZIP_3 Three most
significant diqits of
the actual home zip
code
TARGET_LABEL_ Target Variable: BAD
BAD = 1 = 1 , GOOD = 0
Source: http://sede.neurotech.com.br/PAKDD2010/
HOME EQUITY DATA SET ORIGINAL VARIABLES
Name Model Measurement Description
Role Level
BAD Target Binary 1 = defaulted on loan, 0 = paid
back loan
REASON Input Binary HomeImp = home improvement,
DebtCon = debt consolidation
JOB Input Nominal Six occupational categories
LOAN Input Interval Amount of loan request
MORTDUE Input Interval Amount due on existing mortgage
VALUE Input Interval Value of current property
DEBTINC Input Interval Debt-to-income ratio
YOJ Input Interval Years at present job
DEROG Input Interval Number of major derogatory reports
CLNO Input Interval Number of trade lines
DELINQ Input Interval Number of delinquent trade lines
CLAGE Input Interval Age of oldest trade line in months
NINQ Input Interval Number of recent credit inquiries
Source: www.sasenterpriseminer.com/data/HMEQ.xls
APPENDIX OF R CODE
Credit Card Data Set and interactions
cc<-read.csv("C:/Documents and Settings//My
Documents/cckdd2010.csv")
cc$TARGET_LABEL_BAD<-as.factor(cc$TARGET_LABEL_BAD)
cc$QUANT_DEPENDANTS<-ifelse(cc$QUANT_DEPENDANTS>=
13,13,cc$QUANT_ DEPENDANTS)
#cc$ZipDist<-as.numeric(cc$RESIDENCIAL_ZIP_3)-
as.numeric(cc$PROFESSIONAL_ZIP_3)
#cc$StateDiff<-as.factor(ifelse(cc$RESIDENCIAL_STATE==
cc$PROFESSIONAL_STATE,'Y','N'))
#cc$CityDiff<-as.factor(ifelse(cc$RESIDENCIAL_CITY==cc$PROFESSIONAL_CITY,' Y','N'))
#cc$BoroughDiff<-as.factor(ifelse(cc$RESIDENCIAL_BOROUGH=
cc$PROFESSIONAL_BOROUGH, 'Y','N'))
cc$MissingResidentialPhoneCode<
as.factor(ifelse(is.na(cc$RESIDENCIAL_PHONE_AREA_CODE)==TRUE,'Y','N'))
cc$MissingProfPhoneCode<-as.factor(ifelse(is.na
(cc$PROFESSIONAL_PHONE_AREA_CODE)==TRUE,'Y','N'))
cc<-subset(cc,select=-ID_CLIENT)
cc<-subset(cc,select=-CLERK_TYPE)
cc<-subset(cc,select=-QUANT_ADDITIONAL_CARDS)
cc<-subset(cc,select=-EDUCATION_LEVEL)
#cc<-subset(cc,select=-STATE_OF_BIRTH)
cc<-subset(cc,select=-CITY_OF_BIRTH)
#cc<-subset(cc,select=-RESIDENCIAL_STATE)
cc<-subset(cc,select=-RESIDENCIAL_CITY)
cc<-subset(cc,select=-RESIDENCIAL_BOROUGH)
cc<-subset(cc,select=-PROFESSIONAL_STATE)
cc<-subset(cc,select=-PROFESSIONAL_CITY)
cc<-subset(cc,select=-PROFESSIONAL_BOROUGH)
cc<-subset(cc,select=-FLAG_MOBILE_PHONE)
cc<-subset(cc,select=-FLAG_HOME_ADDRESS_DOCUMENT)
cc<-subset(cc,select=-FLAG_RG)
cc<-subset(cc,select=-FLAG_CPF)
cc<-subset(cc,select=-FLAG_INCOME_PROOF)
cc<-subset(cc,select=-FLAG_ACSP_RECORD)
cc<-subset(cc,select=-TARGET_LABEL_BAD.1)
cc<-subset(cc,select=-RESIDENCIAL_ZIP_3)
cc$PROFESSIONAL_ZIP_3<-as.numeric(cc$PROFESSIONAL_ZIP_3)
cc$RESIDENCIAL_PHONE_AREA_CODE[is.na(cc$RESIDENCIAL_PHONE_AREA_CODE)] <- 0
cc$PROFESSIONAL_PHONE_AREA_CODE[is.na(cc$PROFESSIONAL_PHONE_AREA_CODE)] <- 0
cc$PROFESSION_CODE<-as.numeric(cc$PROFESSION_CODE)
cc$OCCUPATION_TYPE<-as.numeric(cc$OCCUPATION_TYPE)
cc$MATE_PROFESSION_CODE<-as.numeric(cc$MATE_PROFESSION_CODE)
cc$EDUCATION_LEVEL. 1<-as.numeric(cc$EDUCATION_LEVEL. 1)
cc$RESIDENCE_TYPE<-as.numeric(cc$RESIDENCE_TYPE)
cc$MONTHS_IN_RESIDENCE<-as.numeric(cc$MONTHS_IN_RESIDENCE)
cc$TotIncome<-cc$PERSONAL_MONTHLY_INCOME+cc$OTHER_INCOMES
cc$OthIncomePct<-cc$OTHER_INCOMES/cc$PERSONAL_MONTHLY_INCOME
cc$MnthsSavings<-cc$PERSONAL_ASSETS_VALUE/(.01+cc$MONTHS_IN_THE_JOB*cc$ TotIncome)
cc$Afford<-cc$TotIncome+cc$PERSONAL_ASSETS_VALUE
cc$IncomeToAssets<-cc$TotIncome/(cc$PERSONAL_ASSETS_VALUE+.01)
cc$i1<-cc$QUANT_DEPENDANTS*cc$AGE
cc$i2<-cc$AGE*cc$PROFESSIONAL_ZIP_3
cc$i4<-cc$PROFESSION_CODE*cc$AGE
cc$i5<-cc$RESIDENCIAL_PHONE_AREA_CODE*cc$AGE
cc$i6<-cc$RESIDENCIAL_PHONE_AREA_CODE*cc$PROFESSIONAL_PHONE_AREA_CODE
cc$i7<-cc$RESIDENCIAL_PHONE_AREA_CODE*cc$OthIncomePc
cc$i8<-cc$RESIDENCIAL_PHONE_AREA_CODE*cc$IncomeToAssets
cc$i9<-cc$RESIDENCIAL_PHONE_AREA_CODE*cc$i1
cc$i10<-cc$RESIDENCIAL_PHONE_AREA_CODE*cc$i2
cc$i11<-cc$RESIDENCIAL_PHONE_AREA_CODE*cc$i5
cc$i12<-cc$RESIDENCIAL_PHONE_AREA_CODE*cc$OTHER_INCOMES
cc$i13<-cc$QUANT_DEPENDANTS*cc$RESIDENCIAL_PHONE_AREA_CODE
cc$i14<-cc$RESIDENCIAL_PHONE_AREA_CODE*cc$RESIDENCE_TYPE
cc$i15<-cc$RESIDENCIAL_PHONE_AREA_CODE*cc$PROFESSIONAL_ZIP_3
cc$i16<-cc$PERSONAL_MONTHLY_INCOME*cc$PROFESSIONAL_ZIP_3
cc$i17<-cc$OTHER_INCOMES*cc$PROFESSIONAL_ZIP_3
cc$i18<-cc$PROFESSIONAL_ZIP_3*cc$IncomeToAssets
cc$i19<-cc$PROFESSIONAL_ZIP_3*cc$i2
cc$i20<-cc$PROFESSIONAL_ZIP_3*cc$i5
cc$j1<-cc$MONTHS_IN_RESIDENCE*cc$EDUCATION_LEVEL. 1
cc$j2<-cc$MONTHS_IN_RESIDENCE*cc$QUANT_CARS
cc$j3<-cc$MARITAL_STATUS*cc$MONTHS_IN_RESIDENCE
cc$j4<-cc$QUANT_CARS*cc$i12
cc$j5<-cc$FLAG_MASTERCARD*cc$i5
cc$j6<-cc$QUANT_CARS*cc$i2
cc$j7<-cc$FLAG_MASTERCARD*cc$i10
cc$j8<-cc$QUANT_CARS*cc$i19
cc$j9<-cc$QUANT_CARS*cc$OthIncomePct
cc$j10<-cc$NACIONALITY*cc$QUANT_CARS
cc$j11<-as.factor(ifelse(cc$FLAG_RESIDENCIAL_PHONE=
='Y',cc$FLAG_MASTERCARD,'O'))
cc$j12<-cc$QUANT_CARS*cc$i7
cc$j 13<-cc$MARITAL_STATUS*cc$j3
cc$j14<-cc$PAYMENT_DAY*cc$j5
cc$j15<-cc$PAYMENT_DAY*cc$j7
cc$j16<-cc$QUANT_CARS*cc$OCCUPATION_TYPE
cc$j17<-cc$OCCUPATION_TYPE*cc$j9
cc$j18<-as.factor(ifelse(cc$j11=='1',cc$OCCUPATION_TYPE,'O'))
cc$j19<-cc$AGE*cc$i2
cc$j20<-cc$OthIncomePct*cc$i2
cc$j21<-cc$i2*cc$i7
cc$j22<-cc$i2*cc$i10
cc$j23<-cc$i2*cc$i15
cc$j24<-cc$i2*cc$j1
cc$j25<-cc$i2*cc$j2
cc$j26<-cc$RESIDENCE_TYPE*cc$AGE
cc$j27<-cc$RESIDENCE_TYPE*cc$i4
cc$j28<-cc$RESIDENCE_TYPE*cc$i7
cc$j29<-cc$RESIDENCE_TYPE*cc$PROFESSION_CODE
cc$j30<-cc$PROFESSION_CODE*cc$PRODUCT
cc$j31<-cc$PRODUCT*cc$i6
cc$k1<-as.factor(ifelse(cc$AGE<=18 &
cc$PAYMENT_DAY<=15,'Y','N'))
cc$k2<-as.factor(ifelse(cc$AGE>18 &
cc$PAYMENT_DAY<=15,'Y','N'))
cc$k3<-as.factor(ifelse(cc$AGE>21 &
cc$PAYMENT_DAY>15,'Y','N'))
cc$k4<-as.factor(ifelse(cc$AGE<=21 & cc$PAYMENT_DAY>
15,'Y','N'))
cc$k5<-as.factor(ifelse(cc$AGE<=46 & cc$AGE>32 &
cc$j11!='O' & cc$PAYMENT_DAY<=10 &
cc$SEX!='F','Y','N'))
cc$k6<-as.factor(ifelse(cc$AGE<=46 & cc$AGE>32 &
cc$j11!='O' & cc$PAYMENT_DAY<=10 &
cc$SEX=='F','Y','N'))
cc$k7<-as.factor(ifelse(cc$AGE<=46 & cc$AGE>32 &
cc$j11!='O' & cc$PAYMENT_DAY>10 &
cc$SEX!='F','Y','N'))
cc$k8<-as.factor(ifelse(cc$AGE<=46 & cc$AGE>32 &
cc$j11!='O' & cc$PAYMENT_DAY>10 &
cc$SEX=='F' & cc$j30<=40,'Y','N'))
cc$k8a<-as.factor(ifelse(cc$AGE<=46 & cc$AGE>32 &
cc$j11!='O' & cc$PAYMENT_DAY>10 &
cc$SEX=='F' & cc$j30>40,'Y','N'))
cc$k9<-as.factor(ifelse(cc$AGE<=46 & cc$AGE>32 &
cc$j11=='O' &
cc$MissingProfPhoneCode!='N','Y','N'))
cc$k10<-as.factor(ifelse(cc$AGE<=46 & cc$AGE>32 &
cc$j11=='O' &
cc$MissingProfPhoneCode=='Y','Y','N'))
cc$k11<-as.factor(ifelse(cc$AGE>46 &
cc$j11=='O' & cc$FLAG_PROFESSIONAL_PHO
NE=='Y','Y','N'))
cc$k12<-as.factor(ifelse(cc$AGE>46 &
cc$j11=='O' & cc$FLAG_PROFESSIONAL_PHONE=='N'
& cc$j16<=0 &
cc$PAYMENT_DAY<=20,'Y','N'))
cc$k13<-as.factor(ifelse(cc$AGE>46 &
cc$j11=='O' & cc$FLAG_PROFESSIONAL_PHONE=='N'
& cc$j16<=0 &
cc$PAYMENT_DAY>20,'Y','N'))
#cc$k14<-as.factor(ifelse(cc$AGE>46 &
cc$j11=='O' & cc$FLAG_PROFESSIONAL_PHONE=='N'
& cc$j16>0,'Y','N'))
cc$k15<-as.factor(ifelse(cc$AGE>46 & cc$AGE<=52 &
cc$j11!='O' ,'Y','N'))
cc$k16<-as.factor(ifelse(cc$AGE>52 &
cc$j11!='O' & cc$PAYMENT_DAY<=15 &
cc$i11<=271633 & cc$j5<=1220 ,'Y','N'))
cc$k17<-as.factor(ifelse(cc$AGE>52 &
cc$j11!='O' & cc$PAYMENT_DAY<=15 &
cc$i11<=271633 & cc$j5>1220 ,'Y','N'))
cc$k18<-as.factor(ifelse(cc$AGE>52 & cc$j
11!='O' & cc$PAYMENT_DAY<=15 & cc$i11>271633
,'Y','N'))
cc$k18<-as.factor(ifelse(cc$AGE>52 & cc$j
11!='O' & cc$PAYMENT_DAY>15
,'Y','N')) #logit
m<-glm(TARGET_LABEL_BAD~.,data=cc,family=binomial)
cc<-subset(cc,select=-j1)
cc<-subset(cc,select=-j2)
cc<-subset(cc,select=-j3)
cc<-subset(cc,select=-j4)
cc<-subset(cc,select=-j5)
cc<-subset(cc,select=-j6)
cc<-subset(cc,select=-j7)
cc<-subset(cc,select=-j8)
cc<-subset(cc,select=-j9)
cc<-subset(cc,select=-j10)
cc<-subset(cc,select=-j11)
cc<-subset(cc,select=-j12)
cc<-subset(cc,select=-j13)
cc<-subset(cc,select=-j14)
cc<-subset(cc,select=-j15)
cc<-subset(cc,select=-j16)
cc<-subset(cc,select=-j17)
cc<-subset(cc,select=-j18)
cc<-subset(cc,select=-j19)
cc<-subset(cc,select=-j20)
cc<-subset(cc,select=-j21)
cc<-subset(cc,select=-j22)
cc<-subset(cc,select=-j23)
cc<-subset(cc,select=-j24)
cc<-subset(cc,select=-j25)
cc<-subset(cc,select=-j26)
cc<-subset(cc,select=-j27)
cc<-subset(cc,select=-j28)
cc<-subset(cc,select=-j29)
cc<-subset(cc,select=-j30)
cc<-subset(cc,select=-j31)
cc<-subset(cc,select=-i1)
cc<-subset(cc,select=-i2)
cc<-subset(cc,select=-i3)
cc<-subset(cc,select=-i4)
cc<-subset(cc,select=-i5)
cc<-subset(cc,select=-i6)
cc<-subset(cc,select=-i7)
cc<-subset(cc,select=-i8)
cc<-subset(cc,select=-i9)
cc<-subset(cc,select=-i10)
cc<-subset(cc,select=-i11)
cc<-subset(cc,select=-i12)
cc<-subset(cc,select=-i13)
cc<-subset(cc,select=-i14)
cc<-subset(cc,select=-i15)
cc<-subset(cc,select=-i16)
cc<-subset(cc,select=-i17)
cc<-subset(cc,select=-i18)
cc<-subset(cc,select=-i19)
cc<-subset(cc,select=-i20)
Most work done in Rattle.
Home Equity Data Set R
#sas home equity data set
#www.sasenterpriseminer.com/data/HMEQ.xls
#Wielenga, D., Lucas, B. and Georges, J. (1999), Enterprise
MinerTM: Applying Data Mining Techniques Course
Notes, SAS Institute Inc., Cary, NC.
cc<-read.csv("C:/Documents and Settings/ My
Documents/HMEQ.csv")
cc$BAD<-as.factor(cc$BAD)
cc$LTV<-(cc$LOAN+cc$MORTDUE)*100/cc$VALUE
cc$JOB<-as.factor(cc$JOB)
REFERENCES
Avery, Robert B., Raphael W. Bostic, Paul S. Calem, and Glenn
Canner, (1996) "Credit risk, credit scoring, and the performance of
home mortgages," The Federal Reserve Bulletin, Vol. 82, No. 7, ,
pp. 621-648
Breiman, L. (2002) Wald 2: Looking Inside the Black Box. Retrieved
from www.stat.berkeley.edu/users/breiman/wald2002-2.pdf
Brown, Don (2005) Linear Models Unpublished Manuscript at
University of Virginia.
Overstreet, GA; Kemp, RS; (1986) Managerial control in Credit
Scoring Systems. Journal of Retail Banking
Overstreet, G.A.J., Bradley, E.L., Kemp, R.S., 1992. The
flat-maximum effect and generic linear scoring models: a test, IMA
Journal of Mathematics Applied in Business & Industry, 4 (1) 97-109
Sharma, D (2009) Guide to Credit Scoring in R. Retrieved from
http://cran.r- project.org/doc/contrib/SharmaCreditScoring.pdf
Sharma, D; Overstreet, George; Beling, Peter (2009) Not If
Affordability data adds value but how to add real value by Leveraging
Affordability Data: Enhancing Predictive capability of Credit Scoring
Using Affordability Data. CAS (Casualty Actuarial Society) Working
Paper. Retrieved from
http://www.casact.org/research/wp/index.cfm?fa=workingpapers
See Williams, Graham Desktop Guide to Data Mining
http://www.togaware.com/datamining/survivor/
Wielenga, D., Lucas, B. and Georges, J. (2009), Enterprise MinerTM:
Applying Data Mining Techniques, SAS Institute Inc., Cary, NC.
http://www.crc.man.ed.ac.uk/conference/archive/2009/presentations/
Paper-11-Paper.pdf
Dhruv Sharma, Independent Scholar
Logistic regression model
Variable Coefficient Std. Error z value
Estimate
(Intercept) 0.367793 1.008929 0.365
PAYMENT_DAY 0.019525 0.001859 10.505
APPLICATION_SUBMISSION_TYPECarga -0.30733 0.093499 -3.287
APPLICATION_SUBMISSION_TYPEWeb -0.09412 0.059745 -1.575
POSTAL_ADDRESS_TYPE 0.028979 0.151555 0.191
SEXF -1.01841 0.611657 -1.665
SEXM -0.83388 0.611716 -1.363
SEXN -0.96122 0.717716 -1.339
MARITAL_STATUS -0.01168 0.009773 -1.195
QUANT_DEPENDANTS 0.020479 0.010485 1.953
NACIONALITY 0.06682 0.071782 0.931
FLAG_RESIDENCIAL_PHONEY -0.82105 0.720136 -1.14
RESIDENCIAL_PHONE_AREA_CODE 0.000984 0.000398 2.473
RESIDENCE_TYPE -0.01853 0.010731 -1.727
FLAG_EMAIL 0.017635 0.046639 0.378
PERSONAL_MONTHLY_INCOME 5.77E-07 1.48E-06 0.39
OTHER_INCOMES 1.78E-0 1.83E-05 0.971
FLAG_VISA 0.074469 0.042835 1.738
FLAG_MASTERCARD -0.2261 0.046838 -4.827
FLAG_DINERS 0.284333 0.33461 0.85
FLAG_AMERICAN_EXPRESS -0.06287 0.303154 -0.207
FLAG_OTHER_CARDS -0.05443 0.299613 -0.182
QUANT_BANKING_ACCOUNTS -0.00642 0.058358 -0.11
QUANT_SPECIAL_BANKING_ACCOUNTS NA NA NA
PERSONAL_ASSETS_VALUE -4.4E-08 3.3E-07 -0.134
QUANT_CARS -0.02769 0.101239 -0.274
COMPANYY -0.06863 0.031724 -2.163
FLAG_PROFESSIONAL_PHONEY 0.714282 0.617823 1.156
PROFESSIONAL_PHONE_AREA_CODE -0.00056 0.000721 -0.782
MONTHS_IN_THE_JOB -0.06383 0.055293 -1.154
OCCUPATION_TYPE 0.026602 0.007269 3.66
MATE_PROFESSION_CODE -0.00679 0.004226 -1.606
EDUCATION_LEVEL.1 0.000307 0.019259 0.016
PRODUCT 0.034652 0.011976 2.894
AGE -0.01968 0.000975 -20.193
MissingResidentialPhoneCodeY -0.17667 0.720139 -0.245
MissingProfPhoneCodeY 0.804464 0.619538 1.298
Variable Pr(> Significance
[absolute
value of
z])
(Intercept) 0.715457
PAYMENT_DAY < 2e-16 ***
APPLICATION_SUBMISSION_TYPECarga 0.001012 **
APPLICATION_SUBMISSION_TYPEWeb 0.115164
POSTAL_ADDRESS_TYPE 0.848362
SEXF 0.095913
SEXM 0.172827
SEXN 0.180481
MARITAL_STATUS 0.231949
QUANT_DEPENDANTS 0.050805
NACIONALITY 0.351919
FLAG_RESIDENCIAL_PHONEY 0.254232
RESIDENCIAL_PHONE_AREA_CODE 0.013386 *
RESIDENCE_TYPE 0.084166
FLAG_EMAIL 0.705338
PERSONAL_MONTHLY_INCOME 0.69655
OTHER_INCOMES 0.331516
FLAG_VISA 0.082123
FLAG_MASTERCARD 1.38E-06 ***
FLAG_DINERS 0.395467
FLAG_AMERICAN_EXPRESS 0.835701
FLAG_OTHER_CARDS 0.85585
QUANT_BANKING_ACCOUNTS 0.912419
QUANT_SPECIAL_BANKING_ACCOUNTS NA
PERSONAL_ASSETS_VALUE 0.893551
QUANT_CARS 0.784451
COMPANYY 0.030512 *
FLAG_PROFESSIONAL_PHONEY 0.247629
PROFESSIONAL_PHONE_AREA_CODE 0.434309
MONTHS_IN_THE_JOB 0.24833
OCCUPATION_TYPE 0.000252 ***
MATE_PROFESSION_CODE 0.108175
EDUCATION_LEVEL.1 0.987279
PRODUCT 0.00381 **
AGE <2e-16 ***
MissingResidentialPhoneCodeY 0.806201
MissingProfPhoneCodeY 0.194119
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '' 1
(Dispersion parameter for binomial family taken to be 1)
Residual deviance: 39312 on 34964 degrees of freedom
AIC:39384
Number of Fisher Scoring iterations 4
Log likelihood: -19655.757 (36 df)
Null/Residual deviance difference: 906.696 (35df)
Chi-square p-value: 0.00000000
Home Equity Logistic Regression
Variables Coefficient Std. z value
Estimated Error
(Intercept) -17.07851715 524.8953 -0.033
LOAN 0.000001803 1.58E-05 0.114
MORTDUE 0.000019897 1.26E-05 1.576
VALUE -0.000016881 1.13E-05 -1.501
REASONDebtCon -0.621936258 0.635508 -0.979
REASONHomelmp -0.753124539 0.647779 -1.163
JOBMgr 14.79633915 524.8937 0.028
JOBOffice 14.22345444 524.8938 0.027
JOBOther 14.67559173 524.8937 0.028
JOBProfExe 14.83695589 524.8937 0.028
JOBSales 15.91157826 524.8939 0.03
JOBSelf 15.92432142 524.8939 0.03
YOJ -0.005838696 0.012261 -0.476
DEROG 0.802276888 0.123378 6.503
DELINQ 0.817538124 0.085566 9.554
CLAGE -0.00580995 0.001307 -4.445
NINQ 0.155918992 0.042991 3.627
CLNO -0.027956215 0.009666 -2.892
DEBTINC 0.101303396 0.012958 7.818
LTV -0.020306186 0.011472 -1.77
--
Variables Pr(> Significance
[absolute
value of
z])
(Intercept) 0.974044
LOAN 0.909386
MORTDUE 0.115104
VALUE 0.133474
REASONDebtCon 0.327756
REASONHomelmp 0.244982
JOBMgr 0.977511
JOBOffice 0.978382
JOBOther 0.977695
JOBProfExe 0.97745
JOBSales 0.975817
JOBSelf 0.975797
YOJ 0.63394
DEROG 7.89E-11 ***
DELINQ < 2e-16 ***
CLAGE 8.79E-06 ***
NINQ 0.000287 ***
CLNO 0.003827
DEBTINC 5.38E-15 ***
LTV 0.076707
--
Signif. codes: 0 '***'0.00 1 '**' 0.01 '*' 0.05 '.' 0.1 '' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1472.1 on 2431 degrees of freedom
Residual deviance: 1090.8 on 2412 degrees of freedom
(1740 observations deleted due to missingness)
AIC: 1130.8
Number of Fisher Scoring iterations: 16
Log likelihood : -545.408 (20 df)
Null/Residual deviance difference: 3813 (19 df)
Chi-square p-value: 0.0000000
Logistic regression Model
Variable Estimate Std. Error z value
(Intercept) -1.33216 0.031749 -41.959
trades -0.00552 0.001959 -2.819
30 number dlq (not worse) 0.501922 0.008535 58.807
60 number dlq (not worse) -0.94516 0.014058 -67.233
90 day number dlq (not worse) 0.478619 0.012085 39.605
mtg_trd_lines 0.095229 0.00804 11.844
monthly income -3.6E-05 2.29E-06 -15.642
age -0.02729 0.000655 -41.631
revolving balance util 2.55E-05 2.95E-05 0.865
DebtRatio -0.00015 3.61E-05 4.222 0
--
Variable Pr(> Significance
[absolute
value
of z[)
(Intercept) <2e-16 ***
trades 0.00482 **
30 number dlq (not worse) <2e-16 ***
60 number dlq (not worse) <2e-16 ***
90 day number dlq (not worse) <2e-16 ***
mtg_trd_lines <2e-16 ***
monthly income <2e-16 ***
age <2e-16 ***
revolving balance util 0.38731
DebtRatio 2.42E-05 ***
--
Signif. codes: 0 '***' 0.001 '*' 0.01 '*' 0.05 '.' 0.1 '' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 118040 on 235272 degrees of freedom
Residual deviance: 108889 on 235263 degree degrees of freedom
(58148 observations
AIC: 108909
Number of Fisher Scoring iterations: 6
Log likelihood:-54444.557 (f)
Null/Residual deviance difference: 9151.24 (9 df)
Chi-square p-value : 0.00000000