文章基本信息

标题：Improving the art, craft and science of economic credit risk scorecards using random forests: why credit scorers and economists should use random forests.
作者：Sharma, Dhruv
期刊名称：Academy of Banking Studies Journal
印刷版ISSN：1939-2230
出版年度：2012
期号：January
语种：English
出版社：The DreamCatchers Group, LLC
摘要：The aim of this paper is to outline an approach to improving credit risk scorecards using Random Forests. We start with the benefits of random forests compared to logistic regression, the tool used most often for credit scoring systems. We then compare performance of random forests and logistic regression out of the box on a credit card dataset, a home equity loan dataset and a proprietary data set. We outline an approach to improving logistic regression using the random forest. We conclude by demonstrating how power random forests can be used to develop a model using 8 variables which is almost as good as the FICO[R] score. Thus highlighting the fact that data sets with complex interaction terms and contents can benefit from random forest models in 2 ways: 1) clear insight into the most predictive and valuable variables 2) generating robust models which maximize predictive interactions and relationships in the data not detectable by traditional regression techniques.
关键词：Algorithms;Credit ratings;Software;United States economic conditions

Improving the art, craft and science of economic credit risk scorecards using random forests: why credit scorers and economists should use random forests.

Sharma, Dhruv

INTRODUCTION

The aim of this paper is to outline an approach to improving credit risk scorecards using Random Forests. We start with the benefits of random forests compared to logistic regression, the tool used most often for credit scoring systems. We then compare performance of random forests and logistic regression out of the box on a credit card dataset, a home equity loan dataset and a proprietary data set. We outline an approach to improving logistic regression using the random forest. We conclude by demonstrating how power random forests can be used to develop a model using 8 variables which is almost as good as the FICO[R] score. Thus highlighting the fact that data sets with complex interaction terms and contents can benefit from random forest models in 2 ways: 1) clear insight into the most predictive and valuable variables 2) generating robust models which maximize predictive interactions and relationships in the data not detectable by traditional regression techniques.

For the purpose of this study, model performance will be compared using Receiver Operating curves which plot the proportion of bad loans detected vs. incorrectly classified good loans for each model cut off. Numerically this will be represented by the area under the curve of the ROC plot. All performance discussed will be out of sample performance of a 30% hold out sample while the models generated are built on 70% of the dataset. All investigations into data are conducted using R and Rattle tool.

TRADITIONAL CREDIT SCORING PITFALLS

The biggest problem with traditional credit scoring based on logistic regression techniques is that as a scientist or economist one cannot interpret the importance of underlying variables to the probability of a borrower experiencing financial difficulty.

The p values of the regression are not reliable as regression assumes no multicollinearity. As such variables which might make sense from a theoretical point of view, such as cash flow surrogates, and may have strong predictive power would not appear to be statistically significant based on p value statistics. This is a problem because credit data is notoriously correlated and biased. It is well known that 'biased estimation in data ... [which] has been shown to predict and extrapolate better when predictor variables are highly correlated ...' as this is common to credit scoring (Overstreet, 1992) .

Although modelers have used skill and judgment to work past this short coming there is no way in traditional scorecards to assess the predictive value variables in a robust and reliable manner. Thus there might be many opportunities of variables and variable interactions which might be lost given the use of the current tool.

Also from a human factors and organizational point of view people are biased to test theories they have and not try things that might not make sense. Our ability to develop causal models is biased and arbitrary despite the meanings we attach to things after the fact.

The history of credit scoring literature is rife with contradictory studies from the Durand's first study in the 1930s on whether income is predictive. Yet mortgage risk models have shown the debt ratio (monthly expenses/income) to be predictive as well as month's reserves (liquid assets/monthly payment). The successes of credit scoring in the mortgage industry show that financial worth and ability to pay variables can be used effectively in models along with loan to value (loan amount/property value) to assess risk. If we step back we can see that interaction variables of affordability and credit risk have proven to be valuable predictive tools. This is also consistent with the judgment theory of credit of: credit (willingness to pay), capacity (ability to pay) and collateral, and character.

The next leap in improvement to credit scoring is to find ways to test interaction terms in a meaningful and principled way. It stands to reason econometrically that if any variable should have impact on human behavior in spending, consumption, and financial distress it should be ability to pay. The measures of this are income, current debt usage, and reserves and assets one has saved to absorb shocks or life events.

Is there a statistically reliable way to test out the importance of variables, relative to their predictive power?

Importance of Random Forests to Credit Risk and Economics in general

To date the majority of credit scorecards used in industry are linear models despite the known issues of the flat maximum and multicollinearity (Wainer, 1978; Overstreet etal 1997;). Random Forests are a powerful tool for economic science as they are able to successfully deal with correlated variables with complex interactions (Breiman, 2001).

A simple example of the power of Random Forests was shown by Breiman in the binary prediction case of hepatitis mortality in which Stanford medical school had identified variables 6, 12, 14 and 19 as most predictive of risk using logistic regression. Subsequently using the bootstrap technique Efron showed that none of these variables were significant in the random resampling trials he ran. The Random Forest variable importance measure, created by Breiman, showed variables 7 and 11 to be critical and improved the logit regression results simplifying the model and by reducing error from 17% to 12% (Breiman, 2002).

As Random Forests are non parametric the linear restrictions of the flat maximum do not come into play as such. That said predictive models tend to perform well with regards to pareto optimal trade offs in true positive and false positive rates which look like an asymptote like the flat maximum effect. The complex interactions of economic variables such as macroeconomic forces and affordability are too complex to be studied for simple linear regression anymore. Random Forests serve as good estimate for asymptote of possible predictive power in this regards and help us get past the psychological limit we may believe to exist for predictive power as Roger Banister was able to do with preconceived limit on minimum time for completing the mile run. The way Random Forests work by building large quantities of weak classifiers with random selection of variables grown with out of sample testing is analogous to the way humans make decisions in a market place (See Gigerenzer's work on "Fast and Frugal trees" on human judgment models). Humans each look at the data available to them and make quick inferences and take actions based on these data. Random Forests then take votes from these large quantities of predictors and use decisions of all the predictors to make the final decision. The fact that diverse models built on different variables and samples of data when combined outperform other simple linear models is profound.

That said the critical aspects of Random Forests of interest to economic scientists are the features Breiman intended such as:

* Random Forests never overfit the data as they are built with out of sample testing for each submodel

* Variable importance ( a measure based on the importance in accuracy each variable provides to the overall model based on permutation tests of removing variables)

* Being able to see the effects of variables on predictions (2002).

* Handling thousands of variables efficiently by sampling variables.

Random Forests help us see the true impact of complex interrelated variables. As Breiman mentioned in his Wald lecture, complex phenomenon cannot be modeled well with goodness of fit models with simplifications. A more scientific approach is to build as complex a model to fit the phenomenon being studied and then to have tools like variable importance to understand the relationship inside the phenomenon being studied (Breiman, 2002). This is an important point as economics is based more and more complex realities.

Comparison of Random Forests to Logistic Regression

We now examine random forest performance out of the box on 3 data sets. The first dataset is a private label credit card data set from the 2010 KDD contest in Pacific Asia, the second data set is the widely used home equity loans, and the third data is a proprietary dataset.

1. Random Forest vs. Logistic Regression on Credit Card Data Set

Credit Card Dataset

The credit card data set has 50,000 loans of which 13000 are bad (serious delinquency). Using this data set a random forest model and logistic regression scorecard were compared out of the box. The source for the data is http://sede.neurotech.com.br/PAKDD2010/ Pacific-Asian Knowledge Discovery and Data Mining conference.

Models

Random Forest Variable Importance: The variable importance plot for random forests showed the following variables to be predictive in rank order.

According to the random forest plot the majority of predictions of borrower delinquency on the card can be predicted by age, monthly income, phone, payment day, type of occupation, marital status, number of dependents, area code of profession, and type of residence. In addition additional variables can add to predictive power in some fashion through some interaction effects.

Logistic regression model

Insights

Note how the regression makes the personal income appear statistically insignificant although we know from the random forest that it has a great deal of predictive power.

[GRAPHIC OMITTED]

The AUC (area under the curve) for the random forest model was .629 while for the regression model was .60. Thus random forests had a 5% improvement in performance over the logistic regression. By adding interaction terms suggested by variables in the random forest the logistic regression performance can be enhanced to match or slightly exceed random forest performance.

2. Random Forest vs. Logistic Regression on Home Equity Data Set Home Equity Dataset

The home equity data set has approximately 5,960 loans of which 1189 are bad (serious delinquency). Using this data set a random forest model and logistic regression scorecard were compared out of the box. The source for the data is the popular SAS data set: www.sasenterpriseminer.com/data/HMEQ.xls

Models

Random Forest Variable Importance: The variable importance plot for random forests showed the following variables to be predictive in rank order. The debt ratio, age of credit history, value of the home, and delinquency history had the most predictive power according to the random forest.

[GRAPHIC OMITTED]

Insights

The regression shows Debt ratio and other variables suggested by random forests to be statistically significant.

[GRAPHIC OMITTED]

The random forest however greatly outperforms the logistic regression scorecard on the home equity data set. Thus showing that logistic regression is not exploiting the maximum predictive value of the variables.

The AUC of the random forest was .92 while for the logistic regression was .78. Thus out of the box random forests had an 18% advantage in performance over the logistic regression. A recent study of tuning logistic regressions with neural network transformations had a performance of logistic regression to have an AUC of .86 (Wallinga, 2009). Thus Wallinga's approach of general additive neural network logistic regression though a powerful well thought enhancement improved performance by 28% but still did not match the out of performance of random forests.

3. Random Forest vs. Logistic Regression on Proprietary Data set Proprietary Dataset

The proprietary data set comprises of credit data from 2008 and the bad loans are those defined as loans which go 90 days past due or worse within 2 years on any account tradeline or loan. The data has 293,421 loan applicants and 19,449 bad loans.

Models

Random Forest Variable Importance: The variable importance plot for random forests showed the following variables to be predictive in rank order.

The revolving line of credit utilization, debt ratio, income, age of applicant, number of 30 days delinquencies in 2 years, number of tradelines active/open (had activity within 6 months), number of 90 day delinquency tradelines in 2 years, number of 60 day tradelines in 2 years, and number of mortgage tradelines have the most predictive power in predicting serious delinquency for a borrower for up 2 years. The attributes excluded duplicate or invalid status tradelines.

[GRAPHIC OMITTED]

Insights

Regression does not show revolving utilization to be statistically significant while random forests correctly identify it as a very predictive variable and obtain maximal predictive value from the data.

Performance

Using these 8 variables the AUC of the random forest exceed that of logistic regression by a large margin. Random forest has an area under the curve of 0.8522 while logistic regression has an AUC of 0.6964.

[GRAPHIC OMITTED]

In addition results of the performance were also computed for a popular credit score known as FICO[R]. Performance of the credit score was superior to both regression and random forest as it had an AUC of .865.

[GRAPHIC OMITTED]

IMPLICATIONS

The fact that random forests with 8 variables can produce a model which is competitive with FICO[R] out of the box is remarkable. Logistic regression does not achieve that level of performance out of the box.

This example clearly shows random forest's superiority in scientifically rank ordering predictive variables and optimally extracting predictive value from data with multi-collinearity and interactions. The advantage of random forests depends on strength of relationships between variables. In data sets with little interaction effects random forests may not outperform. On large credit data sets, behavioral models, application scoring random forests can improve existing credit models by 5-10% by tuning regression. Once tuned logistic regression can outperform random forests with judgment and careful testing of logistic regression. The example of building a random forest that is almost as predictive as a FICO[R] score, with an AUC of .85 vs. .865, but with 8 variables dramatically shows the power of random forests for scientists and credit risk modelers to maximize predictive value of data using random forests.

All 8 variables conform to theoretical soundness as they relate to borrower cash flow surrogates. Econometrically credit scoring variables can be segmented into: cash flow variables, stability variables, and payment history variables (Overstreet, 1992). Removing the revolving utilization and delinquency behavior variables greatly reduced the random forest performance to be more in line with logistic regression. Implying that the most predictive value is in the interaction of the utilization and delinquency behavior attributes with the other variables. Random forests will outperform when there are complex relationships and interactions between the variables a typical regression might miss.

Explaining the Advantage of Random Forests over Logistic Regression

An explanation of how such a simple data set can be competitive with the FICO[R] is the fact the credit models are thought to suffer from the flat maximum effect which implies that models with smaller data can perform close to larger more sophisticated linear models like logistic regression because these regressors are insensitive to large variations in the size of regression weights. Random forest advantage also seems to correlate with variables with interaction effects and multi-collinearity as the technique is able to determine complex relationships in the data using a bootstrap of variables and samples to build ensembles of models.

The power of random forests has profound implications for taking credit risk scorecards to the next level by optimizing credit score performance and leading to better and more robust scientific inferences about factors and how they impact phenomenon ranging from financial risk to consumer behavior modeling to medical science and perhaps even mimicking know humans think or behave in swarm intelligence.

Optimizing Credit Scorecards Using Random Forests: An approach

Updated Credit Card Random Forest Variable Importance with interaction terms

Main stream credit scorers can benefit from random forest models as well. One approach to optimizing existing models is to test interaction terms with variables identified to be most predictive by random forests. For example using the credit card data set discussed initially one can improve the AUC of the logistic regression to match random forests by adding interaction terms to the credit card data set to achieve an AUC of .626. Thus logistic regression can be tuned to match performance of random forests out of the box and yield almost the same performance as the random forest model (and on some data sets after tuning logistic regression performs better than random forest).

Overall process for Optimizing Existing Credit Scorecard

* SOAR (Specify data, observe data, analyze, and recommend) (Brown, 2005)

* Run Random Forest

* Take top predictive fields and create interactions terms with regression one at a time and retain statistically significant interactions

* Rerun regression and compare until regression outperforms or closely matches random forest out of sample performance

* Run conditional inference trees to identify interactions and re-run random forest and logit models until maximal performance is achieved.

* Convert fields to factors for logit as binned data improves logit in general

* Multiply the score from Random forest and logistic, sum, take max, and compare area under curve. As predicted Hand's Superscorecard literature multiplying the 2 scores resulted in improved performance as well (Hand etal, 2002).

The method of using random forests, affordability and logistic regression in combination with conditional inference trees iteratively to improve logistic regression to match and outperform random forests is dubbed the Sharma method. For the most comprehensive review of credit scoring literature and this approach see (Sharma, Overstreet & Beling, 2009). Also the methods are detailed in the Guide to Credit Scoring in R as well (Sharma, 2009). The pioneering work behind this was Overstreet etal in 1992 which was the first theory based free cash flow model for credit scoring and Breiman's work on random forests which allowed the importance of affordability data to be more clearly seen. Prior to this most logistic regression scorecards showed income and cash flow data to be marginally predictive as the p values were too high and erroneous due to multicollinearity. For details on checkered history of credit scoring see Sharma, Overstreet and Beling, 2009.

In terms of implementation R was used along with Rattle data mining software. Rattle greatly facilitated the speed and ease of running the algorithms and credit scoring once the interaction terms were added by hand code and run through rattle (See Graham for Rattle, 2008).

Extensions

In large data sets I have been able to improve logistic regressions to match the performance of random forests using trial and error, judgment and using random forest variable importance as a base to add interaction terms. This approach is painful, and time consuming. A more viable approach will be to use random forest performance as a benchmark to automatically optimize logistic regression using out of sample error by testing out interactions among most predictive variables and formulas using a genetic algorithm approach.

Credit scoring is a search for meaningful interaction terms and all financial ratios are interaction terms. Hand has shown multiplying scores always produce a better or equivalent score, and this itself is again an example of interaction term of multiplying variables (Hand, 2002). By viewing financial ratios as interactions one can widen the lens and search for optimal interactions to obtain optimal predictive power from the affordability data. Traditional regression, with it's failure to handle multi-collinearity, has made searching for fruitful interaction terms in credit data problematic. Also attempting too many interactions can overfit logits. Thus, a careful knowledge based approach is needed which random forest variable importance measures provide. For an in depth discussion of this, as well as the most comprehensive literature review of credit scoring, and the overall approach see Sharma, Overstreet and Beling, 2009.

CONCLUSIONS

The best of both worlds can be achieved by finding ways to optimally enhance logistic regression using insights from random forest variable importance which are more reliable gauges for variable importance and relationship given the multi-collinearity in all credit models and data. To date, the random forests I have tuned logistic regressions scorecards judgmentally using random forest variable importance to outline interactions terms to be added to the model but the home equity dataset shows that this might not be enough as more transformations and binning of variables might be needed to optimally squeeze performance into logistic regression to explore interaction terms and transformations via stochastic search optimization using genetic algorithms within a bounded variable space using random forest performance as a stopping criterion. This would best be accomplished via an automated algorithm which iterates through variable interaction and combination mining using a sample set of meaningful variables identified by random forest as being predictive which regression p values might miss. A common example of this oversight by traditional scorecards since the time of Durand in the 1930s is that of income and affordability data which standard regressions have shown to not be predictive while flying in the face of common sense. The most successful predictive variables using the mortgage industry are all interaction terms (loan to value, month's reserves, and debt ratio; for example of mortgage scoring see Avery et al 1996). The history of credit scoring shows finding optimal interaction terms is crucial to optimal predictive accuracy and random forests play a vital role in being able to test out meaningful variables which traditional scoring technologies such as regression failed to identify using p value tests of significance.

Human Values perspective

Credit scoring should be integrated with normative models to ensure borrower wellbeing instead of maximizing profit as evidenced by the recent global recession in the 21st century. Credit score models no matter how sophisticated built to predict two years of data fail to assess the long term impact of borrower wellbeing and that is a challenge worth studying; such knowledge will surely lead to sustainable credit markets which do not threaten democracy and have a robust micro-foundation for macro-markets in credit. In the aggregate picture proprietary models to predict behavior are all more suboptimal than a white box credit policy which ensures borrower financial wellbeing by ensuring constraints on borrower reserves, consumption, and expenses to income over time. Competition in credit modeling will not lead to better consumer welfare as credit is a commodity and financial institutions should not compete on credit policy for sustainable advantage but instead should compete on convenience, safer products, and customization to fit borrower life stages.

Let's hope in the future we won't need proprietary models and can live in an enlightened world where borrowers can choose safe products and know the implications of their behavior on their ability to obtain more credit in a open white box world where behavior is then regulated by a desire to conform to standards which will make the borrowers more fiscally responsible. Credit data should be democratized and not for profit entities as it is a social good.

APPENDIX OF DATA DESCRIPTIONS AND OPEN DATA SETS

Credit Card Dataset Original Variable Descriptions

Var_Title              Var_Description         Field_Content

ID_CLIENT              Sequential number for   1-50000,
                       the applicant (to be    50001-70000, 70001-
                       used as a key)          90000

CLERK_TYPE             Not informed            C

PAYMENT_DAY            Day of the month for    1 ,5,10,15,20,25
                       bill payment, chosen
                       by the applicant

APPLICATION_           Indicates if the        Web, Carga
SUBMISSION_TYPE        application was
                       submitted via the
                       internet or in
                       person/posted

QUANT_ADDITIONAL_      Quantity of             1 ,2,NULL
CARDS                  additional cards
                       asked for in the same
                       application form

POSTAL_ADDRESS_TYPE    Indicates if the        1.2
                       address for posting
                       is the home address
                       or other Encoding not
                       informed.

SEX                                            M = Male, F = Female

MARITAL_STATUS         Encodinq not informed   1,2,3,4,5,6,7

QUANT_DEPENDANTS                               0, 1, 2, ...

EDUCATION_LEVEL        Edducational level in   1,2,3,4,5
                       qradual order not
                       informed

STATE_OF_BIRTH                                 Brazilian states,
                                               XX, missing

CITY_OF_BIRTH

NACIONALITY            Country of birth        0, 1, 2
                       Encoding not informed
                       but Brazil is likely
                       to be equal 1

RESIDENCIAL_STATE      State of residence

RESIDENCIAL_CITY       City of residence

RESIDENCIAL_BOROUGH    Borouqh of residence

FLAG_RESIDENCIAL_      Indicates if the        Y, N
PHONE                  applicant possesses a
                       home phone

RESIDENCIAL_PHONE_     Three-digit pseudo-
AREA_CODE              code

RESIDENCE_TYPE         Encoding not            1,2,3,4,5,NULL
                       informed. In general,
                       there are the types:
                       owned, mortqaqe.
                       rented, parents,
                       family etc.

MONTHS_IN_RESIDENCE    Time in the current     1,2, ... , NULL
                       residence in months

FLAG_MOBILE_PHONE      Indicates if the        Y,N
                       applicant possesses a
                       mobile phone

FLAG_EMAIL             Indicates if the        0.1
                       applicant possesses
                       an e-mail address

PERSONAL_MONTHLY_      Applicant's personal
INCOME                 regular monthly
                       income in Brazilian
                       currency (R$)

OTHER_INCOMES          Applicant's other
                       incomes monthly
                       averaged in Brazilian
                       currency (R$)

FLAG_VISA              Flaq indicatinq if      0.1
                       the applicant is a
                       VISA credit card
                       holder

FLAG_MASTERCARD        Flag indicating if      0.1
                       the applicant is a
                       MASTERCARD credit
                       card holder

FLAG_DINERS            Flaq indicatinq if      0 1
                       the applicant is a
                       SINERS credit card
                       holder

FLAG_AMERICAN_         Flag indicating if      0.1
EXPRESS                the applicant is an
                       AMERICAN EXPRESS
                       credit card holder

FLAG_OTHER_CARDS       Despite being label     0, 1, NULL
                       "FLAG", this field
                       presents three values
                       not explained

QUANT_BANKING_                                 0, 1, 2
ACCOUNTS

QUANT_SPECIAL_                                 0, 1, 2
BANKING_ACCOUNTS

PERSONAL_ASSETS_       Total value of the
VALUE                  personal possessions
                       such as houses, cars
                       etc  in Brazilian
                       currency (R$)

QUANT_CARS             Quantity of cars the
                       applicant possesses

COMPANY                If the applicant has    Y,N
                       supplied the name of
                       the company where
                       he/she formally works

PROFESSIONAL_STATE     State where the
                       applicant works

PROFESSIONAL_CITY      City where the
                       applicant works

PROFESSIONAL_          Borough where the
BOROUGH                applicant works

FLAG_PROFESSIONAL_     Indicates if the        Y,N
PHONE                  professional phone
                       number was supplied

PROFESSIONAL_PHONE_    Three-digit
AREA_CODE              pseudo-code

MONTHS_IN_THE_JOB      Time in the current
                       residence in months

PROFESSION_CODE        Applicant's             1.2,3, ...
                       profession code.
                       Encodinq not informed

OCCUPATION_TYPE        Encodinq not informed   1 .2,3,4,5,NULL

MATE_PROFESSION_       Mate's profession       1 ,2,3, ...
CODE                   code. Encoding not
                       informed

EDUCATION_LEVEL        Mate's educational      1 .2,3,4,5
                       level in qradual
                       order not informed

FLAG_HOME_ADDRESS_     Flag indicating         0.1
DOCUMENT               documental
                       confirmation of home
                       address

FLAG_RG                Flaq indicatinq         0.1
                       documental
                       confirmation of
                       citizen card number

FLAG_CPF               Flaq indicatinq         0.1
                       documental
                       confirmation of tax
                       payer status

FLAG_INCOME_PROOF      Flaq indicating         0.1
                       documental
                       confirmation of
                       income

PRODUCT                Type of credit          1, 2, 7
                       product applied
                       Encodinq not informed

FLAG_ACSP_RECORD       Flag indicating if      Y, N
                       the applicant has any
                       previous credit
                       delinquency

AGE                    Applicant's aqe at
                       the moment of
                       submission

RESIDENCIAL_ZIP_3      Three most
                       significant diqits of
                       the actual home zip
                       code

PROFESSIONAL_ZIP_3     Three most
                       significant diqits of
                       the actual home zip
                       code

TARGET_LABEL_          Target Variable: BAD
BAD = 1                = 1 , GOOD = 0

Source: http://sede.neurotech.com.br/PAKDD2010/

HOME EQUITY DATA SET ORIGINAL VARIABLES

 Name     Model    Measurement                Description
           Role       Level

BAD       Target   Binary         1 = defaulted on loan, 0 = paid
                                    back loan
REASON    Input    Binary         HomeImp = home improvement,
                                    DebtCon = debt consolidation
JOB       Input    Nominal        Six occupational categories
LOAN      Input    Interval       Amount of loan request
MORTDUE   Input    Interval       Amount due on existing mortgage
VALUE     Input    Interval       Value of current property
DEBTINC   Input    Interval       Debt-to-income ratio
YOJ       Input    Interval       Years at present job
DEROG     Input    Interval       Number of major derogatory reports
CLNO      Input    Interval       Number of trade lines
DELINQ    Input    Interval       Number of delinquent trade lines
CLAGE     Input    Interval       Age of oldest trade line in months
NINQ      Input    Interval       Number of recent credit inquiries

Source: www.sasenterpriseminer.com/data/HMEQ.xls

APPENDIX OF R CODE

Credit Card Data Set and interactions

cc<-read.csv("C:/Documents and Settings//My Documents/cckdd2010.csv")

cc$TARGET_LABEL_BAD<-as.factor(cc$TARGET_LABEL_BAD)

cc$QUANT_DEPENDANTS<-ifelse(cc$QUANT_DEPENDANTS>= 13,13,cc$QUANT_ DEPENDANTS)

#cc$ZipDist<-as.numeric(cc$RESIDENCIAL_ZIP_3)- as.numeric(cc$PROFESSIONAL_ZIP_3)

#cc$StateDiff<-as.factor(ifelse(cc$RESIDENCIAL_STATE== cc$PROFESSIONAL_STATE,'Y','N'))

#cc$CityDiff<-as.factor(ifelse(cc$RESIDENCIAL_CITY==cc$PROFESSIONAL_CITY,' Y','N'))

#cc$BoroughDiff<-as.factor(ifelse(cc$RESIDENCIAL_BOROUGH=

cc$PROFESSIONAL_BOROUGH, 'Y','N'))

cc$MissingResidentialPhoneCode<

as.factor(ifelse(is.na(cc$RESIDENCIAL_PHONE_AREA_CODE)==TRUE,'Y','N'))

cc$MissingProfPhoneCode<-as.factor(ifelse(is.na (cc$PROFESSIONAL_PHONE_AREA_CODE)==TRUE,'Y','N'))

cc<-subset(cc,select=-ID_CLIENT)

cc<-subset(cc,select=-CLERK_TYPE)

cc<-subset(cc,select=-QUANT_ADDITIONAL_CARDS)

cc<-subset(cc,select=-EDUCATION_LEVEL)

#cc<-subset(cc,select=-STATE_OF_BIRTH)

cc<-subset(cc,select=-CITY_OF_BIRTH)

#cc<-subset(cc,select=-RESIDENCIAL_STATE)

cc<-subset(cc,select=-RESIDENCIAL_CITY)

cc<-subset(cc,select=-RESIDENCIAL_BOROUGH)

cc<-subset(cc,select=-PROFESSIONAL_STATE)

cc<-subset(cc,select=-PROFESSIONAL_CITY)

cc<-subset(cc,select=-PROFESSIONAL_BOROUGH)

cc<-subset(cc,select=-FLAG_MOBILE_PHONE)

cc<-subset(cc,select=-FLAG_HOME_ADDRESS_DOCUMENT)

cc<-subset(cc,select=-FLAG_RG)

cc<-subset(cc,select=-FLAG_CPF)

cc<-subset(cc,select=-FLAG_INCOME_PROOF)

cc<-subset(cc,select=-FLAG_ACSP_RECORD)

cc<-subset(cc,select=-TARGET_LABEL_BAD.1)

cc<-subset(cc,select=-RESIDENCIAL_ZIP_3)

cc$PROFESSIONAL_ZIP_3<-as.numeric(cc$PROFESSIONAL_ZIP_3)

cc$RESIDENCIAL_PHONE_AREA_CODE[is.na(cc$RESIDENCIAL_PHONE_AREA_CODE)] <- 0

cc$PROFESSIONAL_PHONE_AREA_CODE[is.na(cc$PROFESSIONAL_PHONE_AREA_CODE)] <- 0

cc$PROFESSION_CODE<-as.numeric(cc$PROFESSION_CODE)

cc$OCCUPATION_TYPE<-as.numeric(cc$OCCUPATION_TYPE)

cc$MATE_PROFESSION_CODE<-as.numeric(cc$MATE_PROFESSION_CODE)

cc$EDUCATION_LEVEL. 1<-as.numeric(cc$EDUCATION_LEVEL. 1)

cc$RESIDENCE_TYPE<-as.numeric(cc$RESIDENCE_TYPE)

cc$MONTHS_IN_RESIDENCE<-as.numeric(cc$MONTHS_IN_RESIDENCE)

cc$TotIncome<-cc$PERSONAL_MONTHLY_INCOME+cc$OTHER_INCOMES

cc$OthIncomePct<-cc$OTHER_INCOMES/cc$PERSONAL_MONTHLY_INCOME

cc$MnthsSavings<-cc$PERSONAL_ASSETS_VALUE/(.01+cc$MONTHS_IN_THE_JOB*cc$ TotIncome)

cc$Afford<-cc$TotIncome+cc$PERSONAL_ASSETS_VALUE

cc$IncomeToAssets<-cc$TotIncome/(cc$PERSONAL_ASSETS_VALUE+.01)

cc$i1<-cc$QUANT_DEPENDANTS*cc$AGE

cc$i2<-cc$AGE*cc$PROFESSIONAL_ZIP_3

cc$i4<-cc$PROFESSION_CODE*cc$AGE

cc$i5<-cc$RESIDENCIAL_PHONE_AREA_CODE*cc$AGE

cc$i6<-cc$RESIDENCIAL_PHONE_AREA_CODE*cc$PROFESSIONAL_PHONE_AREA_CODE

cc$i7<-cc$RESIDENCIAL_PHONE_AREA_CODE*cc$OthIncomePc

cc$i8<-cc$RESIDENCIAL_PHONE_AREA_CODE*cc$IncomeToAssets

cc$i9<-cc$RESIDENCIAL_PHONE_AREA_CODE*cc$i1

cc$i10<-cc$RESIDENCIAL_PHONE_AREA_CODE*cc$i2

cc$i11<-cc$RESIDENCIAL_PHONE_AREA_CODE*cc$i5

cc$i12<-cc$RESIDENCIAL_PHONE_AREA_CODE*cc$OTHER_INCOMES

cc$i13<-cc$QUANT_DEPENDANTS*cc$RESIDENCIAL_PHONE_AREA_CODE

cc$i14<-cc$RESIDENCIAL_PHONE_AREA_CODE*cc$RESIDENCE_TYPE

cc$i15<-cc$RESIDENCIAL_PHONE_AREA_CODE*cc$PROFESSIONAL_ZIP_3

cc$i16<-cc$PERSONAL_MONTHLY_INCOME*cc$PROFESSIONAL_ZIP_3

cc$i17<-cc$OTHER_INCOMES*cc$PROFESSIONAL_ZIP_3

cc$i18<-cc$PROFESSIONAL_ZIP_3*cc$IncomeToAssets

cc$i19<-cc$PROFESSIONAL_ZIP_3*cc$i2

cc$i20<-cc$PROFESSIONAL_ZIP_3*cc$i5

cc$j1<-cc$MONTHS_IN_RESIDENCE*cc$EDUCATION_LEVEL. 1

cc$j2<-cc$MONTHS_IN_RESIDENCE*cc$QUANT_CARS

cc$j3<-cc$MARITAL_STATUS*cc$MONTHS_IN_RESIDENCE

cc$j4<-cc$QUANT_CARS*cc$i12

cc$j5<-cc$FLAG_MASTERCARD*cc$i5

cc$j6<-cc$QUANT_CARS*cc$i2

cc$j7<-cc$FLAG_MASTERCARD*cc$i10

cc$j8<-cc$QUANT_CARS*cc$i19

cc$j9<-cc$QUANT_CARS*cc$OthIncomePct

cc$j10<-cc$NACIONALITY*cc$QUANT_CARS

cc$j11<-as.factor(ifelse(cc$FLAG_RESIDENCIAL_PHONE= ='Y',cc$FLAG_MASTERCARD,'O'))

cc$j12<-cc$QUANT_CARS*cc$i7

cc$j 13<-cc$MARITAL_STATUS*cc$j3

cc$j14<-cc$PAYMENT_DAY*cc$j5

cc$j15<-cc$PAYMENT_DAY*cc$j7

cc$j16<-cc$QUANT_CARS*cc$OCCUPATION_TYPE

cc$j17<-cc$OCCUPATION_TYPE*cc$j9

cc$j18<-as.factor(ifelse(cc$j11=='1',cc$OCCUPATION_TYPE,'O'))

cc$j19<-cc$AGE*cc$i2

cc$j20<-cc$OthIncomePct*cc$i2

cc$j21<-cc$i2*cc$i7

cc$j22<-cc$i2*cc$i10

cc$j23<-cc$i2*cc$i15

cc$j24<-cc$i2*cc$j1

cc$j25<-cc$i2*cc$j2

cc$j26<-cc$RESIDENCE_TYPE*cc$AGE

cc$j27<-cc$RESIDENCE_TYPE*cc$i4

cc$j28<-cc$RESIDENCE_TYPE*cc$i7

cc$j29<-cc$RESIDENCE_TYPE*cc$PROFESSION_CODE

cc$j30<-cc$PROFESSION_CODE*cc$PRODUCT

cc$j31<-cc$PRODUCT*cc$i6

cc$k1<-as.factor(ifelse(cc$AGE<=18 & cc$PAYMENT_DAY<=15,'Y','N'))

cc$k2<-as.factor(ifelse(cc$AGE>18 & cc$PAYMENT_DAY<=15,'Y','N'))

cc$k3<-as.factor(ifelse(cc$AGE>21 & cc$PAYMENT_DAY>15,'Y','N'))

cc$k4<-as.factor(ifelse(cc$AGE<=21 & cc$PAYMENT_DAY> 15,'Y','N'))

cc$k5<-as.factor(ifelse(cc$AGE<=46 & cc$AGE>32 & cc$j11!='O' & cc$PAYMENT_DAY<=10 & cc$SEX!='F','Y','N'))

cc$k6<-as.factor(ifelse(cc$AGE<=46 & cc$AGE>32 & cc$j11!='O' & cc$PAYMENT_DAY<=10 & cc$SEX=='F','Y','N'))

cc$k7<-as.factor(ifelse(cc$AGE<=46 & cc$AGE>32 & cc$j11!='O' & cc$PAYMENT_DAY>10 & cc$SEX!='F','Y','N'))

cc$k8<-as.factor(ifelse(cc$AGE<=46 & cc$AGE>32 & cc$j11!='O' & cc$PAYMENT_DAY>10 & cc$SEX=='F' & cc$j30<=40,'Y','N'))

cc$k8a<-as.factor(ifelse(cc$AGE<=46 & cc$AGE>32 & cc$j11!='O' & cc$PAYMENT_DAY>10 & cc$SEX=='F' & cc$j30>40,'Y','N'))

cc$k9<-as.factor(ifelse(cc$AGE<=46 & cc$AGE>32 & cc$j11=='O' & cc$MissingProfPhoneCode!='N','Y','N'))

cc$k10<-as.factor(ifelse(cc$AGE<=46 & cc$AGE>32 & cc$j11=='O' & cc$MissingProfPhoneCode=='Y','Y','N'))

cc$k11<-as.factor(ifelse(cc$AGE>46 & cc$j11=='O' & cc$FLAG_PROFESSIONAL_PHO NE=='Y','Y','N'))

cc$k12<-as.factor(ifelse(cc$AGE>46 & cc$j11=='O' & cc$FLAG_PROFESSIONAL_PHONE=='N' & cc$j16<=0 & cc$PAYMENT_DAY<=20,'Y','N'))

cc$k13<-as.factor(ifelse(cc$AGE>46 & cc$j11=='O' & cc$FLAG_PROFESSIONAL_PHONE=='N' & cc$j16<=0 & cc$PAYMENT_DAY>20,'Y','N'))

#cc$k14<-as.factor(ifelse(cc$AGE>46 & cc$j11=='O' & cc$FLAG_PROFESSIONAL_PHONE=='N' & cc$j16>0,'Y','N'))

cc$k15<-as.factor(ifelse(cc$AGE>46 & cc$AGE<=52 & cc$j11!='O' ,'Y','N'))

cc$k16<-as.factor(ifelse(cc$AGE>52 & cc$j11!='O' & cc$PAYMENT_DAY<=15 & cc$i11<=271633 & cc$j5<=1220 ,'Y','N'))

cc$k17<-as.factor(ifelse(cc$AGE>52 & cc$j11!='O' & cc$PAYMENT_DAY<=15 & cc$i11<=271633 & cc$j5>1220 ,'Y','N'))

cc$k18<-as.factor(ifelse(cc$AGE>52 & cc$j 11!='O' & cc$PAYMENT_DAY<=15 & cc$i11>271633 ,'Y','N'))

cc$k18<-as.factor(ifelse(cc$AGE>52 & cc$j 11!='O' & cc$PAYMENT_DAY>15 ,'Y','N')) #logit

m<-glm(TARGET_LABEL_BAD~.,data=cc,family=binomial)

cc<-subset(cc,select=-j1)

cc<-subset(cc,select=-j2)

cc<-subset(cc,select=-j3)

cc<-subset(cc,select=-j4)

cc<-subset(cc,select=-j5)

cc<-subset(cc,select=-j6)

cc<-subset(cc,select=-j7)

cc<-subset(cc,select=-j8)

cc<-subset(cc,select=-j9)

cc<-subset(cc,select=-j10)

cc<-subset(cc,select=-j11)

cc<-subset(cc,select=-j12)

cc<-subset(cc,select=-j13)

cc<-subset(cc,select=-j14)

cc<-subset(cc,select=-j15)

cc<-subset(cc,select=-j16)

cc<-subset(cc,select=-j17)

cc<-subset(cc,select=-j18)

cc<-subset(cc,select=-j19)

cc<-subset(cc,select=-j20)

cc<-subset(cc,select=-j21)

cc<-subset(cc,select=-j22)

cc<-subset(cc,select=-j23)

cc<-subset(cc,select=-j24)

cc<-subset(cc,select=-j25)

cc<-subset(cc,select=-j26)

cc<-subset(cc,select=-j27)

cc<-subset(cc,select=-j28)

cc<-subset(cc,select=-j29)

cc<-subset(cc,select=-j30)

cc<-subset(cc,select=-j31)

cc<-subset(cc,select=-i1)

cc<-subset(cc,select=-i2)

cc<-subset(cc,select=-i3)

cc<-subset(cc,select=-i4)

cc<-subset(cc,select=-i5)

cc<-subset(cc,select=-i6)

cc<-subset(cc,select=-i7)

cc<-subset(cc,select=-i8)

cc<-subset(cc,select=-i9)

cc<-subset(cc,select=-i10)

cc<-subset(cc,select=-i11)

cc<-subset(cc,select=-i12)

cc<-subset(cc,select=-i13)

cc<-subset(cc,select=-i14)

cc<-subset(cc,select=-i15)

cc<-subset(cc,select=-i16)

cc<-subset(cc,select=-i17)

cc<-subset(cc,select=-i18)

cc<-subset(cc,select=-i19)

cc<-subset(cc,select=-i20)

Most work done in Rattle.

Home Equity Data Set R

#sas home equity data set

#www.sasenterpriseminer.com/data/HMEQ.xls

#Wielenga, D., Lucas, B. and Georges, J. (1999), Enterprise MinerTM: Applying Data Mining Techniques Course

Notes, SAS Institute Inc., Cary, NC.

cc<-read.csv("C:/Documents and Settings/ My Documents/HMEQ.csv")

cc$BAD<-as.factor(cc$BAD)

cc$LTV<-(cc$LOAN+cc$MORTDUE)*100/cc$VALUE

cc$JOB<-as.factor(cc$JOB)

REFERENCES

Avery, Robert B., Raphael W. Bostic, Paul S. Calem, and Glenn Canner, (1996) "Credit risk, credit scoring, and the performance of home mortgages," The Federal Reserve Bulletin, Vol. 82, No. 7, , pp. 621-648

Breiman, L. (2002) Wald 2: Looking Inside the Black Box. Retrieved from www.stat.berkeley.edu/users/breiman/wald2002-2.pdf

Brown, Don (2005) Linear Models Unpublished Manuscript at University of Virginia.

Overstreet, GA; Kemp, RS; (1986) Managerial control in Credit Scoring Systems. Journal of Retail Banking

Overstreet, G.A.J., Bradley, E.L., Kemp, R.S., 1992. The flat-maximum effect and generic linear scoring models: a test, IMA Journal of Mathematics Applied in Business & Industry, 4 (1) 97-109

Sharma, D (2009) Guide to Credit Scoring in R. Retrieved from http://cran.r- project.org/doc/contrib/SharmaCreditScoring.pdf

Sharma, D; Overstreet, George; Beling, Peter (2009) Not If Affordability data adds value but how to add real value by Leveraging Affordability Data: Enhancing Predictive capability of Credit Scoring Using Affordability Data. CAS (Casualty Actuarial Society) Working Paper. Retrieved from http://www.casact.org/research/wp/index.cfm?fa=workingpapers

See Williams, Graham Desktop Guide to Data Mining http://www.togaware.com/datamining/survivor/

Wielenga, D., Lucas, B. and Georges, J. (2009), Enterprise MinerTM: Applying Data Mining Techniques, SAS Institute Inc., Cary, NC. http://www.crc.man.ed.ac.uk/conference/archive/2009/presentations/ Paper-11-Paper.pdf

Dhruv Sharma, Independent Scholar

Logistic regression model

Variable                           Coefficient    Std. Error   z value
                                   Estimate

(Intercept)                         0.367793      1.008929      0.365
PAYMENT_DAY                         0.019525      0.001859     10.505
APPLICATION_SUBMISSION_TYPECarga   -0.30733       0.093499     -3.287
APPLICATION_SUBMISSION_TYPEWeb     -0.09412       0.059745     -1.575
POSTAL_ADDRESS_TYPE                 0.028979      0.151555      0.191
SEXF                               -1.01841       0.611657     -1.665
SEXM                               -0.83388       0.611716     -1.363
SEXN                               -0.96122       0.717716     -1.339
MARITAL_STATUS                     -0.01168       0.009773     -1.195
QUANT_DEPENDANTS                    0.020479      0.010485      1.953
NACIONALITY                         0.06682       0.071782      0.931
FLAG_RESIDENCIAL_PHONEY            -0.82105       0.720136     -1.14
RESIDENCIAL_PHONE_AREA_CODE         0.000984      0.000398      2.473
RESIDENCE_TYPE                     -0.01853       0.010731     -1.727
FLAG_EMAIL                          0.017635      0.046639      0.378
PERSONAL_MONTHLY_INCOME             5.77E-07      1.48E-06      0.39
OTHER_INCOMES                       1.78E-0       1.83E-05      0.971
FLAG_VISA                           0.074469      0.042835      1.738
FLAG_MASTERCARD                    -0.2261        0.046838     -4.827
FLAG_DINERS                         0.284333      0.33461       0.85
FLAG_AMERICAN_EXPRESS              -0.06287       0.303154     -0.207
FLAG_OTHER_CARDS                   -0.05443       0.299613     -0.182
QUANT_BANKING_ACCOUNTS             -0.00642       0.058358     -0.11
QUANT_SPECIAL_BANKING_ACCOUNTS       NA             NA           NA
PERSONAL_ASSETS_VALUE              -4.4E-08       3.3E-07      -0.134
QUANT_CARS                         -0.02769       0.101239     -0.274
COMPANYY                           -0.06863       0.031724     -2.163
FLAG_PROFESSIONAL_PHONEY            0.714282      0.617823      1.156
PROFESSIONAL_PHONE_AREA_CODE       -0.00056       0.000721     -0.782
MONTHS_IN_THE_JOB                  -0.06383       0.055293     -1.154
OCCUPATION_TYPE                     0.026602      0.007269      3.66
MATE_PROFESSION_CODE               -0.00679       0.004226     -1.606
EDUCATION_LEVEL.1                   0.000307      0.019259      0.016
PRODUCT                             0.034652      0.011976      2.894
AGE                                -0.01968       0.000975    -20.193
MissingResidentialPhoneCodeY       -0.17667       0.720139     -0.245
MissingProfPhoneCodeY               0.804464      0.619538      1.298

Variable                           Pr(>         Significance
                                   [absolute
                                   value of
                                   z])

(Intercept)                        0.715457
PAYMENT_DAY                        < 2e-16           ***
APPLICATION_SUBMISSION_TYPECarga   0.001012          **
APPLICATION_SUBMISSION_TYPEWeb     0.115164
POSTAL_ADDRESS_TYPE                0.848362
SEXF                               0.095913
SEXM                               0.172827
SEXN                               0.180481
MARITAL_STATUS                     0.231949
QUANT_DEPENDANTS                   0.050805
NACIONALITY                        0.351919
FLAG_RESIDENCIAL_PHONEY            0.254232
RESIDENCIAL_PHONE_AREA_CODE        0.013386           *
RESIDENCE_TYPE                     0.084166
FLAG_EMAIL                         0.705338
PERSONAL_MONTHLY_INCOME            0.69655
OTHER_INCOMES                      0.331516
FLAG_VISA                          0.082123
FLAG_MASTERCARD                    1.38E-06          ***
FLAG_DINERS                        0.395467
FLAG_AMERICAN_EXPRESS              0.835701
FLAG_OTHER_CARDS                   0.85585
QUANT_BANKING_ACCOUNTS             0.912419
QUANT_SPECIAL_BANKING_ACCOUNTS       NA
PERSONAL_ASSETS_VALUE              0.893551
QUANT_CARS                         0.784451
COMPANYY                           0.030512           *
FLAG_PROFESSIONAL_PHONEY           0.247629
PROFESSIONAL_PHONE_AREA_CODE       0.434309
MONTHS_IN_THE_JOB                  0.24833
OCCUPATION_TYPE                    0.000252          ***
MATE_PROFESSION_CODE               0.108175
EDUCATION_LEVEL.1                  0.987279
PRODUCT                            0.00381           **
AGE                                <2e-16            ***
MissingResidentialPhoneCodeY       0.806201
MissingProfPhoneCodeY              0.194119

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '' 1
(Dispersion parameter for binomial family taken to be 1)
Residual deviance: 39312 on 34964 degrees of freedom
AIC:39384

Number of Fisher Scoring iterations 4
Log likelihood: -19655.757 (36 df)
Null/Residual deviance difference: 906.696 (35df)
Chi-square p-value: 0.00000000

Home Equity Logistic Regression

Variables         Coefficient      Std.      z value
                   Estimated       Error

(Intercept)      -17.07851715   524.8953     -0.033
LOAN              0.000001803   1.58E-05      0.114
MORTDUE           0.000019897   1.26E-05      1.576
VALUE            -0.000016881   1.13E-05     -1.501
REASONDebtCon    -0.621936258   0.635508     -0.979
REASONHomelmp    -0.753124539   0.647779     -1.163
JOBMgr            14.79633915   524.8937      0.028
JOBOffice         14.22345444   524.8938      0.027
JOBOther          14.67559173   524.8937      0.028
JOBProfExe        14.83695589   524.8937      0.028
JOBSales          15.91157826   524.8939      0.03
JOBSelf           15.92432142   524.8939      0.03
YOJ              -0.005838696   0.012261     -0.476
DEROG             0.802276888   0.123378      6.503
DELINQ            0.817538124   0.085566      9.554
CLAGE             -0.00580995   0.001307     -4.445
NINQ              0.155918992   0.042991      3.627
CLNO             -0.027956215   0.009666     -2.892
DEBTINC           0.101303396   0.012958      7.818
LTV              -0.020306186   0.011472     -1.77
--

Variables          Pr(>      Significance
                 [absolute
                 value of
                    z])

(Intercept)      0.974044
LOAN             0.909386
MORTDUE          0.115104
VALUE            0.133474
REASONDebtCon    0.327756
REASONHomelmp    0.244982
JOBMgr           0.977511
JOBOffice        0.978382
JOBOther         0.977695
JOBProfExe        0.97745
JOBSales         0.975817
JOBSelf          0.975797
YOJ               0.63394
DEROG            7.89E-11    ***
DELINQ           < 2e-16     ***
CLAGE            8.79E-06    ***
NINQ             0.000287    ***
CLNO             0.003827
DEBTINC          5.38E-15    ***
LTV              0.076707
--
Signif. codes: 0 '***'0.00 1 '**' 0.01 '*' 0.05 '.' 0.1 '' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1472.1 on 2431 degrees of freedom
Residual deviance: 1090.8 on 2412 degrees of freedom
(1740 observations deleted due to missingness)
AIC: 1130.8

Number of Fisher Scoring iterations: 16
Log likelihood : -545.408 (20 df)
Null/Residual deviance difference: 3813 (19 df)
Chi-square p-value: 0.0000000

Logistic regression Model

Variable                        Estimate    Std. Error   z value

(Intercept)                     -1.33216    0.031749       -41.959
trades                          -0.00552    0.001959        -2.819
30 number dlq (not worse)       0.501922    0.008535        58.807
60 number dlq (not worse)       -0.94516    0.014058       -67.233
90 day number dlq (not worse)   0.478619    0.012085        39.605
mtg_trd_lines                   0.095229     0.00804        11.844
monthly income                  -3.6E-05    2.29E-06       -15.642
age                             -0.02729    0.000655       -41.631
revolving balance util          2.55E-05    2.95E-05         0.865
DebtRatio                       -0.00015    3.61E-05     4.222 0
--

Variable                        Pr(>         Significance
                                [absolute
                                value
                                of z[)

(Intercept)                     <2e-16       ***
trades                           0.00482     **
30 number dlq (not worse)       <2e-16       ***
60 number dlq (not worse)       <2e-16       ***
90 day number dlq (not worse)   <2e-16       ***
mtg_trd_lines                   <2e-16       ***
monthly income                  <2e-16       ***
age                             <2e-16       ***
revolving balance util           0.38731
DebtRatio                        2.42E-05    ***
--

Signif. codes: 0 '***' 0.001 '*' 0.01 '*' 0.05 '.' 0.1 '' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 118040 on 235272 degrees of freedom
Residual deviance: 108889 on 235263 degree degrees of freedom
(58148 observations
AIC: 108909

Number of Fisher Scoring iterations: 6
Log likelihood:-54444.557 (f)
Null/Residual deviance difference: 9151.24 (9 df)
Chi-square p-value : 0.00000000