Credit Scoring: an Introduction.

Credit scoring refers to an analysis made by lenders and banks to extended credit (or not) to customers. It has many applications within the financial industry as it automated the underwriting process substantially and allows to derive more sophisticated products to final customers (Risk based pricing is an example).

What I’m going to do is show the versatility of R to deal with a dataset ready to perform analysis and model. Obviously, as it is the case with most applications, the real deal comes when you have to collect the data, define what is a bad customers and so on.

This example also illustrated the ease to build your own functions to perform the computations required to obtain a model. R has many functions built in different packages that help you process your functions through datasets with much control on the output and inputs.

The Data

The dataset comes from the Kaggle Competition Give me some credit which consists in a small dataset of 150,000 observations with 10 predictors. We proceed to read and inspect the data, and rename our columns to more suitable and understandable names of our liking.

data <- read_csv("C:/Users/galcala/Loan/cs-training.csv")
data.tbl <- as_tibble(data.frame(data))
typeof(data.tbl)
## [1] "list"
glimpse(data.tbl)
## Observations: 150,000
## Variables: 12
## $ X1                                   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9...
## $ SeriousDlqin2yrs                     <int> 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ RevolvingUtilizationOfUnsecuredLines <dbl> 0.76612661, 0.95715102, 0...
## $ age                                  <int> 45, 40, 38, 30, 49, 74, 5...
## $ NumberOfTime30.59DaysPastDueNotWorse <int> 2, 0, 1, 0, 1, 0, 0, 0, 0...
## $ DebtRatio                            <dbl> 8.029821e-01, 1.218762e-0...
## $ MonthlyIncome                        <int> 9120, 2600, 3042, 3300, 6...
## $ NumberOfOpenCreditLinesAndLoans      <int> 13, 4, 2, 5, 7, 3, 8, 8, ...
## $ NumberOfTimes90DaysLate              <int> 0, 0, 1, 0, 0, 0, 0, 0, 0...
## $ NumberRealEstateLoansOrLines         <int> 6, 0, 0, 0, 1, 1, 3, 0, 0...
## $ NumberOfTime60.89DaysPastDueNotWorse <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NumberOfDependents                   <int> 2, 1, 0, 0, 0, 1, 0, 0, N...
data.tbl$X1 <- NULL
colnames(data.tbl) <- c("bad","revUtilUnsec","age","n30_59dpd_nw","debtRatio","mIncome","nOpenTrades",
                        "n90dlate","nREstate","n60_89dpd_nw","nDependents")

Alright, seems like our data is well read. Let us split it into training and testing for modelling purposes and work our way through the data.
A very useful tool for inspecting variables (univariate) is the Desc function from the DescTools package, as it gives you many relevant statistics and information as well as useful plots to understand the distributions.

set.seed(43)
inTrain <- createDataPartition(y=data.tbl$bad,p = .8, list = FALSE)
training <- data.tbl[inTrain,]
testing <- data.tbl[-inTrain,]
Desc(training)
## ------------------------------------------------------------------------- 
## Describe training (tbl_df, tbl, data.frame):
## 
## data.frame:  120000 obs. of  11 variables
## 
##   Nr  ColName       Class    NAs            Levels
##   1   bad           integer      .                
##   2   revUtilUnsec  numeric      .                
##   3   age           integer      .                
##   4   n30_59dpd_nw  integer      .                
##   5   debtRatio     numeric      .                
##   6   mIncome       integer  23800 (19.8%)        
##   7   nOpenTrades   integer      .                
##   8   n90dlate      integer      .                
##   9   nREstate      integer      .                
##   10  n60_89dpd_nw  integer      .                
##   11  nDependents   integer   3124 (2.6%)         
## 
## 
## ------------------------------------------------------------------------- 
## 1 - bad (integer - dichotomous)
## 
##    length       n     NAs  unique
##   120'000 120'000       0       2
##            100.0%    0.0%        
## 
##       freq   perc  lci.95  uci.95'
## 0  112'011  93.3%   93.2%   93.5%
## 1    7'989   6.7%    6.5%    6.8%
## 
## ' 95%-CI Wilson

## ------------------------------------------------------------------------- 
## 2 - revUtilUnsec (numeric)
## 
##              length               n            NAs        unique
##             120'000         120'000              0       100'932
##                              100.0%           0.0%              
##                                                                 
##                 .05             .10            .25        median
##        0.0000000000    0.0028972975   0.0296432087  0.1534955235
##                                                                 
##               range              sd          vcoef           mad
##   20'514.0000000000  187.4560832479  34.7373068442  0.2190202689
##                                                                 
##             0s           mean            meanCI
##          8'783   5.3963908051      4.3357669593
##           7.3%                     6.4570146509
##                                                
##            .75            .90               .95
##   0.5570617472   0.9811474900      0.9999999000
##                                                
##            IQR           skew              kurt
##   0.5274185385  57.3508264716  4'203.3035507492
##                                                
## lowest : 0.0 (8'783), 0.00000993, 0.0000125, 0.0000143, 0.0000149
## highest: 13'400.0, 13'498.0, 17'441.0, 18'300.0, 20'514.0

## ------------------------------------------------------------------------- 
## 3 - age (integer)
## 
##    length        n    NAs  unique     0s   mean  meanCI
##   120'000  120'000      0      85      1  52.31   52.23
##             100.0%   0.0%           0.0%          52.40
##                                                        
##       .05      .10    .25  median    .75    .90     .95
##     29.00    33.00  41.00   52.00  63.00  72.00   78.00
##                                                        
##     range       sd  vcoef     mad    IQR   skew    kurt
##    109.00    14.78   0.28   16.31  22.00   0.19   -0.50
##                                                        
## lowest : 0, 21 (160), 22 (333), 23 (543), 24 (635)
## highest: 101 (2), 102, 103 (3), 107, 109 (2)

## ------------------------------------------------------------------------- 
## 4 - n30_59dpd_nw (integer)
## 
##    length        n    NAs  unique       0s   mean  meanCI
##   120'000  120'000      0      16  100'908   0.42    0.40
##             100.0%   0.0%            84.1%           0.45
##                                                          
##       .05      .10    .25  median      .75    .90     .95
##      0.00     0.00   0.00    0.00     0.00   1.00    2.00
##                                                          
##     range       sd  vcoef     mad      IQR   skew    kurt
##     98.00     4.24  10.02    0.00     0.00  22.38  511.90
##                                                          
## lowest : 0 (100'908), 1 (12'766), 2 (3'668), 3 (1'397), 4 (602)
## highest: 11, 12 (2), 13, 96 (4), 98 (216)

## ------------------------------------------------------------------------- 
## 5 - debtRatio (numeric)
## 
##               length                 n           NAs        unique
##              120'000           120'000             0        93'091
##                                 100.0%          0.0%              
##                                                                   
##                  .05               .10           .25        median
##         0.0041962801      0.0303933134  0.1746868008  0.3655131565
##                                                                   
##                range                sd         vcoef           mad
##   329'664.0000000000  1'962.6860436419  5.5812712114  0.3634767527
##                                                                   
##             0s              mean             meanCI
##          3'341    351.6557374292     340.5508877834
##           2.8%                       362.7605870751
##                                                    
##            .75               .90                .95
##   0.8686527818  1'265.0000000000   2'447.0000000000
##                                                    
##            IQR              skew               kurt
##   0.6939659810     92.7561410895  13'505.0802601446
##                                                    
## lowest : 0.0 (3'341), 0.000026, 0.0000369, 0.0000393, 0.000075
## highest: 106'885.0, 168'835.0, 220'516.0, 307'001.0, 329'664.0

## ------------------------------------------------------------------------- 
## 6 - mIncome (integer)
## 
##         length          n       NAs    unique        0s       mean
##        120'000     96'200    23'800    12'652     1'293   6'686.68
##                     80.2%     19.8%                1.1%           
##                                                                   
##            .05        .10       .25    median       .75        .90
##       1'300.00   2'000.00  3'400.00  5'400.00  8'250.00  11'666.00
##                                                                   
##          range         sd     vcoef       mad       IQR       skew
##   3'008'750.00  15'532.50      2.32  3'445.56  4'850.00     111.82
##                                                                   
##      meanCI
##    6'588.53
##    6'784.83
##            
##         .95
##   14'600.00
##            
##        kurt
##   17'882.65
##            
## lowest : 0 (1'293), 1 (488), 2 (6), 4 (2), 5 (2)
## highest: 835'040, 1'072'500, 1'560'100, 1'794'060, 3'008'750

## ------------------------------------------------------------------------- 
## 7 - nOpenTrades (integer)
## 
##    length        n    NAs  unique     0s   mean  meanCI
##   120'000  120'000      0      58  1'528   8.45    8.42
##             100.0%   0.0%           1.3%           8.48
##                                                        
##       .05      .10    .25  median    .75    .90     .95
##      2.00     3.00   5.00    8.00  11.00  15.00   18.00
##                                                        
##     range       sd  vcoef     mad    IQR   skew    kurt
##     58.00     5.15   0.61    4.45   6.00   1.21    3.05
##                                                        
## lowest : 0 (1'528), 1 (3'585), 2 (5'322), 3 (7'266), 4 (9'248)
## highest: 53, 54 (3), 56 (2), 57, 58

## ------------------------------------------------------------------------- 
## 8 - n90dlate (integer)
## 
##    length        n    NAs  unique       0s   mean  meanCI
##   120'000  120'000      0      19  113'346   0.27    0.25
##             100.0%   0.0%            94.5%           0.29
##                                                          
##       .05      .10    .25  median      .75    .90     .95
##      0.00     0.00   0.00    0.00     0.00   0.00    1.00
##                                                          
##     range       sd  vcoef     mad      IQR   skew    kurt
##     98.00     4.22  15.61    0.00     0.00  22.84  526.08
##                                                          
## lowest : 0 (113'346), 1 (4'158), 2 (1'255), 3 (537), 4 (235)
## highest: 14 (2), 15 (2), 17, 96 (4), 98 (216)

## ------------------------------------------------------------------------- 
## 9 - nREstate (integer)
## 
##    length        n    NAs  unique      0s  mean  meanCI
##   120'000  120'000      0      28  45'027  1.02    1.01
##             100.0%   0.0%           37.5%          1.02
##                                                        
##       .05      .10    .25  median     .75   .90     .95
##      0.00     0.00   0.00    1.00    2.00  2.00    3.00
##                                                        
##     range       sd  vcoef     mad     IQR  skew    kurt
##     54.00     1.14   1.12    1.48    2.00  3.79   70.98
##                                                        
## lowest : 0 (45'027), 1 (41'850), 2 (25'172), 3 (5'022), 4 (1'743)
## highest: 25 (3), 26, 29, 32, 54

## ------------------------------------------------------------------------- 
## 10 - n60_89dpd_nw (integer)
## 
##    length        n    NAs  unique       0s   mean  meanCI
##   120'000  120'000      0      12  113'950   0.24    0.22
##             100.0%   0.0%            95.0%           0.27
##                                                          
##       .05      .10    .25  median      .75    .90     .95
##      0.00     0.00   0.00    0.00     0.00   0.00    1.00
##                                                          
##     range       sd  vcoef     mad      IQR   skew    kurt
##     98.00     4.20  17.23    0.00     0.00  23.08  533.83
##                                                          
## 
##     level     freq   perc  cumfreq  cumperc
## 1       0  113'950  95.0%  113'950    95.0%
## 2       1    4'559   3.8%  118'509    98.8%
## 3       2      888   0.7%  119'397    99.5%
## 4       3      251   0.2%  119'648    99.7%
## 5       4       80   0.1%  119'728    99.8%
## 6       5       29   0.0%  119'757    99.8%
## 7       6       11   0.0%  119'768    99.8%
## 8       7        9   0.0%  119'777    99.8%
## 9       8        2   0.0%  119'779    99.8%
## 10     11        1   0.0%  119'780    99.8%
## 11     96        4   0.0%  119'784    99.8%
## 12     98      216   0.2%  120'000   100.0%

## ------------------------------------------------------------------------- 
## 11 - nDependents (integer)
## 
##    length        n    NAs  unique      0s  mean  meanCI
##   120'000  116'876  3'124      13  69'625  0.76    0.75
##              97.4%   2.6%           58.0%          0.76
##                                                        
##       .05      .10    .25  median     .75   .90     .95
##      0.00     0.00   0.00    0.00    1.00  2.00    3.00
##                                                        
##     range       sd  vcoef     mad     IQR  skew    kurt
##     20.00     1.12   1.48    0.00    1.00  1.60    3.20
##                                                        
## lowest : 0 (69'625), 1 (21'010), 2 (15'561), 3 (7'588), 4 (2'303)
## highest: 8 (20), 9 (5), 10 (4), 13, 20

#Income and Dependents have missing values (20% and 2.6%)
#6.7% delinquency rate
#Revolving highly skewed to the right. Needs trimming
#Age: We have a "0" age, 21 is our real minimum.
#n30_59dpd_nw: 84% zeros, skewed to the right
#debtRatio: 2.7% zeros. Highly skewed.
#mIncome: 1.3% zeros, highly skewed. 20% missing. Some negative.
#nOpenTrades: 1.3% zeros. slightly skewed. Integer variable.
#n90dlate: 94.4% zeros, highly skewed
#nREstate: 37.5% zeros, 33% 1s and 2+ the rest.
#n60_89dpd_nw: 94.9% zeros. skewed to the right.
#nDependents: 58% zeros. some missing (2.6%) outliers to the right.

Highly skewed data seems to be the norm, and such other undesired properties like censoring, missings and highly concentrated variables in only one of two attributes (mainly zero). But these are the norm for real world variables in credit scoring. What we are going to do next is to explore the relationship with our target variable, bad and each characteristic. Let us do this visually through a plot.
We are going to plot our variable partitioned in 10 groups (or less, if the data doesn’t allow for 10 groups) and plot the distribution of the data in bars and the rate of bads for each bin in a red line. These plots help us understand the relationship from our variables and our target.

In a credit scoring setting, it is most common to transform our characteristic into discrete variables through their Weight of Evidence (WoE) transformation, best explained here. The advantage from these formulation is that it guarantees monotonic relationships and it is easily transformed into points to add up in a scorecard once the model is fitted.

To do that, let us first define X, our design matrix, with the variables grouped into our values of interest. We also deal with our NAs to obtain better results.

training$mIncome[training$mIncome<0] <- 0

X.tbl <- as_tibble(data.frame(list(bad=training$bad,
                         revUtilUnsec = cut2(training$revUtilUnsec,g=15),
                         age = cut2(training$age,g=13),
                         n30_59dpd_nw = cut2(training$n30_59dpd_nw,cuts=c(0,1,2,3)),
                         mIncome=cut2(training$mIncome,cuts=c(0,500,1250,1500,2000,3000,4000,5000,6000,7000,8000,9000)),
                         debtRatio = cut2(training$debtRatio,cuts=c(0,0.00001,.4,.6,.7,.8,.9,1)),
                         nOpenTrades = cut2(training$nOpenTrades,cuts=c(0,1,2,3,4,5)),
                         n90dlate = cut2(training$n90dlate,cuts=c(0,1,2,3)),  
                         nREstate = cut2(training$nREstate,cuts=c(0,1,2,3)),
                         n60_89dpd_nw = cut2(training$n60_89dpd_nw,cuts=c(0,1,2,3)),
                         nDependents = cut2(training$nDependents,cuts=c(0,1,2,3,4))
)
))
#Transformamos los "NA" en "-1" para mantener dentro de los factores
X.tbl$nDependents<-factor(X.tbl$nDependents, levels = levels(addNA(X.tbl$nDependents)), labels = c(levels(X.tbl$nDependents), -1), exclude = NULL)
X.tbl$mIncome<-factor(X.tbl$mIncome, levels = levels(addNA(X.tbl$mIncome)), labels = c(levels(X.tbl$mIncome), -1), exclude = NULL)

Now we are ready to compute and transform our variables to their respective WoE. We are going to use two functions to do this (and show that it works with an example):

computaWoE <- function(variable,datos=X.tbl,target = "bad"){
  
  datos$variable <- datos[[variable]]
  datos$target <- datos[[target]]
  
  output <- datos %>% dplyr::mutate(totalB = sum(target), totalG = n() - totalB) %>% 
    group_by(variable) %>% dplyr::summarise(Bad = sum(target), Good = sum(1-target), Total = n(), 
                                     totalB = max(totalB), totalG = max(totalG)) %>% 
    ungroup() %>% dplyr::transmute(variable=variable,WoE = 100*log( (Good/totalG) / (Bad/totalB)))
  
  columnas <- colnames(output)
  columnas[1] <- c(variable)
  colnames(output) <- columnas
  output
}

transformaWoE <- function(variable,datos=X.tbl,target = "bad"){
  fact <- datos[[variable]]
  levels(fact)[computaWoE(variable)[[1]]] <- computaWoE(variable,datos=datos)$WoE
  fact
}

computaWoE("age")
## # A tibble: 13 x 2
##         age        WoE
##      <fctr>      <dbl>
##  1 [ 0, 32) -57.337080
##  2 [32, 37) -49.674445
##  3 [37, 41) -30.137585
##  4 [41, 45) -30.174384
##  5 [45, 48) -25.754888
##  6 [48, 51) -14.489128
##  7 [51, 54) -12.276069
##  8 [54, 58)   8.956137
##  9 [58, 61)  34.812621
## 10 [61, 64)  44.845919
## 11 [64, 69)  84.005693
## 12 [69, 75)  96.440486
## 13 [75,109] 122.843888
length(transformaWoE("debtRatio"))
## [1] 120000

It is always useful to explore the transformed variables WoE in the same scale, as it is an indication os predictability.

#Visualicemos los WoE

varWoE <- list(variable = c(), indice = c(),WoE = c())
for (i in seq_along(variables)){
  extrae <-computaWoE(variables[i])
  varWoE$variable <- c(varWoE$variable,rep(variables[i],length(extrae$WoE))) 
  varWoE$indice <- c(varWoE$indice,1:length(extrae$WoE))
  varWoE$WoE <- c(varWoE$WoE,extrae$WoE)
}

varWoE.tbl <- as_tibble(data.frame(varWoE))

ggplot(varWoE.tbl,aes(x=indice,y=WoE)) +
  geom_col() + xlab("WoE de cada variable, misma escala") + theme(
    axis.text.x=element_blank(),
    axis.ticks.x=element_blank()) +
  facet_wrap(~variable,ncol=3,scales="free_x")

Let us construct our final dataset which contains the WoE for every variable to include in our model, this can be doe succinctly with the purrr package in R and the map function, as follows:

res <- purrr::map(variables,safely(transformaWoE)) %>% purrr::map("result")
str(res)
## List of 10
##  $ : Factor w/ 15 levels "84.5354441478967",..: 13 14 12 14 9 10 13 7 9 12 ...
##  $ : Factor w/ 13 levels "-57.3370799679725",..: 5 3 3 6 12 8 3 1 8 1 ...
##  $ : Factor w/ 4 levels "54.2260562745973",..: 3 1 2 2 1 1 1 1 1 1 ...
##  $ : Factor w/ 8 levels "-20.5158770592473",..: 6 2 2 2 2 8 2 8 4 2 ...
##  $ : Factor w/ 13 levels "53.8490868167879",..: 12 5 6 12 6 13 6 13 12 5 ...
##  $ : Factor w/ 6 levels "-154.888786590618",..: 6 5 3 6 4 6 6 3 6 6 ...
##  $ : Factor w/ 4 levels "39.3083392123289",..: 1 1 2 1 1 1 1 1 1 1 ...
##  $ : Factor w/ 4 levels "-23.1757148763687",..: 4 1 1 2 2 4 1 1 4 1 ...
##  $ : Factor w/ 4 levels "29.0045413163247",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ : Factor w/ 6 levels "14.3946159376293",..: 3 2 1 1 2 1 1 6 3 1 ...
data.tbl <- as_tibble(as.data.frame(res))
colnames(data.tbl) <- variables

#remove factors and keep them as numeric
data.tbl <- purrr::map(data.tbl,varhandle::unfactor)
data.tbl<- as_tibble(as.data.frame(data.tbl))

data.tbl$bad <- X.tbl$bad
str(data.tbl)
## Classes 'tbl_df', 'tbl' and 'data.frame':    120000 obs. of  11 variables:
##  $ revUtilUnsec: num  -88.6 -120.9 -40.4 -120.9 59.4 ...
##  $ age         : num  -25.8 -30.1 -30.1 -14.5 96.4 ...
##  $ n30_59dpd_nw: num  -161.6 54.2 -90 -90 54.2 ...
##  $ debtRatio   : num  -53.2 15.5 15.5 15.5 15.5 ...
##  $ mIncome     : num  42.7 -43.9 -30.5 42.7 -30.5 ...
##  $ nOpenTrades : num  12.39 3.66 -32.69 12.39 -12.44 ...
##  $ n90dlate    : num  39.3 39.3 -195.4 39.3 39.3 ...
##  $ nREstate    : num  -25 -23.2 -23.2 25 25 ...
##  $ n60_89dpd_nw: num  29 29 29 29 29 ...
##  $ nDependents : num  -21.2 -10.3 14.4 14.4 -10.3 ...
##  $ bad         : int  1 0 0 0 0 0 0 0 0 0 ...

Seems we are ready to build our model!. We are going to use a logistic regression to show results and different metrics of performance.

modFit <- train(as.factor(bad)~.,data=data.tbl,method="glm", family="binomial")
summary(modFit$finalModel)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4394  -0.3215  -0.2151  -0.1630   3.1864  
## 
## Coefficients:
##                Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)  -2.6024766  0.0140943 -184.648  < 2e-16 ***
## revUtilUnsec -0.0061992  0.0001377  -45.035  < 2e-16 ***
## age          -0.0037715  0.0002988  -12.621  < 2e-16 ***
## n30_59dpd_nw -0.0050920  0.0001460  -34.870  < 2e-16 ***
## debtRatio    -0.0061068  0.0006097  -10.016  < 2e-16 ***
## mIncome      -0.0028861  0.0004547   -6.348 2.18e-10 ***
## nOpenTrades   0.0017555  0.0004281    4.101 4.12e-05 ***
## n90dlate     -0.0052799  0.0001297  -40.706  < 2e-16 ***
## nREstate     -0.0054304  0.0006269   -8.662  < 2e-16 ***
## n60_89dpd_nw -0.0035892  0.0001593  -22.537  < 2e-16 ***
## nDependents  -0.0031822  0.0006932   -4.591 4.42e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 58725  on 119999  degrees of freedom
## Residual deviance: 43820  on 119989  degrees of freedom
## AIC: 43842
## 
## Number of Fisher Scoring iterations: 6

Our model fitted well. Since we have transformed the variables to their WoE all Beta Coefficients have the same negative sign, as higher WoE leads to lower risk individuals. From the z-values of each coefficient we can derive the importance of the variable, but that is also shown with the varImp function:

varImp(modFit)
## glm variable importance
## 
##              Overall
## revUtilUnsec 100.000
## n90dlate      89.423
## n30_59dpd_nw  75.167
## n60_89dpd_nw  45.037
## age           20.815
## debtRatio     14.449
## nREstate      11.143
## mIncome        5.489
## nDependents    1.197
## nOpenTrades    0.000

Same results as above. The statistics that describe our model are shown from a confusionMatrix output:

pred <- predict(modFit,data.tbl)
score <-predict(modFit,data.tbl,type="prob")
colnames(score) <- c("pBueno","pMalo")

confusionMatrix(pred,data.tbl$bad)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0 110942   6433
##          1   1069   1556
##                                           
##                Accuracy : 0.9375          
##                  95% CI : (0.9361, 0.9388)
##     No Information Rate : 0.9334          
##     P-Value [Acc > NIR] : 6.481e-09       
##                                           
##                   Kappa : 0.2691          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9905          
##             Specificity : 0.1948          
##          Pos Pred Value : 0.9452          
##          Neg Pred Value : 0.5928          
##              Prevalence : 0.9334          
##          Detection Rate : 0.9245          
##    Detection Prevalence : 0.9781          
##       Balanced Accuracy : 0.5926          
##                                           
##        'Positive' Class : 0               
## 

As the sample is highly unbalanced in the target variable, our Accuracy leads us to believe we’ve fitted a very good model when in fact it is not as strong as it seems. Kappa is a better indicator of fitness for unbalanced samples and it is not very strong, but still useful.

Next in line is to visualize the fit of our model, we can use our previous function to plot the training sample:

data.tbl$pBueno <- score$pBueno
data.tbl$pMalo <- score$pMalo
graf_biv("pMalo",datos=data.tbl,grupos=30)

Not bad! Most Risk Managers are familiar with the Kolmogorov-Smirnov statistics, the ROC and it’s Area Under the Curve (AUC). Let us obtain these statistics:

df.p <- data.tbl[order(data.tbl$pMalo),]
df.p$cumBad <- cumsum(df.p$bad) / sum(df.p$bad)
df.p$cumGood <- cumsum(1 - df.p$bad) / (dim(df.p)[1] - sum(df.p$bad))

ggplot(df.p,aes(cumBad,cumGood)) + geom_path(col="black") +
  geom_segment(x = 0, xend = 1, y = 0, yend = 1, col="grey") + xlim(0,1) + ylim(0,1) +
  xlab("Specificity") + ylab("Sensitivity") + ggtitle("Curva ROC")

#ROC con fórmula:
modFit$ROC <- 1-(sum(df.p$cumBad/length(df.p$cumBad)) - sum(df.p$cumGood/length(df.p$cumGood)) + 0.5)
print(paste0("AUC: ",round(modFit$ROC*100,2),"%"))
## [1] "AUC: 85.62%"
print(paste0("KS: ",round(max(abs(df.p$cumBad - df.p$cumGood))*100,2),"%"))
## [1] "KS: 55.27%"

Testing Data

An issue with the current code is that our transformation of the data into WoE is dependent on the cuts and WoE of our training sample, thus we need to map there same groups in our testing dataset to the WoE calculated in our traning dataset (which can be quite cumbersome). To accomplish this first let us obtain the same groups (coded as factors in R).

testing$mIncome[testing$mIncome<0] <- 0

XTest.tbl <- as_tibble(data.frame(list(bad=testing$bad,
                         revUtilUnsec = cut2(testing$revUtilUnsec,g=15),
                         age = cut2(testing$age,g=13),
                         n30_59dpd_nw = cut2(testing$n30_59dpd_nw,cuts=c(0,1,2,3)),
                         mIncome=cut2(testing$mIncome,cuts=c(0,500,1250,1500,2000,3000,4000,5000,6000,7000,8000,9000)),
                         debtRatio = cut2(testing$debtRatio,cuts=c(0,0.00001,.4,.6,.7,.8,.9,1)),
                         nOpenTrades = cut2(testing$nOpenTrades,cuts=c(0,1,2,3,4,5)),
                         n90dlate = cut2(testing$n90dlate,cuts=c(0,1,2,3)),  
                         nREstate = cut2(testing$nREstate,cuts=c(0,1,2,3)),
                         n60_89dpd_nw = cut2(testing$n60_89dpd_nw,cuts=c(0,1,2,3)),
                         nDependents = cut2(testing$nDependents,cuts=c(0,1,2,3,4))
)
))
#Transformamos los "NA" en "-1" para mantener dentro de los factores
XTest.tbl$nDependents<-factor(XTest.tbl$nDependents, levels = levels(addNA(XTest.tbl$nDependents)), labels = c(levels(XTest.tbl$nDependents), -1), exclude = NULL)
XTest.tbl$mIncome<-factor(XTest.tbl$mIncome, levels = levels(addNA(XTest.tbl$mIncome)), labels = c(levels(XTest.tbl$mIncome), -1), exclude = NULL)

And from here we can map the WoEs obtaining in our previous dataset:

And it’s all done. We can evaluate our model with our testing dataset to measure the effectiveness of our model.

pred <- predict(modFit,dataT.tbl)
score <-predict(modFit,dataT.tbl,type="prob")
colnames(score) <- c("pBueno","pMalo")

confusionMatrix(pred,dataT.tbl$bad)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 27682  1674
##          1   281   363
##                                          
##                Accuracy : 0.9348         
##                  95% CI : (0.932, 0.9376)
##     No Information Rate : 0.9321         
##     P-Value [Acc > NIR] : 0.03013        
##                                          
##                   Kappa : 0.2462         
##  Mcnemar's Test P-Value : < 2e-16        
##                                          
##             Sensitivity : 0.9900         
##             Specificity : 0.1782         
##          Pos Pred Value : 0.9430         
##          Neg Pred Value : 0.5637         
##              Prevalence : 0.9321         
##          Detection Rate : 0.9227         
##    Detection Prevalence : 0.9785         
##       Balanced Accuracy : 0.5841         
##                                          
##        'Positive' Class : 0              
## 
dataT.tbl$pBueno <- score$pBueno
dataT.tbl$pMalo <- score$pMalo
graf_biv("pMalo",datos=dataT.tbl,grupos=30)

df.p <- dataT.tbl[order(dataT.tbl$pMalo),]
df.p$cumBad <- cumsum(df.p$bad) / sum(df.p$bad)
df.p$cumGood <- cumsum(1 - df.p$bad) / (dim(df.p)[1] - sum(df.p$bad))

ggplot(df.p,aes(cumBad,cumGood)) + geom_path(col="black") +
  geom_segment(x = 0, xend = 1, y = 0, yend = 1, col="grey") + xlim(0,1) + ylim(0,1) +
  xlab("Specificity") + ylab("Sensitivity") + ggtitle("Curva ROC")

#ROC con fórmula:
modFit$ROC <- 1-(sum(df.p$cumBad/length(df.p$cumBad)) - sum(df.p$cumGood/length(df.p$cumGood)) + 0.5)
print(paste0("AUC: ",round(modFit$ROC*100,2),"%"))
## [1] "AUC: 85.78%"
print(paste0("KS: ",round(max(abs(df.p$cumBad - df.p$cumGood))*100,2),"%"))
## [1] "KS: 56.07%"

This can be done iteratively selection different partitions into testing and training to analyze the stability of the fit and coefficients.

Scaling to a scorecard

To scale our results to obtain points to add up can be donde from a linear transformation of our results, but right now I’ll leave that out of the scope of this entry. Obviously, if you’re interested in this step of the process feel free to contact us to discuss the issue.