Credit Scoring: an Introduction.
Credit scoring refers to an analysis made by lenders and banks to extended credit (or not) to customers. It has many applications within the financial industry as it automated the underwriting process substantially and allows to derive more sophisticated products to final customers (Risk based pricing is an example).
What I’m going to do is show the versatility of R to deal with a dataset ready to perform analysis and model. Obviously, as it is the case with most applications, the real deal comes when you have to collect the data, define what is a bad customers and so on.
This example also illustrated the ease to build your own functions to perform the computations required to obtain a model. R has many functions built in different packages that help you process your functions through datasets with much control on the output and inputs.
The Data
The dataset comes from the Kaggle Competition Give me some credit which consists in a small dataset of 150,000 observations with 10 predictors. We proceed to read and inspect the data, and rename our columns to more suitable and understandable names of our liking.
data <- read_csv("C:/Users/galcala/Loan/cs-training.csv")
data.tbl <- as_tibble(data.frame(data))
typeof(data.tbl)
## [1] "list"
glimpse(data.tbl)
## Observations: 150,000
## Variables: 12
## $ X1 <int> 1, 2, 3, 4, 5, 6, 7, 8, 9...
## $ SeriousDlqin2yrs <int> 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ RevolvingUtilizationOfUnsecuredLines <dbl> 0.76612661, 0.95715102, 0...
## $ age <int> 45, 40, 38, 30, 49, 74, 5...
## $ NumberOfTime30.59DaysPastDueNotWorse <int> 2, 0, 1, 0, 1, 0, 0, 0, 0...
## $ DebtRatio <dbl> 8.029821e-01, 1.218762e-0...
## $ MonthlyIncome <int> 9120, 2600, 3042, 3300, 6...
## $ NumberOfOpenCreditLinesAndLoans <int> 13, 4, 2, 5, 7, 3, 8, 8, ...
## $ NumberOfTimes90DaysLate <int> 0, 0, 1, 0, 0, 0, 0, 0, 0...
## $ NumberRealEstateLoansOrLines <int> 6, 0, 0, 0, 1, 1, 3, 0, 0...
## $ NumberOfTime60.89DaysPastDueNotWorse <int> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ NumberOfDependents <int> 2, 1, 0, 0, 0, 1, 0, 0, N...
data.tbl$X1 <- NULL
colnames(data.tbl) <- c("bad","revUtilUnsec","age","n30_59dpd_nw","debtRatio","mIncome","nOpenTrades",
"n90dlate","nREstate","n60_89dpd_nw","nDependents")
Alright, seems like our data is well read. Let us split it into training and testing for modelling purposes and work our way through the data.
A very useful tool for inspecting variables (univariate) is the Desc function from the DescTools package, as it gives you many relevant statistics and information as well as useful plots to understand the distributions.
set.seed(43)
inTrain <- createDataPartition(y=data.tbl$bad,p = .8, list = FALSE)
training <- data.tbl[inTrain,]
testing <- data.tbl[-inTrain,]
Desc(training)
## -------------------------------------------------------------------------
## Describe training (tbl_df, tbl, data.frame):
##
## data.frame: 120000 obs. of 11 variables
##
## Nr ColName Class NAs Levels
## 1 bad integer .
## 2 revUtilUnsec numeric .
## 3 age integer .
## 4 n30_59dpd_nw integer .
## 5 debtRatio numeric .
## 6 mIncome integer 23800 (19.8%)
## 7 nOpenTrades integer .
## 8 n90dlate integer .
## 9 nREstate integer .
## 10 n60_89dpd_nw integer .
## 11 nDependents integer 3124 (2.6%)
##
##
## -------------------------------------------------------------------------
## 1 - bad (integer - dichotomous)
##
## length n NAs unique
## 120'000 120'000 0 2
## 100.0% 0.0%
##
## freq perc lci.95 uci.95'
## 0 112'011 93.3% 93.2% 93.5%
## 1 7'989 6.7% 6.5% 6.8%
##
## ' 95%-CI Wilson
## -------------------------------------------------------------------------
## 2 - revUtilUnsec (numeric)
##
## length n NAs unique
## 120'000 120'000 0 100'932
## 100.0% 0.0%
##
## .05 .10 .25 median
## 0.0000000000 0.0028972975 0.0296432087 0.1534955235
##
## range sd vcoef mad
## 20'514.0000000000 187.4560832479 34.7373068442 0.2190202689
##
## 0s mean meanCI
## 8'783 5.3963908051 4.3357669593
## 7.3% 6.4570146509
##
## .75 .90 .95
## 0.5570617472 0.9811474900 0.9999999000
##
## IQR skew kurt
## 0.5274185385 57.3508264716 4'203.3035507492
##
## lowest : 0.0 (8'783), 0.00000993, 0.0000125, 0.0000143, 0.0000149
## highest: 13'400.0, 13'498.0, 17'441.0, 18'300.0, 20'514.0
## -------------------------------------------------------------------------
## 3 - age (integer)
##
## length n NAs unique 0s mean meanCI
## 120'000 120'000 0 85 1 52.31 52.23
## 100.0% 0.0% 0.0% 52.40
##
## .05 .10 .25 median .75 .90 .95
## 29.00 33.00 41.00 52.00 63.00 72.00 78.00
##
## range sd vcoef mad IQR skew kurt
## 109.00 14.78 0.28 16.31 22.00 0.19 -0.50
##
## lowest : 0, 21 (160), 22 (333), 23 (543), 24 (635)
## highest: 101 (2), 102, 103 (3), 107, 109 (2)
## -------------------------------------------------------------------------
## 4 - n30_59dpd_nw (integer)
##
## length n NAs unique 0s mean meanCI
## 120'000 120'000 0 16 100'908 0.42 0.40
## 100.0% 0.0% 84.1% 0.45
##
## .05 .10 .25 median .75 .90 .95
## 0.00 0.00 0.00 0.00 0.00 1.00 2.00
##
## range sd vcoef mad IQR skew kurt
## 98.00 4.24 10.02 0.00 0.00 22.38 511.90
##
## lowest : 0 (100'908), 1 (12'766), 2 (3'668), 3 (1'397), 4 (602)
## highest: 11, 12 (2), 13, 96 (4), 98 (216)
## -------------------------------------------------------------------------
## 5 - debtRatio (numeric)
##
## length n NAs unique
## 120'000 120'000 0 93'091
## 100.0% 0.0%
##
## .05 .10 .25 median
## 0.0041962801 0.0303933134 0.1746868008 0.3655131565
##
## range sd vcoef mad
## 329'664.0000000000 1'962.6860436419 5.5812712114 0.3634767527
##
## 0s mean meanCI
## 3'341 351.6557374292 340.5508877834
## 2.8% 362.7605870751
##
## .75 .90 .95
## 0.8686527818 1'265.0000000000 2'447.0000000000
##
## IQR skew kurt
## 0.6939659810 92.7561410895 13'505.0802601446
##
## lowest : 0.0 (3'341), 0.000026, 0.0000369, 0.0000393, 0.000075
## highest: 106'885.0, 168'835.0, 220'516.0, 307'001.0, 329'664.0
## -------------------------------------------------------------------------
## 6 - mIncome (integer)
##
## length n NAs unique 0s mean
## 120'000 96'200 23'800 12'652 1'293 6'686.68
## 80.2% 19.8% 1.1%
##
## .05 .10 .25 median .75 .90
## 1'300.00 2'000.00 3'400.00 5'400.00 8'250.00 11'666.00
##
## range sd vcoef mad IQR skew
## 3'008'750.00 15'532.50 2.32 3'445.56 4'850.00 111.82
##
## meanCI
## 6'588.53
## 6'784.83
##
## .95
## 14'600.00
##
## kurt
## 17'882.65
##
## lowest : 0 (1'293), 1 (488), 2 (6), 4 (2), 5 (2)
## highest: 835'040, 1'072'500, 1'560'100, 1'794'060, 3'008'750
## -------------------------------------------------------------------------
## 7 - nOpenTrades (integer)
##
## length n NAs unique 0s mean meanCI
## 120'000 120'000 0 58 1'528 8.45 8.42
## 100.0% 0.0% 1.3% 8.48
##
## .05 .10 .25 median .75 .90 .95
## 2.00 3.00 5.00 8.00 11.00 15.00 18.00
##
## range sd vcoef mad IQR skew kurt
## 58.00 5.15 0.61 4.45 6.00 1.21 3.05
##
## lowest : 0 (1'528), 1 (3'585), 2 (5'322), 3 (7'266), 4 (9'248)
## highest: 53, 54 (3), 56 (2), 57, 58
## -------------------------------------------------------------------------
## 8 - n90dlate (integer)
##
## length n NAs unique 0s mean meanCI
## 120'000 120'000 0 19 113'346 0.27 0.25
## 100.0% 0.0% 94.5% 0.29
##
## .05 .10 .25 median .75 .90 .95
## 0.00 0.00 0.00 0.00 0.00 0.00 1.00
##
## range sd vcoef mad IQR skew kurt
## 98.00 4.22 15.61 0.00 0.00 22.84 526.08
##
## lowest : 0 (113'346), 1 (4'158), 2 (1'255), 3 (537), 4 (235)
## highest: 14 (2), 15 (2), 17, 96 (4), 98 (216)
## -------------------------------------------------------------------------
## 9 - nREstate (integer)
##
## length n NAs unique 0s mean meanCI
## 120'000 120'000 0 28 45'027 1.02 1.01
## 100.0% 0.0% 37.5% 1.02
##
## .05 .10 .25 median .75 .90 .95
## 0.00 0.00 0.00 1.00 2.00 2.00 3.00
##
## range sd vcoef mad IQR skew kurt
## 54.00 1.14 1.12 1.48 2.00 3.79 70.98
##
## lowest : 0 (45'027), 1 (41'850), 2 (25'172), 3 (5'022), 4 (1'743)
## highest: 25 (3), 26, 29, 32, 54
## -------------------------------------------------------------------------
## 10 - n60_89dpd_nw (integer)
##
## length n NAs unique 0s mean meanCI
## 120'000 120'000 0 12 113'950 0.24 0.22
## 100.0% 0.0% 95.0% 0.27
##
## .05 .10 .25 median .75 .90 .95
## 0.00 0.00 0.00 0.00 0.00 0.00 1.00
##
## range sd vcoef mad IQR skew kurt
## 98.00 4.20 17.23 0.00 0.00 23.08 533.83
##
##
## level freq perc cumfreq cumperc
## 1 0 113'950 95.0% 113'950 95.0%
## 2 1 4'559 3.8% 118'509 98.8%
## 3 2 888 0.7% 119'397 99.5%
## 4 3 251 0.2% 119'648 99.7%
## 5 4 80 0.1% 119'728 99.8%
## 6 5 29 0.0% 119'757 99.8%
## 7 6 11 0.0% 119'768 99.8%
## 8 7 9 0.0% 119'777 99.8%
## 9 8 2 0.0% 119'779 99.8%
## 10 11 1 0.0% 119'780 99.8%
## 11 96 4 0.0% 119'784 99.8%
## 12 98 216 0.2% 120'000 100.0%
## -------------------------------------------------------------------------
## 11 - nDependents (integer)
##
## length n NAs unique 0s mean meanCI
## 120'000 116'876 3'124 13 69'625 0.76 0.75
## 97.4% 2.6% 58.0% 0.76
##
## .05 .10 .25 median .75 .90 .95
## 0.00 0.00 0.00 0.00 1.00 2.00 3.00
##
## range sd vcoef mad IQR skew kurt
## 20.00 1.12 1.48 0.00 1.00 1.60 3.20
##
## lowest : 0 (69'625), 1 (21'010), 2 (15'561), 3 (7'588), 4 (2'303)
## highest: 8 (20), 9 (5), 10 (4), 13, 20
#Income and Dependents have missing values (20% and 2.6%)
#6.7% delinquency rate
#Revolving highly skewed to the right. Needs trimming
#Age: We have a "0" age, 21 is our real minimum.
#n30_59dpd_nw: 84% zeros, skewed to the right
#debtRatio: 2.7% zeros. Highly skewed.
#mIncome: 1.3% zeros, highly skewed. 20% missing. Some negative.
#nOpenTrades: 1.3% zeros. slightly skewed. Integer variable.
#n90dlate: 94.4% zeros, highly skewed
#nREstate: 37.5% zeros, 33% 1s and 2+ the rest.
#n60_89dpd_nw: 94.9% zeros. skewed to the right.
#nDependents: 58% zeros. some missing (2.6%) outliers to the right.
Highly skewed data seems to be the norm, and such other undesired properties like censoring, missings and highly concentrated variables in only one of two attributes (mainly zero). But these are the norm for real world variables in credit scoring. What we are going to do next is to explore the relationship with our target variable, bad and each characteristic. Let us do this visually through a plot.
We are going to plot our variable partitioned in 10 groups (or less, if the data doesn’t allow for 10 groups) and plot the distribution of the data in bars and the rate of bads for each bin in a red line. These plots help us understand the relationship from our variables and our target.
In a credit scoring setting, it is most common to transform our characteristic into discrete variables through their Weight of Evidence (WoE) transformation, best explained here. The advantage from these formulation is that it guarantees monotonic relationships and it is easily transformed into points to add up in a scorecard once the model is fitted.
To do that, let us first define X, our design matrix, with the variables grouped into our values of interest. We also deal with our NAs to obtain better results.
training$mIncome[training$mIncome<0] <- 0
X.tbl <- as_tibble(data.frame(list(bad=training$bad,
revUtilUnsec = cut2(training$revUtilUnsec,g=15),
age = cut2(training$age,g=13),
n30_59dpd_nw = cut2(training$n30_59dpd_nw,cuts=c(0,1,2,3)),
mIncome=cut2(training$mIncome,cuts=c(0,500,1250,1500,2000,3000,4000,5000,6000,7000,8000,9000)),
debtRatio = cut2(training$debtRatio,cuts=c(0,0.00001,.4,.6,.7,.8,.9,1)),
nOpenTrades = cut2(training$nOpenTrades,cuts=c(0,1,2,3,4,5)),
n90dlate = cut2(training$n90dlate,cuts=c(0,1,2,3)),
nREstate = cut2(training$nREstate,cuts=c(0,1,2,3)),
n60_89dpd_nw = cut2(training$n60_89dpd_nw,cuts=c(0,1,2,3)),
nDependents = cut2(training$nDependents,cuts=c(0,1,2,3,4))
)
))
#Transformamos los "NA" en "-1" para mantener dentro de los factores
X.tbl$nDependents<-factor(X.tbl$nDependents, levels = levels(addNA(X.tbl$nDependents)), labels = c(levels(X.tbl$nDependents), -1), exclude = NULL)
X.tbl$mIncome<-factor(X.tbl$mIncome, levels = levels(addNA(X.tbl$mIncome)), labels = c(levels(X.tbl$mIncome), -1), exclude = NULL)
Now we are ready to compute and transform our variables to their respective WoE. We are going to use two functions to do this (and show that it works with an example):
computaWoE <- function(variable,datos=X.tbl,target = "bad"){
datos$variable <- datos[[variable]]
datos$target <- datos[[target]]
output <- datos %>% dplyr::mutate(totalB = sum(target), totalG = n() - totalB) %>%
group_by(variable) %>% dplyr::summarise(Bad = sum(target), Good = sum(1-target), Total = n(),
totalB = max(totalB), totalG = max(totalG)) %>%
ungroup() %>% dplyr::transmute(variable=variable,WoE = 100*log( (Good/totalG) / (Bad/totalB)))
columnas <- colnames(output)
columnas[1] <- c(variable)
colnames(output) <- columnas
output
}
transformaWoE <- function(variable,datos=X.tbl,target = "bad"){
fact <- datos[[variable]]
levels(fact)[computaWoE(variable)[[1]]] <- computaWoE(variable,datos=datos)$WoE
fact
}
computaWoE("age")
## # A tibble: 13 x 2
## age WoE
## <fctr> <dbl>
## 1 [ 0, 32) -57.337080
## 2 [32, 37) -49.674445
## 3 [37, 41) -30.137585
## 4 [41, 45) -30.174384
## 5 [45, 48) -25.754888
## 6 [48, 51) -14.489128
## 7 [51, 54) -12.276069
## 8 [54, 58) 8.956137
## 9 [58, 61) 34.812621
## 10 [61, 64) 44.845919
## 11 [64, 69) 84.005693
## 12 [69, 75) 96.440486
## 13 [75,109] 122.843888
length(transformaWoE("debtRatio"))
## [1] 120000
It is always useful to explore the transformed variables WoE in the same scale, as it is an indication os predictability.
#Visualicemos los WoE
varWoE <- list(variable = c(), indice = c(),WoE = c())
for (i in seq_along(variables)){
extrae <-computaWoE(variables[i])
varWoE$variable <- c(varWoE$variable,rep(variables[i],length(extrae$WoE)))
varWoE$indice <- c(varWoE$indice,1:length(extrae$WoE))
varWoE$WoE <- c(varWoE$WoE,extrae$WoE)
}
varWoE.tbl <- as_tibble(data.frame(varWoE))
ggplot(varWoE.tbl,aes(x=indice,y=WoE)) +
geom_col() + xlab("WoE de cada variable, misma escala") + theme(
axis.text.x=element_blank(),
axis.ticks.x=element_blank()) +
facet_wrap(~variable,ncol=3,scales="free_x")
Let us construct our final dataset which contains the WoE for every variable to include in our model, this can be doe succinctly with the purrr package in R and the map function, as follows:
res <- purrr::map(variables,safely(transformaWoE)) %>% purrr::map("result")
str(res)
## List of 10
## $ : Factor w/ 15 levels "84.5354441478967",..: 13 14 12 14 9 10 13 7 9 12 ...
## $ : Factor w/ 13 levels "-57.3370799679725",..: 5 3 3 6 12 8 3 1 8 1 ...
## $ : Factor w/ 4 levels "54.2260562745973",..: 3 1 2 2 1 1 1 1 1 1 ...
## $ : Factor w/ 8 levels "-20.5158770592473",..: 6 2 2 2 2 8 2 8 4 2 ...
## $ : Factor w/ 13 levels "53.8490868167879",..: 12 5 6 12 6 13 6 13 12 5 ...
## $ : Factor w/ 6 levels "-154.888786590618",..: 6 5 3 6 4 6 6 3 6 6 ...
## $ : Factor w/ 4 levels "39.3083392123289",..: 1 1 2 1 1 1 1 1 1 1 ...
## $ : Factor w/ 4 levels "-23.1757148763687",..: 4 1 1 2 2 4 1 1 4 1 ...
## $ : Factor w/ 4 levels "29.0045413163247",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ : Factor w/ 6 levels "14.3946159376293",..: 3 2 1 1 2 1 1 6 3 1 ...
data.tbl <- as_tibble(as.data.frame(res))
colnames(data.tbl) <- variables
#remove factors and keep them as numeric
data.tbl <- purrr::map(data.tbl,varhandle::unfactor)
data.tbl<- as_tibble(as.data.frame(data.tbl))
data.tbl$bad <- X.tbl$bad
str(data.tbl)
## Classes 'tbl_df', 'tbl' and 'data.frame': 120000 obs. of 11 variables:
## $ revUtilUnsec: num -88.6 -120.9 -40.4 -120.9 59.4 ...
## $ age : num -25.8 -30.1 -30.1 -14.5 96.4 ...
## $ n30_59dpd_nw: num -161.6 54.2 -90 -90 54.2 ...
## $ debtRatio : num -53.2 15.5 15.5 15.5 15.5 ...
## $ mIncome : num 42.7 -43.9 -30.5 42.7 -30.5 ...
## $ nOpenTrades : num 12.39 3.66 -32.69 12.39 -12.44 ...
## $ n90dlate : num 39.3 39.3 -195.4 39.3 39.3 ...
## $ nREstate : num -25 -23.2 -23.2 25 25 ...
## $ n60_89dpd_nw: num 29 29 29 29 29 ...
## $ nDependents : num -21.2 -10.3 14.4 14.4 -10.3 ...
## $ bad : int 1 0 0 0 0 0 0 0 0 0 ...
Seems we are ready to build our model!. We are going to use a logistic regression to show results and different metrics of performance.
modFit <- train(as.factor(bad)~.,data=data.tbl,method="glm", family="binomial")
summary(modFit$finalModel)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4394 -0.3215 -0.2151 -0.1630 3.1864
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.6024766 0.0140943 -184.648 < 2e-16 ***
## revUtilUnsec -0.0061992 0.0001377 -45.035 < 2e-16 ***
## age -0.0037715 0.0002988 -12.621 < 2e-16 ***
## n30_59dpd_nw -0.0050920 0.0001460 -34.870 < 2e-16 ***
## debtRatio -0.0061068 0.0006097 -10.016 < 2e-16 ***
## mIncome -0.0028861 0.0004547 -6.348 2.18e-10 ***
## nOpenTrades 0.0017555 0.0004281 4.101 4.12e-05 ***
## n90dlate -0.0052799 0.0001297 -40.706 < 2e-16 ***
## nREstate -0.0054304 0.0006269 -8.662 < 2e-16 ***
## n60_89dpd_nw -0.0035892 0.0001593 -22.537 < 2e-16 ***
## nDependents -0.0031822 0.0006932 -4.591 4.42e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 58725 on 119999 degrees of freedom
## Residual deviance: 43820 on 119989 degrees of freedom
## AIC: 43842
##
## Number of Fisher Scoring iterations: 6
Our model fitted well. Since we have transformed the variables to their WoE all Beta Coefficients have the same negative sign, as higher WoE leads to lower risk individuals. From the z-values of each coefficient we can derive the importance of the variable, but that is also shown with the varImp function:
varImp(modFit)
## glm variable importance
##
## Overall
## revUtilUnsec 100.000
## n90dlate 89.423
## n30_59dpd_nw 75.167
## n60_89dpd_nw 45.037
## age 20.815
## debtRatio 14.449
## nREstate 11.143
## mIncome 5.489
## nDependents 1.197
## nOpenTrades 0.000
Same results as above. The statistics that describe our model are shown from a confusionMatrix output:
pred <- predict(modFit,data.tbl)
score <-predict(modFit,data.tbl,type="prob")
colnames(score) <- c("pBueno","pMalo")
confusionMatrix(pred,data.tbl$bad)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 110942 6433
## 1 1069 1556
##
## Accuracy : 0.9375
## 95% CI : (0.9361, 0.9388)
## No Information Rate : 0.9334
## P-Value [Acc > NIR] : 6.481e-09
##
## Kappa : 0.2691
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9905
## Specificity : 0.1948
## Pos Pred Value : 0.9452
## Neg Pred Value : 0.5928
## Prevalence : 0.9334
## Detection Rate : 0.9245
## Detection Prevalence : 0.9781
## Balanced Accuracy : 0.5926
##
## 'Positive' Class : 0
##
As the sample is highly unbalanced in the target variable, our Accuracy leads us to believe we’ve fitted a very good model when in fact it is not as strong as it seems. Kappa is a better indicator of fitness for unbalanced samples and it is not very strong, but still useful.
Next in line is to visualize the fit of our model, we can use our previous function to plot the training sample:
data.tbl$pBueno <- score$pBueno
data.tbl$pMalo <- score$pMalo
graf_biv("pMalo",datos=data.tbl,grupos=30)
Not bad! Most Risk Managers are familiar with the Kolmogorov-Smirnov statistics, the ROC and it’s Area Under the Curve (AUC). Let us obtain these statistics:
df.p <- data.tbl[order(data.tbl$pMalo),]
df.p$cumBad <- cumsum(df.p$bad) / sum(df.p$bad)
df.p$cumGood <- cumsum(1 - df.p$bad) / (dim(df.p)[1] - sum(df.p$bad))
ggplot(df.p,aes(cumBad,cumGood)) + geom_path(col="black") +
geom_segment(x = 0, xend = 1, y = 0, yend = 1, col="grey") + xlim(0,1) + ylim(0,1) +
xlab("Specificity") + ylab("Sensitivity") + ggtitle("Curva ROC")
#ROC con fórmula:
modFit$ROC <- 1-(sum(df.p$cumBad/length(df.p$cumBad)) - sum(df.p$cumGood/length(df.p$cumGood)) + 0.5)
print(paste0("AUC: ",round(modFit$ROC*100,2),"%"))
## [1] "AUC: 85.62%"
print(paste0("KS: ",round(max(abs(df.p$cumBad - df.p$cumGood))*100,2),"%"))
## [1] "KS: 55.27%"
Testing Data
An issue with the current code is that our transformation of the data into WoE is dependent on the cuts and WoE of our training sample, thus we need to map there same groups in our testing dataset to the WoE calculated in our traning dataset (which can be quite cumbersome). To accomplish this first let us obtain the same groups (coded as factors in R).
testing$mIncome[testing$mIncome<0] <- 0
XTest.tbl <- as_tibble(data.frame(list(bad=testing$bad,
revUtilUnsec = cut2(testing$revUtilUnsec,g=15),
age = cut2(testing$age,g=13),
n30_59dpd_nw = cut2(testing$n30_59dpd_nw,cuts=c(0,1,2,3)),
mIncome=cut2(testing$mIncome,cuts=c(0,500,1250,1500,2000,3000,4000,5000,6000,7000,8000,9000)),
debtRatio = cut2(testing$debtRatio,cuts=c(0,0.00001,.4,.6,.7,.8,.9,1)),
nOpenTrades = cut2(testing$nOpenTrades,cuts=c(0,1,2,3,4,5)),
n90dlate = cut2(testing$n90dlate,cuts=c(0,1,2,3)),
nREstate = cut2(testing$nREstate,cuts=c(0,1,2,3)),
n60_89dpd_nw = cut2(testing$n60_89dpd_nw,cuts=c(0,1,2,3)),
nDependents = cut2(testing$nDependents,cuts=c(0,1,2,3,4))
)
))
#Transformamos los "NA" en "-1" para mantener dentro de los factores
XTest.tbl$nDependents<-factor(XTest.tbl$nDependents, levels = levels(addNA(XTest.tbl$nDependents)), labels = c(levels(XTest.tbl$nDependents), -1), exclude = NULL)
XTest.tbl$mIncome<-factor(XTest.tbl$mIncome, levels = levels(addNA(XTest.tbl$mIncome)), labels = c(levels(XTest.tbl$mIncome), -1), exclude = NULL)
And from here we can map the WoEs obtaining in our previous dataset:
And it’s all done. We can evaluate our model with our testing dataset to measure the effectiveness of our model.
pred <- predict(modFit,dataT.tbl)
score <-predict(modFit,dataT.tbl,type="prob")
colnames(score) <- c("pBueno","pMalo")
confusionMatrix(pred,dataT.tbl$bad)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 27682 1674
## 1 281 363
##
## Accuracy : 0.9348
## 95% CI : (0.932, 0.9376)
## No Information Rate : 0.9321
## P-Value [Acc > NIR] : 0.03013
##
## Kappa : 0.2462
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.9900
## Specificity : 0.1782
## Pos Pred Value : 0.9430
## Neg Pred Value : 0.5637
## Prevalence : 0.9321
## Detection Rate : 0.9227
## Detection Prevalence : 0.9785
## Balanced Accuracy : 0.5841
##
## 'Positive' Class : 0
##
dataT.tbl$pBueno <- score$pBueno
dataT.tbl$pMalo <- score$pMalo
graf_biv("pMalo",datos=dataT.tbl,grupos=30)
df.p <- dataT.tbl[order(dataT.tbl$pMalo),]
df.p$cumBad <- cumsum(df.p$bad) / sum(df.p$bad)
df.p$cumGood <- cumsum(1 - df.p$bad) / (dim(df.p)[1] - sum(df.p$bad))
ggplot(df.p,aes(cumBad,cumGood)) + geom_path(col="black") +
geom_segment(x = 0, xend = 1, y = 0, yend = 1, col="grey") + xlim(0,1) + ylim(0,1) +
xlab("Specificity") + ylab("Sensitivity") + ggtitle("Curva ROC")
#ROC con fórmula:
modFit$ROC <- 1-(sum(df.p$cumBad/length(df.p$cumBad)) - sum(df.p$cumGood/length(df.p$cumGood)) + 0.5)
print(paste0("AUC: ",round(modFit$ROC*100,2),"%"))
## [1] "AUC: 85.78%"
print(paste0("KS: ",round(max(abs(df.p$cumBad - df.p$cumGood))*100,2),"%"))
## [1] "KS: 56.07%"
This can be done iteratively selection different partitions into testing and training to analyze the stability of the fit and coefficients.
Scaling to a scorecard
To scale our results to obtain points to add up can be donde from a linear transformation of our results, but right now I’ll leave that out of the scope of this entry. Obviously, if you’re interested in this step of the process feel free to contact us to discuss the issue.