4.2 Data
Load the necessary packages. Install the packages that you don’t have.
library(dplyr)
library(psych)
library(caret)
library(lubridate)
library(RANN) # Used for KNN
library(lime) # Used for variable importance
library(ggplot2)
library(ROSE) # Used for synthetic samples
library(mice) # Used to impute missing categorical values
You can download the csv files from Kaggle or you can directly read the files from my Github repository, which I do next.
We are first going to use only the file names carInsurance_train.csv
.
dt <- read.csv("http://bit.ly/2V8DBkL",
stringsAsFactors = FALSE)
4.2.1 Data exploration
Check out the names of the columns. Compare to the names in the documentation on Kaggle to make sure we are dealing with the same data set. Furthermore, please understand the meaning of each variable.
names(dt)
## [1] "Id" "Age" "Job"
## [4] "Marital" "Education" "Default"
## [7] "Balance" "HHInsurance" "CarLoan"
## [10] "Communication" "LastContactDay" "LastContactMonth"
## [13] "NoOfContacts" "DaysPassed" "PrevAttempts"
## [16] "Outcome" "CallStart" "CallEnd"
## [19] "CarInsurance"
Learn the variable classes.
sapply(dt, class)
## Id Age Job Marital
## "integer" "integer" "character" "character"
## Education Default Balance HHInsurance
## "character" "integer" "integer" "integer"
## CarLoan Communication LastContactDay LastContactMonth
## "integer" "character" "integer" "character"
## NoOfContacts DaysPassed PrevAttempts Outcome
## "integer" "integer" "integer" "character"
## CallStart CallEnd CarInsurance
## "character" "character" "integer"
CallStart
and CallEnd
are both characters in the data set but in reality they have hour:minute:second
format. Let’s convert them to the right format using lubridate
package. Separate out hour, minutes, and second from these two variables. Also create a variableCallDuration
that captures the duration of the call in seconds. Finally, convert all the character variables to factors.
dt <- dt %>%
mutate(CallStart = lubridate::hms(CallStart),
CallEnd = lubridate::hms(CallEnd),
CallStartHour = hour(CallStart),
CallStartMin = minute(CallStart),
CallStartSec = second(CallStart),
CallEndHour = hour(CallEnd),
CallEndMin = minute(CallEnd),
CallEndSec = second(CallEnd),
CallDuration = period_to_seconds(CallEnd) - period_to_seconds(CallStart)) %>%
select(-CallStart, -CallEnd, -Id) %>%
mutate_if(is.character, as.factor)
Get the descriptive
psych::describe(dt) %>%
select(-vars, -trimmed, -mad, -range, -se) %>%
knitr::kable(digits = 2,
align = "c",
caption = "Summary Statistics",
booktabs = TRUE) # kable prints nice-looking tables.
n | mean | sd | median | min | max | skew | kurtosis | |
---|---|---|---|---|---|---|---|---|
Age | 4000 | 41.21 | 11.55 | 39.0 | 18 | 95 | 0.76 | 0.49 |
Job* | 3981 | 5.42 | 3.21 | 5.0 | 1 | 11 | 0.20 | -1.27 |
Marital* | 4000 | 2.18 | 0.63 | 2.0 | 1 | 3 | -0.15 | -0.56 |
Education* | 3831 | 2.19 | 0.67 | 2.0 | 1 | 3 | -0.24 | -0.80 |
Default | 4000 | 0.01 | 0.12 | 0.0 | 0 | 1 | 8.12 | 63.95 |
Balance | 4000 | 1532.94 | 3511.45 | 551.5 | -3058 | 98417 | 9.87 | 184.73 |
HHInsurance | 4000 | 0.49 | 0.50 | 0.0 | 0 | 1 | 0.03 | -2.00 |
CarLoan | 4000 | 0.13 | 0.34 | 0.0 | 0 | 1 | 2.16 | 2.67 |
Communication* | 3098 | 1.09 | 0.28 | 1.0 | 1 | 2 | 2.95 | 6.69 |
LastContactDay | 4000 | 15.72 | 8.43 | 16.0 | 1 | 31 | 0.09 | -1.07 |
LastContactMonth* | 4000 | 6.47 | 3.13 | 7.0 | 1 | 12 | -0.36 | -1.06 |
NoOfContacts | 4000 | 2.61 | 3.06 | 2.0 | 1 | 43 | 5.24 | 40.12 |
DaysPassed | 4000 | 48.71 | 106.69 | -1.0 | -1 | 854 | 2.53 | 7.52 |
PrevAttempts | 4000 | 0.72 | 2.08 | 0.0 | 0 | 58 | 8.93 | 169.74 |
Outcome* | 958 | 1.88 | 0.89 | 2.0 | 1 | 3 | 0.23 | -1.69 |
CarInsurance | 4000 | 0.40 | 0.49 | 0.0 | 0 | 1 | 0.40 | -1.84 |
CallStartHour | 4000 | 13.03 | 2.59 | 13.0 | 9 | 17 | -0.04 | -1.23 |
CallStartMin | 4000 | 29.66 | 17.38 | 29.0 | 0 | 59 | -0.01 | -1.24 |
CallStartSec | 4000 | 29.69 | 17.31 | 30.0 | 0 | 59 | -0.04 | -1.19 |
CallEndHour | 4000 | 13.14 | 2.61 | 13.0 | 9 | 18 | -0.03 | -1.19 |
CallEndMin | 4000 | 29.33 | 17.41 | 29.0 | 0 | 59 | 0.02 | -1.20 |
CallEndSec | 4000 | 29.10 | 17.17 | 29.0 | 0 | 59 | 0.04 | -1.19 |
CallDuration | 4000 | 350.84 | 342.24 | 232.0 | 5 | 3253 | 2.23 | 7.18 |
4.2.2 Imputing missing values
Most of the variables have no missing values. However, Job
and Education
have a few missing values, which we can easily impute by using a predictive model. Communication
has 902 missing values and Outcome
has 3042 missing values. That’s too much to just impute. Here we can create a separate category for the missing values.
In case of Communication
, the missing value is not available. We will create a category for “Not Available”. Similarly, for Outcome
the missing values mean that the person was not contacted before and therefore there is no outcome. Therefore, we will create a category for “None”.
Also note that DaysPassed
takes a value of -1 when the person was not in any previous campaign. Keeping it at -1 is a mistake. Ideally, we should put \(\small \infty\) in that place. Instead, we will use a very large value. As the maximum days passed is 855, let’s use 1,000 as the upper limit.
With that, let’s make the changes to our data set.
# Duplicate the original dataset
dt2 <- dt
First make the adjustments to Communication
, Outcome
, and DaysPassed
.
dt2 <- dt2 %>%
mutate(Communication = ifelse(is.na(Communication), "Not Available", Communication),
Outcome = ifelse(is.na(Outcome), "None", Outcome),
DaysPassed = ifelse(DaysPassed == -1, 1000, DaysPassed))
Next impute missing values Job
and Education
and use them for missing values. Set a seed for replication in future.
set.seed(8934)
miceMod <- mice::mice(subset(dt2, select = -CarInsurance),
method = "rf") # perform mice imputation based on random forest.
Generate the completed data.
dt3 <- mice::complete(miceMod)
Check whether we have all the values imputed now.
anyNA(dt3)
## [1] FALSE
Now let’s look at the summary.
dt3$CarInsurance <- dt2$CarInsurance
psych::describe(dt3) %>%
select(-vars, -trimmed, -mad, -range, -se) %>%
knitr::kable(digits = 2,
align = "c",
caption = "Summary Statistics\n(Post Imputation)",
booktabs = TRUE)
n | mean | sd | median | min | max | skew | kurtosis | |
---|---|---|---|---|---|---|---|---|
Age | 4000 | 41.21 | 11.55 | 39.0 | 18 | 95 | 0.76 | 0.49 |
Job* | 4000 | 5.42 | 3.21 | 5.0 | 1 | 11 | 0.20 | -1.26 |
Marital* | 4000 | 2.18 | 0.63 | 2.0 | 1 | 3 | -0.15 | -0.56 |
Education* | 4000 | 2.19 | 0.66 | 2.0 | 1 | 3 | -0.24 | -0.78 |
Default | 4000 | 0.01 | 0.12 | 0.0 | 0 | 1 | 8.12 | 63.95 |
Balance | 4000 | 1532.94 | 3511.45 | 551.5 | -3058 | 98417 | 9.87 | 184.73 |
HHInsurance | 4000 | 0.49 | 0.50 | 0.0 | 0 | 1 | 0.03 | -2.00 |
CarLoan | 4000 | 0.13 | 0.34 | 0.0 | 0 | 1 | 2.16 | 2.67 |
Communication* | 4000 | 1.09 | 0.28 | 1.0 | 1 | 2 | 2.95 | 6.69 |
LastContactDay | 4000 | 15.72 | 8.43 | 16.0 | 1 | 31 | 0.09 | -1.07 |
LastContactMonth* | 4000 | 6.47 | 3.13 | 7.0 | 1 | 12 | -0.36 | -1.06 |
NoOfContacts | 4000 | 2.61 | 3.06 | 2.0 | 1 | 43 | 5.24 | 40.12 |
DaysPassed | 4000 | 809.97 | 343.85 | 1000.0 | 1 | 1000 | -1.31 | -0.18 |
PrevAttempts | 4000 | 0.72 | 2.08 | 0.0 | 0 | 58 | 8.93 | 169.74 |
Outcome* | 4000 | 1.88 | 0.89 | 2.0 | 1 | 3 | 0.23 | -1.69 |
CallStartHour | 4000 | 13.03 | 2.59 | 13.0 | 9 | 17 | -0.04 | -1.23 |
CallStartMin | 4000 | 29.66 | 17.38 | 29.0 | 0 | 59 | -0.01 | -1.24 |
CallStartSec | 4000 | 29.69 | 17.31 | 30.0 | 0 | 59 | -0.04 | -1.19 |
CallEndHour | 4000 | 13.14 | 2.61 | 13.0 | 9 | 18 | -0.03 | -1.19 |
CallEndMin | 4000 | 29.33 | 17.41 | 29.0 | 0 | 59 | 0.02 | -1.20 |
CallEndSec | 4000 | 29.10 | 17.17 | 29.0 | 0 | 59 | 0.04 | -1.19 |
CallDuration | 4000 | 350.84 | 342.24 | 232.0 | 5 | 3253 | 2.23 | 7.18 |
CarInsurance | 4000 | 0.40 | 0.49 | 0.0 | 0 | 1 | 0.40 | -1.84 |
Finally, we will create dummy variables for all the factors. For this, we will use dummyVars
function from caret
package. It will drop the dependent variable, CarInsurance
from the data, so we will first save it into a vector and then add it back.
Also, predict
function will create a matrix
, which we need to convert into a data frame
.
dt4 <- predict(dummyVars(CarInsurance ~ .,
data = dt3,
fullRank = TRUE # Drops the reference category
),
newdata = dt3) %>%
data.frame()
dt4$CarInsurance <- dt2$CarInsurance
# Convert CarInsurance to a factor
dt4 <- dt4 %>%
mutate(CarInsurance = as.factor(ifelse(CarInsurance == 0, "No", "Yes")))
4.2.3 Training and test set
Before we build our predictive model, we will split the sample into a training set and a test set. Usually I use 80 - 20 split. As our sample size is pretty large, this split is reasonable. We will build our model using training data and then evaluate its out-of-sample performance on the test data.
For this we will use caret
’s createDataPartition()
function. This function takes a factor as input along with the percentage of split. In our case, we want to split the data in training and testing sets but we want to make sure that the proportion of Yes and No for CarInsurance
remains the same in the two splits. The output of this function is a numeric vector with values corresponding to the row numbers that we want to keep in the training set. Obviously, the test set has all the rows which are discarded by createDataPartition()
.
# Create a vector with row numbers corresponding to the training data
index <- caret::createDataPartition(dt4$CarInsurance,
p = 0.8,
list = FALSE) # caret returns a list by default.
dt4_train <- dt4[index, ]
dt4_test <- dt4[-index, ]