4.2 Data

Load the necessary packages. Install the packages that you don’t have.

library(dplyr)
library(psych)
library(caret)
library(lubridate)
library(RANN) # Used for KNN
library(lime) # Used for variable importance
library(ggplot2)
library(ROSE) # Used for synthetic samples
library(mice) # Used to impute missing categorical values

You can download the csv files from Kaggle or you can directly read the files from my Github repository, which I do next.

We are first going to use only the file names carInsurance_train.csv.

dt <- read.csv("http://bit.ly/2V8DBkL",
               stringsAsFactors = FALSE)

4.2.1 Data exploration

Check out the names of the columns. Compare to the names in the documentation on Kaggle to make sure we are dealing with the same data set. Furthermore, please understand the meaning of each variable.

names(dt)

##  [1] "Id"               "Age"              "Job"             
##  [4] "Marital"          "Education"        "Default"         
##  [7] "Balance"          "HHInsurance"      "CarLoan"         
## [10] "Communication"    "LastContactDay"   "LastContactMonth"
## [13] "NoOfContacts"     "DaysPassed"       "PrevAttempts"    
## [16] "Outcome"          "CallStart"        "CallEnd"         
## [19] "CarInsurance"

Learn the variable classes.

sapply(dt, class)

##               Id              Age              Job          Marital 
##        "integer"        "integer"      "character"      "character" 
##        Education          Default          Balance      HHInsurance 
##      "character"        "integer"        "integer"        "integer" 
##          CarLoan    Communication   LastContactDay LastContactMonth 
##        "integer"      "character"        "integer"      "character" 
##     NoOfContacts       DaysPassed     PrevAttempts          Outcome 
##        "integer"        "integer"        "integer"      "character" 
##        CallStart          CallEnd     CarInsurance 
##      "character"      "character"        "integer"

CallStart and CallEnd are both characters in the data set but in reality they have hour:minute:second format. Let’s convert them to the right format using lubridate package. Separate out hour, minutes, and second from these two variables. Also create a variableCallDuration that captures the duration of the call in seconds. Finally, convert all the character variables to factors.

dt <- dt %>% 
  mutate(CallStart = lubridate::hms(CallStart),
         CallEnd  = lubridate::hms(CallEnd),
         CallStartHour = hour(CallStart),
         CallStartMin = minute(CallStart),
         CallStartSec = second(CallStart),
         CallEndHour = hour(CallEnd),
         CallEndMin = minute(CallEnd),
         CallEndSec = second(CallEnd),
         CallDuration = period_to_seconds(CallEnd) - period_to_seconds(CallStart)) %>% 
  select(-CallStart, -CallEnd, -Id) %>% 
  mutate_if(is.character, as.factor)

Get the descriptive

psych::describe(dt) %>% 
  select(-vars, -trimmed, -mad, -range, -se) %>% 
  knitr::kable(digits = 2,
             align = "c",
             caption = "Summary Statistics",
             booktabs = TRUE) # kable prints nice-looking tables.

Table 4.1: Summary Statistics
	n	mean	sd	median	min	max	skew	kurtosis
Age	4000	41.21	11.55	39.0	18	95	0.76	0.49
Job*	3981	5.42	3.21	5.0	1	11	0.20	-1.27
Marital*	4000	2.18	0.63	2.0	1	3	-0.15	-0.56
Education*	3831	2.19	0.67	2.0	1	3	-0.24	-0.80
Default	4000	0.01	0.12	0.0	0	1	8.12	63.95
Balance	4000	1532.94	3511.45	551.5	-3058	98417	9.87	184.73
HHInsurance	4000	0.49	0.50	0.0	0	1	0.03	-2.00
CarLoan	4000	0.13	0.34	0.0	0	1	2.16	2.67
Communication*	3098	1.09	0.28	1.0	1	2	2.95	6.69
LastContactDay	4000	15.72	8.43	16.0	1	31	0.09	-1.07
LastContactMonth*	4000	6.47	3.13	7.0	1	12	-0.36	-1.06
NoOfContacts	4000	2.61	3.06	2.0	1	43	5.24	40.12
DaysPassed	4000	48.71	106.69	-1.0	-1	854	2.53	7.52
PrevAttempts	4000	0.72	2.08	0.0	0	58	8.93	169.74
Outcome*	958	1.88	0.89	2.0	1	3	0.23	-1.69
CarInsurance	4000	0.40	0.49	0.0	0	1	0.40	-1.84
CallStartHour	4000	13.03	2.59	13.0	9	17	-0.04	-1.23
CallStartMin	4000	29.66	17.38	29.0	0	59	-0.01	-1.24
CallStartSec	4000	29.69	17.31	30.0	0	59	-0.04	-1.19
CallEndHour	4000	13.14	2.61	13.0	9	18	-0.03	-1.19
CallEndMin	4000	29.33	17.41	29.0	0	59	0.02	-1.20
CallEndSec	4000	29.10	17.17	29.0	0	59	0.04	-1.19
CallDuration	4000	350.84	342.24	232.0	5	3253	2.23	7.18

4.2.2 Imputing missing values

Most of the variables have no missing values. However, Job and Education have a few missing values, which we can easily impute by using a predictive model. Communication has 902 missing values and Outcome has 3042 missing values. That’s too much to just impute. Here we can create a separate category for the missing values.

In case of Communication, the missing value is not available. We will create a category for “Not Available”. Similarly, for Outcome the missing values mean that the person was not contacted before and therefore there is no outcome. Therefore, we will create a category for “None”.

Also note that DaysPassed takes a value of -1 when the person was not in any previous campaign. Keeping it at -1 is a mistake. Ideally, we should put \(\small \infty\) in that place. Instead, we will use a very large value. As the maximum days passed is 855, let’s use 1,000 as the upper limit.

With that, let’s make the changes to our data set.

# Duplicate the original dataset

dt2 <- dt

First make the adjustments to Communication, Outcome, and DaysPassed.

dt2 <- dt2 %>% 
  mutate(Communication = ifelse(is.na(Communication), "Not Available", Communication),
         Outcome = ifelse(is.na(Outcome), "None", Outcome),
         DaysPassed = ifelse(DaysPassed == -1, 1000, DaysPassed))

Next impute missing values Job and Education and use them for missing values. Set a seed for replication in future.

set.seed(8934)

miceMod <- mice::mice(subset(dt2, select = -CarInsurance), 
                method = "rf")  # perform mice imputation based on random forest.

Generate the completed data.

dt3 <- mice::complete(miceMod)

Check whether we have all the values imputed now.

anyNA(dt3)

## [1] FALSE

Now let’s look at the summary.

dt3$CarInsurance <- dt2$CarInsurance

psych::describe(dt3) %>% 
  select(-vars, -trimmed, -mad, -range, -se) %>% 
  knitr::kable(digits = 2,
             align = "c",
             caption = "Summary Statistics\n(Post Imputation)",
             booktabs = TRUE)

Table 3.3: Summary Statistics (Post Imputation)
	n	mean	sd	median	min	max	skew	kurtosis
Age	4000	41.21	11.55	39.0	18	95	0.76	0.49
Job*	4000	5.42	3.21	5.0	1	11	0.20	-1.26
Marital*	4000	2.18	0.63	2.0	1	3	-0.15	-0.56
Education*	4000	2.19	0.66	2.0	1	3	-0.24	-0.78
Default	4000	0.01	0.12	0.0	0	1	8.12	63.95
Balance	4000	1532.94	3511.45	551.5	-3058	98417	9.87	184.73
HHInsurance	4000	0.49	0.50	0.0	0	1	0.03	-2.00
CarLoan	4000	0.13	0.34	0.0	0	1	2.16	2.67
Communication*	4000	1.09	0.28	1.0	1	2	2.95	6.69
LastContactDay	4000	15.72	8.43	16.0	1	31	0.09	-1.07
LastContactMonth*	4000	6.47	3.13	7.0	1	12	-0.36	-1.06
NoOfContacts	4000	2.61	3.06	2.0	1	43	5.24	40.12
DaysPassed	4000	809.97	343.85	1000.0	1	1000	-1.31	-0.18
PrevAttempts	4000	0.72	2.08	0.0	0	58	8.93	169.74
Outcome*	4000	1.88	0.89	2.0	1	3	0.23	-1.69
CallStartHour	4000	13.03	2.59	13.0	9	17	-0.04	-1.23
CallStartMin	4000	29.66	17.38	29.0	0	59	-0.01	-1.24
CallStartSec	4000	29.69	17.31	30.0	0	59	-0.04	-1.19
CallEndHour	4000	13.14	2.61	13.0	9	18	-0.03	-1.19
CallEndMin	4000	29.33	17.41	29.0	0	59	0.02	-1.20
CallEndSec	4000	29.10	17.17	29.0	0	59	0.04	-1.19
CallDuration	4000	350.84	342.24	232.0	5	3253	2.23	7.18
CarInsurance	4000	0.40	0.49	0.0	0	1	0.40	-1.84

Finally, we will create dummy variables for all the factors. For this, we will use dummyVars function from caret package. It will drop the dependent variable, CarInsurance from the data, so we will first save it into a vector and then add it back.

Also, predict function will create a matrix, which we need to convert into a data frame.

dt4 <- predict(dummyVars(CarInsurance ~ .,
                         data = dt3,
                         fullRank = TRUE # Drops the reference category
                         ),
               newdata = dt3) %>% 
  data.frame()

dt4$CarInsurance <- dt2$CarInsurance

# Convert CarInsurance to a factor
dt4 <- dt4 %>% 
  mutate(CarInsurance = as.factor(ifelse(CarInsurance == 0, "No", "Yes")))

4.2.3 Training and test set

Before we build our predictive model, we will split the sample into a training set and a test set. Usually I use 80 - 20 split. As our sample size is pretty large, this split is reasonable. We will build our model using training data and then evaluate its out-of-sample performance on the test data.

For this we will use caret’s createDataPartition() function. This function takes a factor as input along with the percentage of split. In our case, we want to split the data in training and testing sets but we want to make sure that the proportion of Yes and No for CarInsurance remains the same in the two splits. The output of this function is a numeric vector with values corresponding to the row numbers that we want to keep in the training set. Obviously, the test set has all the rows which are discarded by createDataPartition().

# Create a vector with row numbers corresponding to the training data

index <- caret::createDataPartition(dt4$CarInsurance, 
                             p = 0.8, 
                             list = FALSE) # caret returns a list by default.

dt4_train <- dt4[index, ]
dt4_test <- dt4[-index, ]