3.10 Support Vector Machines (SVM)

The advantage of using SVM is that although it is a linear model, we can use kernels to model linearly non-separable data. We will use the default radial basis function (RBF) kernel for SVM. An SVM with RBF takes two hyper parameters that we need to tune before estimating SVM. But it takes a long time to tune. Therefore, in this example I won’t actually tune it because I have already done it previously using e1071 package.

We will use e1071 and caret separately to get SVM.

Use the following code to estimate SVM using e1071 package.

3.10.1 Hyperparameters tuning

For SVM with RBF, we need to tune gamma and cost hyperparameters. Whereas cost is generic to SVM with any kernel function, gamma is specific to RBF kernel, which is given as follows:

\[\small K(x_i, x_j) = e^{(-\gamma||x_i-x_j||^2)}\]

where \(\small x_i\) are the sample points and \(\small x_j\) are the support vectors. \(\small \gamma\) controls the extent to which support vectors exert influence on classification of any sample point.¹⁸

Higher values of gamma will lead to higher in-sample error (high bias) and lower out-of-sample error (low variance).

cost decides the penalty we want to set for not classifying sample points correctly. Therefore, lower cost will lead to more accommodating behavior leading to higher in-sample error (high bias) and lower out-of-sample error (low variance).

# Use multiple cores on your computer
# Change the number of cores depending on your processer

doParallel::registerDoParallel(cores = 4) 

set.seed(1234)
svm.tune <- tune.svm(quality.c ~ ., 
                     data = train_wine[, -c(12,15)],
                     gamma = c(0.05, 0.1, 0.5, 1, 2),
                     cost = 10^(0:3))

Print the best gamma and cost

svm.tune$best.parameters

##    gamma cost
## 19     1 1000

We find that gamma = 1 and cost = 1000 give us the best accuracy. You can also print the table for the performance of the entire grid by running the following code. This is not executed for this example.

svm.tune$performances

3.10.2 Model estimation

Using the best hyperparameters we can estimate the model. In the next code block I will use the hyperparameters from the first pass. Note that the summary() function for svm() doesn’t provide any model parameters.

model.svm.c <- e1071::svm(quality.c ~ ., 
                          data = train_wine[, -c(12, 15)],
                          kernel = "radial",
                          gamma = 1,
                          cost = 1000)

Print the model summary. Note that the summary() function for svm() doesn’t provide any model parameters.

summary(model.svm.c)

## 
## Call:
## svm(formula = quality.c ~ ., data = train_wine[, -c(12, 15)], 
##     kernel = "radial", gamma = 1, cost = 1000)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1000 
## 
## Number of Support Vectors:  4389
## 
##  ( 1393 1899 751 346 )
## 
## 
## Number of Classes:  4 
## 
## Levels: 
##  q_5 q_6 q_7 q_3489

We have 4,389 support vectors. This suggests that the model is memorizing rather than learning as the training data set has only 5,200 observations. This could be because of many factors but th emost critical factor is that the predictor variables do not have much information to predict the wine quality.

3.10.3 Assessing the model performance

caret::confusionMatrix(reference = test_wine$quality.c,
                       predict(model.svm.c, 
                               newdata = test_wine[, -c(12, 14, 15)], 
                               type = "class"))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction q_5 q_6 q_7 q_3489
##     q_5    398  13   2      3
##     q_6     26 543  13      6
##     q_7      0  10 199      2
##     q_3489   3   1   1     77
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9383          
##                  95% CI : (0.9238, 0.9508)
##     No Information Rate : 0.4372          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9072          
##                                           
##  Mcnemar's Test P-Value : 0.1005          
## 
## Statistics by Class:
## 
##                      Class: q_5 Class: q_6 Class: q_7 Class: q_3489
## Sensitivity              0.9321     0.9577     0.9256       0.87500
## Specificity              0.9793     0.9384     0.9889       0.99586
## Pos Pred Value           0.9567     0.9235     0.9431       0.93902
## Neg Pred Value           0.9671     0.9661     0.9853       0.99095
## Prevalence               0.3292     0.4372     0.1658       0.06785
## Detection Rate           0.3069     0.4187     0.1534       0.05937
## Detection Prevalence     0.3207     0.4534     0.1627       0.06322
## Balanced Accuracy        0.9557     0.9480     0.9572       0.93543

It looks like SVM classification on the data is far better than MNL! We get the model accuracy up to 67% from a mere 54% for MNL. Recall that the no information rate is only 44% so with SVM we could improve the accuracy by 23%.

3.10.4 SVM with `caret`

The following code will tune the model but it’s actually not going to run in this tutorial because it will take a lot of time.

ctrl <- caret::trainControl(method = "cv",
                            number = 10)
 
set.seed(1492)
grid <- expand.grid(sigma = 10^(-1:4),
                    C = 10^(0:4))

model.svm.tune <- train(quality.c ~ ., 
                        data = train_wine[ , -c(12, 15)],
                        method = "svmRadial",
                        tuneGrid = grid,
                        trControl = ctrl)

Print the best parameter combination.

model.svm.tune$bestTune

##   sigma   C
## 8     1 100

We will use sigma = 1 and cost = 100 and estimate the model. We will then use varImp in caret to get the variable importance. In order to plot variable importance, all the predictor variables have to be non-character. As we have a factor variable, wine, we should create a indicator dummy ourselves.

train_wine$white_wine = ifelse(train_wine$wine == "white", 1, 0)
test_wine$white_wine = ifelse(test_wine$wine == "white", 1, 0)

With the best parameters, estimate the model.

ctrl <- caret::trainControl(method = "cv",
                            number = 10)

grid <- expand.grid(sigma = 1, C = 100)
 
set.seed(89753)

model.svm.c1 <- train(quality.c ~ ., 
                     data = train_wine[,-c(12, 13, 15)],
                     method = "svmRadial", 
                     tuneGrid = grid, 
                     trControl = ctrl)

print(model.svm.c1)

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 5200 samples
##   12 predictor
##    4 classes: 'q_5', 'q_6', 'q_7', 'q_3489' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 4680, 4679, 4680, 4679, 4679, 4681, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.6426823  0.4405893
## 
## Tuning parameter 'sigma' was held constant at a value of 1
## 
## Tuning parameter 'C' was held constant at a value of 100

In-sample accuracy of the model is 64%. Next, we will check the number of support vectors. Similar to the SVM fitted using e1071, we get 4,386 support vectors using caret, which underneath uses kernlab package.

model.svm.c1$finalModel

## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 100 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  1 
## 
## Number of Support Vectors : 4386 
## 
## Objective Function Value : -3306.48 -728.0289 -516.7598 -1715.672 -660.8427 -485.7019 
## Training error : 0.000192

3.10.5 Model performance

We test the SVM performance using test_wine

caret::confusionMatrix(reference = test_wine$quality.c,
                       predict(model.svm.c1, 
                               newdata = test_wine[, -c(12, 13, 14, 15)], 
                               type = "raw"))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction q_5 q_6 q_7 q_3489
##     q_5    398  13   2      3
##     q_6     26 543  14      6
##     q_7      0  10 198      2
##     q_3489   3   1   1     77
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9375         
##                  95% CI : (0.923, 0.9501)
##     No Information Rate : 0.4372         
##     P-Value [Acc > NIR] : < 2e-16        
##                                          
##                   Kappa : 0.9061         
##                                          
##  Mcnemar's Test P-Value : 0.09136        
## 
## Statistics by Class:
## 
##                      Class: q_5 Class: q_6 Class: q_7 Class: q_3489
## Sensitivity              0.9321     0.9577     0.9209       0.87500
## Specificity              0.9793     0.9370     0.9889       0.99586
## Pos Pred Value           0.9567     0.9219     0.9429       0.93902
## Neg Pred Value           0.9671     0.9661     0.9844       0.99095
## Prevalence               0.3292     0.4372     0.1658       0.06785
## Detection Rate           0.3069     0.4187     0.1527       0.05937
## Detection Prevalence     0.3207     0.4541     0.1619       0.06322
## Balanced Accuracy        0.9557     0.9473     0.9549       0.93543

At 66.7%, the model accuracy is comparable to the SVM estimated using e1071. Overall, SVM improved the model accuracy significantly over MNL.

3.10.6 Variable importance

Next, let’s plot the variable importance using ggplot2.¹⁹

var.imp <- varImp(model.svm.c1, 
                  scale = TRUE)

var.imp2 <- var.imp$importance

var.imp2$Chemistry <- rownames(var.imp2)

# Transpose the data
var.imp2 <- reshape2::melt(var.imp2, id = c("Chemistry"))

ggplot(data = var.imp2, aes(x = Chemistry, y = value)) +
   geom_col(aes(fill = variable)) +
   facet_wrap(~variable) +
   scale_fill_manual(values = c("#d08770", "#ebcb8b", "#a3be8c", "#b48ead")) +
   xlab(NULL) + 
   ylab("Importance Weight") +
   theme_minimal() +
   theme(axis.text.x = element_text(angle = 60, hjust = 1),
            legend.position = "none")

Similar to MNL, we have each predictor with different effectiveness in predicting different classes. Clearly alcohol content looks like the most important predictor for all the classes.

In summary, SVM performs really well compared to MNL. Note that we have been using quality.c as our target variable, which assumes there is no ordering in quality. Thus, even with limited imformation, SVM performed really well.

RBF is the Gaussian kernel and \(\small \gamma = 1/\sigma^2\)↩
caret does not report the number of support vectors.↩