3.6 Predictor variables

Now that we have new dependent variables, let’s take a look at the predictor variables. As we are doing a predictive analysis (we have no plan to do a statistical inference), let’s understand the distribution and correlations of the variables and explore the need for transformation.

For this we will first get the descriptive statistics and correlations for all the numeric variables. Table 3.1 shows the correlations.

cormat <- round(cor(as.matrix(wine[,-c(13,14,15)])),2)
cormat[upper.tri(cormat)] <- ""
cormat <- as.data.frame(cormat) %>% select(-quality)
colnames(cormat) <- c("V1", "V2", "V3", "V4", "V5",
                      "V6", "V7", "V8", "V9", "V10", "V11")
rownames(cormat) <- paste(c(colnames(cormat), "V12"),
                          ":",
                          rownames(cormat))
print(cormat)

Table 3.1: Correlation Coefficients
	V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11
V1 : fixed_acidity	1
V2 : volatile_acidity	0.22	1
V3 : citric_acid	0.32	-0.38	1
V4 : residual_sugar	-0.11	-0.2	0.14	1
V5 : chlorides	0.3	0.38	0.04	-0.13	1
V6 : free_sulfur_dioxide	-0.28	-0.35	0.13	0.4	-0.2	1
V7 : total_sulfur_dioxide	-0.33	-0.41	0.2	0.5	-0.28	0.72	1
V8 : density	0.46	0.27	0.1	0.55	0.36	0.03	0.03	1
V9 : pH	-0.25	0.26	-0.33	-0.27	0.04	-0.15	-0.24	0.01	1
V10 : sulphates	0.3	0.23	0.06	-0.19	0.4	-0.19	-0.28	0.26	0.19	1
V11 : alcohol	-0.1	-0.04	-0.01	-0.36	-0.26	-0.18	-0.27	-0.69	0.12	0	1
V12 : quality	-0.08	-0.27	0.09	-0.04	-0.2	0.06	-0.04	-0.31	0.02	0.04	0.44

Next we will use ggcorrplot package to create a nice looking correlation plot. This package is available on CRAN

ggcorrplot::ggcorrplot(round(cor(as.matrix(wine[, -c(13,14,15)])), 2), 
            p.mat = ggcorrplot::cor_pmat(as.matrix(wine[, -c(13,14,15)])),
            hc.order = TRUE, type = "lower",
            outline.col = "white",
            ggtheme = ggplot2::theme_minimal,
            colors = c("#cf222c", "white", "#3a2d7f")
            )

Figure 3.1: Correlation Heatmap

In the above heat map, the crosses indicate non-significant correlations. From the correlations, most variables have their own unique information set. However, it appears that quality is strongly related to only a few variables. This is not great news!¹²

At this point one can think of transformations to increase the correlations between the variables. However, I am not going to do it as this will make this exercise broader than I want. This is left to the reader as an exercise.↩