3.8 Linear regression model
For linear regression model, our assumption is that quality
is a continuous variable. It’s certainly a debatable assumption. However, in many practical cases, a linear regression model works out pretty OK so I am starting off with it.
model.lm <- caret::train(quality ~ .,
data = wine2[, -c(14,15)],
method = "lm")
summary(model.lm)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8267 -0.5347 -0.0440 0.5187 3.4109
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.317675 0.051352 6.186 6.54e-10 ***
## fixed_acidity 0.123372 0.023010 5.362 8.54e-08 ***
## volatile_acidity -0.275112 0.015059 -18.269 < 2e-16 ***
## citric_acid -0.004644 0.012951 -0.359 0.7199
## residual_sugar 0.304732 0.031516 9.669 < 2e-16 ***
## chlorides -0.039373 0.015838 -2.486 0.0129 *
## free_sulfur_dioxide 0.194606 0.016048 12.126 < 2e-16 ***
## total_sulfur_dioxide -0.155376 0.020783 -7.476 8.65e-14 ***
## density -0.317525 0.050289 -6.314 2.90e-10 ***
## pH 0.080127 0.016593 4.829 1.40e-06 ***
## sulphates 0.120871 0.012814 9.433 < 2e-16 ***
## alcohol 0.305010 0.025422 11.998 < 2e-16 ***
## winewhite -0.421383 0.066724 -6.315 2.87e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8328 on 6484 degrees of freedom
## Multiple R-squared: 0.3077, Adjusted R-squared: 0.3065
## F-statistic: 240.2 on 12 and 6484 DF, p-value: < 2.2e-16
We get a decent model with adjusted-R2 = 0.3065. The model is highly significant with p-value of F-statistic < 2.2E-16. Except for citric_acid
all the other variables are statistically significant at conventional levels.
Note that I also included the variable wine
in the data set while estimating the model. As it turns out, white wines have on average lower rating than red wines (all else equal).14
I would have liked to tweak this model a little bit to understand if there are any interactions present. I will leave it for the readers as an exercise.↩