3.8 Linear regression model

For linear regression model, our assumption is that quality is a continuous variable. It’s certainly a debatable assumption. However, in many practical cases, a linear regression model works out pretty OK so I am starting off with it.

model.lm <- caret::train(quality ~ ., 
                         data = wine2[, -c(14,15)], 
                         method = "lm")
summary(model.lm)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8267 -0.5347 -0.0440  0.5187  3.4109 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           0.317675   0.051352   6.186 6.54e-10 ***
## fixed_acidity         0.123372   0.023010   5.362 8.54e-08 ***
## volatile_acidity     -0.275112   0.015059 -18.269  < 2e-16 ***
## citric_acid          -0.004644   0.012951  -0.359   0.7199    
## residual_sugar        0.304732   0.031516   9.669  < 2e-16 ***
## chlorides            -0.039373   0.015838  -2.486   0.0129 *  
## free_sulfur_dioxide   0.194606   0.016048  12.126  < 2e-16 ***
## total_sulfur_dioxide -0.155376   0.020783  -7.476 8.65e-14 ***
## density              -0.317525   0.050289  -6.314 2.90e-10 ***
## pH                    0.080127   0.016593   4.829 1.40e-06 ***
## sulphates             0.120871   0.012814   9.433  < 2e-16 ***
## alcohol               0.305010   0.025422  11.998  < 2e-16 ***
## winewhite            -0.421383   0.066724  -6.315 2.87e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8328 on 6484 degrees of freedom
## Multiple R-squared:  0.3077, Adjusted R-squared:  0.3065 
## F-statistic: 240.2 on 12 and 6484 DF,  p-value: < 2.2e-16

We get a decent model with adjusted-R2 = 0.3065. The model is highly significant with p-value of F-statistic < 2.2E-16. Except for citric_acid all the other variables are statistically significant at conventional levels.

Note that I also included the variable wine in the data set while estimating the model. As it turns out, white wines have on average lower rating than red wines (all else equal).14


  1. I would have liked to tweak this model a little bit to understand if there are any interactions present. I will leave it for the readers as an exercise.