3.8 Linear regression model

For linear regression model, our assumption is that quality is a continuous variable. It’s certainly a debatable assumption. However, in many practical cases, a linear regression model works out pretty OK so I am starting off with it.

model.lm <- caret::train(quality ~ ., 
                         data = wine2[, -c(14,15)], 
                         method = "lm")
summary(model.lm)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8267 -0.5347 -0.0440  0.5187  3.4109 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           0.317675   0.051352   6.186 6.54e-10 ***
## fixed_acidity         0.123372   0.023010   5.362 8.54e-08 ***
## volatile_acidity     -0.275112   0.015059 -18.269  < 2e-16 ***
## citric_acid          -0.004644   0.012951  -0.359   0.7199    
## residual_sugar        0.304732   0.031516   9.669  < 2e-16 ***
## chlorides            -0.039373   0.015838  -2.486   0.0129 *  
## free_sulfur_dioxide   0.194606   0.016048  12.126  < 2e-16 ***
## total_sulfur_dioxide -0.155376   0.020783  -7.476 8.65e-14 ***
## density              -0.317525   0.050289  -6.314 2.90e-10 ***
## pH                    0.080127   0.016593   4.829 1.40e-06 ***
## sulphates             0.120871   0.012814   9.433  < 2e-16 ***
## alcohol               0.305010   0.025422  11.998  < 2e-16 ***
## winewhite            -0.421383   0.066724  -6.315 2.87e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8328 on 6484 degrees of freedom
## Multiple R-squared:  0.3077, Adjusted R-squared:  0.3065 
## F-statistic: 240.2 on 12 and 6484 DF,  p-value: < 2.2e-16

We get a decent model with adjusted-R² = 0.3065. The model is highly significant with p-value of F-statistic < 2.2E-16. Except for citric_acid all the other variables are statistically significant at conventional levels.

Note that I also included the variable wine in the data set while estimating the model. As it turns out, white wines have on average lower rating than red wines (all else equal).¹⁴

I would have liked to tweak this model a little bit to understand if there are any interactions present. I will leave it for the readers as an exercise.↩