3.7 More descriptive statistics
Let’s get more descriptive statistics in order to understand the distribution of our variables a little better.
wine_temp <- wine[,-c(13,14,15)]
desc <- as.data.frame(cbind(Mean = sapply(wine_temp, mean),
Median = sapply(wine_temp, median),
Std_Dev = sapply(wine_temp, sd),
CV = sapply(wine_temp, sd) / sapply(wine_temp, mean),
Skewness = sapply(wine_temp, skewness),
Kurtosis = sapply(wine_temp, kurtosis)))
round(desc,2)
Mean | Median | Std_Dev | CV | Skewness | Kurtosis | |
---|---|---|---|---|---|---|
fixed_acidity | 7.22 | 7.00 | 1.30 | 0.18 | 1.72 | 5.05 |
volatile_acidity | 0.34 | 0.29 | 0.16 | 0.48 | 1.49 | 2.82 |
citric_acid | 0.32 | 0.31 | 0.15 | 0.46 | 0.47 | 2.39 |
residual_sugar | 5.44 | 3.00 | 4.76 | 0.87 | 1.43 | 4.35 |
chlorides | 0.06 | 0.05 | 0.04 | 0.63 | 5.40 | 50.84 |
free_sulfur_dioxide | 30.53 | 29.00 | 17.75 | 0.58 | 1.22 | 7.90 |
total_sulfur_dioxide | 115.74 | 118.00 | 56.52 | 0.49 | 0.00 | -0.37 |
density | 0.99 | 0.99 | 0.00 | 0.00 | 0.50 | 6.60 |
pH | 3.22 | 3.21 | 0.16 | 0.05 | 0.39 | 0.37 |
sulphates | 0.53 | 0.51 | 0.15 | 0.28 | 1.80 | 8.64 |
alcohol | 10.49 | 10.30 | 1.19 | 0.11 | 0.57 | -0.53 |
quality | 5.82 | 6.00 | 0.87 | 0.15 | 0.19 | 0.23 |
The most interesting column for me is the CV (coefficient of variation). This is the ratio of standard deviation to the mean. We have certain observations where CV is very low (e.g., 0 or 0.05). This means that the standard deviation is extremely small compared to the mean. Clearly we need some scaling here to remove the effect of the mean. One way to do that is to mean center all the variables so that we have zero mean all across. It retains the low standard deviation, however. To overcome this issue, we can divide all the variables by their standard deviations. This way, we will normalize our data such that all the variables will have mean = 0 and standard deviation = 1.13
Let’s scale the numeric variables.
# First create a duplicate dataset
wine2 <- wine
wine2[,c(1:12)] <- scale(wine[ , c(1:12)])
desc2 <- as.data.frame(cbind(Mean = sapply(wine2[ , c(1:12)], mean),
Median = sapply(wine2[ , c(1:12)], median),
Std.Dev = sapply(wine2[ , c(1:12)], sd),
Skewness = sapply(wine2[ , c(1:12)], skewness),
Kurtosis = sapply(wine2[ , c(1:12)], kurtosis)))
round(desc2,2)
Mean | Median | Std.Dev | Skewness | Kurtosis | |
---|---|---|---|---|---|
fixed_acidity | 0 | -0.17 | 1 | 1.72 | 5.05 |
volatile_acidity | 0 | -0.30 | 1 | 1.49 | 2.82 |
citric_acid | 0 | -0.06 | 1 | 0.47 | 2.39 |
residual_sugar | 0 | -0.51 | 1 | 1.43 | 4.35 |
chlorides | 0 | -0.26 | 1 | 5.40 | 50.84 |
free_sulfur_dioxide | 0 | -0.09 | 1 | 1.22 | 7.90 |
total_sulfur_dioxide | 0 | 0.04 | 1 | 0.00 | -0.37 |
density | 0 | 0.06 | 1 | 0.50 | 6.60 |
pH | 0 | -0.05 | 1 | 0.39 | 0.37 |
sulphates | 0 | -0.14 | 1 | 1.80 | 8.64 |
alcohol | 0 | -0.16 | 1 | 0.57 | -0.53 |
quality | 0 | 0.21 | 1 | 0.19 | 0.23 |
Scaling doesn’t affect skewness or kurtosis. In order to alter these two moments, we need to use nonlinear transformation such as logarithmic or square root transformations. I’m going to do it through trial and error.
Normal distribution has skewness = 0 and kurtosis = 3. total_sulfur_dioxide
, pH
, alcohol
, and quality
seem to have this shape (a good idea is to plot these distributions). I am concerned about fixed_acidity
, chlorides
, free_sulfur_dioxide
, density
, and sulphates
due to high kurtosis (and skewness in some cases). Let’s take their log transform first and then scale these variables.
wine2[ , c(1, 5, 6, 8, 10)] <- scale(log(wine[ , c(1, 5, 6, 8, 10)]))
Print skewness and kurtosis.
moments::skewness(wine2[,c(1, 5, 6, 8, 10)])
fixed_acidity | 0.8889319 |
chlorides | 0.8762698 |
free_sulfur_dioxide | -0.8340045 |
density | 0.4672599 |
sulphates | 0.4048986 |
moments::kurtosis(wine2[,c(1, 5, 6, 8, 10)])
fixed_acidity | 4.896783 |
chlorides | 5.305355 |
free_sulfur_dioxide | 3.429675 |
density | 9.008338 |
sulphates | 3.701296 |
Except for density
the remaining 4 variables benefited from log transformation. After plotting the distribution for density
it appears that this might be because of a couple of extreme values. Although log transformation should have gotten rid of them, it seems it didn’t work out. So I replaced the two extreme values (there were 3 observations) with the next highest observation and then took log. As it turns out, the transformation paid off.
# replace the values > 1.00369 by 1.00369
wine2$density <- ifelse(wine$density > 1.00369,
1.00369,
wine$density)
wine2$density <- scale(log(wine2$density))
moments::skewness(wine2$density)
## [1] -0.02258351
moments::kurtosis(wine2$density)
## [1] 2.254121
Now we have a data set wine2
which has all the transformed variables. We will use it for the rest of the analysis.
Will that help with reducing skewness and kurtosis? Think about it for a moment before you read on.↩