3.7 More descriptive statistics

Let’s get more descriptive statistics in order to understand the distribution of our variables a little better.

wine_temp <- wine[,-c(13,14,15)]

desc <- as.data.frame(cbind(Mean = sapply(wine_temp, mean),
                      Median = sapply(wine_temp, median),
                      Std_Dev = sapply(wine_temp, sd),
                      CV = sapply(wine_temp, sd) / sapply(wine_temp, mean),
                      Skewness = sapply(wine_temp, skewness),
                      Kurtosis = sapply(wine_temp, kurtosis)))

round(desc,2)

Table 3.2: Detailed Summary Statistics
	Mean	Median	Std_Dev	CV	Skewness	Kurtosis
fixed_acidity	7.22	7.00	1.30	0.18	1.72	5.05
volatile_acidity	0.34	0.29	0.16	0.48	1.49	2.82
citric_acid	0.32	0.31	0.15	0.46	0.47	2.39
residual_sugar	5.44	3.00	4.76	0.87	1.43	4.35
chlorides	0.06	0.05	0.04	0.63	5.40	50.84
free_sulfur_dioxide	30.53	29.00	17.75	0.58	1.22	7.90
total_sulfur_dioxide	115.74	118.00	56.52	0.49	0.00	-0.37
density	0.99	0.99	0.00	0.00	0.50	6.60
pH	3.22	3.21	0.16	0.05	0.39	0.37
sulphates	0.53	0.51	0.15	0.28	1.80	8.64
alcohol	10.49	10.30	1.19	0.11	0.57	-0.53
quality	5.82	6.00	0.87	0.15	0.19	0.23

The most interesting column for me is the CV (coefficient of variation). This is the ratio of standard deviation to the mean. We have certain observations where CV is very low (e.g., 0 or 0.05). This means that the standard deviation is extremely small compared to the mean. Clearly we need some scaling here to remove the effect of the mean. One way to do that is to mean center all the variables so that we have zero mean all across. It retains the low standard deviation, however. To overcome this issue, we can divide all the variables by their standard deviations. This way, we will normalize our data such that all the variables will have mean = 0 and standard deviation = 1.¹³

Let’s scale the numeric variables.

# First create a duplicate dataset
wine2 <- wine
wine2[,c(1:12)] <- scale(wine[ , c(1:12)])

desc2 <- as.data.frame(cbind(Mean = sapply(wine2[ , c(1:12)], mean),
                      Median = sapply(wine2[ , c(1:12)], median),
                      Std.Dev = sapply(wine2[ , c(1:12)], sd),
                      Skewness = sapply(wine2[ , c(1:12)], skewness),
                      Kurtosis = sapply(wine2[ , c(1:12)], kurtosis)))

round(desc2,2)

Table 3.3: Summary Statistics of Scales Variables
	Median	Std.Dev	Skewness	Kurtosis
fixed_acidity	-0.17	1	1.72	5.05
volatile_acidity	-0.30	1	1.49	2.82
citric_acid	-0.06	1	0.47	2.39
residual_sugar	-0.51	1	1.43	4.35
chlorides	-0.26	1	5.40	50.84
free_sulfur_dioxide	-0.09	1	1.22	7.90
total_sulfur_dioxide	0.04	1	0.00	-0.37
density	0.06	1	0.50	6.60
pH	-0.05	1	0.39	0.37
sulphates	-0.14	1	1.80	8.64
alcohol	-0.16	1	0.57	-0.53
quality	0.21	1	0.19	0.23

Scaling doesn’t affect skewness or kurtosis. In order to alter these two moments, we need to use nonlinear transformation such as logarithmic or square root transformations. I’m going to do it through trial and error.

Normal distribution has skewness = 0 and kurtosis = 3. total_sulfur_dioxide, pH, alcohol, and quality seem to have this shape (a good idea is to plot these distributions). I am concerned about fixed_acidity, chlorides, free_sulfur_dioxide, density, and sulphates due to high kurtosis (and skewness in some cases). Let’s take their log transform first and then scale these variables.

wine2[ , c(1, 5, 6, 8, 10)] <- scale(log(wine[ , c(1, 5, 6, 8, 10)]))

Print skewness and kurtosis.

moments::skewness(wine2[,c(1, 5, 6, 8, 10)])

Table 3.4: Skewness
fixed_acidity	0.8889319
chlorides	0.8762698
free_sulfur_dioxide	-0.8340045
density	0.4672599
sulphates	0.4048986

moments::kurtosis(wine2[,c(1, 5, 6, 8, 10)])

Table 3.5: Kurtosis
fixed_acidity	4.896783
chlorides	5.305355
free_sulfur_dioxide	3.429675
density	9.008338
sulphates	3.701296

Except for density the remaining 4 variables benefited from log transformation. After plotting the distribution for density it appears that this might be because of a couple of extreme values. Although log transformation should have gotten rid of them, it seems it didn’t work out. So I replaced the two extreme values (there were 3 observations) with the next highest observation and then took log. As it turns out, the transformation paid off.

# replace the values > 1.00369 by 1.00369

wine2$density <- ifelse(wine$density > 1.00369, 
                        1.00369, 
                        wine$density)
wine2$density <- scale(log(wine2$density))

moments::skewness(wine2$density)

## [1] -0.02258351

moments::kurtosis(wine2$density)

## [1] 2.254121

Now we have a data set wine2 which has all the transformed variables. We will use it for the rest of the analysis.

Will that help with reducing skewness and kurtosis? Think about it for a moment before you read on.↩