A wine bottle and a grape with a few bits and bytes to provide a logo for this assignment

This paper explores which chemical components influence wine quality with the method of ‘Exploratory Data Analysis’ using the R programming language. It is part of Udacity Data Analyst Nanodegree. It is written as an R-Markdown file and provides a stream-of-contiousness analysis of the data as well as a ‘Final Plots’ section at the end which is somewhat more cut-to-the-chase.

In the data set at hand, about 5000 white wines are scored by professionals (0 being bad, 10 being excellent) and the chemical properties are recorded. After doing some preparations and loading the data set, we will go through each variable one-by-one.

1 Preparations

First, we load libraries and define some functions to achieve a unique ‘look and feel’ and some shortcuts for the plotting styles most commonly used in this project.

2 First Observations

We load the data and assess its structure:

## Data set structure:
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
## Data set dimension:
## [1] 4898   13

Seems like no addtitional post-processing will be necessary. quality may be more useful as a factor in certain use cases (we will keep this in mind). We can also see, that there are 4898 observations and 15 variables.

Let’s look at the distribution of the key chemical parameters one-by-one by plotting their distribution and their statistical (summary(...)) properties.

3 Univariate Analysis

3.1 Fixed Acidity (Tartaric Acid) in \(\text{g}/\text{l}\)

Most acids involved with wine are fixed or nonvolatile (do not evaporate readily).

Fixed acidity seems normally distributed (\(\mu = 6.800 \frac{\text{g}}{\text{l}}, \sigma = 0.843 \frac{\text{g}}{\text{dm}^3}\)).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200
## Standard deviation: 0.8438682

3.2 Volatile Acidity (Acetic Acid) in \(\text{g}/\text{l}\)

The amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.

Volatile acids seem to be presentin much lower quantities than fixed acids by an order of magnitude, their distribution (\(\mu = 0.2782 \frac{\text{g}}{\text{l}}, \sigma = 0.1008 \frac{\text{g}}{\text{dm}^3}\)) is slightly skewed to the left.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000
## Standard deviation: 0.1007945

3.3 Citric Acid in \(\text{g}/\text{l}\)

Found in small quantities, citric acid can add ‘freshness’ and flavor to wines.

The amount of citric acidity seems to follow a slightly higher, but otherwise comparable distribution to the volatile acidity (\(\mu = 0.3342 \frac{\text{g}}{\text{l}}, \sigma = 0.1210 \frac{\text{g}}{\text{dm}^3}\)). Left-skewedness seems to be smaller.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600
## Standard deviation: 0.1210198

3.4 pH

… describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.

As expected, we find the pH value distribution with a mean between 3-4 (\(\mu = 3.188\)), most likely influenced by the most prevalent acid (fixed acids).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820
## Standard deviation: 0.1510006

3.5 Chloride Concentration in \(\text{g}/\text{l}\)

… the amount of salt in the wine.

The amount of salt in the wine seems normally distributed at first (\(\mu = 0.0457 \frac{\text{g}}{\text{l}}, \sigma = 0.02184 \frac{\text{g}}{\text{l}}\)) but has an extremely ‘long tail’ on the right which would ‘stretch’ a fitted bell curve. Delimited by dashed lines in belows graph is the 90% confidence interval which the largest part of the bell curve would cover. It will be interesting to see, if/how those outliers affect wine quality.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
## Standard deviation: 0.02184797
## 75%-quantile: 0.05

3.6 Free Sulfur Dioxide in \(\text{mg}/\text{l}\)

The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine.

Also for this chemical property, there are a few outliers on the right, making the fitted normal distribution curve (\(\mu = 35.31 \frac{\text{mg}}{\text{l}}, \sigma = 17.0071 \frac{\text{mg}}{\text{l}}\)) slightly more left-skewed and broader. There is one extreme outlier >150 mg/l which we will just tabulate and not include in the graph

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00
## Standard deviation: 17.00714
## The one outlier with free sulfur dioxide > 150 mg/l:
##         X fixed.acidity volatile.acidity citric.acid residual.sugar
## 4746 4746           6.1             0.26        0.25            2.9
##      chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 4746     0.047                 289                  440 0.99314 3.44
##      sulphates alcohol quality
## 4746      0.64    10.5       3

3.7 Total Sulfur Dioxide in \(\text{mg}/\text{l}\)

Amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50ppm, SO2 becomes evident in the nose and taste of wine.

The total sulfur dioxide distribution (\(\mu = 138.4 \frac{\text{g}}{\text{l}}, \sigma = 42.498 \frac{\text{g}}{\text{l}}\)) shows that the average total sulfur dioxide content is about 3 times higher than the average free sulfur dioxide content. The left-skewedness is less noticable than for free sulfur dioxide contents and the “long-tail” on the right is shorter.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0
## Standard deviation: 42.49806

3.8 Sulphates (\(\text{g}/\text{l}\))

… a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant.

We see a slightly left-skewed distribution of sulphate content ($= 0.4898 , $).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800
## Standard deviation: 0.1141258

3.9 Residual Sugar (\(\text{g}/\text{l}\))

The amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet.

The distribution of sugar in the wine is non-normal (\(\mu = 6.391 \frac{g}{l}\)). There seem to be several different local maxima, the largest at around 1.5 g/l, smaller ones at 4.5 g/l, 6.3 g/l and 8 g/l, potentially signifying the different wine flavours (sweet, dry, etc.) - we will get back to those later.

There is only one sweet wine (with sugar content of over 45 g/l), which we will not show but tabulate. It seems to be a wine rated slightly above average (score 6).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800
## The only sweet wine with residual sugar > 45 g/l:
##         X fixed.acidity volatile.acidity citric.acid residual.sugar
## 2782 2782           7.8            0.965         0.6           65.8
##      chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 2782     0.074                   8                  160 1.03898 3.39
##      sulphates alcohol quality
## 2782      0.69    11.7       6

3.10 Alcohol Content in %

As the sugar content does, the alcohol also is a distribution with several local maxima (\(\mu = 10.5%, \text{max} = 14.2%, \text{min} = 8.0%\)). It is to be expected, that each wine flavor on the market contributes to the overall distribution with their signature distributions.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

3.11 Density in \(\text{kg}/\text{l}\)

The density of water is close to that of water depending on the percent alcohol and sugar content.

Wine density is somewhat tightly normally distributed around the density of water ($= 0.9940 , = 2.9 ).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390
## Standard deviation: 0.002990907

3.12 Output Variable: Quality based on Sensory Data (0-10)

The following graph shows the overall distribution of the score that was given to the wines. It ranges from 3 to 9, with 20 wines being in the worst category (3), 5 in the best (9) and most wines (2198) scoring a “6” slightly above average (5.86) and the mid-point value that could be avarded (5), potentially showing a psychological bias of the wine testers to slightly overscore wines.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000
## Absolute frequency of output variable 'quality':
## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

4 Bivariate Analysis

4.1 Acidity Values

The following graph shows the acidity distributions, each one with a different color, all overlayed in one diagram. Notice how the distribution of citric acidity and volatile acidity is much skinnier and has a lower mean while fixed acidity has a higher mean and broader distribution.

# Create a dataframe with the columns acidtype, acidvalue and fill it 
# with all of the values for citric.acid, volatile.acidity,
# fixed.acidity for all of the wines ...
wineacids <- wine %>%
  dplyr::select(citric.acid, volatile.acidity, fixed.acidity) %>%
  tidyr::gather("acidtype","acidvalue", 1:3)

Does one type of acid come with another type of acid? That is, are any two of the acidity values corellated? We can plot the corellation matrix:

##                  fixed.acidity volatile.acidity citric.acid
## fixed.acidity       1.00000000      -0.02269729   0.2891807
## volatile.acidity   -0.02269729       1.00000000  -0.1494718
## citric.acid         0.28918070      -0.14947181   1.0000000

We can find that fixed acidity and citric acidity are slightly positively corellated (\(r = 0.289\)), that means, where more fixed acidity is present, also more citric acidity is present.

The next plot shows this, but notice the significant spread in the middle of both acidity distributions.

However, we find that volatile acidity is slightly negatively corellated with both types of acidity (\(r_\text{fixed} = -0.0227, r_\text{citric} = -0.1495\) - hinting at that for wines with volatile acidity, less of the other two kinds of acids are present. The next two plots show this relationship.

4.2 pH vs. Acidity

pH is a measure for acidity, but are pH value and the concentration-values of each acid corellated or does any of the acid concentractions have a particulary strong influence? We also create a new “sum-by-concentration” variable acidtity where we just add the concentration of all acids.

We find, as expected, the strongest (negative) corellation between pH and the most prevalent, fixed, acidity (\(r = -0.4258\)). Calculating \(r^2\) or for how much variability in pH this acidity accounts, we get \(r^2 = 0.1814\), which is rather small, but we have to take into account the chemical properties: in a wine, there might be basic molecules which counteract acid components and drive the pH up. Displayed below are some corellation matrices between pH and other variables as well as the respective scatter plots relating both variables.

## Corellation matrix between fixed acidity and ...:
##                         [,1]
## fixed.acidity    -0.42585829
## volatile.acidity -0.03191537
## citric.acid      -0.16374821
## acidity          -0.43065133
## R^2 value between fixed acidity and ...:
##                         [,1]
## fixed.acidity    0.181355284
## volatile.acidity 0.001018591
## citric.acid      0.026813477
## acidity          0.185460569

4.3 Total vs. Free Sulfur Dioxide

Analyzing the corellation between total and free sulfur dioxide content we find that with an increase of total sulfur dioxide by 1 mg/l, free sulfur dioxide increases by \(r = 0.6155 \frac{\text{mg}}{\text{l}}\) and mostly accounts for the majority of sulfur dioxide increase. However, variability of the free sulfur dioxide content increases with a higher total sulfur dioxide content.

## Correlation total vs. free sulfur dioxide: 0.615501

4.4 Sulphates vs. Total Sulfur Dioxide

Let’s verify to which extent the variable description holds true and how much sulfates really contribute to ‘unpleasant’ SO2 levels. Calculating the \(r\), we find just \(r = 0.1345\) and \(r^2 = 0.0181\), showing that variability of the sulphate only describes 1.8% percent of the variability in the sulfur dioxide level and therefore is not a very good indicator for total sulfur dioxide level. The corellation for free sulfur dioxide is even lower (\(r_\text{free} = 0.0592\)).

## Corellation sulphates vs. total sulfur dioxide: 0.1345624
## R^2 value sulphates vs. total sulfur dioxide: 0.01810703
## Corellation sulphates vs. free sulfur dioxide: 0.05921725

4.5 Residual Sugar vs. Alcohol

Gestation creates alcohol from sugar, therefore, with more alcohol in the wine, we expect less sugar; indeed, the values are negatively corellated (\(r = -0.4506\)). The graph below shows that indeed, a higher residual sugar content leads to less alcohol, but we also note, that the spread of the alcohol content is much higher for wines with lower sugar content, suggesting perhaps, that some wines may have not enough sugar to turn into alcohol to begin with.

## Corellation between residual sugar and alcohol content:  -0.4506312

4.6 Density vs. Sugar & Alcohol Level

Verifying the statement of the variable description about the density to be influenced by residual sugar and alcohol content, we analyze the corellation and find \(r_\text{residual.sugar} = 0.8390, r_\text{alcohol} = -0.78013\).

Checking \(r^2\), we find that 70.3% of the variability of density is explained by the variability in residual sugar, 60.8% of the variability of density is explained by the variabilit in alcohol; since these two variables are highly coreallated, we could not use them both in a linear model explaining the density variable.

## Corellation wine density vs. ...
##      residual.sugar    alcohol
## [1,]      0.8389665 -0.7801376
## R^2 value vs. ...
##      residual.sugar   alcohol
## [1,]      0.7038647 0.6086147

5 Introducing a Categorical Variable: Wine Flavor

We create the following categorical variable to classify wines according to their flavor based on existing EU regulations:

# Create categorical variable for "wine flavor" according to EU trade 
# specifications; variable is depending on acidity and residual sugar
wine$flavor <- 
      ifelse(wine$residual.sugar <= 4 & wine$acidity >= wine$residual.sugar-2,
              "classical-dry",
      ifelse(wine$residual.sugar <= 9 & wine$acidity >= wine$residual.sugar-2,
              "dry",
      ifelse(wine$residual.sugar <= 18 & wine$acidity >= wine$residual.sugar-10,
              "half-dry",
      ifelse(wine$residual.sugar <= 45, "semi-sweet",
      ifelse(wine$residual.sugar > 45, "sweet",
      "unclassified"
      )))))

There are many (classical-)dry wines, less sweet wines, only a few semi-sweet wine and only one single sweet wine.

## Distribution of wine flavors:
## 
## classical-dry           dry      half-dry    semi-sweet         sweet 
##          2097          1414          1246           140             1

The following graph visualizes the classification by showing residual sugar level and acidity level. The flavor the wine is assigned is color-coded. One can see a slight “diagonal slope” at each cut-off-point between clusters signifying one particular flavor type. These are due to the restrictions put on the acidity level.

Finally, we show the sugar content distribution of each wine flavor as a histogram. It is now evident, that the local maxima in the aggregated histogram are due to treating all the wines in “one bucket”. Within the sugar content distribution of their wine flavor, the distributions are much more uniform.

## Scale for 'fill' is already present. Adding another scale for 'fill', which will replace the existing scale.

6 Multivariate Analysis

6.1 Scatterplot Matrix

In order to explore the data further, we use a scatterplot matrix which shows

  • a graph for the relationship between any two variables in the lower left triangle, using a scatterplot for numerical data
  • rows/columns that involves a categorical variable showing facetted plots
  • distributions of single variables along the diagonale

The following corellation table also shows the corellation values between any two variables again and highlights visually highest and lowest corellation values:

6.1.1 Confirmation of previous findings (little surprises)

First, we can-re-confirm our previous findings.

## Corellation of acidity with...:
##      fixed.acidity volatile.acidity citric.acid
## [1,]     0.9871787       0.07157062   0.3941434

Negative corellation; acidity is defined as the sum-of-weight of all acids; with fixed acids and citrics acids usually dominating over volatile acids, the corellation here is stronger.

## Corellation of pH with...:
##         acidity fixed.acidity volatile.acidity citric.acid
## [1,] -0.4306513    -0.4258583      -0.03191537  -0.1637482

Negative corellation between pH and all types of acidity; however, corellation between acidity and pH can never be 1.0 because pH should also take basic chemical components into account. Thus, also corellation between pH and any specific type of acidity is lower than corellation between the variable acidity and the corresponding type of acidity.

## Corellation of total sulfur dioxide with free sulfur dioxide:
## [1] 0.615501

Strong positive corellation because free sulfur dioxide makes out the majority of all sulfur dioxides.

## Corellation of sulphates with total sulfur dioxide: 0.1345624

Weak positive corellation, as the description of sulphates suggests, sulphates lead to the creation of SO2 levels.

## Corellation of alcohol content with residual sugar: -0.4506312

Strong negative corellation because of chemical process of alcohol gestation (sugar is converted to alcohol).

## Corellation of alcohol content with residual sugar: 0.8389665

Strong positive corellation, additional sugar in the fluid leads to a higher density than no sugar.

## Corellation of free sulfur dioxide with total sulfur dioxide: 0.615501

Strong positive corellation, because free sulfur dioxide is the major part of total sulfur dioxide.

6.1.2 New Findings Concerning Chemical Properties

## Corellation of residual sugar with free sulfur dioxide: 0.2990984

Weak positive corellation – Free sulfur dioxide ‘prevents microbial growth’, therefore might constrain alcohol gestation and result in more remaining residual sugar.

## Corellation of chlorides with alcohol content: -0.3601887

Weak negative corellation – ‘saltier’ wines seem to contain less alcohol. Does the salt also inhibit the gestation process?

## Corellation of density with ...:
##      chlorides total.sulfur.dioxide
## [1,] 0.2572113            0.5298813

Positive corellation, higher for “total.sulfur.dioxide”, both chlorides and SO2 increase the density of the wine.

## Corellation of acidity with density: 0.2756088

Weak positive corellation, acidity further increases the density of wine.

## Corellation of alcohol content with density: -0.7801376

Strong negative corellation, alcohol content lowers sugar content and leads to releases of gases, therefore creating a lighter brew and lowering density.

6.1.3 New Findings Concerning Quality

Inspecting the row/column ‘quality’ in the scatterplot matrix, we find particulary high absolute corellation values for the following variables:

## Corellation of quality with...:
##      volatile.acidity fixed.acidity  chlorides   alcohol
## [1,]        -0.194723    -0.1136628 -0.2099344 0.4355747
##      total.sulfur.dioxide
## [1,]           -0.1747372
  • volatile acidity is negative corellated with quality, as the variable description suggests (“vinegar taste”)
  • fixed acidity also is negatively corellated
  • chlorides (“salts”) seem to have a negative impact on quality
  • higher alcohol content is generally seen as something of higher quality
  • total sulfur dioxide has a negative influence (recall the variable description “becomes noticable over 50ppm”)

6.2 Scatterplot Matrix (facetted by Quality)

We can furthermore imagine, that certain qualities are only relevant for certain flavours of wine. Or we might overlook some non-linear qualities. Quality can be seen as a 7-level factor variable (3-9), so we can add another discrete (color) dimension to our scatterplot to visualize distributions for different levels of quality.