1 Preparations
2 First Observations
3 Univariate Analysis
4 Bivariate Analysis
5 Introducing a Categorical Variable: Wine Flavor
6 Multivariate Analysis
7 Final Plots and Summaries
8 Reflection
9 References

A wine bottle and a grape with a few bits and bytes to provide a logo for this assignment

This paper explores which chemical components influence wine quality with the method of ‘Exploratory Data Analysis’ using the R programming language. It is part of Udacity Data Analyst Nanodegree. It is written as an R-Markdown file and provides a stream-of-contiousness analysis of the data as well as a ‘Final Plots’ section at the end which is somewhat more cut-to-the-chase.

In the data set at hand, about 5000 white wines are scored by professionals (0 being bad, 10 being excellent) and the chemical properties are recorded. After doing some preparations and loading the data set, we will go through each variable one-by-one.

1 Preparations

First, we load libraries and define some functions to achieve a unique ‘look and feel’ and some shortcuts for the plotting styles most commonly used in this project.

2 First Observations

We load the data and assess its structure:

## Data set structure:

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

## Data set dimension:

## [1] 4898   13

Seems like no addtitional post-processing will be necessary. quality may be more useful as a factor in certain use cases (we will keep this in mind). We can also see, that there are 4898 observations and 15 variables.

Let’s look at the distribution of the key chemical parameters one-by-one by plotting their distribution and their statistical (summary(...)) properties.

3 Univariate Analysis

3.1 Fixed Acidity (Tartaric Acid) in $\text{g}/\text{l}$

Most acids involved with wine are fixed or nonvolatile (do not evaporate readily).

Fixed acidity seems normally distributed ($\mu = 6.800 \frac{\text{g}}{\text{l}}, \sigma = 0.843 \frac{\text{g}}{\text{dm}^3}$).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

## Standard deviation: 0.8438682

3.2 Volatile Acidity (Acetic Acid) in $\text{g}/\text{l}$

The amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.

Volatile acids seem to be presentin much lower quantities than fixed acids by an order of magnitude, their distribution ($\mu = 0.2782 \frac{\text{g}}{\text{l}}, \sigma = 0.1008 \frac{\text{g}}{\text{dm}^3}$) is slightly skewed to the left.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

## Standard deviation: 0.1007945

3.3 Citric Acid in $\text{g}/\text{l}$

Found in small quantities, citric acid can add ‘freshness’ and flavor to wines.

The amount of citric acidity seems to follow a slightly higher, but otherwise comparable distribution to the volatile acidity ($\mu = 0.3342 \frac{\text{g}}{\text{l}}, \sigma = 0.1210 \frac{\text{g}}{\text{dm}^3}$). Left-skewedness seems to be smaller.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

## Standard deviation: 0.1210198

3.4 pH

… describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.

As expected, we find the pH value distribution with a mean between 3-4 ($\mu = 3.188$), most likely influenced by the most prevalent acid (fixed acids).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

## Standard deviation: 0.1510006

3.5 Chloride Concentration in $\text{g}/\text{l}$

… the amount of salt in the wine.

The amount of salt in the wine seems normally distributed at first ($\mu = 0.0457 \frac{\text{g}}{\text{l}}, \sigma = 0.02184 \frac{\text{g}}{\text{l}}$) but has an extremely ‘long tail’ on the right which would ‘stretch’ a fitted bell curve. Delimited by dashed lines in belows graph is the 90% confidence interval which the largest part of the bell curve would cover. It will be interesting to see, if/how those outliers affect wine quality.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

## Standard deviation: 0.02184797

## 75%-quantile: 0.05

3.6 Free Sulfur Dioxide in $\text{mg}/\text{l}$

The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine.

Also for this chemical property, there are a few outliers on the right, making the fitted normal distribution curve ($\mu = 35.31 \frac{\text{mg}}{\text{l}}, \sigma = 17.0071 \frac{\text{mg}}{\text{l}}$) slightly more left-skewed and broader. There is one extreme outlier >150 mg/l which we will just tabulate and not include in the graph

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

## Standard deviation: 17.00714

## The one outlier with free sulfur dioxide > 150 mg/l:

##         X fixed.acidity volatile.acidity citric.acid residual.sugar
## 4746 4746           6.1             0.26        0.25            2.9
##      chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 4746     0.047                 289                  440 0.99314 3.44
##      sulphates alcohol quality
## 4746      0.64    10.5       3

3.7 Total Sulfur Dioxide in $\text{mg}/\text{l}$

Amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50ppm, SO2 becomes evident in the nose and taste of wine.

The total sulfur dioxide distribution ($\mu = 138.4 \frac{\text{g}}{\text{l}}, \sigma = 42.498 \frac{\text{g}}{\text{l}}$) shows that the average total sulfur dioxide content is about 3 times higher than the average free sulfur dioxide content. The left-skewedness is less noticable than for free sulfur dioxide contents and the “long-tail” on the right is shorter.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

## Standard deviation: 42.49806

3.8 Sulphates ($\text{g}/\text{l}$)

… a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant.

We see a slightly left-skewed distribution of sulphate content ($= 0.4898 , $).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

## Standard deviation: 0.1141258

3.9 Residual Sugar ($\text{g}/\text{l}$)

The amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet.

The distribution of sugar in the wine is non-normal ($\mu = 6.391 \frac{g}{l}$). There seem to be several different local maxima, the largest at around 1.5 g/l, smaller ones at 4.5 g/l, 6.3 g/l and 8 g/l, potentially signifying the different wine flavours (sweet, dry, etc.) - we will get back to those later.

There is only one sweet wine (with sugar content of over 45 g/l), which we will not show but tabulate. It seems to be a wine rated slightly above average (score 6).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

## The only sweet wine with residual sugar > 45 g/l:

##         X fixed.acidity volatile.acidity citric.acid residual.sugar
## 2782 2782           7.8            0.965         0.6           65.8
##      chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 2782     0.074                   8                  160 1.03898 3.39
##      sulphates alcohol quality
## 2782      0.69    11.7       6

3.10 Alcohol Content in %

As the sugar content does, the alcohol also is a distribution with several local maxima ($\mu = 10.5%, \text{max} = 14.2%, \text{min} = 8.0%$). It is to be expected, that each wine flavor on the market contributes to the overall distribution with their signature distributions.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

3.11 Density in $\text{kg}/\text{l}$

The density of water is close to that of water depending on the percent alcohol and sugar content.

Wine density is somewhat tightly normally distributed around the density of water ($= 0.9940 , = 2.9 ).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

## Standard deviation: 0.002990907

3.12 Output Variable: Quality based on Sensory Data (0-10)

The following graph shows the overall distribution of the score that was given to the wines. It ranges from 3 to 9, with 20 wines being in the worst category (3), 5 in the best (9) and most wines (2198) scoring a “6” slightly above average (5.86) and the mid-point value that could be avarded (5), potentially showing a psychological bias of the wine testers to slightly overscore wines.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

## Absolute frequency of output variable 'quality':

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

4 Bivariate Analysis

4.1 Acidity Values

The following graph shows the acidity distributions, each one with a different color, all overlayed in one diagram. Notice how the distribution of citric acidity and volatile acidity is much skinnier and has a lower mean while fixed acidity has a higher mean and broader distribution.

# Create a dataframe with the columns acidtype, acidvalue and fill it 
# with all of the values for citric.acid, volatile.acidity,
# fixed.acidity for all of the wines ...
wineacids <- wine %>%
  dplyr::select(citric.acid, volatile.acidity, fixed.acidity) %>%
  tidyr::gather("acidtype","acidvalue", 1:3)

Does one type of acid come with another type of acid? That is, are any two of the acidity values corellated? We can plot the corellation matrix:

##                  fixed.acidity volatile.acidity citric.acid
## fixed.acidity       1.00000000      -0.02269729   0.2891807
## volatile.acidity   -0.02269729       1.00000000  -0.1494718
## citric.acid         0.28918070      -0.14947181   1.0000000

We can find that fixed acidity and citric acidity are slightly positively corellated ($r = 0.289$), that means, where more fixed acidity is present, also more citric acidity is present.

The next plot shows this, but notice the significant spread in the middle of both acidity distributions.

However, we find that volatile acidity is slightly negatively corellated with both types of acidity ($r_\text{fixed} = -0.0227, r_\text{citric} = -0.1495$ - hinting at that for wines with volatile acidity, less of the other two kinds of acids are present. The next two plots show this relationship.

4.2 pH vs. Acidity

pH is a measure for acidity, but are pH value and the concentration-values of each acid corellated or does any of the acid concentractions have a particulary strong influence? We also create a new “sum-by-concentration” variable acidtity where we just add the concentration of all acids.

We find, as expected, the strongest (negative) corellation between pH and the most prevalent, fixed, acidity ($r = -0.4258$). Calculating $r^2$ or for how much variability in pH this acidity accounts, we get $r^2 = 0.1814$, which is rather small, but we have to take into account the chemical properties: in a wine, there might be basic molecules which counteract acid components and drive the pH up. Displayed below are some corellation matrices between pH and other variables as well as the respective scatter plots relating both variables.

## Corellation matrix between fixed acidity and ...:

##                         [,1]
## fixed.acidity    -0.42585829
## volatile.acidity -0.03191537
## citric.acid      -0.16374821
## acidity          -0.43065133

## R^2 value between fixed acidity and ...:

##                         [,1]
## fixed.acidity    0.181355284
## volatile.acidity 0.001018591
## citric.acid      0.026813477
## acidity          0.185460569

4.3 Total vs. Free Sulfur Dioxide

Analyzing the corellation between total and free sulfur dioxide content we find that with an increase of total sulfur dioxide by 1 mg/l, free sulfur dioxide increases by $r = 0.6155 \frac{\text{mg}}{\text{l}}$ and mostly accounts for the majority of sulfur dioxide increase. However, variability of the free sulfur dioxide content increases with a higher total sulfur dioxide content.

## Correlation total vs. free sulfur dioxide: 0.615501

4.4 Sulphates vs. Total Sulfur Dioxide

Let’s verify to which extent the variable description holds true and how much sulfates really contribute to ‘unpleasant’ SO2 levels. Calculating the $r$, we find just $r = 0.1345$ and $r^2 = 0.0181$, showing that variability of the sulphate only describes 1.8% percent of the variability in the sulfur dioxide level and therefore is not a very good indicator for total sulfur dioxide level. The corellation for free sulfur dioxide is even lower ($r_\text{free} = 0.0592$).

## Corellation sulphates vs. total sulfur dioxide: 0.1345624

## R^2 value sulphates vs. total sulfur dioxide: 0.01810703

## Corellation sulphates vs. free sulfur dioxide: 0.05921725

4.5 Residual Sugar vs. Alcohol

Gestation creates alcohol from sugar, therefore, with more alcohol in the wine, we expect less sugar; indeed, the values are negatively corellated ($r = -0.4506$). The graph below shows that indeed, a higher residual sugar content leads to less alcohol, but we also note, that the spread of the alcohol content is much higher for wines with lower sugar content, suggesting perhaps, that some wines may have not enough sugar to turn into alcohol to begin with.

## Corellation between residual sugar and alcohol content:  -0.4506312

4.6 Density vs. Sugar & Alcohol Level

Verifying the statement of the variable description about the density to be influenced by residual sugar and alcohol content, we analyze the corellation and find $r_\text{residual.sugar} = 0.8390, r_\text{alcohol} = -0.78013$.

Checking $r^2$, we find that 70.3% of the variability of density is explained by the variability in residual sugar, 60.8% of the variability of density is explained by the variabilit in alcohol; since these two variables are highly coreallated, we could not use them both in a linear model explaining the density variable.

## Corellation wine density vs. ...

##      residual.sugar    alcohol
## [1,]      0.8389665 -0.7801376

## R^2 value vs. ...

##      residual.sugar   alcohol
## [1,]      0.7038647 0.6086147

5 Introducing a Categorical Variable: Wine Flavor

We create the following categorical variable to classify wines according to their flavor based on existing EU regulations:

wines with max. 4 g/l residual sugar are considered ‘classical-dry’, acidity should be not less than 2 g/l below sugar content
wines with max. 9 g/l residual sugar are considered ‘dry’, acidity should be not less than 2 g/l below sugar content
wines with max. 18 g/l residual sugar content are considered ‘half-dry’, acidity should be not less than 10 g/l below sugar content
wines with max. 45 g/l residual sugar content are considered ‘semi-sweet’
wines above 45 g/l residual sugar content are considered ‘sweet’

# Create categorical variable for "wine flavor" according to EU trade 
# specifications; variable is depending on acidity and residual sugar
wine$flavor <- 
      ifelse(wine$residual.sugar <= 4 & wine$acidity >= wine$residual.sugar-2,
              "classical-dry",
      ifelse(wine$residual.sugar <= 9 & wine$acidity >= wine$residual.sugar-2,
              "dry",
      ifelse(wine$residual.sugar <= 18 & wine$acidity >= wine$residual.sugar-10,
              "half-dry",
      ifelse(wine$residual.sugar <= 45, "semi-sweet",
      ifelse(wine$residual.sugar > 45, "sweet",
      "unclassified"
      )))))

There are many (classical-)dry wines, less sweet wines, only a few semi-sweet wine and only one single sweet wine.

## Distribution of wine flavors:

## 
## classical-dry           dry      half-dry    semi-sweet         sweet 
##          2097          1414          1246           140             1

The following graph visualizes the classification by showing residual sugar level and acidity level. The flavor the wine is assigned is color-coded. One can see a slight “diagonal slope” at each cut-off-point between clusters signifying one particular flavor type. These are due to the restrictions put on the acidity level.

Finally, we show the sugar content distribution of each wine flavor as a histogram. It is now evident, that the local maxima in the aggregated histogram are due to treating all the wines in “one bucket”. Within the sugar content distribution of their wine flavor, the distributions are much more uniform.

## Scale for 'fill' is already present. Adding another scale for 'fill', which will replace the existing scale.

6 Multivariate Analysis

6.1 Scatterplot Matrix

In order to explore the data further, we use a scatterplot matrix which shows

a graph for the relationship between any two variables in the lower left triangle, using a scatterplot for numerical data
rows/columns that involves a categorical variable showing facetted plots
distributions of single variables along the diagonale

The following corellation table also shows the corellation values between any two variables again and highlights visually highest and lowest corellation values:

6.1.1 Confirmation of previous findings (little surprises)

First, we can-re-confirm our previous findings.

## Corellation of acidity with...:

##      fixed.acidity volatile.acidity citric.acid
## [1,]     0.9871787       0.07157062   0.3941434

Negative corellation; acidity is defined as the sum-of-weight of all acids; with fixed acids and citrics acids usually dominating over volatile acids, the corellation here is stronger.

## Corellation of pH with...:

##         acidity fixed.acidity volatile.acidity citric.acid
## [1,] -0.4306513    -0.4258583      -0.03191537  -0.1637482

Negative corellation between pH and all types of acidity; however, corellation between acidity and pH can never be 1.0 because pH should also take basic chemical components into account. Thus, also corellation between pH and any specific type of acidity is lower than corellation between the variable acidity and the corresponding type of acidity.

## Corellation of total sulfur dioxide with free sulfur dioxide:

## [1] 0.615501

Strong positive corellation because free sulfur dioxide makes out the majority of all sulfur dioxides.

## Corellation of sulphates with total sulfur dioxide: 0.1345624

Weak positive corellation, as the description of sulphates suggests, sulphates lead to the creation of SO2 levels.

## Corellation of alcohol content with residual sugar: -0.4506312

Strong negative corellation because of chemical process of alcohol gestation (sugar is converted to alcohol).

## Corellation of alcohol content with residual sugar: 0.8389665

Strong positive corellation, additional sugar in the fluid leads to a higher density than no sugar.

## Corellation of free sulfur dioxide with total sulfur dioxide: 0.615501

Strong positive corellation, because free sulfur dioxide is the major part of total sulfur dioxide.

6.1.2 New Findings Concerning Chemical Properties

## Corellation of residual sugar with free sulfur dioxide: 0.2990984

Weak positive corellation – Free sulfur dioxide ‘prevents microbial growth’, therefore might constrain alcohol gestation and result in more remaining residual sugar.

## Corellation of chlorides with alcohol content: -0.3601887

Weak negative corellation – ‘saltier’ wines seem to contain less alcohol. Does the salt also inhibit the gestation process?

## Corellation of density with ...:

##      chlorides total.sulfur.dioxide
## [1,] 0.2572113            0.5298813

Positive corellation, higher for “total.sulfur.dioxide”, both chlorides and SO2 increase the density of the wine.

## Corellation of acidity with density: 0.2756088

Weak positive corellation, acidity further increases the density of wine.

## Corellation of alcohol content with density: -0.7801376

Strong negative corellation, alcohol content lowers sugar content and leads to releases of gases, therefore creating a lighter brew and lowering density.

6.1.3 New Findings Concerning Quality

Inspecting the row/column ‘quality’ in the scatterplot matrix, we find particulary high absolute corellation values for the following variables:

## Corellation of quality with...:

##      volatile.acidity fixed.acidity  chlorides   alcohol
## [1,]        -0.194723    -0.1136628 -0.2099344 0.4355747
##      total.sulfur.dioxide
## [1,]           -0.1747372

volatile acidity is negative corellated with quality, as the variable description suggests (“vinegar taste”)
fixed acidity also is negatively corellated
chlorides (“salts”) seem to have a negative impact on quality
higher alcohol content is generally seen as something of higher quality
total sulfur dioxide has a negative influence (recall the variable description “becomes noticable over 50ppm”)

6.2 Scatterplot Matrix (facetted by Quality)

We can furthermore imagine, that certain qualities are only relevant for certain flavours of wine. Or we might overlook some non-linear qualities. Quality can be seen as a 7-level factor variable (3-9), so we can add another discrete (color) dimension to our scatterplot to visualize distributions for different levels of quality.

On the diagonal (especially for volatile.acidity, chlorides and density and alcohol) we can clearly see our previous observations: how different quality levels have different means of the respective chemical compound, e.g., how volatile.acidity has a certain local maxima for ‘low quality wine’ (level 3-4) further on the right.

We can, however, also see in column flavor how certain variables seem to be of importance for certain wine flavors and less so for others. For example, citric.acid seems to be positively corellated for classical-dry wines but less so for half dry wines. The same is true for residual sugar, free sulfur dioxide and a few others.

We can plot out the corellation of all of those interesting variables to wine quality for their respective subset of wines of the same flavor with the following formula.

# Show corellation of wine quality with any of a few highlightes variables
# for each wine flavor separately
by(wine, wine$flavor,
    function(w) {
      cor(w$quality, w[, c("volatile.acidity", "fixed.acidity", "citric.acid",
                           "chlorides", "total.sulfur.dioxide", "alcohol",
                           "residual.sugar")])
    }
  )

## wine$flavor: classical-dry
##      volatile.acidity fixed.acidity citric.acid chlorides
## [1,]       -0.1892251    -0.1747645  0.04348309 -0.190533
##      total.sulfur.dioxide  alcohol residual.sugar
## [1,]          -0.07289426 0.457975      0.1654655
## -------------------------------------------------------- 
## wine$flavor: dry
##      volatile.acidity fixed.acidity citric.acid  chlorides
## [1,]       -0.1182325   -0.04137113  0.06397823 -0.2384087
##      total.sulfur.dioxide   alcohol residual.sugar
## [1,]           -0.2598946 0.5512578     -0.1635403
## -------------------------------------------------------- 
## wine$flavor: half-dry
##      volatile.acidity fixed.acidity citric.acid  chlorides
## [1,]       -0.3065855   -0.04305342  -0.1400305 -0.1775522
##      total.sulfur.dioxide   alcohol residual.sugar
## [1,]            -0.158196 0.2469943    -0.00514009
## -------------------------------------------------------- 
## wine$flavor: semi-sweet
##      volatile.acidity fixed.acidity citric.acid  chlorides
## [1,]      -0.01971877    -0.1610847 -0.04354985 -0.2971294
##      total.sulfur.dioxide   alcohol residual.sugar
## [1,]           -0.1896219 0.1595639     0.03847214
## -------------------------------------------------------- 
## wine$flavor: sweet
##      volatile.acidity fixed.acidity citric.acid chlorides
## [1,]               NA            NA          NA        NA
##      total.sulfur.dioxide alcohol residual.sugar
## [1,]                   NA      NA             NA

6.3 Corellation between Quality and Chemical Properties by Wine Flavor

The following plot visualizes the interesting finding of the last chapter graphically: depending on wine flavor, different chemical qualities play a higher or lower role. The first line (“any (summary)” shows the general corellation again, as was displayed in the previous plot.

Blue signifies positive corellation (positive influence on wine quality), red signifies negative corellation (negative influence on wine quality). Corellations between -0.1 and 0.1 are shown in white as to highlight that they will not be of further investigation.

# here again: corellation of a few highlighted variables with wine quality
# depending on wine flavour
winequality_pre <- wine
# "quality" needs to be a numeric in order to asess corellation
winequality_pre$quality <- as.numeric(as.character(winequality_pre$quality))
winequality_by_flavor <- winequality_pre %>%
    # group by flavour and calculate a bunch of corellations
    dplyr::group_by(flavor) %>%
    dplyr::summarise(cor(quality,alcohol), cor(quality,residual.sugar),
                    cor(quality,chlorides), cor(quality,volatile.acidity), 
                    cor(quality,total.sulfur.dioxide),
                    cor(quality,fixed.acidity), cor(quality,citric.acid)) %>%
    # give each coreallation their own row in the data frame
    tidyr::gather("indicator", "corellation", 2:8)
winequality_all_flavors <- winequality_pre %>%
    # calculate a summary row
    dplyr::summarise(cor(quality,alcohol), cor(quality,residual.sugar),
                    cor(quality,chlorides), cor(quality,volatile.acidity), 
                    cor(quality,total.sulfur.dioxide),
                    cor(quality,fixed.acidity), cor(quality,citric.acid)) %>%
    dplyr::mutate(flavor="any (summary)") %>%
    # give each coreallation their own row in the data frame
    tidyr::gather("indicator", "corellation", 1:7)
winequality <-
    # join summary row (on top) with flavor specific row (on bottom)
    dplyr::union(winequality_all_flavors, winequality_by_flavor) 
# re-order factor for display ("summary" should be on top)
winequality$flavor <- factor(winequality$flavor,
    c("classical-dry", "dry", "half-dry", "semi-sweet", "sweet", "any (summary)"))

The graph lets us make the following assumptions:

Higher alcohol content is associated with better wine quality, especially for dryer wines.
Residual sugar is viewed slightly favourable for classical dry wines but not favourable for dry wines; for half-dry or semi-sweet wines, no corellation could be found.
Chlorides are generally a bad thing, especially the sweeter the wine.
The same roughly applies to Volatile acidity, however, for semi-sweet wines, no corellation can be found.
Total sulfur dioxide is corellated with bad wine quality, however, for classical dry wines, no corellation can be established.
Fixed acidity and Citric acidity have a small negative effect on some wine flavors, but no overall pattern can be established.

‘No corellation can be found’ does not mean, that no corellation exists - maybe we just do not have the data! – in the final step, we will look behind these corellation values based on wine flavor one last time in more detail.

6.4 Revisiting Distribution of Chemical Compositions

The next graph re-visits the analysis in the beginning of this paper by analyzing a few chemical variables (horizontally) for each wine flavor (vertically). One can see only minor differences:

slight left-shift of distribution of total sulfur dioxide for half-dry wine
slight left-shift of cidric acid and fixed acid for half-dry and semi-sweet wine

Otherwise, the shape of the distribution does not change – the height, however gets lower, since less wines are available in the sweeter categories. Having that uniformity of distributions assured, we can have more confidence in having compareable samples.

6.5 Influence of Chemical Composition on Quality by Wine Flavor

The following graph breaks the distribution further down to show in boxplots the distribution of the selected variable (column) for each quality score (x-axis) and each flavor (row).

Added is the ‘line of best fit’, directly linked to our corellation coefficient from the table in the beginning of this chapter.

We that corellation and linear fit are not everything: for example, across all flavors, there are some wines, which are ranked very poorly, but have a higher alcohol content than medium-ranked wines

6.6 Comparing Wine Flavor Quality Trends

The above diagram allows us to see trends for each single wine flavor separately, but by drawing each wine flavor into a single diagram, it does not facilitate comparison of the trends for different wine flavors with each other. The next plot will sacrifice detail like outliers and quantiles for each chemical compound / flavor / quality scoring combination but instead use color as a visual cue to be able to plot all trend lines over each other for each chemical property.

We can find a few interesting findings and trends from this graph:

Alcohol level is less important for semi-sweet wines, follows more of a normal distribution than the raising trend it follows for other flavors; also, there is a ‘dip’ for average wines (score 5) in alcohol content, wines that are scored worse do actually have higher alcohol content
Chlorides seem generally to be a problem, however, for all wines except half-dry wines some level of chlorides is actually appreciated in the medium-range of quality scores.
Sulfur Dioxide also seems to be a bad thing, however, for some medium-range dry or half-dry wines, a middle level of sulfur dioxide is appreciated.
Volatile acidity should not be too low (bad wines, scores 3-4) and not too high (average wines, scores 5-7). Half-dry wines and semi-sweet wines should have no volatile acidity at all.
The same goes for Fixed Acidity, however, here the differences for (classical-) dry wines is less important.
Citric acid seems to have a few ‘uncanny valeys’ where a certain low citrus contant scores the wine worse and only for a quite high citrus contant, the wine is perceived as excellent.

7 Final Plots and Summaries

In this paper 4898 wines were compared relating 11 of their chemical properties with an expert rating of the wine’s quality. The wines were grouped into different wine flavors (classical-dry, dry, half-dry, semi-sweet and sweet) according to their residual sugar content and ‘total’ acidity (referring to the sum of all types of acidity concentrations: fixed, volatile and citric). The below graph shows all the wines combined with their classification (color), sugar content (x axis) and acidity (y axis). Notice, how generally, the acidity levels are spread out more for dryer wines.

The corellation of different wine properties with each other were thoroughly analyzed. Some obvious chemical relations were identified, like the negative corellation between sugar content and alcohol or the pH value or (positive) corellation between pH and any of the acidity values.

However, the corellations of different chemical properties with wine quality are most interesting. The following plot shows corellation of some of those properties with wine quality in the fist line as well as a break-down of those corellation by wine flavor in the remaining lines.

We see that most influential for good quality are alcohol content ($r(\text{alcohol},\text{quality}) = 0.45$), and that bad wine quality is mostly due high total sulfur dioxide levels ($r(\text{total.sulfur.dioxide},\text{quality}) = -0.19$), salt levels ($r(\text{chlorides},\text{quality}) = -0.19$) and volatile acidity levels ($r(\text{volatile.acidity},\text{quality}) = -0.18$).

Taking into consideration the break-down by flavor, we find:

Chlorides have a larger negative influence the more sweeter the wine ($r_\text{semi-sweet}(\text{chlorides},\text{quality})=-0.297$)
Volatile acidity seems to have a particular bad impact on half-dry wines ($r_\text{half-dry}(\text{volatile.acidity},\text{quality})=-0.307$)
Total sulfur dioxide seems to decrease the quality of dry wines most ($r_\text{dry}(\text{total.sulfur.dioxide},\text{quality})=-0.26$)
Alcohol content has the strongest positive impact for the driest wines ($r_\text{classical-dry}(\text{alcohol},\text{quality})=0.458$, $r_\text{semi-sweet}(\text{alcohol},\text{quality})=0.16$).

However, (linear) corellation values are not everything: looking at the mean (and standard-error) of each wine flavor scored with a particulary score, we can definetely see some surprising non-linear trends for some of the chemical properties. See belows graph.

The most surprising finding here is that for some of the chemical properties, there seems to be an actual “sweet spot” - more of an ideal value of said distribution than an actual linear trend. This applies to:

salt content for all wines except half-dry wines
sulfur dioxide for some medium-range dry or half-dry wines
volatile acidity for all (classical)-dry wines: should not be too low (bad wines scored 3-4) and not too high (average wines scored 5-7).

In addition to that, citric acidity seems to follow an ‘uncanny valeys’ where a certain low citrus contant scores the wine worse and only for a quite high citrus contant, the wine is perceived as excellent.

8 Reflection

During this project, I learned a lot about plotting with ggplot2, dplyr, tidyr etc. but also about the internal workings of R as a programming language (expression evaluation environments etc.) which I think is somewhat missing from the Data Analyst course.

The things I struggled with most was getting multi-dimensional / -facetted plots with additional cues like whiskers, etc. to work. Also, one can see that for the major part of the analysis, I focus on (linear) corellation while the final plots reveal, that some quadratic relationships (with minima / maxima) are present for some of the chemical compounds, which are hard to spot just looking at corellation. A success was, in my opinion, the idea to differentiate the analysis by wine flavor – for different flavours, indeed, some very different results were obtained.

To enrich the analysis further, one could add more data about those wines, e.g., was the same wine scored by multiple testers? - how much do they disagree and what are the patterns there? - Are other indicators like color, harvestation year or region of the wine indicative?

For this very analysis, based on the premise of focussing highly on corellation, one could of course re-iterate and check other variables again but a quick glimpse on the scatterplot matrix does not reveal any major patterns.

9 References

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties.

In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016
[Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf
[bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

Data Analyst Project 4 - Explorative Data Analysis with R Exploring White Wine Quality

Benjamin Soellner

Oktober 2015