Analyzing the NYC Subway Dataset for Dependencies Between Weather Data and Subway Ridership
by Benjamin Söllner, benjamin.soellner@gmail.com
based on the Udacity.com Intro To Data Science Course
This report is provided as a Python Notebook which is part of a GitHub repository containing this course's code completed with my personal solutions. The (runnable) code segments in this notebook generate output which is discussed in this report. For certain discussions, it might be beneficial to the reader's understanding to look at the code side-by-side with this report.
How to find the relevant code in the GitHub repository?: The import statements at the top of each code fragments guide you to the corresponding solution, i.e. the following line
from project_a.topic.file import f
would reference the file project_a/topic/file.py
, and more precisely, function f(...)
, in the GitHub repository. Note, how this also ties nicely into this course's structure and guides you directly to the respective topic
in the project_a
part of the Intro To Data Science Course.
There were two forum posts which helped me out very much during this project:
The first forum post / problem could only be solved after a coach appointment with Carl, who swiftly me with this Gist code snippet.
Additionally, here are a few more useful references:
c(...,...)
use [..., ...]
Visualizations were created during the Problem Set 4 of the courses. The following sections reference the code which was created to get a few visualization and, moving on, describe them in detail according to the questions from the project rubric:
The following visualization is drawn from Problem 3.1 and shows a histogram of hourly rides on rainy (green) vs. non-rainy (blue) days. Depicted are the number of days (y-axis) when the hourly average number of subway rides are falling within a specific range (x-axis).
%matplotlib inline
from project_3.plot_histogram.plot_histogram import entries_histogram
import pandas as pd
from ggplot import *
turnstile_weather = pd.read_csv("project_3/plot_histogram/turnstile_data_master_with_weather.csv")
plt = entries_histogram(turnstile_weather)
print plt
We can see a non-normal distribution of the data. For non-rainy days, there are more days with a lower hourly average than for rainy days.
The next visualization is taken from Problem 4.1 and shows the average number of subway entries per hour (y-axis) for each weekday (x-axis).
%matplotlib inline
from project_4.exercise_1.data_visualization import plot_weather_data
import pandas as pd
from ggplot import *
turnstile_weather = pd.read_csv("project_4/exercise_1/turnstile_data_master_with_weather.csv")
turnstile_weather['datetime'] = turnstile_weather['DATEn'] + ' ' + turnstile_weather['TIMEn']
gg = plot_weather_data(turnstile_weather)
print gg
There are much fewer subway rides (entries) on weekends (Saturday and Sunday) than on working days with Monday beeing the working day where least people enter the subway.
The last visualization is taken from Problem 4.2 and shows the ratio between entries & exits on various stations. Depicted are only the first 10 and the last 10 stations where that ratio is smallest and biggest, respectively. The y-axis is cut at 100 since unit "R070" has a much higher ratio than all the other units. The output of the program code below also shows the raw data as print-out.
%matplotlib inline
from project_4.exercise_2.data_visualization import plot_weather_data
import pandas as pd
from ggplot import *
turnstile_weather = pd.read_csv("project_4/exercise_2/turnstile_data_master_with_weather.csv")
turnstile_weather['datetime'] = turnstile_weather['DATEn'] + ' ' + turnstile_weather['TIMEn']
gg = plot_weather_data(turnstile_weather)
print gg
The 3 stations with the most exits (vs. entries) are:
The 3 stations with the most entries (vs. exits) are:
The last two items hint at a data error: since the Roosevelt Island Tram consists only of two stations, the number of people entering the tram summed up over those two stations must be similar to those who exit the tram. We can rule out a systematic difference in data collection like tram-exits not being counted at all, since the number of exits is non-zero. There seems to be something else at play here.
As a statistical test, a Mann-Whitney-U-Test has been performed in Problem 3.3 of this course to find out whether more people ride the subway when it is raining vs. when it is not raining. The following code returns the results of this test in the following format:
(mean_with_rain, mean_without_rain, U, p)
from project_3.mann_whitney_u_test.mann_whitney_u import mann_whitney_plus_means
import pandas as pd
input_filename = "project_3/mann_whitney_u_test/turnstile_data_master_with_weather.csv"
turnstile_master = pd.read_csv(input_filename)
print mann_whitney_plus_means(turnstile_master)
The Mann-Whitney-U-Test is applied to data that is non-normal. Based on the histograms shown in Visualization 1, we can assume that the average number of subway riders per day is non-normally distributed (both for the subset of the data for rainy days vs. the subset of the data for non-rainy days). The Mann-Whitney-U-Test per definition is one-tailed, but we can perform a two-tailed Test by multiplying the resulting p-Value by 2.
Our Null-Hypothesis in our case is: "The distributions of ridership on rainy vs. non-rainy days are not significantly different". We choose $\alpha = p_\text{critical} = 0.05$ and a two-tailed test in order to test for both lower and higher ridership on rainy days.
The test results above give us a $p_\text{one-tailed} = 0.0193$ and a $p_\text{two-tailed} = 0.0386$ with the sample means $\bar{x}_\text{rain} = 1105.44 \frac{\text{rides}}{\text{hour}}$ and $\bar{x}_\text{norain} = 1090.2788 \frac{\text{rides}}{\text{hour}}$. Based on the p-Value and our previously chosen $p_\text{critical}$ we can reject the Null-Hypothesis.
Therefore, there is a significant difference between the distribution of daily average rides per hour on rainy days vs. on non-rainy days: the number of people riding the subway is significantly larger ($p_\text{critical} = 0.05$) on rainy vs. on non-rainy days and the distribution therefore more skewed to the left.
The most promising Regression Model could be calculated using the Library statsmodel (see References) and the improved dataset which was provided for this project. Therefore, code of the regression function is not part of the website submission but of the GitHub repository. Since statsmodel's excessively long output the listing of the output has been moved to the Appendix.
I chose to use a non-linear model with a regression based on Ordinary Least Squares implemented using statsmodel. In particular, in preparing my feature matrix (features
) I use the same method I would use for a simple linear model. The only difference - marked below by (*) - is, I will be adding additional columns to the matrix for those features not described by one coefficient ($a X$) but by a polynom ($a_1 X + a_2 X^2 + ... + a_g X^G$) instead.
The Data Frame features
is computed as follows:
We then prepare another Data Frame values
only containing the one column with the response variable. With both variables in place, we can easily use the statsmodels.api
(sm
) to calculate the model, fit it to the dataset and calculate the predicted values of the features (prediction
).
result = sm.OLS(values, features).fit()
prediction = result.predict(features)
print result.summary()
The following features were systematically chosen for the regression model:
hour
) was chosen as a polynomial term with grade 4 and 4 coefficients respectively ($a_1 X + a_2 X^2 + a_3 X^3 + a_4 X^4$). My intuition was, that there would be three extrema in the ridership of subways: a climb to a maximum in the morning-rush-hour, a minimum during the day and another maximum in the evening-rush-hour.meantempi
) was chosen as a 2 grade polynomial term ($a_1 X + a_2 X^2$) because of the assumption, that there is a minimum, where less people take the subway - when the temperature is just right to be outside. If it's too cold or too hot, people might use the subway more often. This model yielded a higher $R^2$ value than a simple linear term.meanprecipi
) as a linear term with the assumption that people take the subway more often when it rains.meanpressurei
) as a linear term guessing that high pressure weather conditions (generally nicer weather) means, people might decide to stay out more.meanwindspdi
) as a linear term assuming that storms make people take the subway more often.is_weekend
) as a linear / quasi-categorical term assuming that there is a significant difference between the number of people taking the subway on weekends vs. on other days.conds
) as a (linear) categorical term using dummy variables, assuming that there is a significant difference in the number of people taking the subway during different weather conditionsUNIT
) as a (linear) categorical term using dummy variables, assuming that there is a more complex significant difference between the riderships on different subway stations. This might overfit the model with little possibilities to add additional subway stations (see Conclusion). I decided to use this term anyway since using "latitude & longitude" instead, even as polynomial term, I only got to about $R^2 = 28.0$.The Coefficients (or Parameters, Weights) can be found in the program output above in the coeff
column.
With $R^2 = 0.528$, we do have a rather high value compared to $R^2 = 0.40$ from Problem Set 3. $R^2$ indicates, how much of the total variability in the dataset is explained by our model, which would be about 52%. For a model representing human behaviour solely on based on time of day, weekend/non-weekend and weather, this is a reasonable value for $R^2$.
It should be noted, that $R^2$ does not signify, whether our model is overfitted / biased (and might therefore not represent data outside the dataset). Therefore, additional tests would be required with a learning and a training set.
$R^2$ also does not explain Goodness-of-fit. To asess this, we would have to compare individual predicted values with values from the dataset to look for potential systematic over- or underpredictions in the dataset.
A Mann-Whitney-U-Test was conducted (see Statistical Test) to compare the number of subway rides in rainy and non-rainy conditions. There was a significant difference ($\alpha = 0.05$) in the average number of subway rides per hour during rainy ($\bar{x}_\text{rain} = 1105.44 \frac{\text{rides}}{\text{hour}}$) vs. during non-rainy ($\bar{x}_\text{norain} = 1090.27 \frac{\text{rides}}{\text{hour}}$) conditions; $U = 1924409167.0, p = 0.0193096$
After controlling for other factors (type of rain conditions, subway unit etc.) using a regression model (see Regression), we could furthermore conclude that the coefficients predicting the number of subway rides have a significant impact on our regression model. Although we cannot expect to 100% eliminate collinearity between, e.g., weather conditions (conds
parameter) and numerical weather data, like percipitation (meanprecipi
), the results of this model give us a good idea that the coefficients taking rain into account have a significant impact. As the tabulated results below show, the respective p-Value (P>|t|
) is calulated low (around 0.000
) for all rainy conditions except "Light Rain
", suggesting, that those conditions do indeed have significance.
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------------------
[...]
meanprecipi 63.6329 13.523 4.706 0.000 37.128 90.138
[...]
conds_Heavy Rain -760.9784 124.282 -6.123 0.000 -1004.574 -517.383
conds_Light Drizzle -728.2438 118.055 -6.169 0.000 -959.634 -496.853
conds_Light Rain 58.2079 55.368 1.051 0.293 -50.315 166.731
[...]
conds_Rain -521.3172 84.996 -6.133 0.000 -687.912 -354.723
The coef
value describes the impact of rainy conditions on subway ridership.
conds_Heavy Rain == 1.0
) reduces the number of riders by 760.9784, but the value of meanprecipi
, assumingly positive for heavy rain, will increase the number of predicted riders again by 63.6329 per percipitation unitconds_Rain == 1.0
) and "Light Drizzle" (conds_Light Drizzle == 1.0
) likewise reduces the number of riders by 521.3172 and 728.2438 respectively, mitigated, again, by an increased value of meanprecipi
In order to gauge, which of the two factors (conds_*
and meanprecipi
) has a higher effect, we need to look at the distribution of meanprecipi
after normalization for the subset of data for each value of the conds_*
feature:
analysis = pd.DataFrame( {"conds_Rain": features[features["conds_Rain"] == 1.0]["meanprecipi"].describe(), \
"conds_Heavy Rain": features[features["conds_Heavy Rain"] == 1.0]["meanprecipi"].describe(), \
"conds_Light Drizzle": features[features["conds_Light Drizzle"] == 1.0]["meanprecipi"].describe() } )
print analysis
========= Analyzing mean precipitation data: ===========
conds_Heavy Rain conds_Light Drizzle conds_Rain
count 288.000000 335.000000 961.000000
mean 1.311504 -0.223961 3.339870
std 0.373432 0.052187 2.731260
min 0.533254 -0.282527 -0.282527
25% 1.043118 -0.282527 0.839172
50% 1.552981 -0.180554 1.349036
75% 1.552981 -0.180554 6.141753
max 2.470736 -0.129568 9.353893
Based on this data:
conds_Heavy Rain
term is increased on average by $1.311504 * 63.6329 = 83.45480$ by the meanprecipi
term. Assuming normal distributed meanprecipi
with 98.5% of data points falling within two standard-deviations of the mean, we can bound the effect of the meanprecipi
to an increase in subway ridership of $[82.4752, 84.43432]$, and in conjunction with the conds_Heavy Rain
feature, an overall decrease of subway ridership of $[676.54408, 678.50311]$ passengersconds_Light Drizzle
term is increased on average by $-0.223961 * 63.6329 = -14.25128$ by the meanprecipi
term (or, in fact, further decreased by 14.25128). With the same argumentation as above, we can find a 98,5% CI for the effect of this coefficient with $[-7.60967,-20.89291]$ and in conjunction with conds_Light Drizzle
conclude an overall decrease of subway ridership of $[735.8535,749.1367]$ passengers.conds_Rain
term is increased on average by $3.33987 * 63.6329 = 12.13341$ by the meanprecipi
term. The 98,5% CI assuming normally distributed meanprecipi
data for the value is $[-7.11375,31.97820]$, which, in conjunction with the conds_Rain
term, yields an overall decrease of subway ridership of $[26.22975,65.91934]$.No, in fact less people do. This came as a surprise to me - maybe generally more people would like to stay indoors.
The previous analysis showed, that rain generally leads to a decrease in number of users using the subway. Depending on the type of rain, this effect seems to be more or less prominent, with "heavy rain" and "light drizzle" leading to even fewer subway riders than normal "rain".
Coming to this conclusion was not an easy process since there are cross-dependencies in the coefficients. How this potentially affected the results of this analysis is explained in the following chapter.
There are a few potential shortcomings like:
Potential remaining collinearity of parameters: Parameters like meanprecipi
are corellated with values like conds_Rain
etc. In our example, however, I analyzed the correlation matrix and found the corellation between those values to actually be not all that high (except conds_Rain
). This might be due to poor data (subclassifications Heavy rain
etc. inconsistently used).
meanprecipi conds_Heavy Rain conds_Light Drizzle conds_Light Rain conds_Rain
meanprecipi 1.00000 0.10814 -0.019928 0.06396 0.50710
conds_Heavy Rain 0.10814 1.00000 -0.007337 -0.01771 -0.01252
conds_Light Drizzle -0.01993 -0.00734 1.000000 -0.01911 -0.01351
conds_Light Rain 0.06396 -0.01771 -0.019107 1.00000 -0.03260
conds_Rain 0.50710 -0.01252 -0.013509 -0.03260 1.00000
Potential overfitting of model: I did not separate the data into learning set and test set to gauge potential overfitting of the model. One particular area, where the model might be overfitted is the subway stations. One interesting experiment would be to remove one subway station entirely and test, how well the model would predict the ridership of that new subway station. Although a model based on latitude and longitude coordinates (see Chosen Features) has a much lower $R^2$ value, based on such a test, this model might turn out to be better for a Use Case, where the model should predict ridership of new subway stations
Missing Data: Of course, additional data could be used to predict subway ridership even more accurately: what comes to mind is data about city events, tourism, financial data to try to model the impact of shopping trips etc. or, if attainable, other traffic data, e.g., from cars or tunnels.
from project_3.other_linear_regressions.advanced_linear_regressions import predictions
import pandas as pd
input_filename = "project_3/other_linear_regressions/turnstile_weather_v2.csv"
turnstile_master = pd.read_csv(input_filename)
predictions(turnstile_master)