This machine learning project tries to identify Persons of Interest from the Enron Scandal. The Enron Scandal was the largest case of corporate fraud in the history of the United States. Several people were indigted or found guilty of misappropriation of large sums of money. People involved in the Enron scandal are called “persons of interest” (POI) in this assignment.
We are building a machine learning algorithm that can, given some financial data of an employee and some data from the so-called Enron Corpus predict if that person could be a person of interest.
Many Persons of Interest were main benefactors from the fraud and therefore took away huge sums of money, either as salary, bonus, as stock options or as other compensations. The Enron Corpus moreover gives an idea about which employees were in frequent contact with one another. From this data source we hope to get insights how information about malpractices was spreading through the company.
No single feature about a person (financial or email-related) can give us a clear yes-no answer about whether a person is “of interest” or not. Machine Learning helps us discover patterns in the interplay of the features available to us. It should also give us an opportunity to predict whether persons are “of interest” when we see new data about persons, of which we do not have any information yet as to whether they are “of interest” (meaning: being involved in the scandal) or not.
Of the 146 observations / persons in the data set, there were two outliers:
a person “named” “TOTAL”, probably included while scrapping the financial data off a table from a web page or document. As a value containing the sum of all features, it was laying far outside the spectrum for all features and easily recognizable
a person “named” “THE TRAVEL AGENCY IN THE PARK” which I also excluded due to it not being a real “person” at all.
I hard-codedly removed both data points from the data set as preproecessing step.
Without outliers, the data set contained 144 observations and of those, 18 (12.5%) were classified as POI and 126 (87.5%) as non-POI.
As a preprocessing step before selecting any features, I removed (or rather “disabled”“) features which had a high number of missing (or”NA“) values. I set the threshold of NA-fraction at 75% since 50% removed too many features and impacted performance. The following features with their relative fractions of NA values were removed:
loan_advances 0.979167
director_fees 0.895833
restricted_stock_deferred 0.881944
Narrowing features down, I tried both manual feature selection after initial explorative analysis of the data and automatic feature selelction using the SelectKBest
preprocessor.
You can choose to enable or disable automatic feature selection using SelectKBest
. Using -f True
or -f False
. The default is -f False
(manual feature selection) since that provided best performance for most of the algorithms, including the one we chose.
For selecting features manually using explorative data analysis I implemented a helper class which shows some GUI windows with POIs and non-POIs highlighted in different color. An example window is shown below; the window is capable of displaying multiple plots in a tabbed view. It is also possible to control certain display parameters like number of histogram bins or alpha transparency with a spinner widget.
This GUI can be shown when calling the poi_id.py
script with the options -g univariate_analysis
and -g bivariate_analysis
respectively.
From this exploratory analysis I selected the following feature set to be most indicative for POI-vs.-non-POI classification.
salary, bonus, deferral_payments, loan_advances (rejected because fraction of
NA values is >0.75), expenses, exercised_stock_options, deferred_income,
other, rate_poi_to_this_person, rate_this_person_to_poi,
rate_shared_receipt_with_poi
I compared performance of each algorithm by evaluating the algorithm without feature scaling, with MinMaxScaler
feature scaling and using Principal Component Analysis (PCA) without feature scaling.
The script poi_id.py
supports manually setting which configuration to run using the option -s on
, -s off
or -s pca
, while using PCA (without scaling) is the default (-s pca
) and best performing setting for all algorithms which were tested. For comparing exact performance, see Algorithms.
Why is PCA ran without feature scaling?: I was made aware by a Udacity Forum Post that PCA usually underperforms when scaling is employed before calling it, since PCA calculates its axes by considering the variance of features, which is re-scaled when scaling each features values to the interval [0, 1].
I tested the hypothesis by re-running the whole simulation both with PCA using scaling and PCA not using scaling. The results where generally improved performance for all algorithms if scaling was not used. Only Support Vector Machines stopped working any time scaling was disabled.
The exact numbers of PCA-with-scaling vs. PCA-without-scaling is documented in a separate PDF file as appendix. Frankly, the result that PCA performed better without scaling puzzled me, also because one project reviewer at Udacity commented “Something to note here, PCA’s process occurs in euclidean space and similarly to algorithms like SVM e.t.c requires scaling to work optimally.”
It should be noted, that some algorithms need some sort of scaling to run at all. For example, the Support Vector Classifier, when used with a polynomial kernel amongst other values in GridSearchCV
did not terminate within 45 minutes runtime on my machine if scaling was turned off.
First, I cross-validated the existing total_*
features by checking if the sum of their constituents resulted in the value present in the data set. For performing this step, poi_id.py
has an option -l <filename>
to write out the whole data set to a CSV file which can be examined and further analyzed in, e.g., MS Excel.
I found that there were a few values where a total_*
feature was set to NA
despite of actually some of its summands being present. For these records, I could just recalculate the total value. For other data points, I had to hard-codedly perform a “one-off” hack/change in the data set to ensure consistency. The whole data cleaning step is done as a manual pre-processing step. The following images show the process in detail:
I also created three new features: from_this_person_to_poi
, from_poi_to_this_person
, shared_receipt_with_poi
:
\(\text{rate_poi_to_this_person} = \frac{\text{from_poi_to_this_person}}{\text{from_messages}}\)
\(\text{rate_this_person_to_poi} = \frac{\text{from_this_person_to_poi}}{\text{to_messages}}\)
\(\text{rate_shared_receipt_with_poi} = \frac{\text{shared_receipt_with_poi}}{\text{to_messages}}\)
The rationale behind creating those features is that some people might write less, some people might write more emails. We really want to look at what is the percentage of emails someone wrote to a person of interest in order to estimate how likely they were to be involved in malpractices of other persons of interest.
I did, however, leave the original features in the data set for the case of automatic feature selection with SelectKBest
and used this as a validation technique of those newly created features. I cannot assume a direct linear corellation with those features, therefore, they should not prove an obstacle to the algorithm.
One additional disclaimer should be added concerning the email features in general: since those features essentially make use of the knowledge whether persons, including the persons we assigned to the test set, are POIs (which is the label we try to predict), we make ourselves guilty of “data leakage” or “test set peeking” (see the relevant discussion in the Udacity forum or the associated thread on Quora).
How to avoid this problem?: The proper way would be to re-compute the email-features during each train-test-split fold for training and test set separately. For the training set, one would re-compute the email-features from the Enron corpus using only the subset of emails for persons in the training set. For the test set, one could compute these values based on the whole data set.
For efficiency reasons, instead of running through the email corpus every time one generates the email-features, one could generate a graph-like data structure with …
This graph (or a sub-graph during the training process) may be re-used to calculate the email features on-the-fly during each train-test-split-fold. The following image showcases the principle of this idea for the sent/received emails, however, an implementation is out-of-scope for this project.
The image below shows the SelectKBest results for different classifiers. Every classifier used the same set of 10 automatically selected features (shown on the x-axis). The number (10 features) was chosen in order to achieve comparable results to the 10 manually selected features. Reducing number of features even lower (by, e.g., rejecting more features with high number of NA-values) impacted performance negatively.