Enron Submission Free-Response Questions

Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question. Were there any outliers in the data when you got it, and how did you handle those?

The goal of this project is to create a Machine Learning algorithm that will accurately identify an Enron employee as a Person of Interest ('POI'). The algorithm analyzes a list of features each Enron employee has, which has been separated into features related to email (email correspondence with and without POIs) and finance (monetary and equity compensation amounts), as well as the POI identifying label. By analyzing and training the algorithm to focus on features that distinguish POIs from non-POIs, we hope that the algorithm would theoretically be able to take a new value with similar datapoints and correctly label it.

As mentioned above, POI is the identifying variable we will be focusing on; we can see that there are 18 POIs and 126 non-POIs, making a total of 146 employees in this dataset. Each employee has data spread across 21 features, although many features are empty for various reasons (e.g. one employee is compensated only through stock, thus total_payments is NaN, another employee had no recorded emails sent to anyone, thus to_messages is NaN). These NaN values were adjusted to become 0 through the pandas replace function.

Outliers were fairly easily corrected by checking the dataset for unusual names ("TOTAL" and "THE TRAVEL AGENCY IN THE PARK" were outstanding) and seeing whether any indexes had corrupted or null values ("LOCKHART EUGENE E" had null values across all features).

What features did you end up using in your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why or why not? As part of the assignment, you should attempt to engineer your own feature that does not come ready-made in the dataset -- explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.) In your feature selection step, if you used an algorithm like a decision tree, please also give the feature importances of the features that you use, and if you used an automated feature selection function like SelectKBest, please report the feature scores and reasons for your choice of parameter values.

I created five new features, two of which were used the in final dataset. Three features involved ratios of POI Email interaction:

  • poi_from_emails_frac: ratio of emails from an employee to POIs by total emails sent,
  • poi_to_emails_frac: ratio of emails to an employee from POIs by total emails received,
  • poi_all_emails_frac: combination of the above

The other two features were a little different:

  • zero_values: number of zero values divided by total number of original features (less POI label). This feature in particular actually seems quite valuable; it appears that POIs were more likely to receive compensation through many means (thus having a lower zero_value), and vice versa.
  • email_avail: whether the employee had an Enron email or not.

In the final Random Forest Classifier, zero_values, poi_from_emails_frac, and email_avail were selected and used by both the Random_Forest features_importances and SelectKBest.

I used pandas to analyze the features first, creating a correlation matrix with R-Squared values. In doing so, I created a list of the five highest R-Squared value feature pairs, and cross referenced these feature pairs with a list of R-Squared values of each feature against the POI label. These features pairs would most likely be redundant and hinder the machine learning algorithm due to the high correlations they shared with each other, and thus the feature with the lower R-squared value against POI was removed.

I also used the RandomForest classifier to calculate features_importances on the list, and filter even further.

  • (0.1652, 'bonus'),
  • (0.1533, 'exercised_stock_options'),
  • (0.1021, 'zero_values'),
  • (0.0909, 'poi_all_emails_frac'),
  • (0.0825, 'salary'),
  • (0.079, 'email_avail'),
  • (0.0568, 'deferred_income'),
  • (0.0486, 'poi_from_emails_frac'),
  • (0.0379, 'total_payments'),
  • (0.0177, 'expenses'),
  • (0.0166, 'long_term_incentive'),
  • (0.014, 'from_poi_to_this_person'),
  • (0.0, 'shared_receipt_with_poi')]

I decided to filter out the features < 0.02 features_importances for the final feature_list.

SelectKBest was run across multiple K values with Random Forest, and we can see that there is a noticeable improvement in the bestscore parameter as more features are used, with the ideal k value being 8. Since there were only 9 features in the final list, the standard k value of 10 could not be calculated.

  • k = 1: 0.366666666667
  • k = 2: 0.375
  • k = 3: 0.503333333333
  • k = 4: 0.516333333333
  • k = 5: 0.503333333333
  • k = 6: 0.51
  • k = 7: 0.416666666667
  • k = 8: 0.53
  • k = 9: 0.403333333333

MinMaxScaler was also used, but in the Pipeline with GridSearchCV in the final tuning process.

However, SelectKBest was still included in the pipeline below, and final results show that its use in combination with the selected RandomForest featureimportances list was still helpful. The data is shown below.

What algorithm did you end up using? What other one(s) did you try? How did model performance differ between algorithms?

I created a quick function to run through various classifiers individually (Naive Bayes, Decision Trees, Random Forest, AdaBoost, SVM). Random Forest, Decision Trees and Naive Bayes did quite well, scoring well above 0.3 in both recall and precision, whereas AdaBoost and SVM were both ineffective (most likely due to the lack of tuning).

I ended up deciding to tune Naive Bayes (due to its simplicity and speed) and Random Forest (due to its sophistication, and the fact that I had already used its featuresimportances attributes for feature selection). Random Forest in particular also does well when the features used are not correlated together; my initial removal of correlated variables in pandas avoids this issue with Random Forest.

What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well? How did you tune the parameters of your particular algorithm? What parameters did you tune? (Some algorithms do not have parameters that you need to tune -- if this is the case for the one you picked, identify and briefly explain how you would have done it for the model that was not your final choice or a different model that does utilize parameter tuning, e.g. a decision tree classifier).

Parameter tuning helps to improve performance of a classifier, and and the optimal paramters ultimately depend on the nature of the data, as well as the needs of its users. I mentioned above that Random Forest, AdaBoost, and SVM did not return good results; this is due to its tuned parameters, specifically the lack thereof.

For both Naive Bayes and Random Forest, MinMaxScaler and SelectKBest were used to transform the data before the data was fit. This was cross validated with Stratified Shuffle Split, and the parameters were all tested with GridSearchCV to find the best combination of parameters.

Interestingly, with the Naive Bayes pipeline, using the MinMaxScaler actually gave a worse Precision score, and thus it was removed. I believe this was due to the fact that the GridSearchCV scoring was weighted towards the overall F1 score (i.e. the weighted average of the precision and recall), which sacrificed precision for recall.

Tuned Naive Bayes

  • Accuracy: 0.59213
  • Precision: 0.22153
  • Recall: 0.81900
  • F1: 0.34873
  • F2: 0.53203

For the Random Forest, there were many more parameters to be adjusted and searched against. Besides setting the random state to 42 (to avoid differing results), the parameters were automatically selected through GridSearchCV. While the algorithm was considerably more complicated, and took longer to process, the statistics were better than the Naive Bayes model.

Random Forest Parameters

  • 'random_forest__n_estimators': 9,
  • 'SelectKBest__k': 6,
  • 'random_forest__min_samples_split': 2,
  • 'random_forest__criterion': 'gini'

Tuned RF

  • Accuracy: 0.89793
  • Precision: 0.64739
  • Recall: 0.51500
  • F1: 0.57366
  • F2: 0.53696

Overall, I would say that the Naive Bayes classifier was actually maybe the better choice, simply due to the length of time it took to train and test the Random Forest Model. However, the results certainly can't be beat.

Another interesting note is that the bestscore parameters calculated above prove that k=8 is better than k=6, and yet the latter was chosen. This most likely is because the GridSearchCV was weighted towards F1, and not bestscore (Mean cross-validated score of the best_estimator). In combination with the Random Forest parameters, it appears that k=6 provided a better overall result.

I ran one more test with a Random Forest classifier without using SelectKBest, and the results were marginally less effective:

Tuned RF without SKB

  • Accuracy: 0.89193
  • Precision: 0.62426
  • Recall: 0.47600
  • F1: 0.54014
  • F2: 0.49974

Thus, SelectKBest managed to improve on all measurements in the classifier, despite the initial Random Forest features selection process.

What is validation, and what’s a classic mistake you can make if you do it wrong? How did you validate your analysis?

Validation is the process of selecting a sample of data from the training set, and using this sample to tune a model’s parameters. A classic mistake of validation is not sampling the data properly, and thus using a sample for validation that is inherently biased in some way and that differs drastically from the training data, and thus affects the parameter tuning and overall accuracy.

Our analysis was cross validated with Stratified Shuffle Split. This randomizes the data when splitting them into folds, but preserves the percentage of samples for each class. Thus, in each fold, there is an even share of non-POIs and POIs.

Give at least 2 evaluation metrics and your average performance for each of them. Explain an interpretation of your metrics that says something human-understandable about your algorithm’s performance.

The main evaluation metrics used were Precision and Recall.

In the context of the Enron project, precision is calculated as the percentage of correctly labelled POIs out of all predicted POIs, whether labelled correctly or not. Thus, for the Random Forest Classifier, its precision score of 0.647 means that it has a 64.7% rate of correctly identifying a POI.

Recall, on the other hand, is calculated as the percentage of correctly labelled POIs out of all POIs that should have been labelled correctly. For the Random Forest Classifier, its recall score of 0.515 means that it has a 51.5% rate of identifying all POIs correctly.