For this project, I classified Census Income data to identify part-time versus full-time workers (less than 40 hours per week versus 40+ hours). I used several classification algorithms including K-Nearest Neighbors (KNN), Logistic Regression, Support Vector Machines (SVM), and Random Forest. I set up an AWS instance to work with the data and created a Python script to query and transform it into a workable dataframe. The best performing model was Support Vector Classification, achieving 82% accuracy.
Table of Contents
- Introduction
- Query and Dataframe Script
- Classification Models with GridSearch
- Voting Model Classifier
- Conclusion
Introduction
For this project, I classified Census Income data to identify part-time versus full-time workers (less than 40 hours per week versus 40+ hours). I used several classification algorithms including K-Nearest Neighbors (KNN), Logistic Regression, Support Vector Machines (SVM), and Random Forest.
I set up an AWS instance to work with the data and created a Python script to query and transform it into a workable dataframe. I evaluated each model using 3-fold cross-validation, optimized hyperparameters using GridSearch, and combined models using VotingClassifier. The best performing model was Support Vector Classification, achieving 82% accuracy.
Imports
# General
import os
import json
import pickle
import numpy as np
import pandas as pd
import random
from __future__ import division
# Plotting
%matplotlib inline
from matplotlib import pyplot as plt
import statsmodels.formula.api as smf
import patsy
import seaborn as sns
from seaborn import plt
# sklearn models
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import linear_model
from sklearn import metrics
from sklearn import tree
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
Query and Dataframe Script
The script below creates a SQL query for the census data hosted on Amazon Web Services for cloud collaboration. It filters the data into specific feature and response columns. The feature columns use "get_dummies" to create separate columns for categorical responses, while the response column is 'hour_binned'. The administrative information (passwords, etc.) has been changed for security purposes.
dbname = 'mcnulty'
user = 'USERNAME'
password = '@@@@'
host = 'ec2-51-34-173-157.us-west-2.compute.amazonaws.com'
port = '5432'
# input must be strings
def query_database(user, dbname='mcnulty', password='@@@@', host='ec2-51-34-173-157.us-west-2.compute.amazonaws.com', port='5432'):
'''
dbname: database name
user: username
password: password for the user
host: public dns
port: typically for postgresql 5432
returns a dataframe given query
'''
try:
# Create connection with database
conn = psycopg2.connect("dbname="+dbname+" user="+user+" password="+password+" host="+host+" port="+port)
print "Connected"
cur = conn.cursor()
# Ask for user's SQL query
print "Query please: "
input_query = raw_input()
# Execute search query
cur.execute(input_query)
data = cur.fetchall()
# Return dataframe
df = pd.DataFrame(data)
cur.close()
conn.close()
return df
except:
print "Connection error or query mistake"
def clean_data_x_y(df):
'''
df: input census dataframe (all data)
return: processed dataframe
'''
df_1 = df.copy()
del df_1[0] # delete original index column
df_1 = df_1.dropna()
df_1.columns = ['age','workclass','fnlwgt','education','education_years','marital_status','occupation','relationship','race','sex','capital_gain','capital_loss','hours_per_week','native_country','income','source']
# removing whitespace from columns with string values
string_cols = ['workclass','education','marital_status','occupation','relationship','race','sex','native_country','income']
for col in string_cols:
df_1[col] = df_1[col].map(str.strip)
df_1['hour_binned'] = [0 if i < 40 else 1 for i in df_1['hours_per_week']]
x = df_1[['age','workclass','education','education_years','marital_status','occupation','relationship','race','sex','capital_gain','capital_loss','native_country','income']]
y = df_1['hour_binned']
x = pd.get_dummies(x, columns = ['workclass','education','marital_status','occupation','relationship','race','sex','native_country','income'])
return x,y
I imported the script and used its functions to query the data into a dataframe.
import project_mcnulty2class
df = project_mcnulty2class.query_database('chris')
Connected Query please: Select * From Census
df.head()
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K | train |
| 1 | 3 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K | train |
| 2 | 4 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K | train |
| 3 | 5 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K | train |
| 4 | 6 | 37 | Private | 284582 | Masters | 14 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 0 | 0 | 40 | United-States | <=50K | train |
x,y = project_mcnulty2class.clean_data_x_y(df)
x.head()
| age | education_years | capital_gain | capital_loss | workclass_Federal-gov | workclass_Local-gov | workclass_Private | workclass_Self-emp-inc | workclass_Self-emp-not-inc | workclass_State-gov | ... | native_country_Scotland | native_country_South | native_country_Taiwan | native_country_Thailand | native_country_Trinadad&Tobago | native_country_United-States | native_country_Vietnam | native_country_Yugoslavia | income_<=50K | income_>50K | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 50 | 13 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 1 | 38 | 9 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 2 | 53 | 7 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 3 | 28 | 13 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4 | 37 | 14 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
5 rows × 104 columns
The response column "hour_binned" is now converted to binary values (0 or 1) depending on hours worked per week.
y.head()
# 0 < 40
# 1 >= 40
0 0 1 1 2 1 3 1 4 1 Name: hour_binned, dtype: int64
I separated the data into training and testing subsets:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=15)
Classification Models with GridSearch
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
Logistic Regression
Parameter C is the cost function. Lower values of C result in lower model complexity. I tested seven different C values with the logistic model.
params_c = np.logspace(-3, 3, 7) # 10^-3 to 10^3
params_c
array([ 1.00000000e-03, 1.00000000e-02, 1.00000000e-01,
1.00000000e+00, 1.00000000e+01, 1.00000000e+02,
1.00000000e+03])
# Setup grid search with model and search parameters
logreg_grid_search = GridSearchCV(LogisticRegression(), param_grid={'C': np.logspace(-3, 3, 7)})
# Fit on the training section
logreg_grid_search.fit(X_train, y_train)
GridSearchCV(cv=None, error_score='raise',
estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False),
fit_params={}, iid=True, n_jobs=1,
param_grid={'C': array([ 1.00000e-03, 1.00000e-02, 1.00000e-01, 1.00000e+00,
1.00000e+01, 1.00000e+02, 1.00000e+03])},
pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)
# Predict on the test section (predicting 1 or 0)
log_pred_search_pred = logreg_grid_search.predict(X_test)
log_pred_search_pred
array([1, 1, 1, ..., 1, 0, 1])
Prediction Breakdown Analysis:
- Precision: True Positives / (True Positives + False Positives)
- Recall: True Positives / (True Positives + False Negatives)
print metrics.accuracy_score(y_test, log_pred_search_pred)
print metrics.classification_report(y_test, log_pred_search_pred)
0.797597110636
precision recall f1-score support
0 0.61 0.27 0.37 3034
1 0.82 0.95 0.88 10533
avg / total 0.77 0.80 0.77 13567
# What is the best "C" parameter
logreg_grid_search.best_params_
{'C': 1.0}
# What are the accuracy scores for all seven "C" parameters?
logreg_grid_search.grid_scores_
[mean: 0.78742, std: 0.00183, params: {'C': 0.001},
mean: 0.80126, std: 0.00063, params: {'C': 0.01},
mean: 0.80287, std: 0.00035, params: {'C': 0.10000000000000001},
mean: 0.80375, std: 0.00067, params: {'C': 1.0},
mean: 0.80284, std: 0.00046, params: {'C': 10.0},
mean: 0.80341, std: 0.00056, params: {'C': 100.0},
mean: 0.80255, std: 0.00173, params: {'C': 1000.0}]
C = 1.0 produces the best accuracy score of 0.80375. The results show there's no significant benefit to further fine-tuning C.
logreg_grid_search.best_score_
0.80375308017944025
The confusion matrix below shows:
- A0: Model correctly predicted "0" 807 times
- B0: Model incorrectly predicted "0" 2227 times
- A1: Model correctly predicted "1" 10014 times
- B1: Model incorrectly predicted "1" 519 times
confusion_matrix(y_test,logreg_grid_search.predict(X_test))
# [ A ] [B ]
# 0 [ 807, 2227]
# 1 [ 519, 10014]
array([[ 807, 2227],
[ 519, 10014]])
K-Nearest Neighbors (KNN)
For the KNN model, the parameter being tuned is the number of "nearest neighbors" to consider. I implemented grid search with the following range:
range(3, 22, 2)
[3, 5, 7, 9, 11, 13, 15, 17, 19, 21]
knn_grid_search = GridSearchCV(KNeighborsClassifier(), param_grid={'n_neighbors': range(3, 22, 2)})
I fit the training data on the grid search model, which identified 19 nearest neighbors as the optimal parameter with 81% accuracy.
knn_grid_search.fit(X_train, y_train)
GridSearchCV(cv=None, error_score='raise',
estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=5, p=2,
weights='uniform'),
fit_params={}, iid=True, n_jobs=1,
param_grid={'n_neighbors': [3, 5, 7, 9, 11, 13, 15, 17, 19, 21]},
pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)
X_train.shape
(31654, 104)
knn_grid_search.best_score_
0.81777974347633786
knn_grid_search.best_params_
{'n_neighbors': 19}
I set up the model using the optimal parameter of 19 nearest neighbors:
knnroc = KNeighborsClassifier(n_neighbors=19)
knnroc.fit(X_train, y_train)
knnroc_pred = knnroc.predict(X_test)
print metrics.accuracy_score(y_test, knnroc_pred)
0.812781012752
I used the model to predict probabilities for each outcome, which are needed for the ROC curve. The area under the curve represents the percentage correctly predicted by the model. A curve hugging the upper left corner is ideal, while a diagonal line represents random prediction (50%).
knn_probs = knnroc.predict_proba(X_test)[:, 1]
print metrics.roc_auc_score(y_test, knn_probs)
0.765842305825
fpr, tpr, thresholds = metrics.roc_curve(y_test, knn_probs)
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
<matplotlib.text.Text at 0x114f3e350>
Naive Bayes
The Naive Bayes model has no parameters to tune. It produced an accuracy score of 68%.
nb_model = GaussianNB()
nb_model.fit(X_train,y_train)
nb_model.score(X_test,y_test)
0.68755067443060369
Decision Tree
For the Decision Tree, I searched the "max_depth" parameter, which determines how many layers deep the tree can be. The best accuracy score of 80% was achieved with a max depth of approximately 4.
tree_grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid={'max_depth': np.logspace(-3, 3, 6)})
tree_grid_search.fit(X_train, y_train)
GridSearchCV(cv=None, error_score='raise',
estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter='best'),
fit_params={}, iid=True, n_jobs=1,
param_grid={'max_depth': array([ 1.00000e-03, 1.58489e-02, 2.51189e-01, 3.98107e+00,
6.30957e+01, 1.00000e+03])},
pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)
tree_grid_search.best_params_
{'max_depth': 3.9810717055349691}
tree_grid_search.best_score_
0.80975548113982432
Random Forest
For Random Forest, I searched the n_estimators parameter (the number of trees in the forest), testing values of 10, 100, and 1000. The optimal value was 1000. Note: I haven't run this code in the notebook due to computation time.
# forest_grid_search = GridSearchCV(RandomForestClassifier(), param_grid={'n_estimators': [10,100,1000]})
# forest_grid_search.fit(X_train, y_train)
# forest_grid_search.best_params_
# forest_grid_search.best_score_
Support Vector Classification (SVC)
For SVC, the parameter "C" represents the penalty of the error term. I searched through three values (0.01, 1, 100) and found C=1 to be optimal with an 82% classification rate.
svc_grid_search = GridSearchCV(SVC(), param_grid={'C': [.01,1,100]})
svc_grid_search.fit(X_train, y_train)
GridSearchCV(cv=None, error_score='raise',
estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False),
fit_params={}, iid=True, n_jobs=1, param_grid={'C': [0.01, 1, 100]},
pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)
svc_grid_search.best_params_
{'C': 1}
svc_grid_search.best_score_
0.81888544891640869
Voting Model Classifier
from sklearn.cross_validation import cross_val_score
from sklearn.ensemble import VotingClassifier
I created variables for each model using the optimal parameters found via grid search. Note: I didn't include the Decision Tree model here due to running time constraints.
clf1 = LogisticRegression(C=1000,random_state=1)
clf2 = RandomForestClassifier(n_estimators=1000,random_state=1)
clf3 = GaussianNB()
clf4 = KNeighborsClassifier(n_neighbors=19)
clf5 = SVC(C=100)
The VotingClassifier combines the various classification models and chooses a class based on different methods. The prediction score automatically performs cross-validation.
Method 1: Hard Voting - Each model votes on whether someone works less than or greater than/equal to 40 hours per week. The majority vote wins. If three out of five models predict "0" (less than 40 hours), the classification results in "0".
eclf1 = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3), ('knn', clf4),('SVC',clf5)], voting='hard')
eclf1 = eclf1.fit(X_train, y_train)
print(eclf1.predict(X_test))
print(eclf1.score(X_test,y_test)) # automatically does cross validation
[1 1 1 ..., 1 0 1] 0.815950468047
Method 2: Soft Voting - Each model represents a probability of being a certain class. The class is chosen based on an average of all probabilities, selecting the more likely outcome.
eclf2 = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3), ('knn', clf4)],voting='soft')
eclf2 = eclf2.fit(X_train, y_train)
print(eclf2.predict(X_test))
print(eclf2.score(X_test,y_test))
[1 1 1 ..., 1 0 1] 0.797670818899
Method 3: Weighted Soft Voting - Same as soft voting, but certain models are weighted more heavily. Here, Logistic Regression has twice the weight of the other models (weights=[2,1,1,1]).
eclf3 = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3), ('knn', clf4)],voting='soft', weights=[2,1,1,1])
eclf3 = eclf3.fit(X_train, y_train)
print(eclf3.predict(X_test))
print(eclf3.score(X_test,y_test))
[1 1 1 ..., 1 0 1] 0.802830397288
Conclusion
This project created a script to query census data from Amazon Web Services and transform it into a workable dataframe. I tested different classification models on training and testing sections of the dataset, using GridSearch to find optimal parameters. Finally, I combined all models using VotingClassifier with several voting methods.
The hard voting classifier (eclf1) produced a strong accuracy score of 81.5%, but the highest performance came from the standalone Support Vector Classification model with an accuracy score of 81.8%. This demonstrates that while ensemble methods can be powerful, sometimes a single well-tuned model can outperform combined approaches.
Model Performance Summary
- Support Vector Classification (SVC): 81.8% accuracy - Best performer
- K-Nearest Neighbors (KNN): 81.3% accuracy
- Hard Voting Classifier: 81.5% accuracy
- Decision Tree: 81.0% accuracy
- Logistic Regression: 80.4% accuracy
- Weighted Soft Voting: 80.3% accuracy
- Soft Voting Classifier: 79.8% accuracy
- Naive Bayes: 68.8% accuracy