Census Classification with ML | GridSearch & SVM Analysis

For this project, I classified Census Income data to identify part-time versus full-time workers (less than 40 hours per week versus 40+ hours). I used several classification algorithms including K-Nearest Neighbors (KNN), Logistic Regression, Support Vector Machines (SVM), and Random Forest. I set up an AWS instance to work with the data and created a Python script to query and transform it into a workable dataframe. The best performing model was Support Vector Classification, achieving 82% accuracy.

Introduction
Query and Dataframe Script
Classification Models with GridSearch
Voting Model Classifier
Conclusion

Introduction

I set up an AWS instance to work with the data and created a Python script to query and transform it into a workable dataframe. I evaluated each model using 3-fold cross-validation, optimized hyperparameters using GridSearch, and combined models using VotingClassifier. The best performing model was Support Vector Classification, achieving 82% accuracy.

Imports

# General
import os
import json
import pickle
import numpy as np
import pandas as pd
import random 
from __future__ import division

# Plotting
%matplotlib inline
from matplotlib import pyplot as plt
import statsmodels.formula.api as smf
import patsy
import seaborn as sns
from seaborn import plt

# sklearn models
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import linear_model
from sklearn import metrics
from sklearn import tree
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

Query and Dataframe Script

The script below creates a SQL query for the census data hosted on Amazon Web Services for cloud collaboration. It filters the data into specific feature and response columns. The feature columns use "get_dummies" to create separate columns for categorical responses, while the response column is 'hour_binned'. The administrative information (passwords, etc.) has been changed for security purposes.

dbname = 'mcnulty'
user = 'USERNAME'
password = '@@@@'
host = 'ec2-51-34-173-157.us-west-2.compute.amazonaws.com'
port = '5432'

# input must be strings
def query_database(user, dbname='mcnulty', password='@@@@', host='ec2-51-34-173-157.us-west-2.compute.amazonaws.com', port='5432'):
    '''
    dbname: database name
    user: username
    password: password for the user
    host: public dns
    port: typically for postgresql 5432
    returns a dataframe given query
    '''
    
    try:
        # Create connection with database
        conn = psycopg2.connect("dbname="+dbname+" user="+user+" password="+password+" host="+host+" port="+port)
        print "Connected"
        cur = conn.cursor()

        # Ask for user's SQL query
        print "Query please: "
        input_query = raw_input()

        # Execute search query
        cur.execute(input_query)
        data = cur.fetchall()

        # Return dataframe
        df = pd.DataFrame(data)
        cur.close()
        conn.close()

        return df

    except:
        print "Connection error or query mistake"

        
def clean_data_x_y(df):
    '''
    df: input census dataframe (all data)
    return: processed dataframe
    '''
    df_1 = df.copy()
    del df_1[0] # delete original index column
    df_1 = df_1.dropna()
    df_1.columns = ['age','workclass','fnlwgt','education','education_years','marital_status','occupation','relationship','race','sex','capital_gain','capital_loss','hours_per_week','native_country','income','source']
    # removing whitespace from columns with string values
    string_cols = ['workclass','education','marital_status','occupation','relationship','race','sex','native_country','income']
    for col in string_cols:
        df_1[col] = df_1[col].map(str.strip)
    df_1['hour_binned'] = [0 if i < 40 else 1 for i in df_1['hours_per_week']]
    x = df_1[['age','workclass','education','education_years','marital_status','occupation','relationship','race','sex','capital_gain','capital_loss','native_country','income']]
    y =  df_1['hour_binned']
    x = pd.get_dummies(x, columns = ['workclass','education','marital_status','occupation','relationship','race','sex','native_country','income'])  
    return x,y

I imported the script and used its functions to query the data into a dataframe.

import project_mcnulty2class

df = project_mcnulty2class.query_database('chris')

Connected
Query please: 
Select * From Census

df.head()

	0	1	2	3	4	5	6	7	8	9	10	13	14	15	16
0	2	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	13	United-States	<=50K	train
1	3	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	40	United-States	<=50K	train
2	4	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	40	United-States	<=50K	train
3	5	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	40	Cuba	<=50K	train
4	6	37	Private	284582	Masters	14	Married-civ-spouse	Exec-managerial	Wife	White	Female	40	United-States	<=50K	train

x,y = project_mcnulty2class.clean_data_x_y(df)

x.head()

	age	education_years	workclass_Private	workclass_Self-emp-not-inc	...	native_country_United-States	income_<=50K
0	50	13	0	1	...	1	1
1	38	9	1	0	...	1	1
2	53	7	1	0	...	1	1
3	28	13	1	0	...	0	1
4	37	14	1	0	...	1	1

5 rows × 104 columns

The response column "hour_binned" is now converted to binary values (0 or 1) depending on hours worked per week.

y.head()
# 0 < 40
# 1 >= 40

0    0
1    1
2    1
3    1
4    1
Name: hour_binned, dtype: int64

I separated the data into training and testing subsets:

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=15)

Classification Models with GridSearch

from sklearn.grid_search import GridSearchCV
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

Logistic Regression

Parameter C is the cost function. Lower values of C result in lower model complexity. I tested seven different C values with the logistic model.

params_c = np.logspace(-3, 3, 7) # 10^-3 to 10^3
params_c

array([  1.00000000e-03,   1.00000000e-02,   1.00000000e-01,
         1.00000000e+00,   1.00000000e+01,   1.00000000e+02,
         1.00000000e+03])

# Setup grid search with model and search parameters
logreg_grid_search = GridSearchCV(LogisticRegression(), param_grid={'C': np.logspace(-3, 3, 7)})

# Fit on the training section
logreg_grid_search.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'C': array([  1.00000e-03,   1.00000e-02,   1.00000e-01,   1.00000e+00,
         1.00000e+01,   1.00000e+02,   1.00000e+03])},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

# Predict on the test section (predicting 1 or 0)
log_pred_search_pred = logreg_grid_search.predict(X_test)

log_pred_search_pred

array([1, 1, 1, ..., 1, 0, 1])

Prediction Breakdown Analysis:

Precision: True Positives / (True Positives + False Positives)
Recall: True Positives / (True Positives + False Negatives)

print metrics.accuracy_score(y_test, log_pred_search_pred)
print metrics.classification_report(y_test, log_pred_search_pred)

0.797597110636
             precision    recall  f1-score   support

          0       0.61      0.27      0.37      3034
          1       0.82      0.95      0.88     10533

avg / total       0.77      0.80      0.77     13567

# What is the best "C" parameter
logreg_grid_search.best_params_

{'C': 1.0}

# What are the accuracy scores for all seven "C" parameters?
logreg_grid_search.grid_scores_

[mean: 0.78742, std: 0.00183, params: {'C': 0.001},
 mean: 0.80126, std: 0.00063, params: {'C': 0.01},
 mean: 0.80287, std: 0.00035, params: {'C': 0.10000000000000001},
 mean: 0.80375, std: 0.00067, params: {'C': 1.0},
 mean: 0.80284, std: 0.00046, params: {'C': 10.0},
 mean: 0.80341, std: 0.00056, params: {'C': 100.0},
 mean: 0.80255, std: 0.00173, params: {'C': 1000.0}]

C = 1.0 produces the best accuracy score of 0.80375. The results show there's no significant benefit to further fine-tuning C.

logreg_grid_search.best_score_

0.80375308017944025

The confusion matrix below shows:

A0: Model correctly predicted "0" 807 times
B0: Model incorrectly predicted "0" 2227 times
A1: Model correctly predicted "1" 10014 times
B1: Model incorrectly predicted "1" 519 times

confusion_matrix(y_test,logreg_grid_search.predict(X_test))
#     [ A ]   [B ] 
# 0  [ 807,  2227] 
# 1  [ 519, 10014]

array([[  807,  2227],
       [  519, 10014]])

K-Nearest Neighbors (KNN)

For the KNN model, the parameter being tuned is the number of "nearest neighbors" to consider. I implemented grid search with the following range:

range(3, 22, 2)

[3, 5, 7, 9, 11, 13, 15, 17, 19, 21]

knn_grid_search = GridSearchCV(KNeighborsClassifier(), param_grid={'n_neighbors': range(3, 22, 2)})

I fit the training data on the grid search model, which identified 19 nearest neighbors as the optimal parameter with 81% accuracy.

knn_grid_search.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [3, 5, 7, 9, 11, 13, 15, 17, 19, 21]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

X_train.shape

(31654, 104)

knn_grid_search.best_score_

0.81777974347633786

knn_grid_search.best_params_

{'n_neighbors': 19}

I set up the model using the optimal parameter of 19 nearest neighbors:

knnroc = KNeighborsClassifier(n_neighbors=19)
knnroc.fit(X_train, y_train)
knnroc_pred = knnroc.predict(X_test)
print metrics.accuracy_score(y_test, knnroc_pred)

0.812781012752

I used the model to predict probabilities for each outcome, which are needed for the ROC curve. The area under the curve represents the percentage correctly predicted by the model. A curve hugging the upper left corner is ideal, while a diagonal line represents random prediction (50%).

knn_probs = knnroc.predict_proba(X_test)[:, 1]
print metrics.roc_auc_score(y_test, knn_probs)

0.765842305825

fpr, tpr, thresholds = metrics.roc_curve(y_test, knn_probs)
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')

<matplotlib.text.Text at 0x114f3e350>

Naive Bayes

The Naive Bayes model has no parameters to tune. It produced an accuracy score of 68%.

nb_model = GaussianNB()
nb_model.fit(X_train,y_train)
nb_model.score(X_test,y_test)

0.68755067443060369

Decision Tree

For the Decision Tree, I searched the "max_depth" parameter, which determines how many layers deep the tree can be. The best accuracy score of 80% was achieved with a max depth of approximately 4.

tree_grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid={'max_depth': np.logspace(-3, 3, 6)})

tree_grid_search.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'max_depth': array([  1.00000e-03,   1.58489e-02,   2.51189e-01,   3.98107e+00,
         6.30957e+01,   1.00000e+03])},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

tree_grid_search.best_params_

{'max_depth': 3.9810717055349691}

tree_grid_search.best_score_

0.80975548113982432

Random Forest

For Random Forest, I searched the n_estimators parameter (the number of trees in the forest), testing values of 10, 100, and 1000. The optimal value was 1000. Note: I haven't run this code in the notebook due to computation time.

# forest_grid_search = GridSearchCV(RandomForestClassifier(), param_grid={'n_estimators': [10,100,1000]})
# forest_grid_search.fit(X_train, y_train)
# forest_grid_search.best_params_
# forest_grid_search.best_score_

Support Vector Classification (SVC)

For SVC, the parameter "C" represents the penalty of the error term. I searched through three values (0.01, 1, 100) and found C=1 to be optimal with an 82% classification rate.

svc_grid_search = GridSearchCV(SVC(), param_grid={'C': [.01,1,100]})

svc_grid_search.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=1, param_grid={'C': [0.01, 1, 100]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

svc_grid_search.best_params_

{'C': 1}

svc_grid_search.best_score_

0.81888544891640869

Voting Model Classifier

from sklearn.cross_validation import cross_val_score
from sklearn.ensemble import VotingClassifier

I created variables for each model using the optimal parameters found via grid search. Note: I didn't include the Decision Tree model here due to running time constraints.

clf1 = LogisticRegression(C=1000,random_state=1)
clf2 = RandomForestClassifier(n_estimators=1000,random_state=1)
clf3 = GaussianNB()
clf4 = KNeighborsClassifier(n_neighbors=19)
clf5 = SVC(C=100)

The VotingClassifier combines the various classification models and chooses a class based on different methods. The prediction score automatically performs cross-validation.

Method 1: Hard Voting - Each model votes on whether someone works less than or greater than/equal to 40 hours per week. The majority vote wins. If three out of five models predict "0" (less than 40 hours), the classification results in "0".

eclf1 = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3), ('knn', clf4),('SVC',clf5)], voting='hard')
eclf1 = eclf1.fit(X_train, y_train)
print(eclf1.predict(X_test))
print(eclf1.score(X_test,y_test)) # automatically does cross validation

[1 1 1 ..., 1 0 1]
0.815950468047

Method 2: Soft Voting - Each model represents a probability of being a certain class. The class is chosen based on an average of all probabilities, selecting the more likely outcome.

eclf2 = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3), ('knn', clf4)],voting='soft')
eclf2 = eclf2.fit(X_train, y_train)
print(eclf2.predict(X_test))
print(eclf2.score(X_test,y_test))

[1 1 1 ..., 1 0 1]
0.797670818899

Method 3: Weighted Soft Voting - Same as soft voting, but certain models are weighted more heavily. Here, Logistic Regression has twice the weight of the other models (weights=[2,1,1,1]).

eclf3 = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3), ('knn', clf4)],voting='soft', weights=[2,1,1,1])
eclf3 = eclf3.fit(X_train, y_train)
print(eclf3.predict(X_test))
print(eclf3.score(X_test,y_test))

[1 1 1 ..., 1 0 1]
0.802830397288

Conclusion

This project created a script to query census data from Amazon Web Services and transform it into a workable dataframe. I tested different classification models on training and testing sections of the dataset, using GridSearch to find optimal parameters. Finally, I combined all models using VotingClassifier with several voting methods.

The hard voting classifier (eclf1) produced a strong accuracy score of 81.5%, but the highest performance came from the standalone Support Vector Classification model with an accuracy score of 81.8%. This demonstrates that while ensemble methods can be powerful, sometimes a single well-tuned model can outperform combined approaches.

Model Performance Summary

Support Vector Classification (SVC): 81.8% accuracy - Best performer
K-Nearest Neighbors (KNN): 81.3% accuracy
Hard Voting Classifier: 81.5% accuracy
Decision Tree: 81.0% accuracy
Logistic Regression: 80.4% accuracy
Weighted Soft Voting: 80.3% accuracy
Soft Voting Classifier: 79.8% accuracy
Naive Bayes: 68.8% accuracy

Census Data Classification: Predicting Work Hours Using Machine Learning

Table of Contents