Company Status Prediction Using NLP

This project analyzes 28,781 US-based companies by combining Crunchbase data with Twitter activity to predict operational status (operating, exits, or closed). Using NLP techniques including TF-IDF, Naive Bayes, Logistic Regression, and LSI, I achieved up to 73.6% accuracy despite dataset imbalance challenges.

Introduction
The Data
Data Processing
Natural Language Processing Techniques
Conclusion

Introduction

For this project, I accessed the Crunchbase and Twitter APIs to merge recent tweets from 30,000 companies and predict their operational status using advanced natural language processing techniques. I implemented TF-IDF, Naive Bayes, Logistic Regression, and Latent Semantic Indexing (LSI) models for classification.

# Imports
import nltk
import pandas as pd
import numpy as np
import pickle
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.cross_validation import train_test_split
from IPython.display import Image
%matplotlib inline
import os
import json
import statsmodels.formula.api as smf
import patsy
import seaborn as sns
from seaborn import plt
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn import tree
from sklearn.ensemble import RandomForestRegressor
import random 
from __future__ import division
from matplotlib import pyplot
import matplotlib.pyplot as plt 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn import linear_model
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import auc

The Data

Crunchbase Data

28,781 companies
- US Based
- Twitter Handles
Classification Column
- Operating: 21,000 companies
- Exits: 6,500 companies
- Closed: 1,500 companies

Twitter Data

Joined last 10 tweets
- Long Text Strings
- Used Tweepy Twitter API

Data Processing

# specific imports
import time
import tweepy 
import requests
from requests_oauthlib import OAuth1
import cnfg
from os.path import expanduser
import pickle

# Accessing Twitter API
home = expanduser("~")
config = cnfg.load(home + "/.twitter_config") 
def auth_twitter():
    auth = tweepy.OAuthHandler(config["consumer_key"],config["consumer_secret"])
    auth.set_access_token(config["access_token"],config["access_token_secret"])
    api=tweepy.API(auth)
    return api

# Import Crunchbase Data & Filter Twitter Handle From URL, save new DF to csv
df = pd.read_csv("crunchbase_export_csv.csv")
df = df.dropna(subset=['twitter_url'], how='all')    
df["twitter_handle"] = df.twitter_url.map(lambda x: x.split("/")[-1].replace("@", "").strip())
df.to_csv("cruchbase.csv", index=False)

# Handle List Variable
handle_list = [row for row in df.twitter_handle]

# Function to collect last ten tweets for a handle
def get_tweet_text(api, screen_name, limit=10):
    combined_text = ''
    for tweet in tweepy.Cursor(api.user_timeline, 
                               screen_name= screen_name, 
                               wait_on_rate_limit=True, 
                               wait_on_rate_limit_notify=True).items(limit):
        combined_text += '  ' + tweet.text
    return combined_text

The code below calls the Twitter API to collect the last ten tweets per each twitter handle, with error handling built in. This was a significant bottleneck that took considerable time to execute.

# API Authentication     
twitter_api = auth_twitter() 
counter = 0
combined_tweets = []

while counter < 28781:
    handle = handle_list[counter]
    try:
        tweet_text = get_tweet_text(twitter_api, handle)
        combined_tweets.append((handle, tweet_text))
        counter += 1 
    except tweepy.TweepError:
        combined_tweets.append((handle, ''))
        counter += 1
        continue;
        
print counter        
print len(combined_tweets)
print combined_tweets[28780]

# save as pickle file
with open('finaltweets6.pkl', 'w') as picklefile:
    pickle.dump(combined_tweets, picklefile)

Load tweet data and crunchbase data:

with open('finaltweets6.pkl', 'r') as picklefile:
    comptweet1 = pickle.load(picklefile)
# separate into just handle and tweets
tweets = pd.DataFrame(comptweet1, columns=["twitter_handle", "tweets"])
# load crunchbase data with created handle column (done above)
dfcrunch = pd.read_csv('cruchbase.csv')

dfcrunch.head()

	company_name	domain	country_code	state_code	region	city	status	short_description	category_list	funding_rounds	...	first_funding_on	last_funding_on	closed_on	email	phone	cb_url	twitter_url	facebook_url	uuid	twitter_handle
0	Oro Inc.	orocrm.com	USA	CA	Los Angeles	Los Angeles	operating	Open Source Tools for Businesses (CRM, B2B eCo...	B2B\|CRM\|E-Commerce\|Enterprise Software\|Open So...	1	...	2016-03-03	2016-03-03	NaN	info@orocrm.com	NaN	https://www.crunchbase.com/organization/oro	https://www.twitter.com/OroCRM	http://www.facebook.com/pages/Oro-CRM/42898686...	b667f47f-c810-7714-b6aa-c34ab3c8d1c2	OroCRM
1	Atlas Obscura	atlasobscura.com	USA	NY	New York City	Brooklyn	operating	Atlas Obscura is a travel guide with articles,...	Leisure\|Travel & Tourism	2	...	2015-02-24	2016-03-02	NaN	info@atlasobscura.com	NaN	https://www.crunchbase.com/organization/atlas-...	https://www.twitter.com/atlasobscura	http://www.facebook.com/atlasobscura	24af462f-7100-9130-db5c-0a38e960efc1	atlasobscura
2	Blippar	blippar.com	USA	NY	New York City	New York	operating	Blippar is the world's leading image-recogniti...	Advertising\|Augmented Reality\|Computer Vision\|...	3	...	2012-01-03	2016-03-02	NaN	info@blippar.com	NaN	https://www.crunchbase.com/organization/blippar	https://www.twitter.com/blippar	http://www.facebook.com/blippar	06f6c176-35cf-acdf-171e-240d23bc5fac	blippar
3	Calyxt	calyxt.com	USA	MN	MN - Other	Minnesota City	closed	Calyxt is an Agbiotech company focused on deve...	NaN	1	...	2016-03-02	2016-03-02	NaN	NaN	NaN	https://www.crunchbase.com/organization/calyxt	https://www.twitter.com/calyxt_inc	NaN	0a4d4b3d-ac4d-566f-923f-5ca1995296ee	calyxt_inc
4	Coop Fuels	coopfuels.com	USA	DC	Washington, D.C.	Washington	operating	Retailer of E85 & Renewable Diesel	Clean Technology\|Fuels\|Internet\|Renewable Ener...	1	...	2016-03-02	2016-03-02	NaN	NaN	NaN	https://www.crunchbase.com/organization/coop-f...	https://www.twitter.com/coopfuels	https://www.facebook.com/coopfuels	93366aa5-6417-4599-260d-897eaf9d65c5	coopfuels

5 rows × 22 columns

Merge Crunchbase and Tweet Dataframes:

company_df = pd.merge(tweets, dfcrunch, how="inner", left_index=True, right_index=True)

Create target column labeling IPOs and Acquired companies as "Exits":

company_df["target"] = company_df.status.map({"operating": "operating", 
                       "closed": "closed", 
                       "ipo": "exits", 
                       "acquired": "exits"})

Analyze the Target Split - The distribution is uneven, which presents a modeling challenge:

company_df.target.value_counts()

operating    20955
exits         6377
closed        1449
Name: target, dtype: int64

Create Predictor "X" and Response "Y" sections. Because this project uses natural language processing techniques, there is only one predictor column "tweets".

X = company_df.tweets
y = company_df.target

Natural Language Processing Techniques

Baseline: TF-IDF with Naive Bayes

First I split the raw data into training and testing portions:

X_train, X_test, y_train, y_test = train_test_split(company_df.tweets, y, test_size=0.2)

print X_test.shape, X_train.shape, y_test.shape, y_train.shape

(5757,) (23024,) (5757,) (23024,)

Then I transform X_train using TF-IDF Count Vectorizer. "TF" represents a word's frequency in a document while "IDF" represents inverse frequency in the corpus.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

# Setup Vectorizer with parameters
vectorizer = TfidfVectorizer(stop_words="english", ngram_range=(1,2), analyzer='word',
                                  token_pattern='\\b[a-z][a-z]+\\b')
# Transform the X_train data
X_train_vect = vectorizer.fit_transform(X_train)

Fit the vectorized training data predictors (X_train_vect) on the training data response (y_train):

nb_clf = MultinomialNB().fit(X_train_vect, y_train)

# Transform the X_test data
X_test_vect = vectorizer.transform(X_test)

The transformation created a much larger sparse matrix with 1.4 million columns:

print X_test_vect.shape

(5757, 1408956)

Create a variable for vectorized test predictions:

y_pred_nb = nb_clf.predict(X_test_vect)

The model's initial performance was disappointing. It predicted every company as "operating" (third column in confusion matrix below):

# nb_clf.classes_ ==> array(['closed', 'exits', 'operating'], dtype='|S9')
confusion_matrix(y_test, y_pred_nb)

array([[   0,    0,  288],
       [   0,    0, 1296],
       [   0,    0, 4173]])

By predicting every company as "operating", the model achieved an accuracy rate of 72%:

nb_clf.score(X_test_vect, y_test)

0.7248566961959354

Classification Report:

print metrics.accuracy_score(y_test, y_pred_nb)
print metrics.classification_report(y_test, y_pred_nb)

0.724856696196
             precision    recall  f1-score   support

     closed       0.00      0.00      0.00       288
      exits       0.00      0.00      0.00      1296
  operating       0.72      1.00      0.84      4173

avg / total       0.53      0.72      0.61      5757

Advanced Approach: GENSIM

Imports:

# gensim
from gensim import corpora, models, similarities, matutils
# sklearn
from sklearn import datasets
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
# logging for gensim (set to INFO)
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# Create a CountVectorizer for parsing/counting words
count_vectorizer = CountVectorizer(analyzer='word',
                                  ngram_range=(1,2), stop_words='english',
                                  token_pattern='\\b[a-z][a-z]+\\b')

Create the term-document matrix. Transpose it so the terms are the rows:

ng_vecs = count_vectorizer.fit_transform(company_df.tweets).transpose()
ng_vecs.shape

(1705443, 28781)

ng_vecs_x = count_vectorizer.fit_transform(company_df.tweets)

Convert sparse matrix of counts to a gensim corpus:

corpus = matutils.Sparse2Corpus(ng_vecs)

Map matrix rows to words (tokens):

# We need to save a mapping (dict) of row id to word (token) for later use by gensim:
id2word = dict((v, k) for k, v in count_vectorizer.vocabulary_.iteritems())

Logistic Regression

Train/test/split using ng_vecs (Count Vect). Here, we are splitting vectorized data:

X_train, X_test, y_train, y_test = train_test_split(ng_vecs_x, company_df.target)

Model with Logistic Reg and predict on Testing data:

nb_nb = LogisticRegression().fit(X_train, y_train)

Unlike TF-IDF with Naive Bayes, made predictions for each target category and had slightly higher accuracy score of 73.6%:

y_pred_reg = nb_nb.predict(X_test)
confusion_matrix(y_test, y_pred_reg)

array([[  12,   36,  324],
       [  15,  328, 1255],
       [  21,  246, 4959]])

print metrics.accuracy_score(y_test, y_pred_reg)
print metrics.classification_report(y_test, y_pred_reg)

0.736381322957
             precision    recall  f1-score   support

     closed       0.25      0.03      0.06       372
      exits       0.54      0.21      0.30      1598
  operating       0.76      0.95      0.84      5226

avg / total       0.68      0.74      0.68      7196

Naive Bayes

nb_nb = MultinomialNB().fit(X_train, y_train)

y_pred_reg = nb_nb.predict(X_test)
confusion_matrix(y_test, y_pred_reg)

array([[   1,   60,  311],
       [   3,  284, 1311],
       [   8,  338, 4880]])

print metrics.accuracy_score(y_test, y_pred_reg)
print metrics.classification_report(y_test, y_pred_reg)

0.717759866593
             precision    recall  f1-score   support

     closed       0.08      0.00      0.01       372
      exits       0.42      0.18      0.25      1598
  operating       0.75      0.93      0.83      5226

avg / total       0.64      0.72      0.66      7196

KNN with LogReg

# Fit KNN classifier to training set 
knn = KNeighborsClassifier(n_neighbors=3)  
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

y_pred_knn = knn.predict(X_test)

# Score against test set
knn.score(X_test, y_test)

0.61381322957198448

confusion_matrix(y_test, y_pred_knn)

array([[  21,   41,  310],
       [  31,  280, 1287],
       [  63, 1047, 4116]])

print metrics.accuracy_score(y_test, y_pred_knn)
print metrics.classification_report(y_test, y_pred_knn)

0.613813229572
             precision    recall  f1-score   support

     closed       0.18      0.06      0.09       372
      exits       0.20      0.18      0.19      1598
  operating       0.72      0.79      0.75      5226

avg / total       0.58      0.61      0.59      7196

LSI (Latent Semantic Indexing)

LSI requires us to go one step further than LDA in preprocessing, we need to calculate TF-IDF weights from our word counts term-document matrix:

# Create a TFIDF transformer from our word counts (equivalent to "fit" in sklearn)
tfidf = models.TfidfModel(corpus)

Create a TF-IDF vector for all documents from the original corpus:

# Create a TFIDF vector for all documents from the original corpus ("transform" in sklearn)
tfidf_corpus = tfidf[corpus]

SVD

Now that we've taken care of the TF-IDF bit, let's crank through the SVD and build the LSI space:

# Build an LSI space from the input TFIDF matrix, mapping of row id to word, and num_topics
# num_topics is the number of dimensions to reduce to after the SVD
# Analagous to "fit" in sklearn, it primes an LSI space
lsi = models.LsiModel(tfidf_corpus, id2word=id2word, num_topics=300)

Now that we have a trained LSI space, we want to do the transform step to figure out where all of the original documents lie in that num_topics=300 dimensional space:

# Retrieve vectors for the original tfidf corpus in the LSI space ("transform" in sklearn)
lsi_corpus = lsi[tfidf_corpus]

lsi_vecs = lsi[tfidf_corpus]

# Dump the resulting document vectors into a list so we can take a look
doc_vecs = [doc for doc in lsi_corpus]

Machine Learning with LSI Vectors

We have (very good, 300-dimensional) vectors for our documents now! So we can do any machine learning we want on our documents!

# Convert the gensim-style corpus vecs to a numpy array for sklearn manipulations
X = matutils.corpus2dense(lsi_vecs, num_terms=300).transpose()
X.shape

(28781, 300)

# splitting vectorized data
X_train, X_test, y_train, y_test = train_test_split(X, company_df.target)

print X_test.shape, X_train.shape, y_test.shape, y_train.shape

(7196, 300) (21585, 300) (7196,) (21585,)

# Model with Logistic Reg
from sklearn.linear_model import LogisticRegression
nb_reg = LogisticRegression().fit(X_train, y_train)

y_pred_reg = nb_reg.predict(X_test)

confusion_matrix(y_test, y_pred_reg)

array([[   3,    4,  355],
       [   2,   34, 1556],
       [   0,   15, 5227]])

print metrics.accuracy_score(y_test, y_pred_reg)
print metrics.classification_report(y_test, y_pred_reg)

0.731517509728
             precision    recall  f1-score   support

     closed       0.60      0.01      0.02       362
      exits       0.64      0.02      0.04      1592
  operating       0.73      1.00      0.84      5242

avg / total       0.71      0.73      0.63      7196

Conclusion

In this project, I successfully integrated data from Crunchbase and Twitter APIs to analyze 28,781 companies and predict their operational status. Through implementing multiple NLP techniques including TF-IDF, Naive Bayes, Logistic Regression, and LSI, I achieved varying levels of accuracy:

Baseline TF-IDF with Naive Bayes: 72.5% accuracy
Logistic Regression with Count Vectorizer: 73.6% accuracy
LSI with Logistic Regression: 73.2% accuracy

The primary challenge was the imbalanced dataset, with 73% of companies in the "operating" category. Future improvements could include implementing SMOTE for class balancing, feature engineering with additional company metadata, and exploring deep learning approaches with word embeddings.

Predicting Company Status with NLP and Twitter Data Analysis

Table of Contents