Skip to main content
Notebooks

Predicting Company Status with NLP and Twitter Data Analysis

This project analyzes 28,781 US-based companies by combining Crunchbase data with Twitter activity to predict operational status (operating, exits, or closed). Using NLP techniques including TF-IDF, Naive Bayes, Logistic Regression, and LSI, I achieved up to 73.6% accuracy despite dataset imbalance challenges.

Table of Contents

  1. Introduction
  2. The Data
  3. Data Processing
  4. Natural Language Processing Techniques
  5. Conclusion

Introduction

For this project, I accessed the Crunchbase and Twitter APIs to merge recent tweets from 30,000 companies and predict their operational status using advanced natural language processing techniques. I implemented TF-IDF, Naive Bayes, Logistic Regression, and Latent Semantic Indexing (LSI) models for classification.

# Imports
import nltk
import pandas as pd
import numpy as np
import pickle
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.cross_validation import train_test_split
from IPython.display import Image
%matplotlib inline
import os
import json
import statsmodels.formula.api as smf
import patsy
import seaborn as sns
from seaborn import plt
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn import tree
from sklearn.ensemble import RandomForestRegressor
import random 
from __future__ import division
from matplotlib import pyplot
import matplotlib.pyplot as plt 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn import linear_model
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import auc

The Data

Crunchbase and Twitter Company Logos

Crunchbase Data

  • 28,781 companies
    • US Based
    • Twitter Handles
  • Classification Column
    • Operating: 21,000 companies
    • Exits: 6,500 companies
    • Closed: 1,500 companies

Twitter Data

  • Joined last 10 tweets
    • Long Text Strings
    • Used Tweepy Twitter API

Data Processing

# specific imports
import time
import tweepy 
import requests
from requests_oauthlib import OAuth1
import cnfg
from os.path import expanduser
import pickle
# Accessing Twitter API
home = expanduser("~")
config = cnfg.load(home + "/.twitter_config") 
def auth_twitter():
    auth = tweepy.OAuthHandler(config["consumer_key"],config["consumer_secret"])
    auth.set_access_token(config["access_token"],config["access_token_secret"])
    api=tweepy.API(auth)
    return api
# Import Crunchbase Data & Filter Twitter Handle From URL, save new DF to csv
df = pd.read_csv("crunchbase_export_csv.csv")
df = df.dropna(subset=['twitter_url'], how='all')    
df["twitter_handle"] = df.twitter_url.map(lambda x: x.split("/")[-1].replace("@", "").strip())
df.to_csv("cruchbase.csv", index=False)

# Handle List Variable
handle_list = [row for row in df.twitter_handle]
# Function to collect last ten tweets for a handle
def get_tweet_text(api, screen_name, limit=10):
    combined_text = ''
    for tweet in tweepy.Cursor(api.user_timeline, 
                               screen_name= screen_name, 
                               wait_on_rate_limit=True, 
                               wait_on_rate_limit_notify=True).items(limit):
        combined_text += '  ' + tweet.text
    return combined_text

The code below calls the Twitter API to collect the last ten tweets per each twitter handle, with error handling built in. This was a significant bottleneck that took considerable time to execute.

# API Authentication     
twitter_api = auth_twitter() 
counter = 0
combined_tweets = []

while counter < 28781:
    handle = handle_list[counter]
    try:
        tweet_text = get_tweet_text(twitter_api, handle)
        combined_tweets.append((handle, tweet_text))
        counter += 1 
    except tweepy.TweepError:
        combined_tweets.append((handle, ''))
        counter += 1
        continue;
        
print counter        
print len(combined_tweets)
print combined_tweets[28780]

# save as pickle file
with open('finaltweets6.pkl', 'w') as picklefile:
    pickle.dump(combined_tweets, picklefile)

Load tweet data and crunchbase data:

with open('finaltweets6.pkl', 'r') as picklefile:
    comptweet1 = pickle.load(picklefile)
# separate into just handle and tweets
tweets = pd.DataFrame(comptweet1, columns=["twitter_handle", "tweets"])
# load crunchbase data with created handle column (done above)
dfcrunch = pd.read_csv('cruchbase.csv')
dfcrunch.head()
company_name domain country_code state_code region city status short_description category_list funding_rounds ... first_funding_on last_funding_on closed_on email phone cb_url twitter_url facebook_url uuid twitter_handle
0 Oro Inc. orocrm.com USA CA Los Angeles Los Angeles operating Open Source Tools for Businesses (CRM, B2B eCo... B2B|CRM|E-Commerce|Enterprise Software|Open So... 1 ... 2016-03-03 2016-03-03 NaN info@orocrm.com NaN https://www.crunchbase.com/organization/oro https://www.twitter.com/OroCRM http://www.facebook.com/pages/Oro-CRM/42898686... b667f47f-c810-7714-b6aa-c34ab3c8d1c2 OroCRM
1 Atlas Obscura atlasobscura.com USA NY New York City Brooklyn operating Atlas Obscura is a travel guide with articles,... Leisure|Travel & Tourism 2 ... 2015-02-24 2016-03-02 NaN info@atlasobscura.com NaN https://www.crunchbase.com/organization/atlas-... https://www.twitter.com/atlasobscura http://www.facebook.com/atlasobscura 24af462f-7100-9130-db5c-0a38e960efc1 atlasobscura
2 Blippar blippar.com USA NY New York City New York operating Blippar is the world's leading image-recogniti... Advertising|Augmented Reality|Computer Vision|... 3 ... 2012-01-03 2016-03-02 NaN info@blippar.com NaN https://www.crunchbase.com/organization/blippar https://www.twitter.com/blippar http://www.facebook.com/blippar 06f6c176-35cf-acdf-171e-240d23bc5fac blippar
3 Calyxt calyxt.com USA MN MN - Other Minnesota City closed Calyxt is an Agbiotech company focused on deve... NaN 1 ... 2016-03-02 2016-03-02 NaN NaN NaN https://www.crunchbase.com/organization/calyxt https://www.twitter.com/calyxt_inc NaN 0a4d4b3d-ac4d-566f-923f-5ca1995296ee calyxt_inc
4 Coop Fuels coopfuels.com USA DC Washington, D.C. Washington operating Retailer of E85 & Renewable Diesel Clean Technology|Fuels|Internet|Renewable Ener... 1 ... 2016-03-02 2016-03-02 NaN NaN NaN https://www.crunchbase.com/organization/coop-f... https://www.twitter.com/coopfuels https://www.facebook.com/coopfuels 93366aa5-6417-4599-260d-897eaf9d65c5 coopfuels

5 rows × 22 columns

Merge Crunchbase and Tweet Dataframes:

company_df = pd.merge(tweets, dfcrunch, how="inner", left_index=True, right_index=True)

Create target column labeling IPOs and Acquired companies as "Exits":

company_df["target"] = company_df.status.map({"operating": "operating", 
                       "closed": "closed", 
                       "ipo": "exits", 
                       "acquired": "exits"})

Analyze the Target Split - The distribution is uneven, which presents a modeling challenge:

company_df.target.value_counts()
operating    20955
exits         6377
closed        1449
Name: target, dtype: int64

Create Predictor "X" and Response "Y" sections. Because this project uses natural language processing techniques, there is only one predictor column "tweets".

X = company_df.tweets
y = company_df.target

Natural Language Processing Techniques

Baseline: TF-IDF with Naive Bayes

First I split the raw data into training and testing portions:

X_train, X_test, y_train, y_test = train_test_split(company_df.tweets, y, test_size=0.2)
print X_test.shape, X_train.shape, y_test.shape, y_train.shape
(5757,) (23024,) (5757,) (23024,)

Then I transform X_train using TF-IDF Count Vectorizer. "TF" represents a word's frequency in a document while "IDF" represents inverse frequency in the corpus.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

# Setup Vectorizer with parameters
vectorizer = TfidfVectorizer(stop_words="english", ngram_range=(1,2), analyzer='word',
                                  token_pattern='\\b[a-z][a-z]+\\b')
# Transform the X_train data
X_train_vect = vectorizer.fit_transform(X_train)

Fit the vectorized training data predictors (X_train_vect) on the training data response (y_train):

nb_clf = MultinomialNB().fit(X_train_vect, y_train)
# Transform the X_test data
X_test_vect = vectorizer.transform(X_test)

The transformation created a much larger sparse matrix with 1.4 million columns:

print X_test_vect.shape
(5757, 1408956)

Create a variable for vectorized test predictions:

y_pred_nb = nb_clf.predict(X_test_vect)

The model's initial performance was disappointing. It predicted every company as "operating" (third column in confusion matrix below):

# nb_clf.classes_ ==> array(['closed', 'exits', 'operating'], dtype='|S9')
confusion_matrix(y_test, y_pred_nb)
array([[   0,    0,  288],
       [   0,    0, 1296],
       [   0,    0, 4173]])

By predicting every company as "operating", the model achieved an accuracy rate of 72%:

nb_clf.score(X_test_vect, y_test)
0.7248566961959354

Classification Report:

print metrics.accuracy_score(y_test, y_pred_nb)
print metrics.classification_report(y_test, y_pred_nb)
0.724856696196
             precision    recall  f1-score   support

     closed       0.00      0.00      0.00       288
      exits       0.00      0.00      0.00      1296
  operating       0.72      1.00      0.84      4173

avg / total       0.53      0.72      0.61      5757

Advanced Approach: GENSIM

Imports:

# gensim
from gensim import corpora, models, similarities, matutils
# sklearn
from sklearn import datasets
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
# logging for gensim (set to INFO)
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
# Create a CountVectorizer for parsing/counting words
count_vectorizer = CountVectorizer(analyzer='word',
                                  ngram_range=(1,2), stop_words='english',
                                  token_pattern='\\b[a-z][a-z]+\\b')

Create the term-document matrix. Transpose it so the terms are the rows:

ng_vecs = count_vectorizer.fit_transform(company_df.tweets).transpose()
ng_vecs.shape
(1705443, 28781)
ng_vecs_x = count_vectorizer.fit_transform(company_df.tweets)

Convert sparse matrix of counts to a gensim corpus:

corpus = matutils.Sparse2Corpus(ng_vecs)

Map matrix rows to words (tokens):

# We need to save a mapping (dict) of row id to word (token) for later use by gensim:
id2word = dict((v, k) for k, v in count_vectorizer.vocabulary_.iteritems())

Logistic Regression

Train/test/split using ng_vecs (Count Vect). Here, we are splitting vectorized data:

X_train, X_test, y_train, y_test = train_test_split(ng_vecs_x, company_df.target)

Model with Logistic Reg and predict on Testing data:

nb_nb = LogisticRegression().fit(X_train, y_train)

Unlike TF-IDF with Naive Bayes, made predictions for each target category and had slightly higher accuracy score of 73.6%:

y_pred_reg = nb_nb.predict(X_test)
confusion_matrix(y_test, y_pred_reg)
array([[  12,   36,  324],
       [  15,  328, 1255],
       [  21,  246, 4959]])
print metrics.accuracy_score(y_test, y_pred_reg)
print metrics.classification_report(y_test, y_pred_reg)
0.736381322957
             precision    recall  f1-score   support

     closed       0.25      0.03      0.06       372
      exits       0.54      0.21      0.30      1598
  operating       0.76      0.95      0.84      5226

avg / total       0.68      0.74      0.68      7196

Naive Bayes

nb_nb = MultinomialNB().fit(X_train, y_train)
y_pred_reg = nb_nb.predict(X_test)
confusion_matrix(y_test, y_pred_reg)
array([[   1,   60,  311],
       [   3,  284, 1311],
       [   8,  338, 4880]])
print metrics.accuracy_score(y_test, y_pred_reg)
print metrics.classification_report(y_test, y_pred_reg)
0.717759866593
             precision    recall  f1-score   support

     closed       0.08      0.00      0.01       372
      exits       0.42      0.18      0.25      1598
  operating       0.75      0.93      0.83      5226

avg / total       0.64      0.72      0.66      7196

KNN with LogReg

# Fit KNN classifier to training set 
knn = KNeighborsClassifier(n_neighbors=3)  
knn.fit(X_train, y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')
y_pred_knn = knn.predict(X_test)
# Score against test set
knn.score(X_test, y_test)
0.61381322957198448
confusion_matrix(y_test, y_pred_knn)
array([[  21,   41,  310],
       [  31,  280, 1287],
       [  63, 1047, 4116]])
print metrics.accuracy_score(y_test, y_pred_knn)
print metrics.classification_report(y_test, y_pred_knn)
0.613813229572
             precision    recall  f1-score   support

     closed       0.18      0.06      0.09       372
      exits       0.20      0.18      0.19      1598
  operating       0.72      0.79      0.75      5226

avg / total       0.58      0.61      0.59      7196

LSI (Latent Semantic Indexing)

LSI requires us to go one step further than LDA in preprocessing, we need to calculate TF-IDF weights from our word counts term-document matrix:

# Create a TFIDF transformer from our word counts (equivalent to "fit" in sklearn)
tfidf = models.TfidfModel(corpus)

Create a TF-IDF vector for all documents from the original corpus:

# Create a TFIDF vector for all documents from the original corpus ("transform" in sklearn)
tfidf_corpus = tfidf[corpus]

SVD

Now that we've taken care of the TF-IDF bit, let's crank through the SVD and build the LSI space:

# Build an LSI space from the input TFIDF matrix, mapping of row id to word, and num_topics
# num_topics is the number of dimensions to reduce to after the SVD
# Analagous to "fit" in sklearn, it primes an LSI space
lsi = models.LsiModel(tfidf_corpus, id2word=id2word, num_topics=300)

Now that we have a trained LSI space, we want to do the transform step to figure out where all of the original documents lie in that num_topics=300 dimensional space:

# Retrieve vectors for the original tfidf corpus in the LSI space ("transform" in sklearn)
lsi_corpus = lsi[tfidf_corpus]
lsi_vecs = lsi[tfidf_corpus]
# Dump the resulting document vectors into a list so we can take a look
doc_vecs = [doc for doc in lsi_corpus]

Machine Learning with LSI Vectors

We have (very good, 300-dimensional) vectors for our documents now! So we can do any machine learning we want on our documents!

# Convert the gensim-style corpus vecs to a numpy array for sklearn manipulations
X = matutils.corpus2dense(lsi_vecs, num_terms=300).transpose()
X.shape
(28781, 300)
# splitting vectorized data
X_train, X_test, y_train, y_test = train_test_split(X, company_df.target)
print X_test.shape, X_train.shape, y_test.shape, y_train.shape
(7196, 300) (21585, 300) (7196,) (21585,)
# Model with Logistic Reg
from sklearn.linear_model import LogisticRegression
nb_reg = LogisticRegression().fit(X_train, y_train)
y_pred_reg = nb_reg.predict(X_test)
confusion_matrix(y_test, y_pred_reg)
array([[   3,    4,  355],
       [   2,   34, 1556],
       [   0,   15, 5227]])
print metrics.accuracy_score(y_test, y_pred_reg)
print metrics.classification_report(y_test, y_pred_reg)
0.731517509728
             precision    recall  f1-score   support

     closed       0.60      0.01      0.02       362
      exits       0.64      0.02      0.04      1592
  operating       0.73      1.00      0.84      5242

avg / total       0.71      0.73      0.63      7196

Conclusion

In this project, I successfully integrated data from Crunchbase and Twitter APIs to analyze 28,781 companies and predict their operational status. Through implementing multiple NLP techniques including TF-IDF, Naive Bayes, Logistic Regression, and LSI, I achieved varying levels of accuracy:

  • Baseline TF-IDF with Naive Bayes: 72.5% accuracy
  • Logistic Regression with Count Vectorizer: 73.6% accuracy
  • LSI with Logistic Regression: 73.2% accuracy

The primary challenge was the imbalanced dataset, with 73% of companies in the "operating" category. Future improvements could include implementing SMOTE for class balancing, feature engineering with additional company metadata, and exploring deep learning approaches with word embeddings.