Can machine learning predict which terrorist organization carried out an attack based solely on the attack's characteristics? During my time at StartupML, I tackled this question using the Global Terrorism Database—a comprehensive dataset of over 150,000 terrorist incidents spanning decades.
The challenge was fascinating: thousands of attacks in the database remain unattributed. By analyzing patterns in attack types, weapons used, locations, targets, and other features, I built predictive models that achieved over 94% accuracy in identifying responsible groups. This post walks through the complete process—from data cleaning to model evaluation—and the insights gained along the way.
Imports
%matplotlib inline
import os
import json
import pickle
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import statsmodels.formula.api as smf
import patsy
import seaborn as sns
from seaborn import plt
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn import tree
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from seaborn import plt
import random
from __future__ import division
from matplotlib import pyplot
import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn import linear_model
pd.set_option('max_info_columns', 200)
Part 1: Problem Statement
The Global Terrorism Database (GTD) is an open-source database including information on terrorist events around the world from 1970 through 2014. Some portion of the attacks have not been attributed to a particular terrorist group.
The goal: Use attack type, weapons used, description of the attack, and other features to build a model that can predict what group may have been responsible for an incident.
Part 2: Data Extraction & Loading
The globalterrorismdb_0615dist.xlsx file can be downloaded from the Global Terrorism Database. After conversion to CSV format, I loaded and filtered the data.
terrorism = pd.read_csv('terrorist_data.csv')
Filtering by Year
I filtered the dataset to include only information starting from 1998 because many of the GTD Codebook features only apply to terrorism incidents occurring after 1997.
terrorism_98 = terrorism[terrorism.iyear >= 1998]
Handling Unknown Values
Binary features containing "9" or "99" for unknown values were replaced with np.nan:
def fill_nines(df):
missing_value_as_neg_nines = ["ishostkid", "property","INT_LOG","INT_IDEO","INT_MISC","INT_ANY"]
for col in missing_value_as_neg_nines:
df.loc[df[col] == -9, col] = np.nan
missing_value_as_neg_ninetynines = ["nhostkid","nhostkidus","nhours","nperpcap","nperps"]
for col in missing_value_as_neg_ninetynines:
df.loc[df[col] == -99, col] = np.nan
return df
terrorism_98 = fill_nines(terrorism_98)
Filtering Target Groups
The gname column contains the name of the terrorist group that carried out the attack. I created two datasets:
- Out-of-sample data where
gnameis "Unknown" (for predictions) - Training data with terrorist groups appearing at least 50 times
# create out of sample dataframe where gname is unknown or blank
terrorism_98_nognames = terrorism_98[terrorism_98.gname == "Unknown"]
# Change existing terrorism_98 dataframe to only include rows that include a gname
terrorism_98 = terrorism_98[terrorism_98.gname != "Unknown"]
# Filter for gname responses that at least occur 50 times
gname_counts = terrorism_98.gname.value_counts()
gnames_over_threshold = gname_counts[gname_counts > 50].index
terrorism_98 = terrorism_98[terrorism_98.gname.isin(gnames_over_threshold)]
print("number of unique gnames:",terrorism_98.gname.unique().size)
print("number of unique gnames:",terrorism_98_nognames.gname.unique().size)
Feature Selection
I created a function to filter features by the quantity of null values, keeping only features with less than 20% missing data:
def get_feature_subset_nulls(df, response_col, threshold=0.2):
"""
Returns Features that have less than 20% missing values.
"""
#Applying per column: Defined by axis=0
feature_null_counts = df.apply(lambda col: col.isnull().sum(), axis=0)
num_rows = len(df)
features = feature_null_counts[feature_null_counts < (threshold * num_rows)].index.difference([response_col])
# Exclude features that are Text (these end with _txt)
features = features[~features.str.endswith("_txt")]
return features
features = get_feature_subset_nulls(terrorism_98, "gname")
Defining Feature Sets
I separated features into numeric and text categories:
numeric_features = ['INT_ANY','INT_IDEO','INT_LOG','INT_MISC',
'attacktype1','claimed', 'country', 'crit1',
'crit2', 'crit3', 'doubtterr','extended',
'guncertain1','iday','imonth','ishostkid',
'iyear','multiple','natlty1', 'nkill',
'nkillter', 'nkillus', 'nperpcap','nperps',
'nwound', 'nwoundte', 'nwoundus', 'property',
'region', 'specificity', 'success', 'suicide',
'targsubtype1', 'targtype1', 'vicinity',
'weapsubtype1', 'weaptype1']
text_features = ['provstate','location','city',
'summary','target1','target2',
'scite1','dbsource','corp1']
all_features = numeric_features + text_features
Part 3: Data Cleaning (Fill Null Values)
Binary Features
For binary features, I filled null values with probability-weighted 0s and 1s based on the existing distribution:
def binary_fill(df, fill_values):
binary_columns_notfull = ["ishostkid", "property","INT_LOG","INT_IDEO","INT_MISC","INT_ANY",'guncertain1']
for col in binary_columns_notfull:
if col not in fill_values:
col_pd = df[col].value_counts()/df[col].value_counts().sum()
fill_values[col] = col_pd[0]
df.loc[df[col].isnull(), col] = np.random.choice([0,1], size=df[col].isnull().sum(), p=[fill_values[col], 1 - fill_values[col]])
return df
Range Features
For categorical range features, null values were replaced with random choices weighted by their probability distribution:
def range_fill(df, fill_values):
range_columns_notfull = ["targsubtype1","weapsubtype1","natlty1","specificity"]
for col in range_columns_notfull:
if col not in fill_values:
col_dist = df[col].value_counts()/df[col].value_counts().sum()
fill_values[col] = col_dist
col_val_missing = df[col].isnull()
df.ix[col_val_missing, col] = np.random.choice(fill_values[col].index, size=col_val_missing.sum(), p=fill_values[col].values)
return df
Numeric Value Features
For numeric features with continuous values, null values were filled with the median:
def numeric_fill(df, fill_values):
numeric_columns_notfull = ["nkill","nkillus","nkillter","nwound","nwoundus","nwoundte",'nperpcap','nperps']
for col in numeric_columns_notfull:
if col not in fill_values:
fill_values[col] = df[col].median()
df.ix[df[col].isnull(), col] = fill_values[col]
return df
Combined Transformation Function
def gen_numeric_features(features):
binary_fill_values = {}
range_fill_values = {}
numerc_fill_values = {}
def transform_numeric_features(df):
df = df[features]
df = binary_fill(df, binary_fill_values)
df = range_fill(df, range_fill_values)
df = numeric_fill(df, numerc_fill_values)
return df
return transform_numeric_features
Part 4: Model Setup
Combining Feature Extraction Steps
I used FeatureUnion to apply transformers in parallel and concatenate results, combining numeric and text features into a single document-term matrix. This approach leverages both structured data and textual information for predictions.
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import FunctionTransformer
pipelines = []
def get_feature_text(col):
def get_text(df):
return df[col].fillna("")
return get_text
# Build a Document Term Matrix using CountVectorizer for ALL Text Columns
for col in text_features:
get_text_ft = FunctionTransformer(get_feature_text(col), validate=False)
pipelines.append(make_pipeline(get_text_ft, CountVectorizer(decode_error='ignore')))
# Add the Numeric Data into the pipeline
get_numeric_features = gen_numeric_features(numeric_features)
pipelines.append(FunctionTransformer(get_numeric_features, validate=False))
union = make_union(*pipelines)
Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(terrorism_98[all_features], terrorism_98.gname,
test_size=0.3, random_state=15)
Creating Document-Term Matrices
X_train_dtm = union.fit_transform(X_train)
X_test_dtm = union.transform(X_test)
# Out of sample data for predictions
x_new_dtm = union.transform(terrorism_98_nognames)
Part 5: Model Evaluation
I evaluated multiple machine learning algorithms to identify the best performer:
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
Logistic Regression
from sklearn.grid_search import GridSearchCV
logreg_grid_search = GridSearchCV(LogisticRegression(), param_grid={'C': np.logspace(-3, 3, 7)})
logreg_grid_search.fit(X_train_dtm, y_train)
log_pred_search_pred = logreg_grid_search.predict(X_test_dtm)
print metrics.accuracy_score(y_test, log_pred_search_pred)
print metrics.classification_report(y_test, log_pred_search_pred)
logreg_grid_search.best_params_
logreg_grid_search.best_score_
confusion_matrix(y_test,logreg_grid_search.predict(X_test_dtm))
print classification_report(y_test,logreg_grid_search.predict(X_test_dtm))
K-Nearest Neighbors (KNN)
knn_grid_search = GridSearchCV(KNeighborsClassifier(), param_grid={'n_neighbors': range(3, 22, 2)})
knn_grid_search.fit(X_train_dtm, y_train)
knn_grid_search.best_score_
knn_grid_search.best_params_
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_dtm, y_train)
knn_pred = knn.predict(X_test_dtm)
print metrics.accuracy_score(y_test, knn_pred)
Naive Bayes
nb_model = GaussianNB()
nb_model.fit(X_train_dtm.toarray(),y_train)
nb_model.score(X_test_dtm.toarray(),y_test)
Decision Tree
tree_grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid={'max_depth': np.logspace(-3, 3, 6)})
tree_grid_search.fit(X_train_dtm, y_train)
tree_grid_search.best_params_
tree_grid_search.best_score_
Random Forest
forest_grid_search = GridSearchCV(RandomForestClassifier(), param_grid={'n_estimators': [10,100,1000]})
forest_grid_search.fit(X_train_dtm, y_train)
forest_grid_search.best_params_
forest_grid_search.best_score_
Support Vector Classifier (SVC)
svc_grid_search = GridSearchCV(SVC(), param_grid={'C': [.01,1,100]})
svc_grid_search.fit(X_train_dtm, y_train)
svc_grid_search.best_params_
svc_grid_search.best_score_
Ensemble: Voting Classifier
I also tested ensemble methods combining multiple models:
from sklearn.ensemble import VotingClassifier
clf1 = LogisticRegression(C=1000,random_state=1)
clf2 = RandomForestClassifier(n_estimators=1000,random_state=1)
clf3 = KNeighborsClassifier(n_neighbors=5)
clf4 = SVC(C=100)
# Hard voting: Majority vote
eclf1 = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('knn', clf3),('SVC',clf4)], voting='hard')
eclf1 = eclf1.fit(X_train_dtm, y_train)
print(eclf1.predict(X_test_dtm))
print(eclf1.score(X_test_dtm,y_test))
# Soft voting: Average probabilities
eclf2 = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('knn', clf3)],voting='soft')
eclf2 = eclf2.fit(X_train_dtm, y_train)
print(eclf2.predict(X_test_dtm))
print(eclf2.score(X_test_dtm,y_test))
Part 6: Insights & Conclusion
Model Rankings by Accuracy Score
- Decision Tree Grid Search: 0.9478
- Random Forest Grid Search: 0.9472
- Logistic Regression Grid Search: 0.9337
- Voting Classifier (Hard): 0.9328
- Voting Classifier (Soft): 0.9265
- SVC Grid Search: 0.8856
- Naive Bayes: 0.8384
- KNN Grid Search: 0.7218
Key Finding: These machine learning models classified the terrorist attack group with remarkably high accuracy. The decision tree grid search model performed the best with an accuracy score of 94.78%.
Predictions on Unknown Attacks
I generated predictions for the out-of-sample data (attacks with unknown attribution):
y_new_logreg = logreg_grid_search.predict(x_new_dtm,delegate='estimator')
y_new_knn = knn.predict(x_new_dtm)
y_new_nb = nb_model.predict(x_new_dtm.toarray())
y_new_tree = tree_grid_search.predict(x_new_dtm,delegate='estimator')
y_new_forest = forest_grid_search.predict(x_new_dtm,delegate='estimator')
y_new_svc = svc_grid_search.predict(x_new_dtm,delegate='estimator')
Limitations & Considerations
It's important to note that these models were tested on a dataset with data starting from 1998, and terrorist attack groups were filtered to include only groups that appeared at least 50 times in the dataset. This filtering ensures sufficient training data but may limit the model's ability to identify smaller or newer terrorist organizations.
Final Thoughts
This project demonstrated the power of combining structured numeric data with unstructured text features to solve complex classification problems. The high accuracy achieved suggests that terrorist groups exhibit identifiable patterns in their attack characteristics—patterns that machine learning can detect and use for attribution.
While the models show strong performance on historical data, real-world application would require continuous updates as new groups emerge and existing groups evolve their tactics. The framework developed here provides a solid foundation for ongoing analysis and could potentially assist intelligence agencies in attributing attacks where responsibility is unclear or disputed.