Terrorist Group Attribution Using ML

Using machine learning to predict terrorist group attribution from attack characteristics in the Global Terrorism Database.

Can machine learning predict which terrorist organization carried out an attack based solely on the attack's characteristics? During my time at StartupML, I tackled this question using the Global Terrorism Database—a comprehensive dataset of over 150,000 terrorist incidents spanning decades.

The challenge was fascinating: thousands of attacks in the database remain unattributed. By analyzing patterns in attack types, weapons used, locations, targets, and other features, I built predictive models that achieved over 94% accuracy in identifying responsible groups. This post walks through the complete process—from data cleaning to model evaluation—and the insights gained along the way.

Imports

%matplotlib inline
import os
import json
import pickle
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import statsmodels.formula.api as smf
import patsy
import seaborn as sns
from seaborn import plt
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn import tree
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from seaborn import plt
import random 
from __future__ import division
from matplotlib import pyplot
import matplotlib.pyplot as plt 
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn import linear_model
pd.set_option('max_info_columns', 200)

Part 1: Problem Statement

The Global Terrorism Database (GTD) is an open-source database including information on terrorist events around the world from 1970 through 2014. Some portion of the attacks have not been attributed to a particular terrorist group.

The goal: Use attack type, weapons used, description of the attack, and other features to build a model that can predict what group may have been responsible for an incident.

Part 2: Data Extraction & Loading

The globalterrorismdb_0615dist.xlsx file can be downloaded from the Global Terrorism Database. After conversion to CSV format, I loaded and filtered the data.

terrorism = pd.read_csv('terrorist_data.csv')

Filtering by Year

I filtered the dataset to include only information starting from 1998 because many of the GTD Codebook features only apply to terrorism incidents occurring after 1997.

terrorism_98 = terrorism[terrorism.iyear >= 1998]

Handling Unknown Values

Binary features containing "9" or "99" for unknown values were replaced with np.nan:

def fill_nines(df):
    missing_value_as_neg_nines = ["ishostkid", "property","INT_LOG","INT_IDEO","INT_MISC","INT_ANY"]
    for col in missing_value_as_neg_nines:
        df.loc[df[col] == -9, col] = np.nan
    missing_value_as_neg_ninetynines = ["nhostkid","nhostkidus","nhours","nperpcap","nperps"]
    for col in missing_value_as_neg_ninetynines:
        df.loc[df[col] == -99, col] = np.nan
    return df

terrorism_98 = fill_nines(terrorism_98)

Filtering Target Groups

The gname column contains the name of the terrorist group that carried out the attack. I created two datasets:

Out-of-sample data where gname is "Unknown" (for predictions)
Training data with terrorist groups appearing at least 50 times

# create out of sample dataframe where gname is unknown or blank
terrorism_98_nognames = terrorism_98[terrorism_98.gname == "Unknown"]

# Change existing terrorism_98 dataframe to only include rows that include a gname
terrorism_98 = terrorism_98[terrorism_98.gname != "Unknown"]

# Filter for gname responses that at least occur 50 times 
gname_counts = terrorism_98.gname.value_counts()
gnames_over_threshold = gname_counts[gname_counts > 50].index
terrorism_98 = terrorism_98[terrorism_98.gname.isin(gnames_over_threshold)]
print("number of unique gnames:",terrorism_98.gname.unique().size)
print("number of unique gnames:",terrorism_98_nognames.gname.unique().size)

Feature Selection

I created a function to filter features by the quantity of null values, keeping only features with less than 20% missing data:

def get_feature_subset_nulls(df, response_col, threshold=0.2):
    """
    Returns Features that have less than 20% missing values.
    """
    #Applying per column: Defined by axis=0
    feature_null_counts = df.apply(lambda col: col.isnull().sum(), axis=0)

    num_rows = len(df)
    features = feature_null_counts[feature_null_counts < (threshold * num_rows)].index.difference([response_col])
    # Exclude features that are Text (these end with _txt)
    features = features[~features.str.endswith("_txt")]

    return features

features = get_feature_subset_nulls(terrorism_98, "gname")

Defining Feature Sets

I separated features into numeric and text categories:

numeric_features = ['INT_ANY','INT_IDEO','INT_LOG','INT_MISC', 
                    'attacktype1','claimed', 'country', 'crit1', 
                    'crit2', 'crit3', 'doubtterr','extended', 
                    'guncertain1','iday','imonth','ishostkid', 
                    'iyear','multiple','natlty1', 'nkill', 
                    'nkillter', 'nkillus', 'nperpcap','nperps', 
                    'nwound', 'nwoundte', 'nwoundus', 'property', 
                    'region', 'specificity', 'success', 'suicide', 
                    'targsubtype1', 'targtype1', 'vicinity', 
                    'weapsubtype1', 'weaptype1']

text_features = ['provstate','location','city',
                 'summary','target1','target2',
                 'scite1','dbsource','corp1']

all_features = numeric_features + text_features

Part 3: Data Cleaning (Fill Null Values)

Binary Features

For binary features, I filled null values with probability-weighted 0s and 1s based on the existing distribution:

def binary_fill(df, fill_values):
    binary_columns_notfull = ["ishostkid", "property","INT_LOG","INT_IDEO","INT_MISC","INT_ANY",'guncertain1']
    for col in binary_columns_notfull:
        if col not in fill_values:
            col_pd = df[col].value_counts()/df[col].value_counts().sum()
            fill_values[col] = col_pd[0]
        df.loc[df[col].isnull(), col] = np.random.choice([0,1], size=df[col].isnull().sum(), p=[fill_values[col], 1 - fill_values[col]])
    return df

Range Features

For categorical range features, null values were replaced with random choices weighted by their probability distribution:

def range_fill(df, fill_values):
    range_columns_notfull = ["targsubtype1","weapsubtype1","natlty1","specificity"]
    for col in range_columns_notfull:
        if col not in fill_values:
            col_dist = df[col].value_counts()/df[col].value_counts().sum()
            fill_values[col] = col_dist
        
        col_val_missing = df[col].isnull()
        df.ix[col_val_missing, col] = np.random.choice(fill_values[col].index, size=col_val_missing.sum(), p=fill_values[col].values)
    return df

Numeric Value Features

For numeric features with continuous values, null values were filled with the median:

def numeric_fill(df, fill_values):
    numeric_columns_notfull = ["nkill","nkillus","nkillter","nwound","nwoundus","nwoundte",'nperpcap','nperps']
    for col in numeric_columns_notfull:
        if col not in fill_values:
            fill_values[col] = df[col].median()

        df.ix[df[col].isnull(), col] = fill_values[col]
    return df

Combined Transformation Function

def gen_numeric_features(features):
    
    binary_fill_values = {}
    range_fill_values = {}
    numerc_fill_values = {}
    
    def transform_numeric_features(df):
        df = df[features]
        df = binary_fill(df, binary_fill_values)
        df = range_fill(df, range_fill_values)
        df = numeric_fill(df, numerc_fill_values)
        
        return df

    return transform_numeric_features

Part 4: Model Setup

Combining Feature Extraction Steps

I used FeatureUnion to apply transformers in parallel and concatenate results, combining numeric and text features into a single document-term matrix. This approach leverages both structured data and textual information for predictions.

from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import FunctionTransformer

pipelines = []

def get_feature_text(col):
    
    def get_text(df):
        return df[col].fillna("")
    
    return get_text

# Build a Document Term Matrix using CountVectorizer for ALL Text Columns
for col in text_features:
    get_text_ft = FunctionTransformer(get_feature_text(col), validate=False)
    pipelines.append(make_pipeline(get_text_ft, CountVectorizer(decode_error='ignore')))

# Add the Numeric Data into the pipeline
get_numeric_features = gen_numeric_features(numeric_features)
pipelines.append(FunctionTransformer(get_numeric_features, validate=False))

union = make_union(*pipelines)

Train/Test Split

X_train, X_test, y_train, y_test = train_test_split(terrorism_98[all_features], terrorism_98.gname, 
                                                    test_size=0.3, random_state=15)

Creating Document-Term Matrices

X_train_dtm = union.fit_transform(X_train)
X_test_dtm = union.transform(X_test)

# Out of sample data for predictions
x_new_dtm = union.transform(terrorism_98_nognames)

Part 5: Model Evaluation

I evaluated multiple machine learning algorithms to identify the best performer:

from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

Logistic Regression

from sklearn.grid_search import GridSearchCV
logreg_grid_search = GridSearchCV(LogisticRegression(), param_grid={'C': np.logspace(-3, 3, 7)})
logreg_grid_search.fit(X_train_dtm, y_train)
log_pred_search_pred = logreg_grid_search.predict(X_test_dtm)

print metrics.accuracy_score(y_test, log_pred_search_pred)
print metrics.classification_report(y_test, log_pred_search_pred)

logreg_grid_search.best_params_
logreg_grid_search.best_score_

confusion_matrix(y_test,logreg_grid_search.predict(X_test_dtm))
print classification_report(y_test,logreg_grid_search.predict(X_test_dtm))

K-Nearest Neighbors (KNN)

knn_grid_search = GridSearchCV(KNeighborsClassifier(), param_grid={'n_neighbors': range(3, 22, 2)})
knn_grid_search.fit(X_train_dtm, y_train)

knn_grid_search.best_score_
knn_grid_search.best_params_

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_dtm, y_train)
knn_pred = knn.predict(X_test_dtm)
print metrics.accuracy_score(y_test, knn_pred)

Naive Bayes

nb_model = GaussianNB()
nb_model.fit(X_train_dtm.toarray(),y_train)
nb_model.score(X_test_dtm.toarray(),y_test)

Decision Tree

tree_grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid={'max_depth': np.logspace(-3, 3, 6)})
tree_grid_search.fit(X_train_dtm, y_train)

tree_grid_search.best_params_
tree_grid_search.best_score_

Random Forest

forest_grid_search = GridSearchCV(RandomForestClassifier(), param_grid={'n_estimators': [10,100,1000]})
forest_grid_search.fit(X_train_dtm, y_train)

forest_grid_search.best_params_
forest_grid_search.best_score_

Support Vector Classifier (SVC)

svc_grid_search = GridSearchCV(SVC(), param_grid={'C': [.01,1,100]})
svc_grid_search.fit(X_train_dtm, y_train)

svc_grid_search.best_params_
svc_grid_search.best_score_

Ensemble: Voting Classifier

I also tested ensemble methods combining multiple models:

from sklearn.ensemble import VotingClassifier

clf1 = LogisticRegression(C=1000,random_state=1)
clf2 = RandomForestClassifier(n_estimators=1000,random_state=1)
clf3 = KNeighborsClassifier(n_neighbors=5)
clf4 = SVC(C=100)

# Hard voting: Majority vote
eclf1 = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('knn', clf3),('SVC',clf4)], voting='hard')
eclf1 = eclf1.fit(X_train_dtm, y_train)
print(eclf1.predict(X_test_dtm))
print(eclf1.score(X_test_dtm,y_test))

# Soft voting: Average probabilities
eclf2 = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('knn', clf3)],voting='soft')
eclf2 = eclf2.fit(X_train_dtm, y_train)
print(eclf2.predict(X_test_dtm))
print(eclf2.score(X_test_dtm,y_test))

Part 6: Insights & Conclusion

Model Rankings by Accuracy Score

Decision Tree Grid Search: 0.9478
Random Forest Grid Search: 0.9472
Logistic Regression Grid Search: 0.9337
Voting Classifier (Hard): 0.9328
Voting Classifier (Soft): 0.9265
SVC Grid Search: 0.8856
Naive Bayes: 0.8384
KNN Grid Search: 0.7218

Key Finding: These machine learning models classified the terrorist attack group with remarkably high accuracy. The decision tree grid search model performed the best with an accuracy score of 94.78%.

Predictions on Unknown Attacks

I generated predictions for the out-of-sample data (attacks with unknown attribution):

y_new_logreg = logreg_grid_search.predict(x_new_dtm,delegate='estimator')
y_new_knn = knn.predict(x_new_dtm)
y_new_nb = nb_model.predict(x_new_dtm.toarray())
y_new_tree = tree_grid_search.predict(x_new_dtm,delegate='estimator')
y_new_forest = forest_grid_search.predict(x_new_dtm,delegate='estimator')
y_new_svc = svc_grid_search.predict(x_new_dtm,delegate='estimator')

Limitations & Considerations

It's important to note that these models were tested on a dataset with data starting from 1998, and terrorist attack groups were filtered to include only groups that appeared at least 50 times in the dataset. This filtering ensures sufficient training data but may limit the model's ability to identify smaller or newer terrorist organizations.

Final Thoughts

This project demonstrated the power of combining structured numeric data with unstructured text features to solve complex classification problems. The high accuracy achieved suggests that terrorist groups exhibit identifiable patterns in their attack characteristics—patterns that machine learning can detect and use for attribution.

While the models show strong performance on historical data, real-world application would require continuous updates as new groups emerge and existing groups evolve their tactics. The framework developed here provides a solid foundation for ongoing analysis and could potentially assist intelligence agencies in attributing attacks where responsibility is unclear or disputed.

Predicting Terrorist Group Attribution with Machine Learning: A 94% Accuracy Challenge

Imports

Part 1: Problem Statement

Part 2: Data Extraction & Loading

Filtering by Year

Handling Unknown Values

Filtering Target Groups

Feature Selection

Defining Feature Sets

Part 3: Data Cleaning (Fill Null Values)

Binary Features

Range Features

Numeric Value Features

Combined Transformation Function

Part 4: Model Setup

Combining Feature Extraction Steps

Train/Test Split

Creating Document-Term Matrices

Part 5: Model Evaluation

Logistic Regression

K-Nearest Neighbors (KNN)

Naive Bayes

Decision Tree

Random Forest

Support Vector Classifier (SVC)

Ensemble: Voting Classifier

Part 6: Insights & Conclusion

Model Rankings by Accuracy Score

Predictions on Unknown Attacks

Limitations & Considerations

Final Thoughts

Previous PostFinancial Planner

Next PostDetecting Cyber Threats with ML: My Journey at StartupML's Adversarial.AI Project