Skip to main content
Notebooks

Analyzing NYC Subway Traffic Patterns with Python Time Series Analysis

I applied Python time series analysis on NYC metro turnstile data to identify which subway line had the most traffic over a week period in January 2016. This project showcases data wrangling techniques, time series manipulation, and visualization of large-scale public transit data.

Table of Contents

  1. Introduction
  2. Data
  3. Line Data Function
  4. Plotting Line Activity
  5. Conclusion

Introduction

For this project, I worked with NYC's MTA turnstile data to analyze subway traffic patterns. The goal was straightforward: determine which line experiences the highest volume of commuter activity during a typical week. I chose January 2016 data - a period representing normal winter commuting patterns without holiday anomalies.

The technical challenge here wasn't just parsing ~20,000 rows of data, but handling inconsistent audit intervals, negative counter values from turnstile resets, and distributing activity across multiple lines serving the same station.

Imports

import datetime
from datetime import datetime
from datetime import timedelta
import time
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import matplotlib.pyplot as plt
import csv as csv
import requests
%matplotlib inline
from IPython.display import Image

Data

The NYC MTA provides turnstile data in CSV format with approximately 20,000 rows per week. Here's what a sample looks like:

NYC Turnstile Data Sample

Understanding the data structure was critical. Each row represents an audit event from a specific turnstile device:

  • C/A = Control Area (A002)
  • UNIT = Remote Unit for a station (R051)
  • SCP = Subunit Channel Position represents a specific address for a device (02-00-00)
  • STATION = The station name where the device is located
  • LINENAME = All train lines accessible at this station. For example, LINENAME 456NQR indicates service for the 4, 5, 6, N, Q, and R trains.
  • DIVISION = The original line the station belonged to: BMT, IRT, or IND
  • DATE = Date in MM-DD-YY format
  • TIME = Time in hh:mm:ss format for a scheduled audit event
  • DESC = Describes the audit type, typically "REGULAR" (occurs every 4 hours)
    1. Audits may occur at intervals greater than 4 hours due to maintenance or troubleshooting
    2. "RECOVR AUD" entries indicate recovered missed audits
  • ENTRIES = Cumulative entry register value for the device
  • EXITS = Cumulative exit register value for the device

Line Data Function

The core of my analysis lies in this function. I needed to calculate activity per line while handling several data quality issues:

Activity Per Line Equation

The equation above represents the critical calculation in my read_csv function. It computes activity for each line based on the turnstile ID (SCP, UNIT, STATION), then aggregates this into the total for each line serving that station.

I implemented several data cleaning steps:

  • Fixed negative entry/exit values by taking absolute values for differences less than -15,000 (indicating counter resets)
  • Set values below -15,000 to zero (likely erroneous readings)
  • Normalized time intervals to consistent 4-hour windows (3:00, 7:00, 11:00, 15:00, 19:00, 23:00)
def read_csv(filename):
    prevTurnstileID = ''
    old_row = ''
    lineData = {}
    ideal_time = (3, 7, 11, 15, 19, 23)
    with open(filename, 'rU') as f:
        reader = csv.reader(f)
        reader.next()
        for row in reader:
            unit = row[1]
            scp = row[2]
            station = row[3]
            lines = row[4]
            date = row[6]
            time = row[7]
            turnstileID = (scp, unit, station)
            date_time = datetime.strptime(date + " " + time, "%m/%d/%Y %H:%M:%S")
            entries = int(row[9])
            exits = int(row[10])
            '''
            We are testing to see if our current turnstile is the same as the previous turnstile.  If it is, we can deduct the current
            entries and exits from the previous ones to determine the total entries and exits for the difference in time.

            '''
            if turnstileID == prevTurnstileID:
                difEntries = entries - prevEntries
                difExits = exits - prevExits
                if difEntries < -15000:
                    print "DifEntries: %s" % difEntries
                    difEntries = 0
                elif difEntries < 0:
                    difEntries = abs(difEntries)
                if difExits < -15000:
                    print "DifExits: %s" % difExits
                    difExits = 0
                elif difExits < 0:
                    difExits = abs(difExits)
                numLines = len(lines)
                averAct = (difEntries + difExits)/numLines
                for line in lines:
                    if date_time.minute >= 30:
                        date_time += timedelta(hours = 1)
                        date_time = date_time.replace(minute=0, second=0)
                    elif date_time.minute < 30 and date_time.minute != 0:
                        date_time -= timedelta(hours = 1)
                        date_time = date_time.replace(minute=0, second=0)
                    if date_time.hour not in ideal_time:
                        #print('not in ideal_time zomg')
                        if date_time.hour < 3:
                            date_time = date_time.replace(hour=3)
                        else:
                            for val in ideal_time:
                                if abs(date_time.hour - val) <= 2:
                                    date_time = date_time.replace(hour=val)

                    if line in lineData:
                        if str(date_time) in lineData[line]:
                            lineData[line][str(date_time)] += averAct
                        else:
                            lineData[line][str(date_time)] = averAct
                    else:
                        lineData[line] = {str(date_time): averAct}
            prevTurnstileID = turnstileID
            prevEntries = entries
            prevExits = exits
            old_row = row
        return lineData

The output is a nested dictionary structure where each line maps to timestamps and their corresponding activity values:

lineData={'1': {'2016-01-11 07:00:00': 74, '2016-01-10 15:00:00': 320, '2016-01-11 11:00:00': 776, '2016-01-11 03:00:00': 23, '2016-01-10 19:00:00': 317, '2016-01-10 23:00:00': 139},
          'N': {'2016-01-09 07:00:00': 3, '2016-01-09 23:00:00': 51, '2016-01-09 15:00:00': 55, '2016-01-09 19:00:00': 75, '2016-01-09 11:00:00': 31, '2016-01-10 03:00:00': 17},
          'Q': {'2016-01-09 07:00:00': 3, '2016-01-09 23:00:00': 51, '2016-01-09 15:00:00': 55, '2016-01-09 19:00:00': 75, '2016-01-09 11:00:00': 31, '2016-01-10 03:00:00': 17},
          '3': {'2016-01-09 19:00:00': 597, '2016-01-09 23:00:00': 403, '2016-01-12 19:00:00': 691, '2016-01-09 07:00:00': 229, '2016-01-13 03:00:00': 75, '2016-01-09 15:00:00': 513, '2016-01-15 08:00:00': 189, '2016-01-14 08:00:00': 180, '2016-01-15 04:00:00': 25, '2016-01-14 04:00:00': 15, '2016-01-14 12:00:00': 390, '2016-01-15 00:00:00': 125, '2016-01-09 11:00:00': 427, '2016-01-12 23:00:00': 295, '2016-01-14 16:00:00': 312, '2016-01-14 20:00:00': 393, '2016-01-10 03:00:00': 226},
          'R': {'2016-01-09 07:00:00': 3, '2016-01-09 23:00:00': 51, '2016-01-09 15:00:00': 55, '2016-01-09 19:00:00': 75, '2016-01-09 11:00:00': 31, '2016-01-10 03:00:00': 17},
          '5': {'2016-01-11 12:00:00': 313, '2016-01-10 07:00:00': 46, '2016-01-09 07:00:00': 83, '2016-01-12 04:00:00': 9, '2016-01-09 23:00:00': 314, '2016-01-09 15:00:00': 494, '2016-01-11 08:00:00': 378, '2016-01-09 19:00:00': 487, '2016-01-11 20:00:00': 302, '2016-01-12 00:00:00': 74, '2016-01-12 12:00:00': 307, '2016-01-11 16:00:00': 309, '2016-01-09 11:00:00': 358, '2016-01-12 08:00:00': 447, '2016-01-10 03:00:00': 131},
          '4': {'2016-01-09 07:00:00': 3, '2016-01-09 23:00:00': 51, '2016-01-09 15:00:00': 55, '2016-01-09 19:00:00': 75, '2016-01-09 11:00:00': 31, '2016-01-10 03:00:00': 17},
          '7': {'2016-01-13 15:00:00': 490, '2016-01-12 19:00:00': 1048, '2016-01-13 03:00:00': 97, '2016-01-12 23:00:00': 490, '2016-01-13 07:00:00': 148, '2016-01-13 11:00:00': 794},
          '6': {'2016-01-09 07:00:00': 3, '2016-01-09 23:00:00': 51, '2016-01-09 15:00:00': 55, '2016-01-09 19:00:00': 75, '2016-01-09 11:00:00': 31, '2016-01-10 03:00:00': 17},
          '2': {'2016-01-09 19:00:00': 412, '2016-01-09 23:00:00': 263, '2016-01-09 07:00:00': 80, '2016-01-09 15:00:00': 439, '2016-01-10 07:00:00': 46, '2016-01-15 08:00:00': 189, '2016-01-14 08:00:00': 180, '2016-01-15 04:00:00': 25, '2016-01-14 04:00:00': 15, '2016-01-14 12:00:00': 390, '2016-01-15 00:00:00': 125, '2016-01-09 11:00:00': 327, '2016-01-14 16:00:00': 312, '2016-01-14 20:00:00': 393, '2016-01-10 03:00:00': 114}}

Plotting Line Activity

For visualization, I mapped each line to its official MTA colors and created distinct line styles for clarity when multiple lines share the same color:

lineColors={"A":"#2850ad","C":"#2850ad","E":"#2850ad",
            "B":"#ff6319","D":"#ff6319","F":"#ff6319","M":"#ff6319",
            "G":"#6cbe45",
            "L":"#a7a9ac",
            "J":"#996633","Z":"#996633",
            "N":"#fccc0a","Q":"#fccc0a","R":"#fccc0a",
            "1":"#ee352e","2":"#ee352e","3":"#ee352e",
            "4":"#00933c","5":"#00933c","6":"#00933c",
            "7":"#b933ad",
            "S":"#808183"}
lineStyles={"A":"-","C":":","E":"-.",
            "B":"-","D":":","F":"-.","M":"--",
            "G":"-",
            "L":"-",
            "J":"-","Z":":",
            "N":"-","Q":":","R":"-.",
            "1":"-","2":":","3":"-.",
            "4":"-","5":":","6":"-.",
            "7":"-",
            "S":"-"}

I built a helper function to transform the lineData dictionary into plottable time series:

plotList=[]
import time
def addLine(thisLine):
    lineTimeData={}
    for eachTime in lineData[thisLine]:
        tempTime = datetime.strptime(eachTime, "%Y-%m-%d %H:%M:%S")
        lineTimeData[tempTime] = lineData[thisLine][eachTime]
    timeList = lineTimeData.keys()
    timeList.sort()
    activityList=[]
    for eachTime in timeList:
        activityList.append(lineTimeData[eachTime])
    labelName = 'Line ' + thisLine
    line1_patch = mpatches.Patch(color=lineColors[thisLine], label=labelName)
    # plt.legend(handles=[line1_patch])
    plt.plot(timeList, activityList, linewidth=2,color=lineColors[thisLine],label=labelName,linestyle=lineStyles[thisLine])
    thisPlot, = plt.plot(timeList, activityList, linewidth=2,color=lineColors[thisLine],label=labelName,linestyle=lineStyles[thisLine])
    plotList.append(thisPlot)

The main execution code ties everything together - reading the CSV, processing each line, and generating the visualization:

filename = './turnstile_160109.csv'
lineData = read_csv(filename)

for eachLine in lineData:
    addLine(eachLine)

plt.legend(handles=plotList,bbox_to_anchor=(1,1),bbox_transform=plt.gcf().transFigure)
plt.grid()
plt.suptitle('Total Activity Per NYC Metro Line', fontsize=28)
plt.xlabel('Time (Days)',fontsize='18')
plt.ylabel('Activity (Entries + Exits)',fontsize='18')
plt.axis('tight')
plt.show()
Total Activity Per NYC Metro Line

Here's a detailed view of the activity patterns:

Activity Line Plot Detail

Key observations from the visualization:

  • Y Axis: Activity ranges from 0 to 35,000 combined entries and exits
  • X Axis: Timeline spans the first week of January 2016
  • Legend: Each line is color-coded according to official MTA branding

Conclusion

The data clearly shows that NYC's N line dominates in terms of raw commuter volume during this period. The N line's route explains this: it connects Queens to Manhattan's midtown, passes through Times Square (34th Street), continues down to Union Square (14th Street), and terminates in Brooklyn - hitting virtually every major business and transit hub in the city.

From a data analysis perspective, this project demonstrates the importance of robust data cleaning when working with real-world sensor data. The turnstile counters reset, audits miss scheduled intervals, and stations serve multiple lines - all challenges I addressed in the processing pipeline.

Future enhancements I'm considering:

  • Refining the activity distribution algorithm for multi-line stations (currently using simple averaging)
  • Modeling line dependencies and transfer patterns between lines
  • Expanding the analysis to cover full monthly or yearly cycles to identify seasonal trends
  • Incorporating external factors like weather, events, and service disruptions