Out of the Sea

Identifying Critical Tweets During Disasters

This is a collaborative, client-oriented project that I worked on with Javier Martinez and Alexander Nguyen, at the request of an organization involved in contracting work for FEMA (Federal Emergency Management Agency). This work was not compensated, and is not proprietary.

Our goal in this project was to make initial steps toward designing and implementing a web-tool or an app for tracking developments during a disastrous event, in close to real time. While traditional methods for alerting on such events rely on official information derived from official sources (e.g. USGS), we were tasked here with attempting to utilize social media activity to identify these events and alert when an event first occurs. The question we look at primarily here is, given a sea of text content from social media platforms, how do you identify what is relevant information for emergency response personnel? And what sort of implementation would be valuable?

Importing Libraries

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from twython import Twython
from tqdm import *
from time import sleep

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.naive_bayes import MultinomialNB
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

import re
from wordcloud import WordCloud
import pickle
In [4]:
pd.set_option('display.max_rows',1000)

Data Collection

Our initial plan was to try and real-time map incoming tweets using Twitter, Facebook, Instagram, Snapchat, etc. We quickly found that the APIs for these social media platforms have become much more restrictive than they used to be. The limits on how many tweets we could get at one time were such that it wasn't feasible to build out a dataset that would support training a model. We also could not get geolocation. We needed a different approach.

We found a useful dataset on the website CrisisLex, which collects datasets specific to NLP applications in disaster scenarios. This dataset contained tweet IDs for all geotagged tweets (6 million +) from affected areas of the Eastern Seaboard during the 11 day period surrounding Hurricane Sandy's landfall (10/22/2012-11/2/2012). The tweets include all content, not just disaster-related tweets.

You can use specific tweet IDs to pull dictionaries corresponding to the specific tweets. Having specific tweet IDs also allows you to collect up to 900 tweets every fifteen minutes, which is much larger than the normal limit. The fact that geotags with precise coordinates of sending location and timestamps to the second are included in the data is also very relevant, as it allows live-mapping, and accurately simulates the kind of information we would expect FEMA to have access to in this type of real-world application scenario.

Hurricanes, of all disasters, are probably the best-equipped to represent a generalizable lexicon, as hurricanes often involve a combination of flooding, fires, building damage/collapse, sufficient wind to down trees, explosions, injuries, deaths, trapped/stranded individuals, etc. Obviously, the training data could be expanded to include a variety of disaster types in the future.

Further Information on the CrisisLex Sandy Tweet ID Dataset:

  • CrisisLex: SandyHurricaneGeoT1 Geo-Located tweets from the 2012 Sandy Hurricane
  • Contents: tweet ids for 6,556,328 tweets, representing all tweets from October 22nd, 2012 β€”the day Sandy formedβ€” until November 2nd, 2012 β€” the day that it dissipated.
  • Sampling method: tweets were geotagged and located in Washington DC or one of 13 US states affected by Sandy: Connecticut, Delaware, Massachusetts, Maryland, New Jersey, New York, North Carolina, Ohio, Pennsylvania, Rhode Island, South Carolina, Virginia,West Virginia. This filter was based on a set of bounding boxes that covered the desired area, which also covered small parts of adjacent states.
  • Labels: no labels. The corpus contains tweets both relevant and irrelevant to Hurricane Sandy (no content based filter was applied).
  • Data format: comma-separated values (.csv) files containing the tweet ID, the time stamp of the tweet, a field indicating whether the tweet contains word "sandy".

Cleaning Sandy ID List

In [562]:
#Imports the list of 6 Million IDs
data = pd.read_csv('../../release.txt',sep= ' ', header = None)
In [667]:
data.head()
Out[667]:
0
0 tag:search.twitter.com,2005:260244087901413376...
1 tag:search.twitter.com,2005:260244088203403264...
2 tag:search.twitter.com,2005:260244088161439744...
3 tag:search.twitter.com,2005:260244088819945472...
4 tag:search.twitter.com,2005:260244089080004609...
In [564]:
data.shape
Out[564]:
(6554744, 1)
In [668]:
#Split data into respective columns, create datetime column, drop unnecessary columns

df = data[0].map(lambda x: x.split('\t'))
df = pd.DataFrame(df)
df['timestamp'] = df[0].map(lambda x: x[1])
df['tweet_id'] = df[0].map(lambda x: x[0])
df['bool'] = df[0].map(lambda x: x[2])
df = df.drop(columns=0)
df['tweet_id'] = df['tweet_id'].map(lambda x: x.split(':')[2])
df['datetime'] = pd.to_datetime(df['timestamp'])
df = df.drop(columns=['timestamp','bool'])
In [576]:
df.head()
Out[576]:
tweet_id datetime
0 260244087901413376 2012-10-22 05:00:00
1 260244088203403264 2012-10-22 05:00:00
2 260244088161439744 2012-10-22 05:00:00
3 260244088819945472 2012-10-22 05:00:00
4 260244089080004609 2012-10-22 05:00:00

Building ID List for Sandy Training Set

We wanted to simulate the sort of access to Twitter that FEMA would have during a crisis situation, i.e.; all geotagged and timestamped tweets within some period of time. Moreover, we anticipated that our classes in our final model would be unbalanced, because actually critical disaster tweets would be quite rare. We wanted to sample from the period that would have as many of these as possible, in order to have more of them to train on and rely less on bootstrapping.

Accordingly, we chose to sample from the window surrounding the landfall of Hurricane Sandy in New Jersey and New York (~8PM ET, October 29th, 2012. We calculated that we could reasonably aim to pull about 180000 tweets for the training set, timewise. We chose to pull all tweets from the list for the 3 hour period spanning from 1 hour prior to landfall to 2 hours afterward, so approximately 7PM-10PM that night.

In [325]:
dftime = df.sort_values('datetime')
dftime = dftime.reset_index(drop=True)

#Picked the time index corresponding approximately to landfall of the hurricane 
dftime[dftime['datetime']=='2012-10-30 00:00:01'].head()

#creates our major id list, from approximate time of landfall in NJ to about 3 hours later 
#(i.e., 180000 tweets down the timestamp-sorted ID list), all geotagged tweets in that time
#continuous timespan also allows us to show complete minute to minute mapping visualization
sandy_id_time = dftime.loc[4428365:4608434,:]
sandy_id_time.to_csv('./csvs/sandy_train_ids.csv',index=False)

Pulling Tweets with Twython

Each of us created Twitter Development accounts and submitted applications for the project. We each created two sets of Twitter API app keys so that we could pull tweets in tandem to allow for higher volume data collection.

In [12]:
CONSUMER_KEY = 'INSERT KEY HERE'
CONSUMER_SECRET = 'INSERT KEY HERE'

OAUTH_TOKEN = 'INSERT KEY HERE'
OAUTH_SECRET = 'INSERT KEY HERE'
In [13]:
twitter = Twython(CONSUMER_KEY, CONSUMER_SECRET, OAUTH_TOKEN, OAUTH_SECRET)

Pulling a Single Tweet

In [14]:
sandy_train_ids['tweet_id'][2]
Out[14]:
260244088161439744
In [16]:
twitter.show_status(id='260244088161439744')
Out[16]:
{'created_at': 'Mon Oct 22 05:00:00 +0000 2012',
 'id': 260244088161439744,
 'id_str': '260244088161439744',
 'text': '@NOT_savinHOES Not r yu upp',
 'truncated': False,
 'entities': {'hashtags': [],
  'symbols': [],
  'user_mentions': [{'screen_name': 'NOT_savinHOES',
    'name': '01.18🀸🏽\u200d♀️',
    'id': 293455555,
    'id_str': '293455555',
    'indices': [0, 14]}],
  'urls': []},
 'source': '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>',
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': 293455555,
 'in_reply_to_user_id_str': '293455555',
 'in_reply_to_screen_name': 'NOT_savinHOES',
 'user': {'id': 401231570,
  'id_str': '401231570',
  'name': 'Jay 🀷🏽\u200d♂️',
  'screen_name': 'JayyLive202',
  'location': 'Washington, DC, USA',
  'description': 'ΠΊΞΉΠΈg ΚΞ±ΠΌΡ”Ρ•πŸŽ©πŸ† 25. D[M]V | πŸ‘»:OfficialJaymes πŸ“ΈInsta:Oh.ThatsAlexx',
  'url': 'https://t.co/a6Q76c6YNl',
  'entities': {'url': {'urls': [{'url': 'https://t.co/a6Q76c6YNl',
      'expanded_url': 'https://link.dosh.cash/JAMESR117',
      'display_url': 'link.dosh.cash/JAMESR117',
      'indices': [0, 23]}]},
   'description': {'urls': []}},
  'protected': False,
  'followers_count': 803,
  'friends_count': 576,
  'listed_count': 0,
  'created_at': 'Sun Oct 30 07:41:09 +0000 2011',
  'favourites_count': 326,
  'utc_offset': None,
  'time_zone': None,
  'geo_enabled': True,
  'verified': False,
  'statuses_count': 18359,
  'lang': 'en',
  'contributors_enabled': False,
  'is_translator': False,
  'is_translation_enabled': False,
  'profile_background_color': 'C0DEED',
  'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png',
  'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png',
  'profile_background_tile': True,
  'profile_image_url': 'http://pbs.twimg.com/profile_images/1070198512380461056/zthonwAC_normal.jpg',
  'profile_image_url_https': 'https://pbs.twimg.com/profile_images/1070198512380461056/zthonwAC_normal.jpg',
  'profile_banner_url': 'https://pbs.twimg.com/profile_banners/401231570/1543990185',
  'profile_link_color': '0084B4',
  'profile_sidebar_border_color': 'FFFFFF',
  'profile_sidebar_fill_color': 'DDEEF6',
  'profile_text_color': '333333',
  'profile_use_background_image': True,
  'has_extended_profile': True,
  'default_profile': False,
  'default_profile_image': False,
  'following': False,
  'follow_request_sent': False,
  'notifications': False,
  'translator_type': 'none'},
 'geo': {'type': 'Point', 'coordinates': [40.2371544, -76.8206691]},
 'coordinates': {'type': 'Point', 'coordinates': [-76.8206691, 40.2371544]},
 'place': {'id': 'b8ce2948ffafff5f',
  'url': 'https://api.twitter.com/1.1/geo/id/b8ce2948ffafff5f.json',
  'place_type': 'city',
  'name': 'Bressler-Enhaut-Oberlin',
  'full_name': 'Bressler-Enhaut-Oberlin, PA',
  'country_code': 'US',
  'country': 'United States',
  'contained_within': [],
  'bounding_box': {'type': 'Polygon',
   'coordinates': [[[-76.831479, 40.22417],
     [-76.811937, 40.22417],
     [-76.811937, 40.242082],
     [-76.831479, 40.242082]]]},
  'attributes': {}},
 'contributors': None,
 'is_quote_status': False,
 'retweet_count': 0,
 'favorite_count': 0,
 'favorited': False,
 'retweeted': False,
 'lang': 'en'}

Building a Pull Loop

We were aware the cap was 900 tweets in 15 minutes. While we could set up a Twython call to run through some number of indices in our id list, once it hit the 900 limit, it would continue to mow through indices without actually getting anything. This 900 count includes the significant percentage (~25%) of these old Sandy tweets that have since been deleted and yield no information, they are still counted as tweet calls. So basically we would move 900 indices through the tweet id list per pull, regardless.

We needed a way to automate looping through this pull process at least a few times so as not to end up with some ridiculous number of csvs, and so we could leave things running. In order to achieve this, we needed the loop to know where to pick up on each new pull.

In [56]:
#Goes through a block of 900 indices from the tweet list, from some start index.
#Pulls tweet dictionary if tweet exists and adds to lst, otherwise continues to next index.
#Returns a list with the start index for the next pull, and the lst containing all the tweet dicts.
#Tqdm allows us to track the progress visually with each pull, as seen below.

def tweet_pull(start_index):
    tweet = None
    lst = []
    for i in tqdm(range(start_index,start_index+900)):
        try:
            dct = twitter.show_status(id=str(sandy_train_ids['tweet_id'][i]))
            lst.append(dct)
        except:
            tweet = None
        sandy_train_ids.set_value(i, 'tweet_texts', tweet)
    return [start_index+900,lst]
In [59]:
#While loop for tweet pulling
#Simply set count = [index you want to start at], and set while count < [index you want to end at] (multiple of 900, ideally)
#The while loop will run through all the indices in pulls of 900, shifting the start index up 900 each time,
#and sleeping 15 minutes after each pull to make sure we are never drawing on empty.
#The pulls are added to a single list of dictionaries that can be converted into a df.

count = 0
tweets = []
while count < 3600:
    pull = tweet_pull(count)
    tweets.extend(pull[1])
    count = pull[0]
    sleep(900) #15 minute limit
  0%|          | 0/900 [00:00<?, ?it/s]C:\Users\eamon\Anaconda3\lib\site-packages\ipykernel_launcher.py:10: FutureWarning: set_value is deprecated and will be removed in a future release. Please use .at[] or .iat[] accessors instead
  # Remove the CWD from sys.path while we load stuff.
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 900/900 [01:19<00:00, 13.37it/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 900/900 [01:18<00:00, 11.41it/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 900/900 [01:25<00:00, 10.57it/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 900/900 [01:29<00:00, 10.06it/s]
In [61]:
sandy = pd.DataFrame(tweets)
Out[61]:
contributors coordinates created_at entities extended_entities favorite_count favorited geo id id_str ... lang place possibly_sensitive possibly_sensitive_appealable retweet_count retweeted source text truncated user
0 None None Mon Oct 22 05:00:00 +0000 2012 {'hashtags': [{'text': 'ilovemaggiesmith', 'in... NaN 0 False None 260244087901413376 260244087901413376 ... en {'id': 'c55500e8cd2a1c64', 'url': 'https://api... NaN NaN 0 False <a href="http://twitter.com" rel="nofollow">Tw... "I suppose she has an appropriate costume for ... False {'id': 24753438, 'id_str': '24753438', 'name':...
1 None {'type': 'Point', 'coordinates': [-76.8206691,... Mon Oct 22 05:00:00 +0000 2012 {'hashtags': [], 'symbols': [], 'user_mentions... NaN 0 False {'type': 'Point', 'coordinates': [40.2371544, ... 260244088161439744 260244088161439744 ... en {'id': 'b8ce2948ffafff5f', 'url': 'https://api... NaN NaN 0 False <a href="http://twitter.com/download/android" ... @NOT_savinHOES Not r yu upp False {'id': 401231570, 'id_str': '401231570', 'name...
2 None {'type': 'Point', 'coordinates': [-79.20266541... Mon Oct 22 05:00:00 +0000 2012 {'hashtags': [], 'symbols': [], 'user_mentions... NaN 0 False {'type': 'Point', 'coordinates': [34.69318931,... 260244088819945472 260244088819945472 ... en {'id': '6057f1e35bcc6c20', 'url': 'https://api... NaN NaN 0 False <a href="http://twitter.com/download/android" ... Hit and Run is so sad.. False {'id': 123368790, 'id_str': '123368790', 'name...
3 None {'type': 'Point', 'coordinates': [-71.04264063... Mon Oct 22 05:00:00 +0000 2012 {'hashtags': [], 'symbols': [], 'user_mentions... NaN 0 False {'type': 'Point', 'coordinates': [42.44167162,... 260244089080004609 260244089080004609 ... en {'id': '75f5a403163f6f95', 'url': 'https://api... NaN NaN 0 False <a href="http://twitter.com/download/iphone" r... Who's up? False {'id': 47812293, 'id_str': '47812293', 'name':...
4 None {'type': 'Point', 'coordinates': [-80.08961896... Mon Oct 22 05:00:00 +0000 2012 {'hashtags': [], 'symbols': [], 'user_mentions... NaN 0 False {'type': 'Point', 'coordinates': [42.09464892,... 260244089985957888 260244089985957888 ... en {'id': '29aaa88d9fe74b50', 'url': 'https://api... NaN NaN 0 False <a href="http://twitter.com/download/android" ... @augustushazel idk I'm just ugly or annoying o... False {'id': 274750107, 'id_str': '274750107', 'name...

5 rows Γ— 26 columns

In [62]:
sandy.shape
Out[62]:
(2569, 26)
In [63]:
sandy.to_csv('example_pull.csv')

We coordinated to split up the job of pulling these loops. We looped through a total of 180000 tweets from the landfall of Sandy. We used Google Colab and Google Cloud Computing to run our pull loops over long stretches and collected data into a handful of csvs, which we combined to create our main data set for model building.

Combining All the Pulls

In [174]:
data1 = pd.read_csv('alexpulls.csv')
data2 = pd.read_csv('eamonpulls.csv')
data3 = pd.read_csv('javipulls.csv')

data = pd.concat([data1,data2,data3], ignore_index=True)
data.drop(columns=['Unnamed: 0'])
data.dropna(subset=['id','text','created_at'], inplace=True)
In [ ]:
pd.to_csv('./csvs/sandy_landfall.csv')

Filtering for Disaster Tweets: Figure Eight Model

One of the first issues we ran into in approaching our newly collected dataset is scale. We knew we would have to manually label the critically relevant tweets, however it was unfeasible to search through over 100k tweets to do this. So how could we whittle the larger body of tweets down to a selection of at least disaster-related tweets that we could then go through manually.

We discovered a dataset on the website Figure Eight that had also been posted to Kaggle. It was essentially a dataset of tweets from the time and location of recent disaster scenarios which had been labeled for whether they referred to the actual disaster or not. So we figured that by training on this dataset first, we could develop a way to whittle our list down to at least disaster-related tweets.

In [ ]:
#Cleaning Kaggle Data (remove http addresses with regex, lowercase the text)
df = pd.read_csv('./csvs/figure_eight_dataset.csv')
df['text'] = df['text'].apply(lambda x: re.split('http:\/\/.*', str(x))[0])
df['text'] = df['text'].apply(lambda x: re.split('https:\/\/.*', str(x))[0])
df = df[df.duplicated('text')==False]
df['text']= df['text'].map(lambda x: x.lower())

Train Test Split

In [ ]:
X = df['text']
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y , stratify=y, random_state=24)

Logistic Regression GridSearch, Multiple Vectorizers

We ran Logistic Regression GridSearches with multiple vectorizers (Tdidf, Hashing, Count), and found that CountVectorizer performed the best.

In [ ]:
pipe = Pipeline([
    ('vect', CountVectorizer()),
    ('model', LogisticRegression())
])

params = {
    'vect__min_df':[2, 4, 6],                            #In the paragraph after I will explain the parameters chosen
    'vect__ngram_range':[(1,2),(1,3)],
    'vect__stop_words':[None, 'english'],
    'model__penalty':['l1','l2'],
    'model__C':[0.01, 0.1 ,1]
}

gs = GridSearchCV(pipe, params, cv=5, verbose=2, n_jobs=-1)

gs.fit(X_train, y_train)

print('Best Params: ',gs.best_params_)
print('Best Estimator Score Train: ', gs.best_estimator_.score(X_train, y_train))
print('Best Estimator Score Test: ', gs.best_estimator_.score(X_test, y_test))

Random Forest GridSearch, Multiple Vectorizers

In [ ]:
pipe = Pipeline([
    ('vect', CountVectorizer()),
    ('model', RandomForestClassifier() )
])

params = {
    'vect__min_df':[2,4,6],
    'vect__stop_words':[None ,'english'],
    'vect__ngram_range':[(1,2),(1,3)],
    'model__n_estimators':[75, 200, 500],
    'model__max_depth':[5, 25, 75],
    'model__min_samples_split':[2,3,4]
}

gs = GridSearchCV(pipe, params, cv=5, verbose=2, n_jobs=-1)

gs.fit(X_train, y_train)

print('Best Params: ',gs.best_params_)
print('Best Estimator Score Train: ', gs.best_estimator_.score(X_train, y_train))
print('Best Estimator Score Test: ', gs.best_estimator_.score(X_test, y_test))

SVM GridSearch, Multiple Vectorizers

In [ ]:
pipe = Pipeline([
    ('vect', CountVectorizer()),
    ('model', svm.SVC())
])

params = {
    'vect__min_df':[2,4,6],
    'vect__stop_words':[None ,'english'],
    'model__kernel':['rbf','poly'],
    'model__C':[.1, 1, 10]
}

gs = GridSearchCV(pipe, params, cv=5, verbose=2, n_jobs=-1)

gs.fit(X_train, y_train)

print('Best Params: ',gs.best_params_)
print('Best Estimator Score Train: ', gs.best_estimator_.score(X_train, y_train))
print('Best Estimator Score Test: ', gs.best_estimator_.score(X_test, y_test))

Naive Bayes GridSearch, Multiple Vectorizers

In [ ]:
pipe = Pipeline([
    ('vect', CountVectorizer()),
    ('model', MultinomialNB())
])

params = {
    'vect__min_df':[1,2,4, 6],
    'vect__stop_words':[None, 'english'],
    'model__alpha': [0.1,1,10]
}

gs = GridSearchCV(pipe, params, cv=5, verbose=2, n_jobs=-1)

gs.fit(X_train, y_train)

print('Best Params: ',gs.best_params_)
print('Best Estimator Score Train: ', gs.best_estimator_.score(X_train, y_train))
print('Best Estimator Score Test: ', gs.best_estimator_.score(X_test, y_test))

XGBoost GridSearch, Multiple Vectorizers

We ran a number of different boosting classifiers with multiple vectorizers, of which XG Boost performed the best.

In [ ]:
pipe = Pipeline([
    ('vect', CountVectorizer()),
    ('model', XGBClassifier())
])

params = {
    'vect__min_df':[2,4,6],
    'vect__stop_words':[None,'english'],
    'model__n_estimators': [700, 1500],
    'model__min_samples_split':[2,4,6],
    'model__max_depth':[3,5]
}

gs = GridSearchCV(pipe, params, cv=5, verbose=2, n_jobs=-1)

gs.fit(X_train, y_train)

print('Best Params: ',gs.best_params_)
print('Best Estimator Score Train: ', gs.best_estimator_.score(X_train, y_train))
print('Best Estimator Score Test: ', gs.best_estimator_.score(X_test, y_test))

VotingClassifier with best LogReg, XGBoost, RandomForest (Final Model Choice)

Ultimately we went with a VotingClassifier that combined the predictive input of multiple models (LogReg, XGBoost, and Random Forest).

In [ ]:
model = Pipeline([
        ('count_vect', CountVectorizer(min_df=2,
                                  ngram_range=(1, 3))),
        ('clf', VotingClassifier(estimators=[("pip1", LogisticRegression(penalty='l2', C=0.1)),
                                             ("pip2", XGBClassifier(n_estimators=1500, min_samples_split = 2, max_depth= 3)),
                                             #("pip3", svm.SVC(kernel='rbf',C=10,probability=True)),
                                             #("pip4", MultinomialNB(alpha=1)), 

                                             ("pip5", RandomForestClassifier(max_depth=75,
                                                                             min_samples_split=4,
                                                                             n_estimators=200))],voting='soft'))
         ])
model.fit(X_train, y_train)

Our cross-val score for this model was 0.802, against a baseline accuracy of about 0.58. This is the best result we saw in this modeling process.

Training the VotingClassifier on the Figure Eight Dataset

In [ ]:
vectorizer = CountVectorizer(min_df=2, ngram_range = (1,3))

X_train_features = vectorizer.fit_transform(X_train)
X_test_features = vectorizer.transform(X_test)


model = VotingClassifier(estimators=[("pip1", LogisticRegression(penalty='l2', C=0.1)),
                                  ("pip2", XGBClassifier(n_estimators=1500, min_samples_split = 2, max_depth= 3)),
                                  ("pip3", RandomForestClassifier(max_depth=75, min_samples_split=4, n_estimators=200))]
                                    ,voting='soft')
model.fit(X_train_features, y_train)

#
print('Best Estimator Score Train: ', model.score(X_train_features, y_train))
print('Best Estimator Score Test: ', model.score(X_test_features, y_test))

Confusion Matrix

In [ ]:
predictions = model.predict(X_test_features)

def make_nice_conmat(y_test, preds):

    cmat = confusion_matrix(y_test, preds)
    print(f'Accuracy: {accuracy_score(y_test, preds)}')
    print(classification_report(y_test, preds))
    return pd.DataFrame(cmat, columns=['Predicted ' + str(i) for i in ['Regular Tweets','Disaster Tweets']],\
            index=['Actual ' + str(i) for i in ['Regular Tweets','Disaster Tweets']])

make_nice_conmat(y_test, predictions)

Train VotingClassifier on Entire Figure 8 Dataset

In [ ]:
vectorizer2 = CountVectorizer(min_df=2, ngram_range = (1,3))

X_features = vectorizer2.fit_transform(X)


model2 = VotingClassifier(estimators=[("pip1", LogisticRegression(penalty='l2', C=0.1)),
                                  ("pip2", XGBClassifier(n_estimators=1500, min_samples_split = 2, max_depth= 3)),
                                  ("pip3", RandomForestClassifier(max_depth=75, min_samples_split=4, n_estimators=200))]
                                    ,voting='soft')
model2.fit(X_features, y)
In [ ]:
model2.score(X_features,y)
In [ ]:
predictions2 = model2.predict(X_features)
df['predictions'] = predictions2

Filtering for Disaster Tweets: Crisis Words

We were less than completely satisfied with the output of the Kaggle model, and in particular, looking closely at the dataset, felt that some of the labeling was suspect or flat out incorrect. We had to time-box ourselves to a degree in this project, but we decided that a keyword list could be useful in further filtering for tweets of disaster relevance. We felt it was better to cast a wider net, as we would be loathe to miss a truly critical tweet.

We were able to locate a long, standardized list of disaster-related keywords on the CrisisNLP website. We imported this and added some words of our own. In particular, the CrisisNLP list appeared to be designed for circumstances in which the words are pre-tokenized. Seeing as we could filter based on the complete tweet texts, we condensed the list somewhat to avoid redundancy.

In [589]:
#This is where we added words of our own to the single word items in the existing CrisisNLP list.  
#This is the list we will use to filter.  An enterprising user of our model could edit this list themselves very easily:
keys_slist = [
 '911',
 'affected',
 'aftermath',
 'ambulance',
 'arrest',
 'attack',
 'authorities',
 'blast',
 'blood',
 'body',
 'bodies',
 'bomber',
 'bombing',
 'braces',
 'buried',
 'burn',
 'casualties',
 'cleanup',
 'collapse',
 'collapsed',
 'conditions',
 'crash',
 'crisis',
 'damage',
 'danger',
 'dead',
 'deadly',
 'death',
 'destroyed',
 'destruction',
 'devastating',
 'disaster',
 'displaced',
 'donate',
 'dozens',
 'dramatic',
 'drown',
 'emergency',
 'enforcement',
 'evacu',
 'events',
 'explosion',
 'fallen',
 'fatalities',
 'fire',
 'flood',
 'flooding',
 'floodwaters',
 'footage',
 'gun',
 'help!',
 'hurricane',
 'imminent',
 'impacted',
 'injured',
 'injuries',
 'inundated',
 'investigation',
 'killed',
 'landfall',
 'levy',
 'looting',
 'magnitude',
 'massive',
 'military',
 'missing',
 'nursing',
 'outage',
 'paramedic',
 'prayers',
 'praying',
 'ravaged',
 'recede',
 'recover',
 'redcross',
 'relief',
 'rescue',
 'rescuers',
 'residents',
 'responders',
 'rubble',
 'saddened',
 'safety',
 'scream',
 'seismic',
 'seizure',
 'shelter',
 'shooter',
 'shooting',
 'shot',
 'soldier',
 'storm',
 'stream',
 'surviving',
 'survivor',
 'terrifying',
 'terror',
 'toll',
 'tornado',
 'torrential',
 'toxins',
 'tragedy',
 'tragic',
 'troops',
 'twister',
 'unaccounted',
 'urgent',
 'victims',
 'volunteers',
 'warning',
 'wounded']

Apply Filters to Generate List of Disaster Tweets

Apply Figure 8 Model to Sandy Landfall Tweets

In [ ]:
data = pd.read_csv('./csvs/sandy_landfall.csv')

#Cleaning
data.dropna(subset=['id','text','created_at'], inplace=True)
data['text'] = data['text'].str.replace('[^\w\s#@/:%.,_-]', '', flags=re.UNICODE)
data['text'] = data['text'].map(lambda x: x.lower())
data['text'] = data['text'].apply(lambda x: re.split('http:\/\/.*', str(x))[0])
data['text'] = data['text'].apply(lambda x: re.split('https:\/\/.*', str(x))[0])
data = data[data.duplicated('text')==False]

#Apply VotingClassifier model to data
with open('kaggle_model_2.pkl', 'rb') as file:
    model = pickle.load(file)
data['predicted']= model.predict(data['text'])

Apply Keyword Filter to Sandy Landfall Tweets

In [ ]:
#Isolates tweets that have not been predicted as disaster tweets by the Figure 8 model
non_predicted = data[data['predicted']==0]
predicted = data[data['predicted']==1]

#Recall that keys_slist is the list of crisis keywords defined explicitly above
#This maps through the tweets not labeled as disaster and labels them as disaster if they include keywords
non_predicted['predicted'] = non_predicted['text'].map(lambda x: 1 if sum([x.find(i) + 1 for i in keys_slist])>0 else 0)

#Combines keyword flags and Figure 8 model flags to produce the set of all disaster tweets
keywords =non_predicted[non_predicted['predicted']==1]
disaster_tweets = pd.concat([keywords,predicted], ignore_index=True)

#Produces set of all regular tweets
regular_tweets = non_predicted[non_predicted['predicted']==0]
In [ ]:
disaster_tweets.to_csv('./csvs/disaster_tweets.csv')
regular_tweets.to_csv('./csvs/regular_tweets.csv')

Identifying Critical Tweets Among Disaster Tweets

We combed through the approximately 9000 tweets identified by the Figure Eight Model and Keyword Filtering as disaster tweets. We manually identified about 900 tweets that we felt met criteria for "critical" - i.e.; novel information that could be immediately relevant to emergency responders.

In [773]:
#Reload disaster tweets with labels now included
disaster_labeled = pd.read_csv('./csvs/manual_tags_final.csv')
In [774]:
disaster_labeled.head()
Out[774]:
id text created_at coordinates geo place user entities in_reply_to_user_id lang predicted tag
0 263097193563566080 rt @passantino: wow: floodwaters inundate grou... Tue Oct 30 01:57:13 +0000 2012 {'type': 'Point', 'coordinates': [-80.7245571,... {'type': 'Point', 'coordinates': [41.025018, -... {'id': 'de599025180e2ee7', 'url': 'https://api... {'id': 373792493, 'id_str': '373792493', 'name... {'hashtags': [], 'symbols': [], 'user_mentions... NaN en 1.0 1.0
1 263097199213285377 some folks maybe feelin lonely bein in a storm... Tue Oct 30 01:57:15 +0000 2012 {'type': 'Point', 'coordinates': [-76.8325555,... {'type': 'Point', 'coordinates': [38.89184242,... {'id': '19f2fcdf0d209467', 'url': 'https://api... {'id': 250905822, 'id_str': '250905822', 'name... {'hashtags': [], 'symbols': [], 'user_mentions... NaN en 1.0 0.0
2 263097201469845504 #itjustgotreal ... #iphonealerts @ #hurricane... Tue Oct 30 01:57:15 +0000 2012 NaN NaN {'id': 'b6ea2e341ba4356f', 'url': 'https://api... {'id': 112052977, 'id_str': '112052977', 'name... {'hashtags': [{'text': 'itJustGotReal', 'indic... NaN und 1.0 0.0
3 263097206125518849 @ahurricanesandy hey #sandy get your ass down ... Tue Oct 30 01:57:16 +0000 2012 {'type': 'Point', 'coordinates': [-79.99661002... {'type': 'Point', 'coordinates': [35.97201904,... {'id': 'aef8c3da277ca498', 'url': 'https://api... {'id': 266676966, 'id_str': '266676966', 'name... {'hashtags': [{'text': 'Sandy', 'indices': [21... 364217289.0 en 1.0 0.0
4 263097208923099136 everytime the wind picks up it sounds like som... Tue Oct 30 01:57:17 +0000 2012 {'type': 'Point', 'coordinates': [-80.4487703,... {'type': 'Point', 'coordinates': [37.213948, -... {'id': '820684853e0f1eb6', 'url': 'https://api... {'id': 479075523, 'id_str': '479075523', 'name... {'hashtags': [{'text': 'hurricanessuck', 'indi... NaN en 1.0 0.0

The tag column indicates whether the tweet was manually tagged as "critical". This is based on our first tagging run, in which we cast a relatively wide net. The predicted column indicates whether the tweet is disaster related (all of these were, of course), in case we want to concatenate with our regular tweets data (predicted = 0) later on.

The above is also a good look at what our tweet data dictionaries contain. Not everything is immediately relevant to the project here, but we left in anything that might be used predictively later on. Obviously, we have the specific ID that we can use to look up the tweet online, or use as an index. We also have:

  • the text of the tweet
  • the timestamp
  • the geo-coordinate information
  • a place dictionary that contains information about the area including the city/neighborhood of origin
  • a dictionary of information about the user sending the tweet
  • a dictionary that grabs any hashtags the tweet contains
  • a column that allows us to tell if the tweet was a reply or not
  • the language of the tweet

Labeled Disaster Tweets - Data Cleaning

In [776]:
#Check if we missed or deleted any cells while tagging manually
disaster_labeled[disaster_labeled['tag'].isnull()==True]
Out[776]:
id text created_at coordinates geo place user entities in_reply_to_user_id lang predicted tag
30 263097391169826817 @rwzombie fuck #hurricanesandy keep voting Tue Oct 30 01:58:00 +0000 2012 {'type': 'Point', 'coordinates': [-74.95308309... {'type': 'Point', 'coordinates': [40.04676075,... {'id': '31fbce652077706d', 'url': 'https://api... {'id': 25303398, 'id_str': '25303398', 'name':... {'hashtags': [{'text': 'hurricanesandy', 'indi... 43469093.0 en 1.0 NaN
In [777]:
disaster_labeled['tag'] = disaster_labeled['tag'].fillna(0)
disaster_labeled['tag'].value_counts()
Out[777]:
0.0    8694
1.0     836
Name: tag, dtype: int64
In [780]:
disaster_labeled.dropna(subset=['id','text','created_at'],inplace=True)

#english language only
disaster_labeled = disaster_labeled[disaster_labeled['lang']=='en']
#create readable datetime column and sort by datetime
disaster_labeled['datetime'] = pd.to_datetime(disaster_labeled['created_at'])
disaster_labeled = disaster_labeled.sort_values('datetime').reset_index(drop=True)

#Selects columns of interest
disaster_labeled = disaster_labeled[['id','text','datetime','geo','predicted','tag']]

#remove retweets (begins with rt)
disaster_labeled['text'] = disaster_labeled['text'].map(lambda x: np.nan if x.find('rt')==0 else x)
disaster_labeled.dropna(subset=['text'],inplace=True)

#remove retweets (contains rt elsewhere)
disaster_labeled['text'] = disaster_labeled['text'].map(lambda x: np.nan if 'rt' in x.split(' ') else x)
disaster_labeled.dropna(subset=['text'],inplace=True)
In [865]:
disaster_labeled.head()
Out[865]:
id text datetime geo predicted tag
0 2.630677e+17 @godseyg i was arrested about 36 hours later. ... 2012-10-30 00:00:02 {'type': 'Point', 'coordinates': [39.12081393,... 1.0 0.0
1 2.630677e+17 wish i was wiff my love during this disaster #... 2012-10-30 00:00:04 {'type': 'Point', 'coordinates': [41.28866479,... 1.0 0.0
2 2.630677e+17 im a hurricane vet.. so #hurricanesandy isnt a... 2012-10-30 00:00:04 {'type': 'Point', 'coordinates': [39.12088316,... 1.0 0.0
3 2.630677e+17 the scorpions - rock you like a hurricane #201... 2012-10-30 00:00:04 {'type': 'Point', 'coordinates': [42.60095122,... 1.0 0.0
4 2.630677e+17 lights out @ frankenstorm apocalypse - hurrica... 2012-10-30 00:00:06 {'type': 'Point', 'coordinates': [40.79093941,... 1.0 0.0
In [798]:
disaster_labeled.to_csv('./csvs/disaster_labeled.csv',index=False)

disaster_labeled[disaster_labeled['tag']==1].to_csv('./csvs/critical.csv', index=False)
disaster_labeled[disaster_labeled['tag']==0].to_csv('./csvs/disaster_nonrel.csv', index=False)

Weighting and Bootstrapping

After manually tagging, we had about a 9:1 ratio of disaster non critical to disaster critical tweets, so baseline accuracy of about 90%. In order to resolve this it was necessary to oversample the tweets we had labeled critical (bootstrapping). We expanded the 900ish critical tweets to 9000, balancing the classes evenly.

While manually tagging, we had often felt that some of the tweets we labeled critical were borderline, while others were immediate and dire, and highly useful potentially. We felt this should be reflected in our bootstrapping by using weighted probabilities in our oversampling. We thus went through the 900 critical tweets and weighted them 1, 3, 5, or 10 based on their degree of relevance to emergency personnel who might be scanning twitter for information. We then normalized these to percentages of 1 to create probabilities for use in bootstrapping.

In [ ]:
weighted = pd.read_csv('./csvs/weighted_critical.csv')

#Converts manually assigned weights to bootstrap weighted probabilities
weighted['weight']= weighted['weight'].map(lambda x: x/weighted['weight'].sum())

boot = weighted.sample(9000,replace=True, weights=weighted['weight'])
boot.drop(columns = ['weight'],inplace=True)

nonrel = pd.read_csv('./csvs/disaster_nonrel.csv')
nonrel.drop(columns='Unnamed: 0', inplace=True)

disaster = pd.concat([nonrel, boot], ignore_index=True)
#So now we have a training set composed of roughly equal classes - half are irrelevant disaster-related tweets, 
#and the other half are critical tweets that have been bootstrapped with weights to appear multiple times.

We tried our hand with a few different things, but ended up settling on a similarly trained VotingClassifier to our earlier Figure 8 Model, with combination of trained LogReg, XGBoost and Random Forest Classifiers.

In [ ]:
X = disaster['text']
y = disaster['tag']

critical_from_disaster_model = Pipeline([
        ('count_vect', CountVectorizer(min_df=2,
                                  ngram_range=(1, 3))),
        ('clf', VotingClassifier(estimators=[("pip1", LogisticRegression(penalty='l2', C=0.1)),
                                  ("pip2", XGBClassifier(n_estimators=1500, min_samples_split = 2, max_depth= 3)),
                                  ("pip3", RandomForestClassifier(max_depth=75, min_samples_split=4, n_estimators=200))]
                                    ,voting='soft'))
         ])
critical_from_disaster_model.fit(X, y)

We observed cross-validation scores in the mid to high 90s with this training process, however this was obviously the result of highly weighted bootstrapped replacement tweets appearing in the test split as well as in the train split. The only way to really assess performance is to run the model on test data and see how its predictions perform.

WordCloud: Words that Help Identify Critical Tweets Among Disaster Tweets

Obviously the words here are somewhat specific to the Sandy Hurricane, which we would expect. Continuing to train with a wider variety of disasters (wildfires, tornadoes, mass shootings, earthquakes, etc - even everyday emergencies perhaps) would allow the model to become more generalized.

Testing 2-Phase Model {1. Disaster Filter, 2. Predict Critical} on New Data

Building ID List for Sandy Test Set

For our test set, we wanted to sample from a wider section of the hurricane's course. We aimed for about 40000 tweets for the test set. We decided to randomly select 40000 tweets from the 2 million or so post-landfall time window tweets, so, starting at the end of the window of our training set (10PM on 10/29) through the last timestamp in the ID List (11/2).

In [535]:
#creates our test set id list
#Random selection of tweets over the rest of the hurricane, so we can show geographic progression in mapping as well
#40,500 was chosen as n because it is a convenient multiple of 2700 (900*3) for the tweet pulls

sandy_random = dftime.loc[4608435:6000000,:]
sandy_random = sandy_random.sample(40500,replace=False,random_state=37)
sandy_random = sandy_random.sort_values('datetime')
sandy_random = sandy_random.reset_index()
In [551]:
sandy_random.to_csv('./csvs/sandy_random.csv',index=False)

As with the main set, we ran tweet pulls in tandem and combined the resulting csvs to complete the test set.

Cleaning the Test Data

In [758]:
testset = pd.read_csv('./csvs/testset.csv')

testset.dropna(subset=['id','text','created_at'],inplace=True)

testset = testset[testset['lang']=='en']

#create readable datetime column and sort by datetime
testset['datetime'] = pd.to_datetime(testset['created_at'])
testset = testset.sort_values('datetime').reset_index(drop=True)

#english language only
testset = testset[testset['lang']=='en']

#create readable datetime column and sort by datetime
testset['datetime'] = pd.to_datetime(testset['created_at'])
testset = testset.sort_values('datetime').reset_index(drop=True)

#Selects columns of interest
testset = testset[['id','text','datetime','geo']]

#remove retweets (begins with rt)
testset['text'] = testset['text'].map(lambda x: np.nan if x.find('rt')==0 else x)
testset.dropna(subset=['text'],inplace=True)

#remove retweets (contains rt elsewhere)
testset['text'] = testset['text'].map(lambda x: np.nan if 'rt' in x.split(' ') else x)
testset.dropna(subset=['text'],inplace=True)

testset['id'] = testset['id'].map(lambda x: int(x))

testset['text'] = testset['text'].str.replace('[^\w\s#@/:%.,_-]', '', flags=re.UNICODE)
testset['text']= testset['text'].map(lambda x: x.lower())
testset['text'] = testset['text'].apply(lambda x: re.split('http:\/\/.*', str(x))[0])
testset['text'] = testset['text'].apply(lambda x: re.split('https:\/\/.*', str(x))[0])

testset.to_csv('./csvs/testset.csv',index=False)

Running the Filter/Models on the Test Data to Identify Critical Tweets

In [ ]:
#Run Figure 8 model and separate into regular and disaster tweets.
testset['is_disaster']= kaggle_model.predict(testset['text'])
test_regular = testset[testset['is_disaster']==0]
test_disaster = testset[testset['is_disaster']==1]

#Apply keyword filter to tweets identified as regular by Figure 8
#to identify additional disaster tweets, and combine to collect all disaster tweets
test_regular['is_disaster'] = test_regular['text'].map(lambda x: 1 if sum([x.find(i) + 1 for i in keys_slist])>0 else 0)
contains_keywords = test_regular[test_regular['is_disaster']==1]
test_disaster = pd.concat([contains_keywords,test_disaster], ignore_index=True)

#Run Critical Tweet Identifier model on disaster tweets
test_disaster['is_critical']= critical_from_disaster_model.predict(test_disaster['text'])

#Save all regular tweets (not disaster related)
test_regular.to_csv('./csvs/test_regular.csv',index=False)

#Save all disaster-related tweets that are not relevant
test_disaster_nonrel = test_disaster[test_disaster['is_critical'] == 0]
test_disaster_nonrel.to_csv('./csvs/test_disaster_nonrel.csv',index=False)

#Save all tweets that were identified as disaster AND critical
test_critical = test_disaster[test_disaster['is_critical'] == 1]
test_critical.to_csv('./csvs/test_critical.csv',index=False)

Results on Test Data

Out of the 25000 or so tweets in the test set, our two-phase model identified about 50 as potentially critical tweets. You can review these below. Given the limitations we had to work with in terms of bootstrapping, and simply not having that much data for the problem, we were quite pleased with the performance here. Most of the tweets identified as critical do seem to correspond to situations where the hurricane is actively creating problems. You sense immediacy in most of these tweets.

Comparing this to a selection of the non-relevant disaster tweets, or the regular tweets (seen below), we can see that the model is actually doing a pretty excellent job of homing in on real issues and ignoring other mentions of the storm. This is a pretty remarkable result, and we have reason to expect that there is still a lot of potential for the model to improve, with an appropriately large and diverse data set.

Tweets That Were Identified as Critical

In [13]:
for i in test_critical['text']:
    print(i)
    print('----------')
@wwegames @wwe @cmpunk step outside my house and swim in the flood
----------
the streets could be flooded and houses could be floating and coach would be like 5:30am practice on main street dress warm
----------
everywhere is flooded
----------
awesome my street is flooded..
----------
im alive no damage either a tree across the street broke. we have power
----------
the remains of the tree that fell on tracks by stony brook yesterday, was mostly dead already.
----------
no tree damage this morning @ mcdevitt field. the field is ok, just several leaves amp small branches down.
----------
@primetime_lerch @realfrankgeib24 @abegotwheels @a_stump_ swimming through the flooded half of red tail
----------
dragon #sandy ripped off roof #alexandria apt. building so loud, child thought it was a dragon no injuries @wusa9
----------
 rt @dopeasjordan: a wire just exploded outside on my block and electrocuted and set a dog on fire, w o w.
----------
woke up w no power just candles on couldnt text or make calls or use data lights out everywhere trees down felt like i was in #walkingdead
----------
creek road and 926 flooded #wcugis
----------
flooded fdr #sandy #frankenstorm @bsheridan  @ fdr drive
----------
whats left of our cherry tree... #hurricanesandy #picfx @ casa de urena
----------
my next door neighbors tree is threatening to kill our electricity #hurricanesandy
----------
welcome to l-town #hurricanesandy #boat #in #a #tree  @ water logged l-town
----------
fire on my street. geeze. probably the wesley kids. ahaha. #justassuming
----------
wait jones beach theater is flooded, it better be ready for one direction this summer
----------
tree down on carroll st. right in front of where that brownstone collapsed earlier this year.
----------
drove around town to see the damage amp where all these power company trucks are its bad up here. power company gets de
----------
520 steps to my powerless 26th floor apartment, just 12 steps to my powerful recover @ 46th street clubhouse
----------
@joshelliottabc live across the street from hospital 1st ave and 33rd flooded
----------
tree on house after sandy. no damage inside. thank god.
----------
lots of trees down in prospect park #sandy  @ prospect park west
----------
@fdny: qns 6-6 breezy point fire, 50 homes completely destroyed by fire ouch....
----------
front of building destroyed by sandy: a chelsea building near the corner of 8th amp w. 14th st. partially...
----------
@shawnmichaels 2 straight days off, 2 confirmed deaths in baltimore, md. 200,000 no power. #222 #sandy
----------
still no power at hemlock and spruce st in north andover, #nationalgrid #sandy
----------
river flooding 71st street @ bobby wagner walk
----------
access to hoboken from jersey city next to impossible. grove street. marin blvd., jersey ave flooded.
----------
in the movie 2012 nyc floods... nyc is flooded right now #freaky
----------
@severnaprkpatch tree hit house in berrywood south
----------
@jzanetti25 a tree fell on the house and killed two kids
----------
destruction around coned, the water actually moved cars around in the street. @ coned 750 east 16 street
----------
flood warning issued for glen burnie, md
----------
trees down in central park. #sandy  @ rustic overlook
----------
#sandy update: no power, cell, or landline service in huntington station, #longisland since 4pm yesterday. trees partially blocking roads
----------
tree blocking street in wantagh, ny #sandy
----------
tree in the middle of pierce rd #sandy2012  @ pierce rd
----------
@sandrabookman7 heres a tree down shot, taken at murdock n neried
----------
major tree damage in theodore roosevelt park. #ues #sandynyc  @ theodore roosevelt park museum park
----------
bloomberg: zone a evacuation order is in effect until all clear from building inspectors, which has not happened yet.
----------
@newzphotos grant rd tree on a car also...three calls
----------
downed trees across lubber street at quaker path in stony brook, n.y. @ quaker path
----------
just passed a massive tree down on the grounds of the museum of natural history. audible gasp from the unusually quiet bus passengers.
#nyc
----------
water street flooded :/ hurricane sandy can suck it : #hurricane #hurricanesandy #water #s @ long warf park
----------
stay away from east avenue and wall street truck just took out all power line
----------
closed dirty alley or street request at 2217 etting st baltimore
----------
in numbers: 27,373 @coned customers in chelsea without power 96,257 in cooper square 21,281 in kips bay, for starters. #sandy
----------
#wellington closed due to flooding on rt 18 both eb/wb between sr 58 and oh-301 #traffic
----------
#cuyahoga closed due to flooding on w 150th st north of i 480 #traffic
----------
In [14]:
for i in test_disaster_nonrel['text'].head(50):
    print(i)
    print('----------')
on a day when i am afraid, i will trust you, god -- psalm 56:3 pray for those enduring #hurricane #sandy
----------
fuck #hurricanesandy
----------
... just spent 2 hours chillin in my car. now im back in the house praying my phone lasts me the whole night lmao i hate not having power
----------
@kmbutler4 like the wizard of oz lol i really meant kansas remember the tornado and the flying monkeys lol
----------
@annanicolexxo i will when she gets back from the tests. im having a mini orgasm because im in an emergency room gt.lt
----------
@verizonwireless boy you guys are gonna hear it tomorrow when this hurricane is over...
----------
@steve_whosoever @joshuanason hi steve, joshua is from ri, im from pa. im sure youve been through a lot of hurricanes living on the gulf
----------
@allen_strk: really excited to see how the teams play out for survivor series. they suck
----------
the real risks from hurricane #sandy:
----------
so im without electricity and very limited service. a prayer to everyone who is being affected or has been affected badly.
----------
i dont know whats worse this damn storm or my brothers snoring #fml
----------
@msoblount: @princephillyock working in the dark at arcadia university ...persevering through the hurricane . love my son
----------
this one time, at hurricane camp, i stuck a package of ramen noodles in
my vagina #fucksandy #survival
----------
getting thru the blackout #hurricanesandy caused here in #nyc  with the help of @arminvanbuurens @djmagazine 2012 awards set #trancefamily
----------
power outages: one more pragmatic reason to be #vegan. #lessfunkyfridgeforthewin #sandy
----------
@dailydeadnews evil dead remake looks intense. im excited bout carrie. but thoughts on lords of salem trailer was, ehhhh...
----------
@aaronwanat yes its a hurricane that hasnt really affected western ny but the bigger cities
----------
hoodlums #troublemakers #introuble #hurricanesandy  @ weldin hall
----------
praying for everyone at #nyu hospital. #sandy
----------
uff, so much bad news. i feel like were sitting pretty while the rest of the city burns/floods. you can do it, new york
----------
next hurricane should be called hurricane taylor sounds cute : hehe
----------
#NAME?
----------
@madmaxjr_ man stfu with that dead shit
----------
want to sleep with my window open, but the wind is going crazy #hurricaneproblems
----------
through this freak storm i still find something to make me relaxed and smile @jcovv @kpod23 see you two thanksgiving
----------
this big bad hurricane had better at least knock out the power at rio, because im not feeling class tomorrow #dontlikethecold
----------
@kasiyan93 its not that would be a tragedy calllin it tho
----------
thanks for all the well wishes, plz know i am in a low risk area and many around me could use prayers more. lt3 @andrewneylon @ericneedleman
----------
dear hurricane sandy,
can you take obama with you
sincerely, smart americans
----------
waiting for the power to go out. @ frankenstorm apocalypse - hurricane sandy w/ 404 others
----------
@nbfirefighter90 your such a tard
----------
you will never find new yorks times square so empty -ever #hurricane sandy# at times square pic
----------
pokemon song/britney sing-a-long, apples to apples, and destroying our beer supply #survival #sandy #frankenstorm @kimmiefg @krystinlemieux
----------
@stephgrillz angy poooos house is getting destroyed
----------
power. i want it. #frankenstorm
----------
the hurricane sandy twitter parody is terrible
----------
@billycury: top 3 fav games this week: borderlands 2, the walking dead, skyrim dusted it off #billystopgames #woohwaah
----------
storms are creepy i need my bella
----------
sandbags in place at the 205 north location. praying that all of that was unnecessary. #noflood
----------
enjoying my power outage evening. thank for a roof over my head.
----------
blasting @hardwell #thecreator
----------
@sorrynotsoorry we arent allowed to drive unless its an emergency.
----------
@willwilkinson: no valid government-issued picture id, no rescue. but you cant buy cigarrettes without an id, so ....
----------
thoughts are with staff amp 200 patients evacuating nyu hospital after generator failure. #hurricanesandy #nyc
----------
this hurricane wasnt even bad
----------
the walking dead is 75% off at @steam_games get it before nov 1
----------
@ladygaga thanks so much for ur prayers still have a long way to go with storm just hanging in there love you
----------
i hope everyones okay during this storm. my power went out but have the generator goiing.
----------
to bad hurricane sandy wasnt big enough to wipeout the whole world.
----------
everyone stay safe #scbd #hurricanesandy #goodnight
----------

Tweets That Were Identified as Regular, Non-Disaster

In [17]:
for i in test_regular['text'].head(50):
    print(i)
    print('----------')
didnt even text me
----------
i dont understand bitch
----------
@marylandprobz i gotta get one of those where can i
----------
couldnt be more excited
----------
@hitmanholla @scizas word stone cold.open a can of whip ass on dat boy and dump budweiser on him after it...lol
----------
ride or die .. we gon make it out this hood
----------
i dont understand bitch lol
----------
knock knock bitch im in the house now whats up
----------
@rm9_era lmao why you wanna do that haha
----------
i just messaged you cause i was bored yea im sure thats the reason
----------
@kweadilovewale yo ijus turned my phone on. my power is out so im tryin save battery
----------
@_percyyy good one -_-
----------
listening to @brothalynchhung season of da siccness right now.
----------
@mikepereira would he be considered a defenseless receiver
----------
just went outside to check out sandy, pretty scary stuff, trees on top of cars etc.
----------
#100thingsaboutme if i could wear a tiedye shirt and sweats every single day of my life, you had better believe i would.
----------
@leeeexis @xburke13x  omggg.. where are these pics
----------
@classictwenty3 he makes me upset
----------
@rookiewriter8 seems like hk is the safest place to be tonight #sandy
----------
how tf you accidentally vote for romney  lol
----------
, she a lil bop
----------
laid 3 little angels down to bed, i just pray demons dont wake up in there places  for school tomorrow lol help aunt jesse lol
----------
@kradick22 im not complaining it was a simple tweet
----------
@darth_talbert its going on right now lol
----------
@halenspencer yes and the boys :
----------
@xxjransom2xx @thagavster youre damn right just the thought of you coming here gets me rock solid.
----------
the haves and the have nots in manhattan. the view from my blacked out hotel room on west 16th.
----------
sandy turn off de lights
----------
hearing that nyc has plunged into darkness puts my stomach in a knot
----------
her leo self better not take advantage of his pisces kindness
----------
its crazy how bored a person can get from doing absolutely nothing
----------
@burnymacc @roxinoel94 i love my jack haha
----------
lets just pretend that this was all just a dream.
----------
i didnt change. you just never knew me.
----------
temp: 39.7f  wind:n at 8.7kts barometer: 992.2mb and falling quickly rain since midnight: 1.29in relative humidity: 95%  #wvwx
----------
@jboochh high tides over water should be receding now  good to know everyones safe
----------
i cant let these little things slip out of my mouth... cause its you, oh its you, its you they add up tolt3
----------
guess its time to go to bed now.  hopefully the power will be back soon.
----------
@mgee_x3  family guyyyyy
----------
@katie_newport @allisonrockey yay were good. were having a brown out-low flickering lights. its like 1873 up here.
----------
please think before you speak
----------
eff too bad this episode was on the other night
----------
@23solesandhoes she not ugly tho
----------
fail.
----------
she gonna say im late
----------
one ice down.... one to go #drunkyyyy
----------
@boxing he should rematch quartey, tito, or mosley.
----------
@johnpattwell yeah same, now we need to drink cause ive got nothing else to do
----------
choke my chicken, slap my wanker, a girl lets me watch her with her tits out when im spankin then i thank her
----------
i wish i had friends to stay up with #sigh
----------

Create Combined Landfall Dataset with Class Labeling

We wanted to build out a cleaned, combined dataset of our Sandy landfall tweets to be able to use the geo-tags we have to build a map where we can easily represent how disaster and critical tweets can be visualized amid the sea of all tweets. For this we are returning to our main data set.

In [813]:
#This is the portion of Sandy Landfall Tweets (our main pull) that was not labeled 
#as disaster tweets by either the Figure 8 Model or the Keyword Filtering - so regular tweets.
regular_tweets = pd.read_csv('./csvs/regular_final.csv')
In [814]:
regular_tweets.head()
Out[814]:
id text created_at coordinates geo place user entities in_reply_to_user_id lang predicted tag
0 263097187775434752 if anybody needs a 2nd shift job and can pass ... Tue Oct 30 01:57:12 +0000 2012 NaN NaN NaN {'id': 146665477, 'id_str': '146665477', 'name... {'hashtags': [], 'symbols': [], 'user_mentions... NaN en 0 0
1 263097188446523393 of course when the voice is on, my tv decides ... Tue Oct 30 01:57:12 +0000 2012 {'type': 'Point', 'coordinates': [-81.33815822... {'type': 'Point', 'coordinates': [41.15005871,... {'id': '45a0ea3329c38f9f', 'url': 'https://api... {'id': 65144874, 'id_str': '65144874', 'name':... {'hashtags': [{'text': 'ThanksSandy', 'indices... NaN en 0 0
2 263097191260901378 @ken_fedor oh hell yeah. we need too Tue Oct 30 01:57:13 +0000 2012 {'type': 'Point', 'coordinates': [-74.10466037... {'type': 'Point', 'coordinates': [40.87194808,... {'id': '86fc60f26e1639cc', 'url': 'https://api... {'id': 331299957, 'id_str': '331299957', 'name... {'hashtags': [], 'symbols': [], 'user_mentions... NaN en 0 0
3 263097191881646080 ok... enough sandy, time to go away. no real... Tue Oct 30 01:57:13 +0000 2012 {'type': 'Point', 'coordinates': [-77.092894, ... {'type': 'Point', 'coordinates': [38.978183, -... {'id': '864ff125241f172f', 'url': 'https://api... {'id': 302586627, 'id_str': '302586627', 'name... {'hashtags': [], 'symbols': [], 'user_mentions... NaN en 0 0
4 263097193530023936 niggas not loyal Tue Oct 30 01:57:13 +0000 2012 {'type': 'Point', 'coordinates': [-81.6315227,... {'type': 'Point', 'coordinates': [41.5384851, ... {'id': '0eb9676d24b211f1', 'url': 'https://api... {'id': 390423015, 'id_str': '390423015', 'name... {'hashtags': [], 'symbols': [], 'user_mentions... NaN en 0 0

Regular Tweets - Data Cleaning

In [815]:
#creates target column for regular tweets, all of which should be 0
regular_tweets['tag'] = np.zeros(len(regular_tweets))
In [816]:
regular_tweets.shape
Out[816]:
(105804, 12)
In [817]:
regular_tweets.isnull().sum()
Out[817]:
id                         0
text                       4
created_at                 2
coordinates            10378
geo                    10378
place                   3170
user                      24
entities                   2
in_reply_to_user_id    69816
lang                      27
predicted                  0
tag                        0
dtype: int64
In [818]:
regular_tweets.dropna(subset=['text','created_at'],inplace=True)
In [819]:
#english language only
regular_tweets = regular_tweets[regular_tweets['lang']=='en']

#create readable datetime column and sort by datetime
regular_tweets['datetime'] = pd.to_datetime(regular_tweets['created_at'])
regular_tweets = regular_tweets.sort_values('datetime').reset_index(drop=True)

#Selects columns of interest
regular_tweets = regular_tweets[['id','text','datetime','geo','predicted','tag']]

#remove retweets (begins with rt)
regular_tweets['text'] = regular_tweets['text'].map(lambda x: np.nan if x.find('rt')==0 else x)
regular_tweets.dropna(subset=['text'],inplace=True)

#remove retweets (contains rt elsewhere)
regular_tweets['text'] = regular_tweets['text'].map(lambda x: np.nan if 'rt' in x.split(' ') else x)
regular_tweets.dropna(subset=['text'],inplace=True)

regular_tweets['id'] = regular_tweets['id'].map(lambda x: int(x))

Combine Regular and Disaster

In [820]:
disaster_labeled = pd.read_csv('./csvs/disaster_labeled.csv')
In [821]:
disaster_labeled.head()
Out[821]:
id text datetime geo predicted tag
0 2.630677e+17 @godseyg i was arrested about 36 hours later. ... 2012-10-30 00:00:02 {'type': 'Point', 'coordinates': [39.12081393,... 1.0 0.0
1 2.630677e+17 wish i was wiff my love during this disaster #... 2012-10-30 00:00:04 {'type': 'Point', 'coordinates': [41.28866479,... 1.0 0.0
2 2.630677e+17 im a hurricane vet.. so #hurricanesandy isnt a... 2012-10-30 00:00:04 {'type': 'Point', 'coordinates': [39.12088316,... 1.0 0.0
3 2.630677e+17 the scorpions - rock you like a hurricane #201... 2012-10-30 00:00:04 {'type': 'Point', 'coordinates': [42.60095122,... 1.0 0.0
4 2.630677e+17 lights out @ frankenstorm apocalypse - hurrica... 2012-10-30 00:00:06 {'type': 'Point', 'coordinates': [40.79093941,... 1.0 0.0
In [822]:
#Combines disaster-related and regular tweets into a cleaned dataset containing all three classes
#(Regular, Disaster Non-Critical, and Disaster Critical)
#The convenience of this dataset is it can easily be manipulated to have a class column with all three types labeled.
#This will be useful for geo-mapping visualizations.
sandy_combined = pd.concat([disaster_labeled,regular_tweets],ignore_index=True)
sandy_combined['datetime'] = pd.to_datetime(sandy_combined['datetime'])
sandy_combined = sandy_combined.sort_values('datetime')
sandy_combined['id'] = sandy_combined['id'].map(lambda x: int(x))
In [830]:
sandy_combined.to_csv('./csvs/sandy_combined.csv')

Geomapping Landfall Data in Tableau

For the first geomap, we want to map the full body of tweets (regular, disaster, and critical) from the main pull, ie all geotagged tweets from the northeastern seaboard from the time of Sandy landfall in NJ/NY to about 15 hours later. Some additional processing needs to be done. Also, not all tweets in our pull have precise coordinates, although most do.

Geomap Preprocessing

In [831]:
sandy_geomap = pd.read_csv('./csvs/sandy_combined.csv')
In [833]:
#Remove all rows that don't have coordinates
sandy_geomap = sandy_geomap[sandy_geomap['geo'].notnull()==True]

#Splits geo column into latitude and longitude columns
sandy_geomap['geo'] = sandy_geomap['geo'].map(lambda x: x.split('[')[1].split(']')[0])
sandy_geomap['latitude'] = sandy_geomap['geo'].map(lambda x: x.split(',')[0])
sandy_geomap['longitude'] = sandy_geomap['geo'].map(lambda x: x.split(',')[1])
sandy_geomap['latitude'] = sandy_geomap['latitude'].map(lambda x: float(x))
sandy_geomap['longitude'] = sandy_geomap['longitude'].map(lambda x: float(x))

#Condenses to relevant information for geomapping
sandy_geomap = sandy_geomap[['id','text','latitude','longitude','datetime','predicted','tag']]
In [836]:
sandy_geomap.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 94043 entries, 0 to 104272
Data columns (total 7 columns):
id           94043 non-null int64
text         94043 non-null object
latitude     94043 non-null float64
longitude    94043 non-null float64
datetime     94043 non-null object
predicted    94043 non-null float64
tag          94043 non-null float64
dtypes: float64(4), int64(1), object(2)
memory usage: 5.7+ MB
In [847]:
sandy_geomap['class'] = sandy_geomap['predicted'] + sandy_geomap['tag']
In [848]:
sandy_geomap['class'].value_counts()
Out[848]:
0.0    86060
1.0     7297
2.0      686
Name: class, dtype: int64
In [849]:
sandy_geomap.to_csv('./csvs/sandy_geomap.csv',index=False)
In [850]:
len(sandy_geomap)
Out[850]:
94043
In [867]:
sandy_geomap.head()
Out[867]:
id text latitude longitude datetime predicted tag class
0 263067699821821952 i should probably start doing my hw 43.104663 -75.127981 2012-10-30 00:00:01 0.0 0.0 0.0
1 263067700270596096 good thing i practiced my i totally understand... 39.166361 -84.606048 2012-10-30 00:00:01 0.0 0.0 0.0
2 263067699133947904 @day_hammonds you already know 34.997820 -80.090215 2012-10-30 00:00:01 0.0 0.0 0.0
3 263067696235692032 @tiffany_niccole im sitting at this gate a6 to... 39.998155 -82.884330 2012-10-30 00:00:01 0.0 0.0 0.0
4 263067699213660160 girls who go commandogtgt 39.923918 -75.173551 2012-10-30 00:00:01 0.0 0.0 0.0

Real-Time Mapping Disaster Critical Tweets in the NYC Area during Hurricane Sandy Landfall (Tableau)

Live Streaming Disaster Tweets via Twython

We wanted to return at the end to the root purpose of the project and see what we could do with applying our model to real-time tweets. The primary difficulty here was just obtaining the tweets, but we were able to develop a method using Twython.

In [1]:
#This set of code uses Twython to stream tweets with disaster keywords in real-time.  
#It also compiles these incoming tweets into dataframes of n=100, for potential analysis with our model.

from twython import TwythonStreamer
import pandas as pd
import datetime
import pickle
class MyStreamer(TwythonStreamer):
    def on_success(self, data, ids=[], texts=[],geos=[],created_ats=[],df = pd.DataFrame(columns=['id','text','geo','created_at']),count=0 ):
        print(data['text'])
        now = datetime.datetime.now()
        ids.append(data['id'])
        texts.append(data['text'])
        geos.append(data['geo'])
        created_ats.append(data['created_at'])
        count = len(texts)
        print(count)
        if count == 10:
            df['id'] = ids
            df['text'] = texts
            df['geo'] = geos
            df['created_at'] = created_ats
            df.to_csv("./Live_tweets/Live Tweets "+ str(now)+".csv")
            print('CSV SAVED')
            count = 0
            df = pd.DataFrame(columns=['id','text','geo','created_at'])

stream = MyStreamer(app_key='INSERT KEY HERE',
                    app_secret='INSERT KEY HERE',
                    oauth_token='INSERT KEY HERE',
                    oauth_token_secret='INSERT KEY HERE')


stream.statuses.filter(track= ['fire'])

Conclusions

Our primary takeaways in this project thus far are as follows:

Despite initial setbacks, we were able to arrive at a satisfying and effective two-phase process for identifying critical disaster tweets out of the sea of all incoming tweets on social media. We were pleased with how well we were able to isolate potentially relevant tweets from a sea of test data. We also successfully visualized what an interface for receiving geotagged tweets in real-time might look like, although we expect a more integrated and multi-functional mapping software than Tableau might be necessary for real-world implementation. Finally, we were able to demonstrate proof-of-concept on live-streaming capture of tweets.

Given unrestricted access to the data available via Twitter, let alone Facebook, Snapchat, Instagram (all of which FEMA or a similar organization would likely have in the hypothetical scenario where they would implement this process), I feel we have demonstrated that it would be absolutely possible to build out a very useful and accurate geo-feed of emergency response information in the area of an ongoing disaster.

An ideal implementation perhaps would be to continually label critical tweets after the fact and progressively train for different types of emergency tweets over many documents and events. Emergency personnel whose job it is to review the incoming tweets that the existing model has identified as critical could also be tasked with manually labeling the identified tweets as truly critical or not as they come in, with this labeling feeding back into actively improving the performance of the model as time goes on. With enough development, we could reach a point where we have specialized models for different disaster types, which emergency personnel could turn on as appropriate once a disaster scenario is live.

All of this bodes well for future expansion of the project. Other directions we might go with more time include attempting to involve another social media platform, or locate a database of tweets from a disaster scenario other than a hurricane to try and diversify our filtering process. We also would like to try and improve predictive accuracy by use of Words2Vec, which is well equipped to locate similar types of tweets (e.g. critical tweets) without explicit labeling.