The key features of our project were to model the change in Twitter user’s moods as a timeseries allowing us to use a vector autoregression model to infer information about stock prices.
It worked (e.g., user mood had minor predictive power on several stocks) and we will revisit the model after more data is collected.
My friend and collaborator Adam Delora and I took a course in data mining last semester, taught by Abdullah Mueen, PhD. As is typical with advanced courses, we had a semester project of considerable scale. Adam, being rather into finance (he has a background in economics), wanted to play with stock market data, and I, being rather into social network semantics, wanted to do something with data from a network. We decided to combine our powers to model information about network user’s moods to attempt predicting several stocks and then expand the project if we found something interesting.
We narrowed our choices down to Twitter as a market due to its ease of access. Twitter has build a robust, easy to use API that many academic groups have used for research and I hope that our dataset will be useful for other projects as well. Twitter has roughly 284 million monthly active users and sees about 500 million tweets per day, allowing researchers a vast amount of text, network, and geolocation data with which they may play.
At a high level, we used latent semantic indexing to get a set of topics for a given hour over a month’s worth of tweets and assigned a continuous semantic “score” for each hour using the AFINN database. This allowed us to use a vector autoregressive model to investigate the relationship between user mood and stock price. The AFINN database assigns coded valence values to common words to allow quantification of a set of word’s “mood” and LSI allows us to reduce the space of each set of tweets, roughly 200,000 per hour, to reduce the noise and other issues with AFINN.
Stock collection and preprocessing
NASDAQ market data was collected over a slightly longer period, 2014-08-30 – 2014-11-09. We tracked the following stocks:
- Apple (aapl)
- Amazon (amzn)
- Facebook (fb)
- Google (goog)
- Microsoft (msft)
- Twitter (twtr)
Stock data were analyzed using their hourly closing price.
Tweet collection and preprocessing
We collected tweets using the public Twitter Streaming RESTful API between 2014-10-17 and 2014-11-09 by tracking words that related to various tech stocks indexed by NASDAQ. Tweets were stored in a Mongo database.
Tweet text was preprocessed to remove common stopwords and punctuation. Tweets were processed in a one-hour bin and each bin was represented as bag-of-words.
Latent semantic indexing was performed on the one-hour bins of tweets, giving a total of 218 hours included in analysis.
Each hour’s LSI topics were scored using the AFINN database, resulting in a single number indicating semantic valence for each hour. The LSI score was smoothed using rolling means and assessed for periodicity.
Semantic data was combined with the stock data and standardized (z-score) for visualization and analysis. A model was created using vector autoregression (VAR) to assess the predictive power of the semantic data against the stock data.
As an additional visualization, word clouds were generated for each hour bin using Andreas Muller’s word_cloud package.
All work was done using Python: Tweets were harvested using Tweepy, text preprocessing was performed using Gensim, statistical analysis was performed with Statsmodels, and plots were made using Matplotlib & Seaborn.
Within an hour bin, LSI would provide a set of topics as noted below
Word clouds, while simple, are a nice way to explore how the topics changed per hour. I generated clouds using the great word_cloud package.
Word clouds generated from topics on 2014-10-30, 11:00-12:00 and 12:00 - 13:00 are shown below. The LSI model was able to capture information about Tim Cook (Apple’s CEO) coming out as gay on that day.
The following plot shows the rolling means of the standardized LSI score and stock prices.
VAR models describe a set of variables as a linear function of their previous values. A -th order var model is denoted by
VAR models are often used with a lag parameter that operates on the elements of a time series to produce the previous element. We determined the lag order by employing an information criteria-based order selection which led us to use a lag of 1 hour in the model.
Impulse response analysis and a Granger’s causality test were performed on the fitted VAR model. A causal effect of LSI score was found at the 90% level for the following variables in the model.
Amazon stock price
Google stock price
tweets per hour
The impulse responses are plotted below:
It is difficult to infer exactly why the semantic scores predicted some of these stocks, though there does seem to be a heavy overrepresentation of tweets that follow a form of “I liked a photo from Facebook” or “I liked a video on Youtube”, which are automated tweets that some users have enabled in their accounts. Volume of these tweets may be a proxy for these service’s usage rates, which could predict minor fluctuations in stock prices. It is useful to note that the fluctuations in prices are minor, 1-2% of total prices, but may not be so minor to investors. We will revisit this dataset after collection of broader data is complete in several months and will investigate other drawbacks of using the AFINN database and more effective filtering of the tweets.