Sentiment Analysis

Importing Libraries

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import datetime
import time

import nltk.data
from nltk import tokenize
import re

Importing Data

In [10]:
columns = ['Year', 'Week Start', 'Week End', 'Section', 'Number', 'Headline', 'Body Text']
articles_df = pd.read_csv('https://s3.amazonaws.com/cs109data/articles_db.csv', names=columns)
In [11]:
articles_df.head(100)
Out[11]:
Year Week Start Week End Section Number Headline Body Text
0 2000 2000-01-03 2000-01-09 business 0 There's no time to waste Over the past few months, President Clinton ha...
0 2000 2000-01-03 2000-01-09 business 1 Ford staff threaten strike Leaders of salaried staff at Ford are threaten...
0 2000 2000-01-03 2000-01-09 business 2 There's no time to waste Over the past few months, President Clinton ha...
0 2000 2000-01-03 2000-01-09 business 3 Cybersquatters with an eye for domain chance What's in a domain name? Loadsamoney, apparent...
0 2000 2000-01-03 2000-01-09 business 4 Clicks and mortar leave property crumbling away The property market looks in pretty good healt...
0 2000 2000-01-03 2000-01-09 business 5 Labour isn't working hard enough Few people I know would dissent from the propo...
0 2000 2000-01-03 2000-01-09 business 6 Dunces excel in the knowledge economy While all the fashionable blather is of a know...
0 2000 2000-01-03 2000-01-09 business 7 Russia Y2K bill 'shows West overreacted' Russia spent just $200 million on preparing fo...
0 2000 2000-01-03 2000-01-09 business 8 Briefcase BUY... Domino's Pizza company, which last week...
0 2000 2000-01-03 2000-01-09 business 9 TransTec duo kept silent on £11m claim Two former executive directors of TransTec, th...
0 2000 2000-01-03 2000-01-09 business 10 US capital firm hires QXL founder heads Net st... Tim Jackson, the 34-year-old journalist and mi...
0 2000 2000-01-03 2000-01-09 business 11 Figures lift M&S gloom First signs that Marks & Spencer may have arre...
0 2000 2000-01-03 2000-01-09 business 12 Experts predict ¼-point interest rate rise The city is expecting interest rates to rise b...
0 2000 2000-01-03 2000-01-09 business 13 Vodafone 'must bid billions in cash' The gloves came off in the biggest hostile tak...
0 2000 2000-01-03 2000-01-09 business 14 Getting smart on subsidies New Labour seems to hate 's' words. Socialism,...
0 2000 2000-01-03 2000-01-09 business 15 Shy dealmaker who kept quiet about costly details He was the ultimate City high flyer. Plucked f...
0 2000 2000-01-03 2000-01-09 business 16 Stockwatch Index beaters It was a mammoth task, calcula...
0 2000 2000-01-03 2000-01-09 business 17 Golden boy lost his Midas touch Geoffrey Robinson, the colourful and controver...
0 2000 2000-01-03 2000-01-09 business 18 @large It is as we had suspected all along: Netheads ...
0 2000 2000-01-03 2000-01-09 business 19 Health check Your personal happiness is greatly affected by...
0 2000 2000-01-03 2000-01-09 business 20 A wider Net for eBusiness shares The arrival of Y2K may have failed to wreak ha...
0 2000 2000-01-03 2000-01-09 business 21 Media Diary Domeward Bound Much of the festivity at the D...
0 2000 2000-01-03 2000-01-09 business 22 Infamous 5's star ratings Question: Which is the only mainstream broadca...
0 2000 2000-01-03 2000-01-09 business 23 How to 1. Accept that most people (including you) h...
0 2000 2000-01-03 2000-01-09 business 24 Rage against the dying of light If you've ever thought that buildings aren't a...
0 2000 2000-01-03 2000-01-09 business 25 Taxes, certainly - but on all our houses The recent controversy about the European Unio...
0 2000 2000-01-03 2000-01-09 business 26 Crash? What crash? In any normal market, it would be seen as a ro...
0 2000 2000-01-03 2000-01-09 business 27 Granada forces pace on ITV's fate Granada yesterday stepped up pressure on the g...
0 2000 2000-01-03 2000-01-09 business 28 Byers ultimatum for WTO The trade and industry secretary, Stephen Byer...
0 2000 2000-01-03 2000-01-09 business 29 Underside • Not everyone was taken by surprise at NatWes...
... ... ... ... ... ... ... ...
0 2000 2000-01-03 2000-01-09 uk-news 10 Welcome back to the craic A is for Assembly After decades of resistan...
0 2000 2000-01-03 2000-01-09 uk-news 11 Nelson bomb suspect arrested in the US The chief suspect in the murder of the Norther...
0 2000 2000-01-03 2000-01-09 uk-news 12 Irving ready for court battle over Holocaust The most emotive libel trial to be heard in Br...
0 2000 2000-01-03 2000-01-09 uk-news 13 Women who flee violence 'lack shelter' More than 50,000 women and children flee their...
0 2000 2000-01-03 2000-01-09 uk-news 14 Dealing with the end of the world time after time No one has seen the end of the world come roun...
0 2000 2000-01-03 2000-01-09 uk-news 15 Tory idea for schools to branch out Leading independent schools should be encourag...
0 2000 2000-01-03 2000-01-09 uk-news 16 Hindley may face brain surgery The moors murderer Myra Hindley may undergo em...
0 2000 2000-01-03 2000-01-09 uk-news 17 Gun law on streets of Manchester Armed police are to patrol parts of Manchester...
0 2000 2000-01-03 2000-01-09 uk-news 18 Bridging the gap: Walkway reveals gorgeous gorge One of the last inaccessible places in England...
0 2000 2000-01-03 2000-01-09 uk-news 19 Villagers break away from UK Residents of a village in East Sussex have dec...
0 2000 2000-01-03 2000-01-09 uk-news 20 Warmed by the flame of dance When Monica Mason's ballet shoe snagged on the...
0 2000 2000-01-03 2000-01-09 uk-news 21 In brief Price of coffee rises 10p The price of coffee...
0 2000 2000-01-03 2000-01-09 uk-news 22 How art treasures are stolen to order Christopher Brown is in sombre mood. "I have a...
0 2000 2000-01-03 2000-01-09 uk-news 23 Courts may get 'enforcers' to make debtors pay up A new breed of court "enforcers", with the pow...
0 2000 2000-01-03 2000-01-09 uk-news 24 Limpets threaten coast When part of Beachy Head fell into the Channel...
0 2000 2000-01-03 2000-01-09 uk-news 25 Lloyds sues over lost Shelley letter The poet Shelley, contemplating the ruins of a...
0 2000 2000-01-03 2000-01-09 uk-news 26 Briton's 24-hour ordeal in shark sea A British tourist whose family had all but giv...
0 2000 2000-01-03 2000-01-09 uk-news 27 Parents call for schools to bring back the cane A majority of parents want corporal punishment...
0 2000 2000-01-03 2000-01-09 uk-news 28 Morning-after pill trial hailed as success A project in Manchester which allows women to ...
0 2000 2000-01-03 2000-01-09 uk-news 29 Judges may double injury payouts The court of appeal will hold an unprecedented...
0 2000 2000-01-10 2000-01-16 business 0 Crunch time for euro in Lisbon Tony Blair wants the EU Summit in Lisbon in Ma...
0 2000 2000-01-10 2000-01-16 business 1 Confidence in a tarnished age Confidence is at the heart of economic policy....
0 2000 2000-01-10 2000-01-16 business 2 Media diary Victorian values It would be wrong to let Vi...
0 2000 2000-01-10 2000-01-16 business 3 Land of the free and home of the brave class a... There are 2 million guns in civilian hands in ...
0 2000 2000-01-10 2000-01-16 business 4 Old hand for new job It was teatime when Sir George Bull stepped ou...
0 2000 2000-01-10 2000-01-16 business 5 BA cabin crews face job losses British Airways plans to shed at least 2,500 c...
0 2000 2000-01-10 2000-01-16 business 6 BOC £700m sale means total break-up BOC is to spin off its world-beating vacuum pu...
0 2000 2000-01-10 2000-01-16 business 7 BNFL threatened by loss of ISO quality guarantee Nuclear reprocessor and generator British Nucl...
0 2000 2000-01-10 2000-01-16 business 8 Utility fat cats face pay curb Government pressure on the bosses of privatise...
0 2000 2000-01-10 2000-01-16 business 9 Stockwatch Merger medicine Merger activity notwithstandi...

100 rows × 7 columns

In [12]:
articles_df.shape
Out[12]:
(88745, 7)

 Initial Data Exploration

Filtering articles based on relevance to exchange rate

In [40]:
# Dictionary of relevant words
relevant_words = np.genfromtxt('Keywords.txt', dtype='str')
In [14]:
def find_relevant(text, n):
    text = str(text)
    num_relevant_words = [word for word in relevant_words if ((' '+word+' ') in text)]
    if len(num_relevant_words) > n:
        return True
    else:
        return False

Number of relevant articles per week

In [9]:
text = articles_df['Body Text'].values[0]

plt.figure(figsize=(10,5))
for n in [3,5,7]:
    relevant_articles = [find_relevant(text, n) for text in articles_df['Body Text'].values]
    relevant_df = articles_df[relevant_articles]
    weekly_articles = relevant_df.groupby('Week Start').size().reset_index()
    plt.plot(weekly_articles[0], label='n='+str(n),)
    
plt.legend(loc='best')
plt.ylabel('Number of relevant articles')
axes = plt.gca()
axes.set_ylim([0,50])
Out[9]:
(0, 50)

Number of relevant articles per section:

In [10]:
n = 3
relevant_articles = [find_relevant(text, n) for text in articles_df['Body Text'].values]
relevant_df = articles_df[relevant_articles]
articles_per_section = relevant_df.groupby(['Week Start', 'Section']).size().reset_index()

plt.figure(figsize=(15,5))
for section in articles_per_section['Section'].unique():
    articles_count = articles_per_section[articles_per_section['Section'] == section]
    articles_count.head()
    plt.plot(range(0, len(articles_count[0])), articles_count[0], label=section)
plt.xlabel('Week', fontsize=20)
plt.ylabel('Number of relevant articles', fontsize=20)
plt.rc('xtick', labelsize=20) 
plt.rc('ytick', labelsize=20)
plt.grid(True)
plt.legend()
axes = plt.gca()
axes.set_ylim([0,30])
Out[10]:
(0, 30)

Sentiment Analysis using SentiWordNet

In [41]:
# Source code for sentwordinet: http://www.nltk.org/_modules/nltk/corpus/reader/sentiwordnet.html
import nltk
from nltk.corpus import sentiwordnet as swn
from nltk.corpus import stopwords
In [16]:
# A simple function to obtain the overall sentimate of a text chunk
# Method: tokenise the text chunk, obtain the sentiment score of each token, then take mean average.
# Note: you may need to separately install sentiwordnet: nltk.download('sentiwordnet')

## synsets based on context
## phrases/tokens 

## classify words as noun/adjectives

# unsupervised split between adjectives 

def simple_sentiment(text_chunk):
    cumulative_pos_sentiment = 0
    cumulative_neg_sentiment = 0
    index = 0
    
    # Tokenizing the sample text
    tokens=nltk.word_tokenize(text_chunk)
    # Removing words of lenght 2 or less
    tokens = [i for i in tokens if len(i)>=3]
    # remove stop words
    tokens = [word for word in tokens if word not in stopwords.words('english')]
    
    # a/n/v/r represent adjective/noun/verb/adverb respectively. They are used to index the sentinet dictionary.
    for i in tokens:
        if len(list(swn.senti_synsets(i, 'a')))>0:
            cumulative_pos_sentiment += list(swn.senti_synsets(i, 'a'))[0].pos_score()
            cumulative_neg_sentiment += list(swn.senti_synsets(i, 'a'))[0].neg_score()
            index +=1
        elif len(list(swn.senti_synsets(i, 'n')))>0:
            cumulative_pos_sentiment += list(swn.senti_synsets(i, 'n'))[0].pos_score()
            cumulative_neg_sentiment += list(swn.senti_synsets(i, 'n'))[0].neg_score()
            index +=1
        elif len(list(swn.senti_synsets(i, 'v')))>0:
            cumulative_pos_sentiment += list(swn.senti_synsets(i, 'v'))[0].pos_score()
            cumulative_neg_sentiment += list(swn.senti_synsets(i, 'v'))[0].neg_score()
            index +=1
        elif len(list(swn.senti_synsets(i, 'r')))>0:
            cumulative_pos_sentiment += list(swn.senti_synsets(i, 'r'))[0].pos_score()
            cumulative_neg_sentiment += list(swn.senti_synsets(i, 'r'))[0].neg_score()
            index +=1
        
    avg_pos_sentiment = cumulative_pos_sentiment / float((1 if (index == 0) else index))
    avg_neg_sentiment = cumulative_neg_sentiment / float((1 if (index == 0) else index))
    
#     print('Positive sentiment:',avg_pos_sentiment)
#     print('Negative sentiment:',avg_neg_sentiment)
    
    return (avg_pos_sentiment,avg_neg_sentiment)
In [17]:
sample_text = 'There\'s no time to waste,"Over the past few months, President Clinton has lost few opportunities to sing the praises of his favourite book. In November, he told a conference attended by Tony Blair that it was no longer necessary to choose between growth and environment. He took as evidence Natural Capitalism, The Next Industrial Revolution (Paul Hawken and Amory and Hunter Lovins, Earthscan, pounds 18.99), which \'proves beyond argument that there are presently available technologies, and those just on the horizon, which will permit us to get richer by cleaning, not by spoiling, the environment. This is a huge deal,\' Clinton said.   It\'s a suitably millennial claim. The authors argue that \'capitalism, as practised, is a financially profitable, nonsustainable aberration in human development... [which] does not fully conform to its own accounting principles. It liquidates its capital and calls it income. It neglects to assign any value to the largest stocks of capital it employs, the natural resources and living systems, as well as the social and cultural systems that are the basis of human capital.\'   Companies, as has been well said, are brilliant externalising machines, pocketing the profits and shunting the costs of their enterprise on to the collectivity. Thus, the NHS pays for the profits of big tobacco, and the Government subsidises cars by building roads. Put it another way, business is a free rider on the environment and the services it provides, services which have been tentatively valued by Nature magazine at $36 trillion annually, roughly the same as world GDP.   The reason business is so profligate with the the environment (the \'natural capital\' of the book) is that its goods are assumed by economists to be free and infinitely substitutable. So they are uncosted. But in reality they are not free. They are produced by the earth\'s 3.8-billion-year store of natural capital which, as the authors rehearse with hair-raising thoroughness, is being eroded so fast that by the end of this century there will be little left. And there is no conceivable substitute, for example, for the biosphere\'s ability to produce oxygen.   The authors manage to recast this rush to disaster as a story with a (potentially) happier ending. Their grounds for optimism are partly familiar American technological optimism, if natural resources were treated as scarce and expensive, then nanotechnology and biotechnology could multiply four or even tenfold the outputs from today\'s inputs. Hence Clinton\'s enthusiasm.   But more crucial to the project is a complete mental flip of what an \'output\' consists of (as Edwin Land once said, a great idea is often \'not having a new thought but stopping having an old one\').   At present, it is entirely conceivable that one-quarter or even half of the GDP of advanced countries makes not value but waste. Most industrial processes, and the economy as a whole, are inefficient , at best achieving 10 per cent of their potential likewise their products. A car uses just 1 per cent of the energy it burns to propel the driver, the rest to warm the atmosphere, deafen pedestrians and shift ponderous steel boxes between traffic jams.   Moreover, waste is cumulative, so an increasing income has to be spent on alleviating growth\'s byproducts, pollution, traffic accidents and congestion, crime. Hence the phenomenon of uneconomic growth, where increases in nominal wealth produce no net gain in quality of life or standard of living: in real terms 80 per cent of Americans are no better off than they were in 1979.   However, the grossness of the waste is, say the authors, also a measure of the huge potential for improvement if the spiral changed to virtuous. The secret is taking a systems view in which it is always more expensive to get rid of waste than to design it out in the first place. Given the wastefulness of most current systems, improvements of 10 to 100 times in overall efficiency are possible even with existing technology.   Much of what the Lovins and Hawken propose is not new. Frances Cairncross wrote about costing the earth 10 years ago, and Richard Schonberger coined the term \'frugal manufacturing\' in the 1980s. What is new is the way these ideas are brought together in a systems approach to business and the environment, and the coopting of markets as the mechanism which can be used to turn things around.   There is some irony here, of course. The greatest obstacle to \'natural capitalism\' in practice will be the vested interests and special pleading of those most vociferous champions of capitalist orthodoxy, US companies, which emerge from this book the masters of the perverse, not to mention grotesque, hidden subsidy, whether of agriculture, cars, or their wealthy executives.   Persuading them to confront their own bad faith will be no easy matter. But, as someone once said, the economy is a wholly-owned subsidiary of the environment, and time is running out for the parent to bring it to heel.'
simple_sentiment(sample_text)
Out[17]:
(0.0845771144278607, 0.04695273631840796)
In [18]:
n = 3
relevant_articles = [find_relevant(text, n) for text in articles_df['Body Text'].values]
relevant_df = articles_df[relevant_articles]
weeks = relevant_df['Week Start'].unique()

avg_weekly_pos_score = np.zeros((len(weeks), 1))
avg_weekly_neg_score = np.zeros((len(weeks), 1))
avg_weekly_pos_minus_neg_score = np.zeros((len(weeks), 1))
In [ ]:
# Calculate weekly sentiment scores across the entire time period
weeks = relevant_df['Week Start'].unique()

for i, week in enumerate(weeks):
    articles = relevant_df[relevant_df['Week Start'] == week]['Body Text']
    num_articles = articles.shape[0]
    pos_score = 0
    neg_score = 0
    for article in articles:
        pos, neg = simple_sentiment(article)
        pos_score += pos
        neg_score += neg
    avg_weekly_pos_score[i] = (pos_score/float(num_articles))
    avg_weekly_neg_score[i] = (neg_score/float(num_articles))
    avg_weekly_pos_minus_neg_score[i] = avg_weekly_pos_score[i] - avg_weekly_neg_score[i]
    if (i%10 == 0):
        print('Week: ', week, 'Postive: ', avg_weekly_pos_score[i][0], 'Negative: ', avg_weekly_neg_score[i][0])

Saving file to not recalculate every time

In [32]:
# Saving file to not recalculate every time
# scores_df = pd.DataFrame()
# scores_df['weeks']=weeks
# scores_df['avg_weekly_pos_score']=avg_weekly_pos_score
# scores_df['avg_weekly_neg_score']=avg_weekly_neg_score

# scores_df.to_csv('scores_df.csv',index=False)
In [20]:
scores_df = pd.read_csv('scores_df.csv')
avg_weekly_pos_score = scores_df['avg_weekly_pos_score']
avg_weekly_neg_score = scores_df['avg_weekly_neg_score']

Plot of average weekly positive sentiments

In [19]:
plt.figure(figsize=(15, 10))
plt.plot(scores_df['avg_weekly_pos_score'])
plt.xlabel('Week', fontsize=20)
plt.ylabel('Average weekly postive sentiment score', fontsize=20)
Out[19]:
<matplotlib.text.Text at 0x1282fa8d0>

Plot of average weekly negative sentiments

In [20]:
plt.figure(figsize=(15, 10))
plt.plot(scores_df['avg_weekly_neg_score'],'red')
plt.xlabel('Week', fontsize=20)
plt.ylabel('Average weekly negative sentiment score', fontsize=20)
Out[20]:
<matplotlib.text.Text at 0x12c608ef0>

Plot of average weekly net postive (positive minus negative) sentiments

Including the exchange rate plots

In [21]:
daily_data = pd.read_csv('daily_rates.csv', skiprows=3, header=0)
monthly_data = pd.read_csv('monthly_rates.csv', skiprows=11, header=0)
In [22]:
daily_data['datetime'] = pd.to_datetime(daily_data['DATE'])
monthly_data['datetime'] = pd.to_datetime(monthly_data['DATE'])

daily_data['dayofweek'] = daily_data['datetime'].apply(lambda row: row.dayofweek)
weekly_data = daily_data[daily_data['dayofweek'] == 4]
In [23]:
timestamp_weeks = [pd.to_datetime(week) for week in weeks]
In [24]:
fig, ax1 = plt.subplots( figsize=(20,15))

ax1.plot(weekly_data['datetime'], weekly_data['XUDLERS'], 'brown', linewidth=2, label=str('EUR/GBP'))
ax1.plot(weekly_data['datetime'], weekly_data['XUDLUSS'], 'blue', linewidth=2, label=str('USD/GBP'))
ax1.legend(loc='best', fontsize=20)
ax1.set_xlabel('Year', fontsize=20)
ax1.set_ylabel('Euro and US Dollar to Pound exchange rate', fontsize=20)
ax1.grid(True)
ax1.set_ylim([min(min(weekly_data['XUDLERS']),min(weekly_data['XUDLUSS'])),max(max(weekly_data['XUDLERS']),max(weekly_data['XUDLUSS']))])
ax1.axvline(x=datetime.datetime(2016,1,8), color='grey', linewidth=2)
ax1.axvline(x=datetime.datetime(2007,1,5), color='orange', linewidth=2)
ax1.axvline(x=datetime.datetime(2009,1,12), color='orange', linewidth=2)

ax2 = ax1.twinx()
ax2.plot(timestamp_weeks, scores_df['avg_weekly_pos_score'], 'green',linewidth=0.5, label = 'Average weekly positive score')
ax2.set_ylabel('Positive Sentiment Score', color='green',fontsize=20)
ax2.set_ylim([0.0,0.1])
for tl in ax2.get_yticklabels():
    tl.set_color('green')

plt.show()