1---
  2title: Using sentiment analysis for clickbait detection in RSS feeds
  3url: using-sentiment-analysis-for-clickbait-detection-in-rss-feeds.html
  4date: 2019-10-19T12:00:00+02:00
  5type: post
  6draft: false
  7---
  8
  9## Initial thoughts
 10
 11One of the things that interested me for a while now is if major well
 12established news sites use click bait titles to drive additional traffic to
 13their sites and generate additional impressions.
 14
 15Goal is to see how article titles and actual content of article differ from each
 16other and see if titles are clickbaited.
 17
 18## Preparing and cleaning data
 19
 20For this example I opted to just use RSS feed from a new website and decided to
 21go with [The Guardian](https://www.theguardian.com) World news. While this gets
 22us limited data (~40) articles and also description (actual content) is trimmed
 23this really doesn't reflect the actual article contents.
 24
 25To get better content I could use web scraping and use RSS as link list and
 26fetch contents directly from website, but for this simple example this will
 27suffice.
 28
 29There are couple of requirements we need to install before we continue:
 30
 31- `pip3 install feedparser` (parses RSS feed from url)
 32- `pip3 install vaderSentiment` (does sentiment polarity analysis)
 33- `pip3 install matplotlib` (plots chart of results)
 34
 35So first we need to fetch RSS data and sanitize HTML content from description.
 36
 37```python
 38import re
 39import feedparser
 40
 41feed_url = "https://www.theguardian.com/world/rss"
 42feed = feedparser.parse(feed_url)
 43
 44# sanitize html
 45for item in feed.entries:
 46    item.description = re.sub('<[^<]+?>', '', item.description)
 47```
 48
 49## Perform sentiment analysis
 50
 51Since we now have cleaned up data in our `feed.entries` object we can start with
 52performing sentiment analysis.
 53
 54There are many sentiment analysis libraries available that range from rule-based
 55sentiment analysis up to machine learning supported analysis. To keep things
 56simple I decided to use rule-based analysis library
 57[vaderSentiment](https://github.com/cjhutto/vaderSentiment) from
 58[C.J. Hutto](https://github.com/cjhutto). Really nice library and quite easy to
 59use.
 60
 61```python
 62from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
 63analyser = SentimentIntensityAnalyzer()
 64
 65sentiment_results = []
 66for item in feed.entries:
 67    sentiment_title = analyser.polarity_scores(item.title)
 68    sentiment_description = analyser.polarity_scores(item.description)
 69    sentiment_results.append([sentiment_title['compound'], sentiment_description['compound']])
 70```
 71
 72Now that we have this data in a shape that is compatible with matplotlib we can
 73plot results to see the difference between title and description sentiment of an
 74article.
 75
 76```python
 77import matplotlib.pyplot as plt
 78
 79plt.rcParams['figure.figsize'] = (15, 3)
 80plt.plot(sentiment_results, drawstyle='steps')
 81plt.title('Sentiment analysis relationship between title and description (Guardian World News)')
 82plt.legend(['title', 'description'])
 83plt.show()
 84```
 85
 86## Results and assets
 87
 881. Because of the small sample size further conclusions are impossible to make.
 892. Rule-based approach may not be the best way of doing this. By using deep
 90   learning we would be able to get better insights.
 913. **Next step would be to** periodically fetch RSS items and store them over a
 92   longer period of time and then perform analysis again and use either machine
 93   learning or deep learning on top of it.
 94
 95![Relationship between title and description](/assets/posts/sentiment-analysis/guardian-sa-title-desc-relationship.png)
 96
 97Figure above displays difference between title and description sentiment for
 98specific RSS feed item. 1 means positive and -1 means negative sentiment.
 99
100[» Download Jupyter Notebook](/assets/posts/sentiment-analysis/sentiment-analysis.ipynb)
101
102## Going further
103
104- [Twitter Sentiment Analysis by Bryan Schwierzke](https://github.com/bswiss/news_mood)
105- [AFINN-based sentiment analysis for Node.js by Andrew Sliwinski](https://github.com/thisandagain/sentiment)
106- [Sentiment Analysis with LSTMs in Tensorflow by Adit Deshpande](https://github.com/adeshpande3/LSTM-Sentiment-Analysis)
107- [Sentiment analysis on tweets using Naive Bayes, SVM, CNN, LSTM, etc. by Abdul Fatir](https://github.com/abdulfatir/twitter-sentiment-analysis)
108