Using sentiment analysis for clickbait detection in RSS feeds

post, Oct 19, 2019 on Mitja Felicijan's blog

Initial thoughts

One of the things that interested me for a while now is if major well +established news sites use click bait titles to drive additional traffic to +their sites and generate additional impressions.

Goal is to see how article titles and actual content of article differ from each +other and see if titles are clickbaited.

Preparing and cleaning data

For this example I opted to just use RSS feed from a new website and decided to +go with The Guardian World news. While this gets +us limited data (~40) articles and also description (actual content) is trimmed +this really doesn't reflect the actual article contents.

To get better content I could use web scraping and use RSS as link list and +fetch contents directly from website, but for this simple example this will +suffice.

There are couple of requirements we need to install before we continue:

  • pip3 install feedparser (parses RSS feed from url)
  • pip3 install vaderSentiment (does sentiment polarity analysis)
  • pip3 install matplotlib (plots chart of results)

So first we need to fetch RSS data and sanitize HTML content from description.

import re
+import feedparser
+
+feed_url = "https://www.theguardian.com/world/rss"
+feed = feedparser.parse(feed_url)
+
+# sanitize html
+for item in feed.entries:
+    item.description = re.sub('<[^<]+?>', '', item.description)
+

Perform sentiment analysis

Since we now have cleaned up data in our feed.entries object we can start with +performing sentiment analysis.

There are many sentiment analysis libraries available that range from rule-based +sentiment analysis up to machine learning supported analysis. To keep things +simple I decided to use rule-based analysis library +vaderSentiment from +C.J. Hutto. Really nice library and quite easy to +use.

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
+analyser = SentimentIntensityAnalyzer()
+
+sentiment_results = []
+for item in feed.entries:
+    sentiment_title = analyser.polarity_scores(item.title)
+    sentiment_description = analyser.polarity_scores(item.description)
+    sentiment_results.append([sentiment_title['compound'], sentiment_description['compound']])
+

Now that we have this data in a shape that is compatible with matplotlib we can +plot results to see the difference between title and description sentiment of an +article.

import matplotlib.pyplot as plt
+
+plt.rcParams['figure.figsize'] = (15, 3)
+plt.plot(sentiment_results, drawstyle='steps')
+plt.title('Sentiment analysis relationship between title and description (Guardian World News)')
+plt.legend(['title', 'description'])
+plt.show()
+

Results and assets

  1. Because of the small sample size further conclusions are impossible to make.
  2. Rule-based approach may not be the best way of doing this. By using deep +learning we would be able to get better insights.
  3. Next step would be to periodically fetch RSS items and store them over a +longer period of time and then perform analysis again and use either machine +learning or deep learning on top of it.
Relationship between title and description

Figure above displays difference between title and description sentiment for +specific RSS feed item. 1 means positive and -1 means negative sentiment.

» Download Jupyter Notebook

Going further