aboutsummaryrefslogtreecommitdiff
path: root/_posts/2019-10-19-using-sentiment-analysis-for-clickbait-detection.md
diff options
context:
space:
mode:
Diffstat (limited to '_posts/2019-10-19-using-sentiment-analysis-for-clickbait-detection.md')
-rw-r--r--_posts/2019-10-19-using-sentiment-analysis-for-clickbait-detection.md109
1 files changed, 109 insertions, 0 deletions
diff --git a/_posts/2019-10-19-using-sentiment-analysis-for-clickbait-detection.md b/_posts/2019-10-19-using-sentiment-analysis-for-clickbait-detection.md
new file mode 100644
index 0000000..5aad23c
--- /dev/null
+++ b/_posts/2019-10-19-using-sentiment-analysis-for-clickbait-detection.md
@@ -0,0 +1,109 @@
1---
2title: Using sentiment analysis for clickbait detection in RSS feeds
3permalink: /using-sentiment-analysis-for-clickbait-detection-in-rss-feeds.html
4date: 2019-10-19T12:00:00+02:00
5layout: post
6type: post
7draft: false
8---
9
10## Initial thoughts
11
12One of the things that interested me for a while now is if major well
13established news sites use click bait titles to drive additional traffic to
14their sites and generate additional impressions.
15
16Goal is to see how article titles and actual content of article differ from each
17other and see if titles are clickbaited.
18
19## Preparing and cleaning data
20
21For this example I opted to just use RSS feed from a new website and decided to
22go with [The Guardian](https://www.theguardian.com) World news. While this gets
23us limited data (~40) articles and also description (actual content) is trimmed
24this really doesn't reflect the actual article contents.
25
26To get better content I could use web scraping and use RSS as link list and
27fetch contents directly from website, but for this simple example this will
28suffice.
29
30There are couple of requirements we need to install before we continue:
31
32- `pip3 install feedparser` (parses RSS feed from url)
33- `pip3 install vaderSentiment` (does sentiment polarity analysis)
34- `pip3 install matplotlib` (plots chart of results)
35
36So first we need to fetch RSS data and sanitize HTML content from description.
37
38```python
39import re
40import feedparser
41
42feed_url = "https://www.theguardian.com/world/rss"
43feed = feedparser.parse(feed_url)
44
45# sanitize html
46for item in feed.entries:
47 item.description = re.sub('<[^<]+?>', '', item.description)
48```
49
50## Perform sentiment analysis
51
52Since we now have cleaned up data in our `feed.entries` object we can start with
53performing sentiment analysis.
54
55There are many sentiment analysis libraries available that range from rule-based
56sentiment analysis up to machine learning supported analysis. To keep things
57simple I decided to use rule-based analysis library
58[vaderSentiment](https://github.com/cjhutto/vaderSentiment) from
59[C.J. Hutto](https://github.com/cjhutto). Really nice library and quite easy to
60use.
61
62```python
63from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
64analyser = SentimentIntensityAnalyzer()
65
66sentiment_results = []
67for item in feed.entries:
68 sentiment_title = analyser.polarity_scores(item.title)
69 sentiment_description = analyser.polarity_scores(item.description)
70 sentiment_results.append([sentiment_title['compound'], sentiment_description['compound']])
71```
72
73Now that we have this data in a shape that is compatible with matplotlib we can
74plot results to see the difference between title and description sentiment of an
75article.
76
77```python
78import matplotlib.pyplot as plt
79
80plt.rcParams['figure.figsize'] = (15, 3)
81plt.plot(sentiment_results, drawstyle='steps')
82plt.title('Sentiment analysis relationship between title and description (Guardian World News)')
83plt.legend(['title', 'description'])
84plt.show()
85```
86
87## Results and assets
88
891. Because of the small sample size further conclusions are impossible to make.
902. Rule-based approach may not be the best way of doing this. By using deep
91 learning we would be able to get better insights.
923. **Next step would be to** periodically fetch RSS items and store them over a
93 longer period of time and then perform analysis again and use either machine
94 learning or deep learning on top of it.
95
96![Relationship between title and description](/assets/posts/sentiment-analysis/guardian-sa-title-desc-relationship.png)
97
98Figure above displays difference between title and description sentiment for
99specific RSS feed item. 1 means positive and -1 means negative sentiment.
100
101[ยป Download Jupyter Notebook](/assets/posts/sentiment-analysis/sentiment-analysis.ipynb)
102
103## Going further
104
105- [Twitter Sentiment Analysis by Bryan Schwierzke](https://github.com/bswiss/news_mood)
106- [AFINN-based sentiment analysis for Node.js by Andrew Sliwinski](https://github.com/thisandagain/sentiment)
107- [Sentiment Analysis with LSTMs in Tensorflow by Adit Deshpande](https://github.com/adeshpande3/LSTM-Sentiment-Analysis)
108- [Sentiment analysis on tweets using Naive Bayes, SVM, CNN, LSTM, etc. by Abdul Fatir](https://github.com/abdulfatir/twitter-sentiment-analysis)
109