Testing thoughts page

author: Mitja Felicijan <mitja.felicijan@gmail.com> 2024-02-23 10:35:22 +0100
committer: Mitja Felicijan <mitja.felicijan@gmail.com> 2024-02-23 10:35:22 +0100
commit: 4abcce013c9ee3053badf2abda77190233066676 (patch)
tree: 450de7e8fed3c3c7501a9d2e2eb60a676bdfa09e /_posts/posts/2019-10-19-using-sentiment-analysis-for-clickbait-detection.md
parent: cdf50cb2e3051200c6ea0628c318d66220b7d1a1 (diff)
download: mitjafelicijan.com-4abcce013c9ee3053badf2abda77190233066676.tar.gz
1 files changed, 109 insertions, 0 deletions
diff --git a/_posts/posts/2019-10-19-using-sentiment-analysis-for-clickbait-detection.md b/_posts/posts/2019-10-19-using-sentiment-analysis-for-clickbait-detection.md
new file mode 100644
index 0000000..a1b237b
--- /dev/null
+++ b/_posts/posts/2019-10-19-using-sentiment-analysis-for-clickbait-detection.md
@@ -0,0 +1,109 @@
+---
+title: Using sentiment analysis for clickbait detection in RSS feeds
+permalink: /using-sentiment-analysis-for-clickbait-detection-in-rss-feeds.html
+date: 2019-10-19T12:00:00+02:00
+layout: post
+type: post
+draft: false
+---
+## Initial thoughts
+One of the things that interested me for a while now is if major well
+established news sites use click bait titles to drive additional traffic to
+their sites and generate additional impressions.
+Goal is to see how article titles and actual content of article differ from each
+other and see if titles are clickbaited.
+## Preparing and cleaning data
+For this example I opted to just use RSS feed from a new website and decided to
+go with [The Guardian](https://www.theguardian.com) World news. While this gets
+us limited data (~40) articles and also description (actual content) is trimmed
+this really doesn't reflect the actual article contents.
+To get better content I could use web scraping and use RSS as link list and
+fetch contents directly from website, but for this simple example this will
+suffice.
+There are couple of requirements we need to install before we continue:
+- `pip3 install feedparser` (parses RSS feed from url)
+- `pip3 install vaderSentiment` (does sentiment polarity analysis)
+- `pip3 install matplotlib` (plots chart of results)
+So first we need to fetch RSS data and sanitize HTML content from description.
+```python
+import re
+import feedparser
+feed_url = "https://www.theguardian.com/world/rss"
+feed = feedparser.parse(feed_url)
+# sanitize html
+for item in feed.entries:
+    item.description = re.sub('<[^<]+?>', '', item.description)
+```
+## Perform sentiment analysis
+Since we now have cleaned up data in our `feed.entries` object we can start with
+performing sentiment analysis.
+There are many sentiment analysis libraries available that range from rule-based
+sentiment analysis up to machine learning supported analysis. To keep things
+simple I decided to use rule-based analysis library
+[vaderSentiment](https://github.com/cjhutto/vaderSentiment) from
+[C.J. Hutto](https://github.com/cjhutto). Really nice library and quite easy to
+use.
+```python
+from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
+analyser = SentimentIntensityAnalyzer()
+sentiment_results = []
+for item in feed.entries:
+    sentiment_title = analyser.polarity_scores(item.title)
+    sentiment_description = analyser.polarity_scores(item.description)
+    sentiment_results.append([sentiment_title['compound'], sentiment_description['compound']])
+```
+Now that we have this data in a shape that is compatible with matplotlib we can
+plot results to see the difference between title and description sentiment of an
+article.
+```python
+import matplotlib.pyplot as plt
+plt.rcParams['figure.figsize'] = (15, 3)
+plt.plot(sentiment_results, drawstyle='steps')
+plt.title('Sentiment analysis relationship between title and description (Guardian World News)')
+plt.legend(['title', 'description'])
+plt.show()
+```
+## Results and assets
+1. Because of the small sample size further conclusions are impossible to make.
+2. Rule-based approach may not be the best way of doing this. By using deep
+   learning we would be able to get better insights.
+3. **Next step would be to** periodically fetch RSS items and store them over a
+   longer period of time and then perform analysis again and use either machine
+   learning or deep learning on top of it.
+![Relationship between title and description](/assets/posts/sentiment-analysis/guardian-sa-title-desc-relationship.png){:loading="lazy"}
+Figure above displays difference between title and description sentiment for
+specific RSS feed item. 1 means positive and -1 means negative sentiment.
+[» Download Jupyter Notebook](/assets/posts/sentiment-analysis/sentiment-analysis.ipynb)
+## Going further
+- [Twitter Sentiment Analysis by Bryan Schwierzke](https://github.com/bswiss/news_mood)
+- [AFINN-based sentiment analysis for Node.js by Andrew Sliwinski](https://github.com/thisandagain/sentiment)
+- [Sentiment Analysis with LSTMs in Tensorflow by Adit Deshpande](https://github.com/adeshpande3/LSTM-Sentiment-Analysis)
+- [Sentiment analysis on tweets using Naive Bayes, SVM, CNN, LSTM, etc. by Abdul Fatir](https://github.com/abdulfatir/twitter-sentiment-analysis)
author	Mitja Felicijan <mitja.felicijan@gmail.com>	2024-02-23 10:35:22 +0100
committer	Mitja Felicijan <mitja.felicijan@gmail.com>	2024-02-23 10:35:22 +0100
commit	4abcce013c9ee3053badf2abda77190233066676 (patch)
tree	450de7e8fed3c3c7501a9d2e2eb60a676bdfa09e /_posts/posts/2019-10-19-using-sentiment-analysis-for-clickbait-detection.md
parent	cdf50cb2e3051200c6ea0628c318d66220b7d1a1 (diff)
download	mitjafelicijan.com-4abcce013c9ee3053badf2abda77190233066676.tar.gz

diff --git a/_posts/posts/2019-10-19-using-sentiment-analysis-for-clickbait-detection.md b/_posts/posts/2019-10-19-using-sentiment-analysis-for-clickbait-detection.md new file mode 100644 index 0000000..a1b237b --- /dev/null +++ b/_posts/posts/2019-10-19-using-sentiment-analysis-for-clickbait-detection.md
@@ -0,0 +1,109 @@
	1	---
	2	title: Using sentiment analysis for clickbait detection in RSS feeds
	3	permalink: /using-sentiment-analysis-for-clickbait-detection-in-rss-feeds.html
	4	date: 2019-10-19T12:00:00+02:00
	5	layout: post
	6	type: post
	7	draft: false
	8	---
	9
	10	## Initial thoughts
	11
	12	One of the things that interested me for a while now is if major well
	13	established news sites use click bait titles to drive additional traffic to
	14	their sites and generate additional impressions.
	15
	16	Goal is to see how article titles and actual content of article differ from each
	17	other and see if titles are clickbaited.
	18
	19	## Preparing and cleaning data
	20
	21	For this example I opted to just use RSS feed from a new website and decided to
	22	go with [The Guardian](https://www.theguardian.com) World news. While this gets
	23	us limited data (~40) articles and also description (actual content) is trimmed
	24	this really doesn't reflect the actual article contents.
	25
	26	To get better content I could use web scraping and use RSS as link list and
	27	fetch contents directly from website, but for this simple example this will
	28	suffice.
	29
	30	There are couple of requirements we need to install before we continue:
	31
	32	- `pip3 install feedparser` (parses RSS feed from url)
	33	- `pip3 install vaderSentiment` (does sentiment polarity analysis)
	34	- `pip3 install matplotlib` (plots chart of results)
	35
	36	So first we need to fetch RSS data and sanitize HTML content from description.
	37
	38	```python
	39	import re
	40	import feedparser
	41
	42	feed_url = "https://www.theguardian.com/world/rss"
	43	feed = feedparser.parse(feed_url)
	44
	45	# sanitize html
	46	for item in feed.entries:
	47	item.description = re.sub('<[^<]+?>', '', item.description)
	48	```
	49
	50	## Perform sentiment analysis
	51
	52	Since we now have cleaned up data in our `feed.entries` object we can start with
	53	performing sentiment analysis.
	54
	55	There are many sentiment analysis libraries available that range from rule-based
	56	sentiment analysis up to machine learning supported analysis. To keep things
	57	simple I decided to use rule-based analysis library
	58	[vaderSentiment](https://github.com/cjhutto/vaderSentiment) from
	59	[C.J. Hutto](https://github.com/cjhutto). Really nice library and quite easy to
	60	use.
	61
	62	```python
	63	from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
	64	analyser = SentimentIntensityAnalyzer()
	65
	66	sentiment_results = []
	67	for item in feed.entries:
	68	sentiment_title = analyser.polarity_scores(item.title)
	69	sentiment_description = analyser.polarity_scores(item.description)
	70	sentiment_results.append([sentiment_title['compound'], sentiment_description['compound']])
	71	```
	72
	73	Now that we have this data in a shape that is compatible with matplotlib we can
	74	plot results to see the difference between title and description sentiment of an
	75	article.
	76
	77	```python
	78	import matplotlib.pyplot as plt
	79
	80	plt.rcParams['figure.figsize'] = (15, 3)
	81	plt.plot(sentiment_results, drawstyle='steps')
	82	plt.title('Sentiment analysis relationship between title and description (Guardian World News)')
	83	plt.legend(['title', 'description'])
	84	plt.show()
	85	```
	86
	87	## Results and assets
	88
	89	1. Because of the small sample size further conclusions are impossible to make.
	90	2. Rule-based approach may not be the best way of doing this. By using deep
	91	learning we would be able to get better insights.
	92	3. Next step would be to periodically fetch RSS items and store them over a
	93	longer period of time and then perform analysis again and use either machine
	94	learning or deep learning on top of it.
	95
	96	![Relationship between title and description](/assets/posts/sentiment-analysis/guardian-sa-title-desc-relationship.png){:loading="lazy"}
	97
	98	Figure above displays difference between title and description sentiment for
	99	specific RSS feed item. 1 means positive and -1 means negative sentiment.
	100
	101	[» Download Jupyter Notebook](/assets/posts/sentiment-analysis/sentiment-analysis.ipynb)
	102
	103	## Going further
	104
	105	- [Twitter Sentiment Analysis by Bryan Schwierzke](https://github.com/bswiss/news_mood)
	106	- [AFINN-based sentiment analysis for Node.js by Andrew Sliwinski](https://github.com/thisandagain/sentiment)
	107	- [Sentiment Analysis with LSTMs in Tensorflow by Adit Deshpande](https://github.com/adeshpande3/LSTM-Sentiment-Analysis)
	108	- [Sentiment analysis on tweets using Naive Bayes, SVM, CNN, LSTM, etc. by Abdul Fatir](https://github.com/abdulfatir/twitter-sentiment-analysis)
	109