diff options
| author | Mitja Felicijan <mitja.felicijan@gmail.com> | 2024-03-10 14:59:14 +0100 |
|---|---|---|
| committer | Mitja Felicijan <mitja.felicijan@gmail.com> | 2024-03-10 14:59:14 +0100 |
| commit | 1100562e29f6476448b656dbddd4cf22505523f6 (patch) | |
| tree | 442eec492199104bd49dfd74474ce89ade8fcac9 /content/posts/2019-10-19-using-sentiment-analysis-for-clickbait-detection.md | |
| parent | a40d80be378e46a6c490e1b99b0d8f4acd968503 (diff) | |
| download | mitjafelicijan.com-1100562e29f6476448b656dbddd4cf22505523f6.tar.gz | |
Move back to JBMAFP
Diffstat (limited to 'content/posts/2019-10-19-using-sentiment-analysis-for-clickbait-detection.md')
| -rw-r--r-- | content/posts/2019-10-19-using-sentiment-analysis-for-clickbait-detection.md | 108 |
1 files changed, 108 insertions, 0 deletions
diff --git a/content/posts/2019-10-19-using-sentiment-analysis-for-clickbait-detection.md b/content/posts/2019-10-19-using-sentiment-analysis-for-clickbait-detection.md new file mode 100644 index 0000000..1e43554 --- /dev/null +++ b/content/posts/2019-10-19-using-sentiment-analysis-for-clickbait-detection.md | |||
| @@ -0,0 +1,108 @@ | |||
| 1 | --- | ||
| 2 | title: Using sentiment analysis for clickbait detection in RSS feeds | ||
| 3 | url: /using-sentiment-analysis-for-clickbait-detection-in-rss-feeds.html | ||
| 4 | date: 2019-10-19T12:00:00+02:00 | ||
| 5 | type: post | ||
| 6 | draft: false | ||
| 7 | --- | ||
| 8 | |||
| 9 | ## Initial thoughts | ||
| 10 | |||
| 11 | One of the things that interested me for a while now is if major well | ||
| 12 | established news sites use click bait titles to drive additional traffic to | ||
| 13 | their sites and generate additional impressions. | ||
| 14 | |||
| 15 | Goal is to see how article titles and actual content of article differ from each | ||
| 16 | other and see if titles are clickbaited. | ||
| 17 | |||
| 18 | ## Preparing and cleaning data | ||
| 19 | |||
| 20 | For this example I opted to just use RSS feed from a new website and decided to | ||
| 21 | go with [The Guardian](https://www.theguardian.com) World news. While this gets | ||
| 22 | us limited data (~40) articles and also description (actual content) is trimmed | ||
| 23 | this really doesn't reflect the actual article contents. | ||
| 24 | |||
| 25 | To get better content I could use web scraping and use RSS as link list and | ||
| 26 | fetch contents directly from website, but for this simple example this will | ||
| 27 | suffice. | ||
| 28 | |||
| 29 | There are couple of requirements we need to install before we continue: | ||
| 30 | |||
| 31 | - `pip3 install feedparser` (parses RSS feed from url) | ||
| 32 | - `pip3 install vaderSentiment` (does sentiment polarity analysis) | ||
| 33 | - `pip3 install matplotlib` (plots chart of results) | ||
| 34 | |||
| 35 | So first we need to fetch RSS data and sanitize HTML content from description. | ||
| 36 | |||
| 37 | ```python | ||
| 38 | import re | ||
| 39 | import feedparser | ||
| 40 | |||
| 41 | feed_url = "https://www.theguardian.com/world/rss" | ||
| 42 | feed = feedparser.parse(feed_url) | ||
| 43 | |||
| 44 | # sanitize html | ||
| 45 | for item in feed.entries: | ||
| 46 | item.description = re.sub('<[^<]+?>', '', item.description) | ||
| 47 | ``` | ||
| 48 | |||
| 49 | ## Perform sentiment analysis | ||
| 50 | |||
| 51 | Since we now have cleaned up data in our `feed.entries` object we can start with | ||
| 52 | performing sentiment analysis. | ||
| 53 | |||
| 54 | There are many sentiment analysis libraries available that range from rule-based | ||
| 55 | sentiment analysis up to machine learning supported analysis. To keep things | ||
| 56 | simple I decided to use rule-based analysis library | ||
| 57 | [vaderSentiment](https://github.com/cjhutto/vaderSentiment) from | ||
| 58 | [C.J. Hutto](https://github.com/cjhutto). Really nice library and quite easy to | ||
| 59 | use. | ||
| 60 | |||
| 61 | ```python | ||
| 62 | from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer | ||
| 63 | analyser = SentimentIntensityAnalyzer() | ||
| 64 | |||
| 65 | sentiment_results = [] | ||
| 66 | for item in feed.entries: | ||
| 67 | sentiment_title = analyser.polarity_scores(item.title) | ||
| 68 | sentiment_description = analyser.polarity_scores(item.description) | ||
| 69 | sentiment_results.append([sentiment_title['compound'], sentiment_description['compound']]) | ||
| 70 | ``` | ||
| 71 | |||
| 72 | Now that we have this data in a shape that is compatible with matplotlib we can | ||
| 73 | plot results to see the difference between title and description sentiment of an | ||
| 74 | article. | ||
| 75 | |||
| 76 | ```python | ||
| 77 | import matplotlib.pyplot as plt | ||
| 78 | |||
| 79 | plt.rcParams['figure.figsize'] = (15, 3) | ||
| 80 | plt.plot(sentiment_results, drawstyle='steps') | ||
| 81 | plt.title('Sentiment analysis relationship between title and description (Guardian World News)') | ||
| 82 | plt.legend(['title', 'description']) | ||
| 83 | plt.show() | ||
| 84 | ``` | ||
| 85 | |||
| 86 | ## Results and assets | ||
| 87 | |||
| 88 | 1. Because of the small sample size further conclusions are impossible to make. | ||
| 89 | 2. Rule-based approach may not be the best way of doing this. By using deep | ||
| 90 | learning we would be able to get better insights. | ||
| 91 | 3. **Next step would be to** periodically fetch RSS items and store them over a | ||
| 92 | longer period of time and then perform analysis again and use either machine | ||
| 93 | learning or deep learning on top of it. | ||
| 94 | |||
| 95 |  | ||
| 96 | |||
| 97 | Figure above displays difference between title and description sentiment for | ||
| 98 | specific RSS feed item. 1 means positive and -1 means negative sentiment. | ||
| 99 | |||
| 100 | [ยป Download Jupyter Notebook](/assets/posts/sentiment-analysis/sentiment-analysis.ipynb) | ||
| 101 | |||
| 102 | ## Going further | ||
| 103 | |||
| 104 | - [Twitter Sentiment Analysis by Bryan Schwierzke](https://github.com/bswiss/news_mood) | ||
| 105 | - [AFINN-based sentiment analysis for Node.js by Andrew Sliwinski](https://github.com/thisandagain/sentiment) | ||
| 106 | - [Sentiment Analysis with LSTMs in Tensorflow by Adit Deshpande](https://github.com/adeshpande3/LSTM-Sentiment-Analysis) | ||
| 107 | - [Sentiment analysis on tweets using Naive Bayes, SVM, CNN, LSTM, etc. by Abdul Fatir](https://github.com/abdulfatir/twitter-sentiment-analysis) | ||
| 108 | |||
