diff options
| author | Mitja Felicijan <mitja.felicijan@gmail.com> | 2024-02-23 10:35:22 +0100 |
|---|---|---|
| committer | Mitja Felicijan <mitja.felicijan@gmail.com> | 2024-02-23 10:35:22 +0100 |
| commit | 4abcce013c9ee3053badf2abda77190233066676 (patch) | |
| tree | 450de7e8fed3c3c7501a9d2e2eb60a676bdfa09e /_posts/posts/2019-10-19-using-sentiment-analysis-for-clickbait-detection.md | |
| parent | cdf50cb2e3051200c6ea0628c318d66220b7d1a1 (diff) | |
| download | mitjafelicijan.com-4abcce013c9ee3053badf2abda77190233066676.tar.gz | |
Testing thoughts page
Diffstat (limited to '_posts/posts/2019-10-19-using-sentiment-analysis-for-clickbait-detection.md')
| -rw-r--r-- | _posts/posts/2019-10-19-using-sentiment-analysis-for-clickbait-detection.md | 109 |
1 files changed, 109 insertions, 0 deletions
diff --git a/_posts/posts/2019-10-19-using-sentiment-analysis-for-clickbait-detection.md b/_posts/posts/2019-10-19-using-sentiment-analysis-for-clickbait-detection.md new file mode 100644 index 0000000..a1b237b --- /dev/null +++ b/_posts/posts/2019-10-19-using-sentiment-analysis-for-clickbait-detection.md | |||
| @@ -0,0 +1,109 @@ | |||
| 1 | --- | ||
| 2 | title: Using sentiment analysis for clickbait detection in RSS feeds | ||
| 3 | permalink: /using-sentiment-analysis-for-clickbait-detection-in-rss-feeds.html | ||
| 4 | date: 2019-10-19T12:00:00+02:00 | ||
| 5 | layout: post | ||
| 6 | type: post | ||
| 7 | draft: false | ||
| 8 | --- | ||
| 9 | |||
| 10 | ## Initial thoughts | ||
| 11 | |||
| 12 | One of the things that interested me for a while now is if major well | ||
| 13 | established news sites use click bait titles to drive additional traffic to | ||
| 14 | their sites and generate additional impressions. | ||
| 15 | |||
| 16 | Goal is to see how article titles and actual content of article differ from each | ||
| 17 | other and see if titles are clickbaited. | ||
| 18 | |||
| 19 | ## Preparing and cleaning data | ||
| 20 | |||
| 21 | For this example I opted to just use RSS feed from a new website and decided to | ||
| 22 | go with [The Guardian](https://www.theguardian.com) World news. While this gets | ||
| 23 | us limited data (~40) articles and also description (actual content) is trimmed | ||
| 24 | this really doesn't reflect the actual article contents. | ||
| 25 | |||
| 26 | To get better content I could use web scraping and use RSS as link list and | ||
| 27 | fetch contents directly from website, but for this simple example this will | ||
| 28 | suffice. | ||
| 29 | |||
| 30 | There are couple of requirements we need to install before we continue: | ||
| 31 | |||
| 32 | - `pip3 install feedparser` (parses RSS feed from url) | ||
| 33 | - `pip3 install vaderSentiment` (does sentiment polarity analysis) | ||
| 34 | - `pip3 install matplotlib` (plots chart of results) | ||
| 35 | |||
| 36 | So first we need to fetch RSS data and sanitize HTML content from description. | ||
| 37 | |||
| 38 | ```python | ||
| 39 | import re | ||
| 40 | import feedparser | ||
| 41 | |||
| 42 | feed_url = "https://www.theguardian.com/world/rss" | ||
| 43 | feed = feedparser.parse(feed_url) | ||
| 44 | |||
| 45 | # sanitize html | ||
| 46 | for item in feed.entries: | ||
| 47 | item.description = re.sub('<[^<]+?>', '', item.description) | ||
| 48 | ``` | ||
| 49 | |||
| 50 | ## Perform sentiment analysis | ||
| 51 | |||
| 52 | Since we now have cleaned up data in our `feed.entries` object we can start with | ||
| 53 | performing sentiment analysis. | ||
| 54 | |||
| 55 | There are many sentiment analysis libraries available that range from rule-based | ||
| 56 | sentiment analysis up to machine learning supported analysis. To keep things | ||
| 57 | simple I decided to use rule-based analysis library | ||
| 58 | [vaderSentiment](https://github.com/cjhutto/vaderSentiment) from | ||
| 59 | [C.J. Hutto](https://github.com/cjhutto). Really nice library and quite easy to | ||
| 60 | use. | ||
| 61 | |||
| 62 | ```python | ||
| 63 | from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer | ||
| 64 | analyser = SentimentIntensityAnalyzer() | ||
| 65 | |||
| 66 | sentiment_results = [] | ||
| 67 | for item in feed.entries: | ||
| 68 | sentiment_title = analyser.polarity_scores(item.title) | ||
| 69 | sentiment_description = analyser.polarity_scores(item.description) | ||
| 70 | sentiment_results.append([sentiment_title['compound'], sentiment_description['compound']]) | ||
| 71 | ``` | ||
| 72 | |||
| 73 | Now that we have this data in a shape that is compatible with matplotlib we can | ||
| 74 | plot results to see the difference between title and description sentiment of an | ||
| 75 | article. | ||
| 76 | |||
| 77 | ```python | ||
| 78 | import matplotlib.pyplot as plt | ||
| 79 | |||
| 80 | plt.rcParams['figure.figsize'] = (15, 3) | ||
| 81 | plt.plot(sentiment_results, drawstyle='steps') | ||
| 82 | plt.title('Sentiment analysis relationship between title and description (Guardian World News)') | ||
| 83 | plt.legend(['title', 'description']) | ||
| 84 | plt.show() | ||
| 85 | ``` | ||
| 86 | |||
| 87 | ## Results and assets | ||
| 88 | |||
| 89 | 1. Because of the small sample size further conclusions are impossible to make. | ||
| 90 | 2. Rule-based approach may not be the best way of doing this. By using deep | ||
| 91 | learning we would be able to get better insights. | ||
| 92 | 3. **Next step would be to** periodically fetch RSS items and store them over a | ||
| 93 | longer period of time and then perform analysis again and use either machine | ||
| 94 | learning or deep learning on top of it. | ||
| 95 | |||
| 96 | {:loading="lazy"} | ||
| 97 | |||
| 98 | Figure above displays difference between title and description sentiment for | ||
| 99 | specific RSS feed item. 1 means positive and -1 means negative sentiment. | ||
| 100 | |||
| 101 | [ยป Download Jupyter Notebook](/assets/posts/sentiment-analysis/sentiment-analysis.ipynb) | ||
| 102 | |||
| 103 | ## Going further | ||
| 104 | |||
| 105 | - [Twitter Sentiment Analysis by Bryan Schwierzke](https://github.com/bswiss/news_mood) | ||
| 106 | - [AFINN-based sentiment analysis for Node.js by Andrew Sliwinski](https://github.com/thisandagain/sentiment) | ||
| 107 | - [Sentiment Analysis with LSTMs in Tensorflow by Adit Deshpande](https://github.com/adeshpande3/LSTM-Sentiment-Analysis) | ||
| 108 | - [Sentiment analysis on tweets using Naive Bayes, SVM, CNN, LSTM, etc. by Abdul Fatir](https://github.com/abdulfatir/twitter-sentiment-analysis) | ||
| 109 | |||
