aboutsummaryrefslogtreecommitdiff
path: root/public/using-sentiment-analysis-for-clickbait-detection-in-rss-feeds.html
diff options
context:
space:
mode:
Diffstat (limited to 'public/using-sentiment-analysis-for-clickbait-detection-in-rss-feeds.html')
-rwxr-xr-xpublic/using-sentiment-analysis-for-clickbait-detection-in-rss-feeds.html88
1 files changed, 0 insertions, 88 deletions
diff --git a/public/using-sentiment-analysis-for-clickbait-detection-in-rss-feeds.html b/public/using-sentiment-analysis-for-clickbait-detection-in-rss-feeds.html
deleted file mode 100755
index 7a70590..0000000
--- a/public/using-sentiment-analysis-for-clickbait-detection-in-rss-feeds.html
+++ /dev/null
@@ -1,88 +0,0 @@
1<!doctype html><html lang=en-us><meta charset=utf-8><meta name=viewport content="width=device-width,initial-scale=1"><meta name=generator content="JBMAFP - github.com/mitjafelicijan/jbmafp"><link href="data:image/x-icon;base64,AAABAAEAEBAAAAEAIABoBAAAFgAAACgAAAAQAAAAIAAAAAEAIAAAAAAAAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAL69vf8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAv76+/8LBwQkAAAAAAAAAAAAAAAC+vb3/AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAL+9vf/Bv78JAAAAAAAAAAAAAAAAu7q6/wAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC7ubr/vr29CAAAAAAAAAAAy8nJAZ6foP8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAnqGj/6GipAoAAAAAHLjU/xcXHf/BwsL/I8XY/yPK3v8XGiD/IbjL/yPF2f8XGiD/Fxkf/yLF2f8gnK3/Fxog/62ztv8fwNf/FRcd/x271v8mz93/GRsi/xkXHf8p097/GiIp/xobIv8p0t3/KdPe/xocIv8fYmr/KNPe/xoZH/8aHCL/J87c/xy81/8VFxz/IsPZ/8zS0/8XGiD/Ir/R/yPH2/8XGiD/Fxkf/yPH2/8dd4T/GBog/yPJ3f8jyNr/uru9/xcUGv8cudb/EhITDKi5vRKlvMP/RUpOERwcHRAdOj4QHTk8EBwdHRAdNTgQHTo/EBwcHRAcHB0QSGduEKW4vf+koqQfHzg+EBqz0ewSFRv7EyMr/xq51vsTERb7ExUb+xq41fsau9j7ExUb+xiPp/sZudb7ExUb+xMVG/sZuNX/GKvI/BIUGfMdvdn/IrfL/xcaIP8n1eb/J9Dh/xkcIf8ZGR7/J8/f/xxCSv8ZGyH/J9Dg/ybQ4P8ZHCL/FSQs/yPK3/8UExj/GE1b/ybS5P8ZGB7/Ghwj/ynW5P8p2Ob/Ghwi/yWrtv8p1eH/Ghwi/xocIv8p1uT/J8XT/xkcIv8m1un/Hb7d/xUYH/8hzOr/HtHu/xcaIf8XGB//I8vi/xgxOv8XGSD/I8rg/yPK4P8XGiD/GUFL/yPP6f8SERj/Fhkh/x3A4f8AAAAAJ2f9/ydr//8mZPH/AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAlYu38J2v//ydo/f8AAAAAAAAAAAd8/fkFqf//Iob8sAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAMY39awWr//8FfP3/AAAAAAAAAAAFm/7/SfD//wR+/f8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOB/f9B7v//BaX+/wAAAAAAAAAAQ878SAyZ/v9n1v4KAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADu9v8DDJb+/z3N/XgAAAAA3/sAAN/7AADf+wAA3/sAAAAAAAAAAAAAAAAAAN/7AAAAAAAAAAAAAAAAAAAAAAAAj/EAAI/5AACP8QAA3/sAAA==" rel=icon type=image/x-icon><title>Using sentiment analysis for clickbait detection in RSS feeds</title><meta name=description content="Initial thoughtsOne of the things that interested me for a while now is if major wellestablished news sites use click bait titles to drive additional traffic totheir sites and generate additional impressions."><meta name=author content="Mitja Felicijan"><link rel=alternate type=application/rss+xml title="Mitja Felicijan's posts" href=https://mitjafelicijan.com/index.xml><link rel=alternate type=application/rss+xml title="Mitja Felicijan's notes" href=https://mitjafelicijan.com/notes.xml><style>:root{--border-color:gainsboro;--border-size:2px;--link-color:blue;--bg-color:#eee}*::selection{background:var(--link-color);color:#fff}*::-moz-selection{background:var(--link-color);color:#fff}*::-webkit-selection{background:var(--link-color);color:#fff}body{padding:2.5rem;max-width:1900px;background:#fff;font-family:sans-serif;line-height:1.35rem;font-size:16px}hr{border:0;border-bottom:var(--border-size)solid var(--border-color);margin-block-start:1.5rem}a{color:var(--link-color);text-decoration:none}h1,h2,h3{line-height:initial}h1{font-size:xx-large}footer{margin-block-start:2rem}cap{text-transform:capitalize}blockquote{font-style:italic}table{max-width:100%;border:var(--border-size)solid var(--border-color);border-collapse:separate;border-spacing:0}table thead tr th{border-bottom:var(--border-size)solid var(--border-color);text-align:left}table th,table td{padding:.5em .8em}ul.list li{padding:.2em 0}ul{line-height:1.35em}pre{text-wrap:nowrap;overflow-x:auto;padding:0 1em;border:var(--border-size)solid var(--border-color)}code{padding:0 3px;font-size:14px;border:0;background:var(--bg-color)}pre code{line-height:1.3em;background:#fff}pre,code,pre *,code *{font-family:monospace}figure{margin-inline-start:0;margin-inline-end:0}figcaption{width:800px;max-width:100%;text-align:center}figcaption p{margin:.3em 0 1.5em;font-style:italic}img,video,audio{width:800px;max-width:100%;border:var(--border-size)solid var(--border-color);padding:.5em}header nav{display:flex;gap:.9rem}article iframe{margin:0!important}audio::-webkit-media-controls-enclosure{border-radius:0}@media only screen and (max-width:600px){body{padding:.5em;word-wrap:break-word}header nav{gap:.7rem}header nav .hob{display:none}a{word-wrap:break-word}img,video,audio{padding:0}}</style><header><nav class=main itemscope itemtype=http://schema.org/SiteNavigationElement role=navigation aria-label="Main navigation"><a href=/>Home</a>
2<a href=/#posts>Posts</a>
3<a href=/#notes>Notes</a>
4<a href=/#sideprojects class=hob>Side Projects</a>
5<a href=/vault.html>Vault</a>
6<a href=https://github.com/mitjafelicijan target=_blank>Code</a>
7<a href=/mitjafelicijan.pgp.pub.txt target=_blank class=hob>PGP</a>
8<a href=/curriculum-vitae.html>CV</a>
9<a href=/index.xml target=_blank class=hob>RSS</a></nav></header><main role=main><article itemtype=http://schema.org/Article><h1 itemtype=headline>Using sentiment analysis for clickbait detection in RSS feeds</h1><p><cap>post</cap>, Oct 19, 2019 on <a href=https://mitjafelicijan.com>Mitja Felicijan's blog</a><div><h2 id=initial-thoughts>Initial thoughts</h2><p>One of the things that interested me for a while now is if major well
10established news sites use click bait titles to drive additional traffic to
11their sites and generate additional impressions.<p>Goal is to see how article titles and actual content of article differ from each
12other and see if titles are clickbaited.<h2 id=preparing-and-cleaning-data>Preparing and cleaning data</h2><p>For this example I opted to just use RSS feed from a new website and decided to
13go with <a href=https://www.theguardian.com>The Guardian</a> World news. While this gets
14us limited data (~40) articles and also description (actual content) is trimmed
15this really doesn't reflect the actual article contents.<p>To get better content I could use web scraping and use RSS as link list and
16fetch contents directly from website, but for this simple example this will
17suffice.<p>There are couple of requirements we need to install before we continue:<ul><li><code>pip3 install feedparser</code> (parses RSS feed from url)<li><code>pip3 install vaderSentiment</code> (does sentiment polarity analysis)<li><code>pip3 install matplotlib</code> (plots chart of results)</ul><p>So first we need to fetch RSS data and sanitize HTML content from description.<pre tabindex=0 style=background-color:#fff><code><span style=display:flex><span><span style=color:#00f>import</span> re
18</span></span><span style=display:flex><span><span style=color:#00f>import</span> feedparser
19</span></span><span style=display:flex><span>
20</span></span><span style=display:flex><span>feed_url = <span style=color:#a31515>&#34;https://www.theguardian.com/world/rss&#34;</span>
21</span></span><span style=display:flex><span>feed = feedparser.parse(feed_url)
22</span></span><span style=display:flex><span>
23</span></span><span style=display:flex><span><span style=color:green># sanitize html</span>
24</span></span><span style=display:flex><span><span style=color:#00f>for</span> item <span style=color:#00f>in</span> feed.entries:
25</span></span><span style=display:flex><span> item.description = re.sub(<span style=color:#a31515>&#39;&lt;[^&lt;]+?&gt;&#39;</span>, <span style=color:#a31515>&#39;&#39;</span>, item.description)
26</span></span></code></pre><h2 id=perform-sentiment-analysis>Perform sentiment analysis</h2><p>Since we now have cleaned up data in our <code>feed.entries</code> object we can start with
27performing sentiment analysis.<p>There are many sentiment analysis libraries available that range from rule-based
28sentiment analysis up to machine learning supported analysis. To keep things
29simple I decided to use rule-based analysis library
30<a href=https://github.com/cjhutto/vaderSentiment>vaderSentiment</a> from
31<a href=https://github.com/cjhutto>C.J. Hutto</a>. Really nice library and quite easy to
32use.<pre tabindex=0 style=background-color:#fff><code><span style=display:flex><span><span style=color:#00f>from</span> vaderSentiment.vaderSentiment <span style=color:#00f>import</span> SentimentIntensityAnalyzer
33</span></span><span style=display:flex><span>analyser = SentimentIntensityAnalyzer()
34</span></span><span style=display:flex><span>
35</span></span><span style=display:flex><span>sentiment_results = []
36</span></span><span style=display:flex><span><span style=color:#00f>for</span> item <span style=color:#00f>in</span> feed.entries:
37</span></span><span style=display:flex><span> sentiment_title = analyser.polarity_scores(item.title)
38</span></span><span style=display:flex><span> sentiment_description = analyser.polarity_scores(item.description)
39</span></span><span style=display:flex><span> sentiment_results.append([sentiment_title[<span style=color:#a31515>&#39;compound&#39;</span>], sentiment_description[<span style=color:#a31515>&#39;compound&#39;</span>]])
40</span></span></code></pre><p>Now that we have this data in a shape that is compatible with matplotlib we can
41plot results to see the difference between title and description sentiment of an
42article.<pre tabindex=0 style=background-color:#fff><code><span style=display:flex><span><span style=color:#00f>import</span> matplotlib.pyplot <span style=color:#00f>as</span> plt
43</span></span><span style=display:flex><span>
44</span></span><span style=display:flex><span>plt.rcParams[<span style=color:#a31515>&#39;figure.figsize&#39;</span>] = (15, 3)
45</span></span><span style=display:flex><span>plt.plot(sentiment_results, drawstyle=<span style=color:#a31515>&#39;steps&#39;</span>)
46</span></span><span style=display:flex><span>plt.title(<span style=color:#a31515>&#39;Sentiment analysis relationship between title and description (Guardian World News)&#39;</span>)
47</span></span><span style=display:flex><span>plt.legend([<span style=color:#a31515>&#39;title&#39;</span>, <span style=color:#a31515>&#39;description&#39;</span>])
48</span></span><span style=display:flex><span>plt.show()
49</span></span></code></pre><h2 id=results-and-assets>Results and assets</h2><ol><li>Because of the small sample size further conclusions are impossible to make.<li>Rule-based approach may not be the best way of doing this. By using deep
50learning we would be able to get better insights.<li><strong>Next step would be to</strong> periodically fetch RSS items and store them over a
51longer period of time and then perform analysis again and use either machine
52learning or deep learning on top of it.</ol><figure><img src=/posts/sentiment-analysis/guardian-sa-title-desc-relationship.png alt="Relationship between title and description"></figure><p>Figure above displays difference between title and description sentiment for
53specific RSS feed item. 1 means positive and -1 means negative sentiment.<p><a href=/posts/sentiment-analysis/sentiment-analysis.ipynb>» Download Jupyter Notebook</a><h2 id=going-further>Going further</h2><ul><li><a href=https://github.com/bswiss/news_mood>Twitter Sentiment Analysis by Bryan Schwierzke</a><li><a href=https://github.com/thisandagain/sentiment>AFINN-based sentiment analysis for Node.js by Andrew Sliwinski</a><li><a href=https://github.com/adeshpande3/LSTM-Sentiment-Analysis>Sentiment Analysis with LSTMs in Tensorflow by Adit Deshpande</a><li><a href=https://github.com/abdulfatir/twitter-sentiment-analysis>Sentiment analysis on tweets using Naive Bayes, SVM, CNN, LSTM, etc. by Abdul Fatir</a></ul></div></article></main><section><hr><h2>Posts from blogs I follow around the net</h2><ul><li><a href=https://utcc.utoronto.ca/~cks/space/blog/linux/NFSv4ServerLockClients target=_blank rel=noopener>Finding which NFSv4 client owns a lock on a Linux NFS(v4) server</a> — <a href=https://utcc.utoronto.ca/~cks/space/blog/>Chris's Wiki :: blog</a><div>A while back I wrote an entry about finding which NFS client owns
54a lock on a Linux NFS server, which turned
55out to be specific to NFS v3 (which I really should have seen coming,
56since it involved NLM and lockd). Finding the NFS v4 client that
57owns a lock is, depending on your perspective, either simpl…<li><a href=http://www.landley.net/notes-2023.html#28-10-2023 target=_blank rel=noopener>October 28, 2023</a> — <a href=http://www.landley.net/notes-2023.html>Rob Landley's Blog Thing for 2023</a><div>Oh good grief, two of my least favorite licensing people, Larry Rosen
58and Bradley Kuhn, are interacting on the OSI's license-discuss
59list where the're doing
60bad computer history and insisting that a guy Larry Rosen
61coincidentally interviewed for a book years ago is clearly the origin of
62somethin…<li><a href="http://offbeatpursuit.com:80/blog/?id=25" target=_blank rel=noopener>A fix by any other name</a> — <a href=http://offbeatpursuit.com:80/blog/>WLOG - blog</a><div>tags:
63i2c, plan9
64Another month, another file system.
65Well, if you can’t fix it in software, fix it in hardware (looking at
66you, bme680, we’re not
67done yet). The show must go on, as they say, and I would like my
68experiments to go on.
69So a “new” addition to the environmental sensor family connected to
70the h…<li><a href=https://mirzapandzo.com/next-image-url-parameter-is-valid-but-upstream-response-is-invalid target=_blank rel=noopener>Next/Image "url" parameter is valid but upstream response is invalid</a> — <a href=https://mirzapandzo.com/>Mirza Pandzo's Blog</a><div>Getting "url" parameter is valid but upstream response is invalid error with Next/Image on WSL2<li><a href=https://drewdevault.com/2023/10/13/Going-off-script.html target=_blank rel=noopener>Going off-script</a> — <a href=https://drewdevault.com>Drew DeVault's blog</a><div>There is a phenomenon in society which I find quite bizarre. Upon our entry to
71this mortal coil, we are endowed with self-awareness, agency, and free will.
72Each of the 8 billion members of this human race represents a unique person, a
73unique worldview, and a unique agency. Yet, many of us have the sam…<li><a href=https://szymonkaliski.com/writing/2023-10-02-building-a-diy-pen-plotter/ target=_blank rel=noopener>Building a DIY Pen Plotter</a> — <a href=http://github.com/dylang/node-rss>Szymon Kaliski</a><div>This article documents my learnings from designing and building a DIY Pen Plotter during the summer of 2023.
74My ultimate goal is to build my…<li><a href=https://neil.computer/notes/chart-of-accounts-for-startups-and-saas-companies/ target=_blank rel=noopener>Chart of Accounts for Startups and SaaS Companies</a> — <a href=https://neil.computer/>Neil Panchal</a><div>Accounting is fundamental to starting a business. You need to have a basic understanding of accounting principles and essential bookkeeping. I had to learn it. There was no choice. For filing taxes, your CPA is going to ask you for an Income Statement (also known as P/L statement). If<li><a href=https://journal.valeriansaliou.name/deploy-a-nomad-cluster-on-alpine-linux-with-vultr/ target=_blank rel=noopener>Deploy a Nomad Cluster on Alpine Linux with Vultr</a> — <a href=https://journal.valeriansaliou.name/>Valerian Saliou</a><div>After spending countless hours trying to understand how to deploy my apps on Kubernetes for the first time to host Mirage, an AI API service that I run, I ended up making myself a promise that the next app I work on would be using a more productive & simpler<li><a href=https://jcs.org/2023/10/25/wifi_da target=_blank rel=noopener>BlueSCSI Wi-Fi Desk Accessory 1.0 Released</a> — <a href=https://jcs.org/>joshua stein</a><div>BlueSCSI Wi-Fi Desk Accessory
751.0 has been released:
76wifi_da-1.0.sit
77(StuffIt 3 archive)
78SHA256: ccfc9d27dd5da7412d10cef73b81119a1fec3848e4d1d88ff652a07ffdc6a69aSHA1: ff124972f202ceda6d7fa4788110a67ccda6a13a
79This is the initial public release of my BlueSCSI Wi-Fi Desk Accessory for
80classic MacOS.<li><a href=https://michael.stapelberg.ch/posts/2023-10-25-my-all-flash-zfs-network-storage-build/ target=_blank rel=noopener>My 2023 all-flash ZFS NAS (Network Storage) build</a> — <a href=https://michael.stapelberg.ch/>Michael Stapelbergs Website</a><div>For over 10 years now, I run two self-built NAS (Network Storage) devices which serve media (currently via Jellyfin) and run daily backups of all my PCs and servers.
81In this article, I describe my goals, which hardware I picked for my new build (and why) and how I set it up.
82Design Goals
83I use my netw…</ul><p>Generated with <a href=https://git.sr.ht/~sircmpwn/openring target=_blank rel=noopener>openring</a>.</section><footer><hr><p><big><strong>Want to comment or have something to add?</strong></big><p>You can write me an email
84at <a href=mailto:mitja.felicijan@gmail.com>mitja.felicijan@gmail.com</a> or
85catch up with me <a href=https://telegram.me/mitjafelicijan target=_blank>on Telegram</a>.<hr><p>This website does not track you. Content is made available under the <a href=https://creativecommons.org/licenses/by/4.0/ target=_blank rel=noreferrer>CC BY 4.0 license</a> unless
86specified otherwise. Blog is also available as <a href=/index.xml target=_blank>RSS feed</a>.</footer><script>
87 window.va = window.va || function () { (window.vaq = window.vaq || []).push(arguments); };
88 </script><script defer src=/_vercel/insights/script.js></script> \ No newline at end of file