diff options
Diffstat (limited to 'public/using-sentiment-analysis-for-clickbait-detection-in-rss-feeds.html')
| -rwxr-xr-x | public/using-sentiment-analysis-for-clickbait-detection-in-rss-feeds.html | 55 |
1 files changed, 55 insertions, 0 deletions
diff --git a/public/using-sentiment-analysis-for-clickbait-detection-in-rss-feeds.html b/public/using-sentiment-analysis-for-clickbait-detection-in-rss-feeds.html new file mode 100755 index 0000000..05907b6 --- /dev/null +++ b/public/using-sentiment-analysis-for-clickbait-detection-in-rss-feeds.html | |||
| @@ -0,0 +1,55 @@ | |||
| 1 | <!doctype html><html lang=en-us><meta charset=utf-8><meta name=viewport content="width=device-width,initial-scale=1"><link href="data:image/x-icon;base64,AAABAAEAEBAAAAEAIABoBAAAFgAAACgAAAAQAAAAIAAAAAEAIAAAAAAAAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAL69vf8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAv76+/8LBwQkAAAAAAAAAAAAAAAC+vb3/AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAL+9vf/Bv78JAAAAAAAAAAAAAAAAu7q6/wAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC7ubr/vr29CAAAAAAAAAAAy8nJAZ6foP8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAnqGj/6GipAoAAAAAHLjU/xcXHf/BwsL/I8XY/yPK3v8XGiD/IbjL/yPF2f8XGiD/Fxkf/yLF2f8gnK3/Fxog/62ztv8fwNf/FRcd/x271v8mz93/GRsi/xkXHf8p097/GiIp/xobIv8p0t3/KdPe/xocIv8fYmr/KNPe/xoZH/8aHCL/J87c/xy81/8VFxz/IsPZ/8zS0/8XGiD/Ir/R/yPH2/8XGiD/Fxkf/yPH2/8dd4T/GBog/yPJ3f8jyNr/uru9/xcUGv8cudb/EhITDKi5vRKlvMP/RUpOERwcHRAdOj4QHTk8EBwdHRAdNTgQHTo/EBwcHRAcHB0QSGduEKW4vf+koqQfHzg+EBqz0ewSFRv7EyMr/xq51vsTERb7ExUb+xq41fsau9j7ExUb+xiPp/sZudb7ExUb+xMVG/sZuNX/GKvI/BIUGfMdvdn/IrfL/xcaIP8n1eb/J9Dh/xkcIf8ZGR7/J8/f/xxCSv8ZGyH/J9Dg/ybQ4P8ZHCL/FSQs/yPK3/8UExj/GE1b/ybS5P8ZGB7/Ghwj/ynW5P8p2Ob/Ghwi/yWrtv8p1eH/Ghwi/xocIv8p1uT/J8XT/xkcIv8m1un/Hb7d/xUYH/8hzOr/HtHu/xcaIf8XGB//I8vi/xgxOv8XGSD/I8rg/yPK4P8XGiD/GUFL/yPP6f8SERj/Fhkh/x3A4f8AAAAAJ2f9/ydr//8mZPH/AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAlYu38J2v//ydo/f8AAAAAAAAAAAd8/fkFqf//Iob8sAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAMY39awWr//8FfP3/AAAAAAAAAAAFm/7/SfD//wR+/f8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAOB/f9B7v//BaX+/wAAAAAAAAAAQ878SAyZ/v9n1v4KAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADu9v8DDJb+/z3N/XgAAAAA3/sAAN/7AADf+wAA3/sAAAAAAAAAAAAAAAAAAN/7AAAAAAAAAAAAAAAAAAAAAAAAj/EAAI/5AACP8QAA3/sAAA==" rel=icon type=image/x-icon><title>Using sentiment analysis for clickbait detection in RSS feeds</title><meta name=description content="Initial thoughtsOne of the things that interested me for a while now is if major wellestablished news sites use click bait titles to drive additional traffic totheir sites and generate additional impressions."><link rel=alternate type=application/rss+xml title="Mitja Felicijan's posts" href=https://mitjafelicijan.com/index.xml><link rel=alternate type=application/rss+xml title="Mitja Felicijan's notes" href=https://mitjafelicijan.com/notes.xml><style>body{padding:1rem;max-width:760px;background:#fff;font-family:times new roman,Times,serif;line-height:1.35rem}hr{margin-block-start:1.5rem}h1,h2,h3{line-height:initial}footer{margin-block-start:3rem}table{max-width:100%;border-collapse:separate;border-spacing:2px;border:1px solid #000;border-left:1px solid #999;border-top:1px solid #999}blockquote{font-style:italic}table thead{background:#eee}td,th{border:1px solid #000;padding:4px;border-right:1px solid #999;border-bottom:1px solid #999;text-align:left}pre{text-wrap:nowrap;overflow-x:auto;margin-block-start:1.5rem;margin-block-end:1.5rem;padding:.5rem 0;border-top:1px solid #000;border-bottom:1px solid #000}pre code{line-height:1.3em}pre,code,pre *,code *{font-family:monospace;font-size:initial!important}img,video,audio{max-width:100%}header{display:flex;flex-direction:row;gap:3rem}nav{display:flex;gap:.75rem}.pstatus-orange{background:gold}.pstatus-green{background:#9acd32}.pstatus-red{background:#cd5c5c}@media only screen and (max-width:600px){header{flex-direction:column;gap:1rem}a{word-wrap:break-word}}</style><header><nav class=main><a href=/>Home</a> | ||
| 2 | <a href=https://git.mitjafelicijan.com/ target=_blank>Git</a> | ||
| 3 | <a href=https://files.mitjafelicijan.com/ target=_blank>Files</a> | ||
| 4 | <a href=/mitjafelicijan.pgp.pub.txt target=_blank>PGP</a> | ||
| 5 | <a href=/curriculum-vitae.html>CV</a> | ||
| 6 | <a href=/index.xml target=_blank>RSS</a></nav></header><main><div><h1>Using sentiment analysis for clickbait detection in RSS feeds</h1><p>Oct 19, 2019<div><h2 id=initial-thoughts>Initial thoughts</h2><p>One of the things that interested me for a while now is if major well | ||
| 7 | established news sites use click bait titles to drive additional traffic to | ||
| 8 | their sites and generate additional impressions.<p>Goal is to see how article titles and actual content of article differ from each | ||
| 9 | other and see if titles are clickbaited.<h2 id=preparing-and-cleaning-data>Preparing and cleaning data</h2><p>For this example I opted to just use RSS feed from a new website and decided to | ||
| 10 | go with <a href=https://www.theguardian.com>The Guardian</a> World news. While this gets | ||
| 11 | us limited data (~40) articles and also description (actual content) is trimmed | ||
| 12 | this really doesn't reflect the actual article contents.<p>To get better content I could use web scraping and use RSS as link list and | ||
| 13 | fetch contents directly from website, but for this simple example this will | ||
| 14 | suffice.<p>There are couple of requirements we need to install before we continue:<ul><li><code>pip3 install feedparser</code> (parses RSS feed from url)<li><code>pip3 install vaderSentiment</code> (does sentiment polarity analysis)<li><code>pip3 install matplotlib</code> (plots chart of results)</ul><p>So first we need to fetch RSS data and sanitize HTML content from description.<pre tabindex=0 style=background-color:#fff><code><span style=display:flex><span><span style=color:#00f>import</span> re | ||
| 15 | </span></span><span style=display:flex><span><span style=color:#00f>import</span> feedparser | ||
| 16 | </span></span><span style=display:flex><span> | ||
| 17 | </span></span><span style=display:flex><span>feed_url = <span style=color:#a31515>"https://www.theguardian.com/world/rss"</span> | ||
| 18 | </span></span><span style=display:flex><span>feed = feedparser.parse(feed_url) | ||
| 19 | </span></span><span style=display:flex><span> | ||
| 20 | </span></span><span style=display:flex><span><span style=color:green># sanitize html</span> | ||
| 21 | </span></span><span style=display:flex><span><span style=color:#00f>for</span> item <span style=color:#00f>in</span> feed.entries: | ||
| 22 | </span></span><span style=display:flex><span> item.description = re.sub(<span style=color:#a31515>'<[^<]+?>'</span>, <span style=color:#a31515>''</span>, item.description) | ||
| 23 | </span></span></code></pre><h2 id=perform-sentiment-analysis>Perform sentiment analysis</h2><p>Since we now have cleaned up data in our <code>feed.entries</code> object we can start with | ||
| 24 | performing sentiment analysis.<p>There are many sentiment analysis libraries available that range from rule-based | ||
| 25 | sentiment analysis up to machine learning supported analysis. To keep things | ||
| 26 | simple I decided to use rule-based analysis library | ||
| 27 | <a href=https://github.com/cjhutto/vaderSentiment>vaderSentiment</a> from | ||
| 28 | <a href=https://github.com/cjhutto>C.J. Hutto</a>. Really nice library and quite easy to | ||
| 29 | use.<pre tabindex=0 style=background-color:#fff><code><span style=display:flex><span><span style=color:#00f>from</span> vaderSentiment.vaderSentiment <span style=color:#00f>import</span> SentimentIntensityAnalyzer | ||
| 30 | </span></span><span style=display:flex><span>analyser = SentimentIntensityAnalyzer() | ||
| 31 | </span></span><span style=display:flex><span> | ||
| 32 | </span></span><span style=display:flex><span>sentiment_results = [] | ||
| 33 | </span></span><span style=display:flex><span><span style=color:#00f>for</span> item <span style=color:#00f>in</span> feed.entries: | ||
| 34 | </span></span><span style=display:flex><span> sentiment_title = analyser.polarity_scores(item.title) | ||
| 35 | </span></span><span style=display:flex><span> sentiment_description = analyser.polarity_scores(item.description) | ||
| 36 | </span></span><span style=display:flex><span> sentiment_results.append([sentiment_title[<span style=color:#a31515>'compound'</span>], sentiment_description[<span style=color:#a31515>'compound'</span>]]) | ||
| 37 | </span></span></code></pre><p>Now that we have this data in a shape that is compatible with matplotlib we can | ||
| 38 | plot results to see the difference between title and description sentiment of an | ||
| 39 | article.<pre tabindex=0 style=background-color:#fff><code><span style=display:flex><span><span style=color:#00f>import</span> matplotlib.pyplot <span style=color:#00f>as</span> plt | ||
| 40 | </span></span><span style=display:flex><span> | ||
| 41 | </span></span><span style=display:flex><span>plt.rcParams[<span style=color:#a31515>'figure.figsize'</span>] = (15, 3) | ||
| 42 | </span></span><span style=display:flex><span>plt.plot(sentiment_results, drawstyle=<span style=color:#a31515>'steps'</span>) | ||
| 43 | </span></span><span style=display:flex><span>plt.title(<span style=color:#a31515>'Sentiment analysis relationship between title and description (Guardian World News)'</span>) | ||
| 44 | </span></span><span style=display:flex><span>plt.legend([<span style=color:#a31515>'title'</span>, <span style=color:#a31515>'description'</span>]) | ||
| 45 | </span></span><span style=display:flex><span>plt.show() | ||
| 46 | </span></span></code></pre><h2 id=results-and-assets>Results and assets</h2><ol><li>Because of the small sample size further conclusions are impossible to make.<li>Rule-based approach may not be the best way of doing this. By using deep | ||
| 47 | learning we would be able to get better insights.<li><strong>Next step would be to</strong> periodically fetch RSS items and store them over a | ||
| 48 | longer period of time and then perform analysis again and use either machine | ||
| 49 | learning or deep learning on top of it.</ol><p><img src=/assets/sentiment-analysis/guardian-sa-title-desc-relationship.png alt="Relationship between title and description"><p>Figure above displays difference between title and description sentiment for | ||
| 50 | specific RSS feed item. 1 means positive and -1 means negative sentiment.<p><a href=/assets/sentiment-analysis/sentiment-analysis.ipynb>ยป Download Jupyter Notebook</a><h2 id=going-further>Going further</h2><ul><li><a href=https://github.com/bswiss/news_mood>Twitter Sentiment Analysis by Bryan Schwierzke</a><li><a href=https://github.com/thisandagain/sentiment>AFINN-based sentiment analysis for Node.js by Andrew Sliwinski</a><li><a href=https://github.com/adeshpande3/LSTM-Sentiment-Analysis>Sentiment Analysis with LSTMs in Tensorflow by Adit Deshpande</a><li><a href=https://github.com/abdulfatir/twitter-sentiment-analysis>Sentiment analysis on tweets using Naive Bayes, SVM, CNN, LSTM, etc. by Abdul Fatir</a></ul></div></div></main><footer><hr><div><h3>Want to comment or have something to add?</h3>You can write me an email at | ||
| 51 | <a href=mailto:m@mitjafelicijan.com>m@mitjafelicijan.com</a> or catch up | ||
| 52 | with me | ||
| 53 | <a href=https://telegram.me/mitjafelicijan target=_blank>on Telegram</a>.</div><hr><p>This website does not track you. Content is made available under | ||
| 54 | the <a href=https://creativecommons.org/licenses/by/4.0/ target=_blank rel=noreferrer>CC BY 4.0 license</a> unless specified | ||
| 55 | otherwise. Blog feed is available as <a href=/index.xml target=_blank>RSS feed</a>.</footer><script src=https://cdn.usefathom.com/script.js data-site=XHQARKXP defer></script> \ No newline at end of file | ||
