diff options
Diffstat (limited to 'content/posts/2019-10-19-using-sentiment-analysis-for-clickbait-detection.md')
| -rw-r--r-- | content/posts/2019-10-19-using-sentiment-analysis-for-clickbait-detection.md | 50 |
1 files changed, 25 insertions, 25 deletions
diff --git a/content/posts/2019-10-19-using-sentiment-analysis-for-clickbait-detection.md b/content/posts/2019-10-19-using-sentiment-analysis-for-clickbait-detection.md index 995da25..e7324bb 100644 --- a/content/posts/2019-10-19-using-sentiment-analysis-for-clickbait-detection.md +++ b/content/posts/2019-10-19-using-sentiment-analysis-for-clickbait-detection.md | |||
| @@ -7,22 +7,22 @@ draft: false | |||
| 7 | 7 | ||
| 8 | ## Initial thoughts | 8 | ## Initial thoughts |
| 9 | 9 | ||
| 10 | One of the things that interested me for a while now is if major well | 10 | One of the things that interested me for a while now is if major well |
| 11 | established news sites use click bait titles to drive additional traffic | 11 | established news sites use click bait titles to drive additional traffic to |
| 12 | to their sites and generate additional impressions. | 12 | their sites and generate additional impressions. |
| 13 | 13 | ||
| 14 | Goal is to see how article titles and actual content of article differ from | 14 | Goal is to see how article titles and actual content of article differ from each |
| 15 | each other and see if titles are clickbaited. | 15 | other and see if titles are clickbaited. |
| 16 | 16 | ||
| 17 | ## Preparing and cleaning data | 17 | ## Preparing and cleaning data |
| 18 | 18 | ||
| 19 | For this example I opted to just use RSS feed from a new website and decided | 19 | For this example I opted to just use RSS feed from a new website and decided to |
| 20 | to go with [The Guardian](https://www.theguardian.com) World news. While this | 20 | go with [The Guardian](https://www.theguardian.com) World news. While this gets |
| 21 | gets us limited data (~40) articles and also description (actual content) is | 21 | us limited data (~40) articles and also description (actual content) is trimmed |
| 22 | trimmed this really doesn't reflect the actual article contents. | 22 | this really doesn't reflect the actual article contents. |
| 23 | 23 | ||
| 24 | To get better content I could use web scraping and use RSS as link list and | 24 | To get better content I could use web scraping and use RSS as link list and |
| 25 | fetch contents directly from website, but for this simple example this will | 25 | fetch contents directly from website, but for this simple example this will |
| 26 | suffice. | 26 | suffice. |
| 27 | 27 | ||
| 28 | There are couple of requirements we need to install before we continue: | 28 | There are couple of requirements we need to install before we continue: |
| @@ -50,12 +50,12 @@ for item in feed.entries: | |||
| 50 | Since we now have cleaned up data in our `feed.entries` object we can start with | 50 | Since we now have cleaned up data in our `feed.entries` object we can start with |
| 51 | performing sentiment analysis. | 51 | performing sentiment analysis. |
| 52 | 52 | ||
| 53 | There are many sentiment analysis libraries available that range from rule-based | 53 | There are many sentiment analysis libraries available that range from rule-based |
| 54 | sentiment analysis up to machine learning supported analysis. To keep things | 54 | sentiment analysis up to machine learning supported analysis. To keep things |
| 55 | simple I decided to use rule-based analysis library | 55 | simple I decided to use rule-based analysis library |
| 56 | [vaderSentiment](https://github.com/cjhutto/vaderSentiment) from | 56 | [vaderSentiment](https://github.com/cjhutto/vaderSentiment) from |
| 57 | [C.J. Hutto](https://github.com/cjhutto). Really nice library and quite | 57 | [C.J. Hutto](https://github.com/cjhutto). Really nice library and quite easy to |
| 58 | easy to use. | 58 | use. |
| 59 | 59 | ||
| 60 | ```python | 60 | ```python |
| 61 | from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer | 61 | from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer |
| @@ -68,9 +68,9 @@ for item in feed.entries: | |||
| 68 | sentiment_results.append([sentiment_title['compound'], sentiment_description['compound']]) | 68 | sentiment_results.append([sentiment_title['compound'], sentiment_description['compound']]) |
| 69 | ``` | 69 | ``` |
| 70 | 70 | ||
| 71 | Now that we have this data in a shape that is compatible with matplotlib we can | 71 | Now that we have this data in a shape that is compatible with matplotlib we can |
| 72 | plot results to see the difference between title and description sentiment of | 72 | plot results to see the difference between title and description sentiment of an |
| 73 | an article. | 73 | article. |
| 74 | 74 | ||
| 75 | ```python | 75 | ```python |
| 76 | import matplotlib.pyplot as plt | 76 | import matplotlib.pyplot as plt |
| @@ -85,15 +85,15 @@ plt.show() | |||
| 85 | ## Results and assets | 85 | ## Results and assets |
| 86 | 86 | ||
| 87 | 1. Because of the small sample size further conclusions are impossible to make. | 87 | 1. Because of the small sample size further conclusions are impossible to make. |
| 88 | 2. Rule-based approach may not be the best way of doing this. By using deep | 88 | 2. Rule-based approach may not be the best way of doing this. By using deep |
| 89 | learning we would be able to get better insights. | 89 | learning we would be able to get better insights. |
| 90 | 3. **Next step would be to** periodically fetch RSS items and store them over | 90 | 3. **Next step would be to** periodically fetch RSS items and store them over a |
| 91 | a longer period of time and then perform analysis again and use either | 91 | longer period of time and then perform analysis again and use either machine |
| 92 | machine learning or deep learning on top of it. | 92 | learning or deep learning on top of it. |
| 93 | 93 | ||
| 94 |  | 94 |  |
| 95 | 95 | ||
| 96 | Figure above displays difference between title and description sentiment for | 96 | Figure above displays difference between title and description sentiment for |
| 97 | specific RSS feed item. 1 means positive and -1 means negative sentiment. | 97 | specific RSS feed item. 1 means positive and -1 means negative sentiment. |
| 98 | 98 | ||
| 99 | [» Download Jupyter Notebook](/assets/sentiment-analysis/sentiment-analysis.ipynb) | 99 | [» Download Jupyter Notebook](/assets/sentiment-analysis/sentiment-analysis.ipynb) |
