O'Reilly logo

Effective Amazon Machine Learning by Alexis Perrier

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Removing duplicate tweets

In all Twitter-based NLP analysis, you end up dealing with bots, even when collecting tweets about vegetables! In our dataset, we had many versions of promotion tweets where the text was the same across tweets, but the links and users were different. We remove duplicate tweets by first removing the URL from the tweets and then using the drop_duplicates Pandas method.Noting that all URLs in Tweets start with https://t.co/, it's easy to remove all URLs from the Tweets. We will create a new tweet column without URLs in our dataframe. We enter the following line, which, given a tweet, returns the tweet without URLs:

' '.join([token for token tk in tweet.split(' ') if 'https://t.co/' not in tk])

When working with pandas ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required