Data Labeling

Natural Language Processing NLP

Creating a Dataset of Tweets

How to use Twitter APIs to collect tweets and build your dataset for NLP applications.

Kili Technology

Jun 28, 2023

Heading2

Heading3

Introduction

Twitter is one of the most relevant social networks for quite some time, but do you know it can also be a very rich data source for research in NLP?

Well, there we can find people posting about many different topics, from entertainment stuff such as games, movies, lifestyle, etc to more serious things like science, news, stock market, especially politics and all the discussion involved. With access to all this content, we can analyze opinions or sentiments about a brand, a person, or an event, among many other applications in data science.

In this article, we present methods to access this data through Twitter API v1.1 and we also discuss the restrictions and some good practices to prepare your environment for collecting the dataset you need for your experiments. Finally, we will present to you how to import your tweet dataset in Kili with beautiful formatting.

Create an application

At first, we need to create an application on Twitter’s developer portal: https://developer.twitter.com/en/portal/apps/new as we see in the Figure below.

App interface of Twitter's developer portal

Once you name your app (here we used ‘covid_dataset_builder’) and click on the ‘Next’ button, there will be a keys & tokens page. Make sure that you save the keys shown on that page, you will need them to access the API.

With your app created and tokens on hand, we can move forward to play around with Twitter APIs.

pip install tweepy

Search API

Suppose we want to create a dataset of tweets about COVID, so we can conduct some research, such as sentiment analysis for instance. If we could read Twitter’s database and select tweets containing the term ‘covid’, it’s fair to believe that we will get what we need.

Before performing any search query, we must authenticate using our application keys (the ones we saved when we created our app). Let’s store them in the variables below to simplify:

CONSUMER_KEY = '<consumer key here>'
CONSUMER_SECRET = "<consumer secret here>"

OAUTH_TOKEN = "<token here>"
OAUTH_TOKEN_SECRET = "<token secret here>"

And the authentication is performed with:

import tweepy
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET) auth.set_access_token(OAUTH_TOKEN, OAUTH_TOKEN_SECRET) api = tweepy.API(auth)

In summary, we authenticated using our app keys and then we got access to the API functions, which are stored in the ‘API’ object. Now we can use a Cursor object to query tweets containing the term ‘covid’ and iterate through the results.

for tweet in tweepy.Cursor(api.search_tweets,
q="covid",
lang="en").items():
print(tweet.text)

Notice that our Cursor calls the search_tweets API, and the parameter ‘q’ indicates that we are looking for the term ‘covid’, and the parameter ‘lang’ specify only results in English (‘en’).

In this case, each tweet is an extensive JSON object that we can explore to retrieve many fields we may need. But for our purpose, we’ll focus on the text, which is stored in tweet.text field.

Even though we can easily retrieve tweets using Search API, there’re still some restrictions:

1. Only 450 requests per 15 minutes window are allowed;
2. We can only retrieve tweets posted in the last 7 days.

It means that our covid dataset builder app can only get tweets related to covid, posted in the last week. An alternative to building a larger dataset is to use the Streaming API instead, as we’ll see in the next section.

Streaming API

Instead of performing ‘queries’ on Twitter, when using Streaming API a client application connects to the API and listens to specific terms. Whenever a tweet is posted containing one of the tracked terms, this tweet is streamed into the application.

This time, we need to create a class extending tweetpy Stream, and implement the on_status method, that will listen to any posted tweet, as below:

class MyStream(tweepy.Stream):
def on_status(self, status):
print(status.text)

Notice that our on_status method will simply print the tweet so far. But in fact some tweets will appear truncated. As to retrieve these full tweets, we have to access the JSON object within status variable and obtain the full text. We change our code as follows:

def on_status(self, status):
if 'retweeted_status' in status._json.keys():
if 'extended_tweet' in status._json['retweeted_status'].keys():
print(status._json['retweeted_status']['extended_tweet']['full_text'])
else:
print(status.text)
else:
if 'extended_tweet' in status._json.keys():
print(status._json['extended_tweet']['full_text'])
else:
print(status.text)

Finally and as in the previous section, let us tell the API to track the term ‘covid’

myStream = MyStream(CONSUMER_KEY, CONSUMER_SECRET, OAUTH_TOKEN, OAUTH_TOKEN_SECRET)

myStream.filter(track=['covid'], threaded=True)

An application that uses Streaming API must keep the connection alive, with the advantage of possibly capturing any tweet whenever it’s posted, even though there may be some limitations for users who try to connect multiple times.

Strategies to Collect a Dataset

With access to both APIs, we can now decide how we will collect our dataset. An easy choice is to use the Streaming API to create a script that keeps running indefinitely and collects all tweets we need.

The MyStream class can be modified to save any tracked tweet into a database or a file, as below.

class MyStream(tweepy.Stream):
def __init__(self, file): self.file = file
def on_status(self, status):
self.file.write(status.text + ‘\n’)

Notice that we used the constructor method to pass a file handle and we write a new line into it whenever there is a new tracked tweet. In this case, we suppose the file is opened and closed outside this class.

An important issue that those who use Streaming API must care about is how to handle the long-living connection. Twitter itself brings a series of recommendations such as avoiding disconnections by not establishing many connections with the same credentials, using some backoff strategy if no data is streamed in the last 90 seconds, and do not read tracked tweets too slowly. Let us focus a little bit on this last point.

We need to consider that there may be a large number of tweets in the input stream, especially when tracking such a popular term as ‘covid’. In this case, our on_status method needs to do its job quickly. So if we plan to collect a tweet and do some NLP preprocessing steps or even generate embeddings on the fly, we may have some problems... A better strategy would be to simply store tweets in some intermediate/temporary file, to be later consumed by another process, or even send it to be preprocessed by another living thread, and move on to the next incoming tweet.

While by using Streaming API the script is required to stay up and running constantly, it may cause some side effects, such as consuming more computational hours, which increases the cost if it’s running in a cloud environment, for example. If that is the case, we still can think of a strategy to use Search API instead.

Before we dive into a Search API-based solution, recall that any search query performed will be restricted to results from the last 7 days. How could we work around it?

In this case, we can write a simple script using Search API and run it periodically, such as once every 7 days. Although we can not retrieve tweets older than 7 days (but it’s still better than using Streaming API, which can’t retrieve tweets posted any minute before we connect...), we can still be up to date if we run it again 7 days later.

If we’re running an application on Linux or in the cloud, you can use crontab. At first, use the command :

crontab -e

We then edit the file to schedule our script to run every Sunday at midnight by adding a line like this :

0 0 * * 0 python my_script.py > log.txt 2>&1

The string 0 0 * * 0 indicates the frequency in which the task will be executed. The first and second 0’s indicate 00:00 (zero hours and zero minutes) and the last 0 is an index for the day of the week (0 to 6, starting with Sunday). It is followed by the command to be executed.

In this case, we suppose the script is named ´my_script.py’, and the output is redirected into the log.txt file. The part: ‘2>&1’ makes both standard and error output be redirected into the same file.

But if we’re running the script in the cloud and we don’t want to keep the instance running all the time, an easy and affordable option is to schedule the script by using a Google Cloud App Engine. First, let us create a new project at https://console.cloud.google.com/projectcreate

Graphical User Interface Automatically generated description with low confidence

After that, in the App Engine dashboard, go to the Cron jobs menu :

Now we can create a new job using the same frequency string 0 0 * * 0.

The upcoming screens ask you to fill out an URL for the job to be called (instead of a simple command line as in the local crontab solution discussed earlier). So if you want to follow this solution, you need to host your script on some server or even in the Google App Engine itself. Another drawback of this solution is that it may require you to store collected tweets directly into storage services provided by Google Cloud.

Learn more!

Discover how training data can make or break your AI projects, and how to implement the Data Centric AI philosophy in your ML projects.

Download White Paper

Importing your dataset in Kili

In order to visualize your data and to annotate it, we are now going to see how to import your dataset within Kili. As we used the API to retrieve Tweets, we will set everything up using Kili API.

Let’s first create a TEXT project with a simple classification task:

from kili.client import kili

kili = Kili()json_interface = {
"jobs": {
"CLASSIFICATION_JOB": {
"content": {
"categories": {
"YES": {
"children": [],
"name": "Yes",
"id": "category123"
},
"NO": {
"children": [],
"name": "No",
"id": "category124"
}
},
"input": "radio"
},
"instruction": "Is this tweet an opinion about Covid-19 ?",
"mlTask": "CLASSIFICATION",
"required": 1,
"isChild": false,
"isNew": false
}
}kili.create_project(
input_type=’TEXT’,
title=’Tweet classification’
description=’From the blogpost creating a dataset of tweet’
json_interface=json_interface
)

Kili offers two possibilities for importing TEXT assets:

1. a simple one displaying the text without any formatting, fast and easy;
2. a more complete one called richText that enables to personalize the text structure and format.

We will here use the richText import, trying to reproduce a tweet format. Let’s for example import this tweet: https://twitter.com/covid19nz/status/1547771218446544896?s=20&t=gQQB3p2NQMdCvygpoxq44g

Here is a json_content that will display the tweet with a very nice and tweet-like formatting:

json_content = [{ "display": "flex", "flexDirection": "column", "marginRight": "50px", "marginLeft": "50px", "children": [ { "alignSelf": "flex-start", "display": "flex", "margin": "30px", "children": [ { "padding": "20px", "borderRadius": "10px", "border": "1px solid #DCDCDC", "background": "white", "maxWidth": "600px", "display": "block", "children": [ { "display": "block", "children": [ { "bold": True, "text": "united agaist COVID-19", "id": "name" } ] }, { "display": "block", "color": "grey", "children": [ { "italic": True, "id": "username", "text": "@covid19nz" } ] }, { "display": "block", "type": "h4", "children": [ { "id": "message", "marginTop": '5px', "text": "⏺️ Wider availability of antiviral treatments \n" \ "\n" \ "From Monday 18 July 2022, the access criteria for three antiviral" \ "treatments for COVID-19 will be widened to include a wider group" \ "of people at risk of severe illness from COVID-19 infection." } ] }, { "display": "block", "color": "grey", "children": [ { "id": "date", "text": "4:33 AM · 15 juil. 2022 · Sprinklr" } ] }, ] } ] } ] }]kili.append_many_to_dataset(project_id=project_id, json_content_array=[json_content], external_id_array=['Covid Tweet'])

Here is below how it renders on the labeling interface:

This structure can be used to import all tweets. For more information on how to import assets in Kili with RichText, you can have a look at the recipe import text assets.

Conclusion

Twitter Search and Streaming APIs can be used for retrieving tweets, but it is necessary to take into consideration APIs restrictions and computational performance to create a script for building a dataset of tweets.

An article by Fernando Vieira da Silva

Subscribe for updates

Stay updated with the latest news, articles and update directly into your box

March 12, 2026

Report: Building Trusted GenAI with LLM-as-a-Judge and Human-in-the-Loop Workflows

Enterprise AI has a validation problem — and it's bigger than most teams realize. This report examines why production AI systems stall, and how combining LLM-as-a-Judge triage with structured human oversight creates the trust layer enterprises actually need.

Kili Technology

LLMs

AI Evaluation