• Solutions
  • Company
  • Resources
  • Docs

How to Create Datasets: strategies and examples

Are you tired of scouring the internet for the perfect dataset to train your ML models? Worry no more! This article will show you six tried-and-true methods for creating datasets that will make your models sing. These techniques may not be magic, but they'll fit the bill for most ML projects.

How to Create Datasets: strategies and examples


Who loves datasets?! At Kili Technology,  we do love datasets –it won't be a shocker. But guess what none of us like it? Spending too much time creating datasets (or searching for datasets). Although this step is essential to the machine learning process, we must admit it: this task gets daunting quickly. Do not worry, though: we've got you covered!

This article will go through the 6 common strategies to think of when building a dataset.

Although these strategies may not be suitable for every use case, they're common approaches to consider when building a dataset and should give you a hand in building your ML dataset. Without further due, let's create your dataset! 

Strategy #1 to Create your Dataset: ask your IT

When it comes to building and fine-tuning your machine-learning models, one strategy should be at the top of your list: using your data. Not only is this data naturally tailored to your specific needs, but it's also the best way to ensure that your model is optimized for the types of data it will encounter in the real world. So if you want to achieve maximum performance and accuracy, prioritize your internal data first.

Here are additional techniques to gather more data from your users:

User in the loop


Are you looking to get more data from your users? One effective way to do so is by designing your product to make it easy for users to share their data with you. Take inspiration from companies like Meta (formerly Facebook) and its fault-tolerant UX. Users might not see it, but its UX leads them to correct machine errors or improve ML algorithms

Side business

Let's focus on data gathered through the "freemium" model –which is particularly popular in the Computer Vision field. By offering a free-to-use app with valuable features, you can attract a large user base and gather valuable data in the process. A great example of this technique can be seen in popular photo-editing apps, which offer users powerful editing tools while collecting data (such as face images) for the company's core business. It's a win-win for everyone involved!"


To make the most of your internal data, you should ensure it meets these three crucial criteria:

  1. Compliance: Ensure your data is fully compliant with all relevant legislation and regulations, such as the GDPR and CCPA.

  2. Security: Have the necessary credentials and safeguards to protect your data and ensure that only authorized personnel can access it.

  3. Timeliness: Keep your data fresh and up-to-date to ensure it's as valuable and relevant as possible.

Strategy #2 to Create your Dataset: Look for Research Dataset platforms

You can find several web pages or websites that gather ready-to-use datasets for machine learning. Among the most famous:

Strategy #3 to Create your Dataset: Look for GitHub Awesome pages


GitHub Awesome pages are lists that gather resources for a specific domain –isn't it cool?! There are fantastic pages for many things, and lucky us: datasets as well. 

Awesome pages can be on more or less specific topics:
- You can find datasets on awesome pages that gather resources with a broad scope, ranging from agriculture to economy and more:
https://github.com/awesomedata/awesome-public-datasets or https://github.com/NajiElKotob/Awesome-Datasets
- But you can also find awesome pages on more narrow and specific topics. For example, datasets focusing on tiny objects detection https://github.com/kuanhungchen/awesome-tiny-object-detection or few shot learning https://github.com/Bryce1010/Awesome-Few-shot.

Strategy #4 to Create your Dataset: Crawl and Scrape the Web


Crawling is browsing a vast number of web pages that might interest you. Scrapping is about gathering data from given web pages.

Both tasks can be more or less complex. Crawling will be easier if you narrow the pages to a specific domain (for example, all Wikipedia pages).

Both these techniques enable the collection of different types of datasets:

  • Available raw text, which can be used to train large language models.

  • A specific introductory text that is used to train models specialized in tasks: product reviews and stars.

  • Text with metadata that enables to train of classification models.

  • Multilingual text that instructs translation models.

  • Images with legends that enables training image classification or image-to-text models…

Pro tip: you can build your crawler and scrapper with the following python packages:

You can also find more specific but ready-to-use repositories on Github, including: 

Google Image scrapper: https://github.com/jqueguiner/googleImagesWebScraping

News scrapper: https://github.com/fhamborg/news-please

Strategy #5 to Create your Dataset: Use products API

Some big service providers or media give an API in their product that you can use to get data when it is open source. You can, for example, think of:

Strategy #6 to Create your Dataset: Look for datasets used in research papers


You may be scratching your head and wondering how on earth you'll raise the suitable dataset to visualize and solve your problem – no need to pull your hair over it!

Odds are some chances that some researchers were already interested in your use case and faced the same problem as you. If this is the case, you can find the datasets they used and sometimes built themselves. If they publish this dataset on an open-source platform, you can retrieve it. If not, you can contact them to see if they accept sharing their dataset – polite requests wouldn't hurt, wouldn't they?

How to Create a Dataset of Amazon Reviews with Python and BeautifulSoup: A Step-by-Step Guide

Now that we’ve shared all our strategies to find or to build your own datasets, let’s practice our dataset-building skills with a real-life example.

Here’s your step-by-step tutorial on extracting valuable insights from Amazon reviews using Python and BeautifulSoup.

By the end of it, you'll have a fully functional Python script that effectively scrapes Amazon reviews and compiles them into a clean, structured dataset ready for analysis.

Let's jump right in!

Step 1: Install Required Libraries

Make sure you have Python and the required libraries installed on your system, and let's examine the code. Before diving into the code, make sure you have the following libraries installed:

You can install them using pip:

pip install requests beautifulsoup4 pandas

Step 2: Import Libraries and Set Up the Base URL

Begin by importing the necessary libraries and establishing the base URL for Amazon's product page:

import requests
from bs4 import BeautifulSoup
import pandas as pd

base_url = 'https://www.amazon.com/product-reviews/{}?pageNumber={}'
product_id = 'YOUR_PRODUCT_ID_HERE'

This sets the groundwork for our script by importing the libraries and defining the base URL to access the Amazon product review pages.

Step 3: Define a Function to Scrape Reviews.

Now, create a function to scrape reviews from a single page:

def scrape_reviews(product_id, page_number):
    url = base_url.format(product_id, page_number)
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')
    reviews = []
    for review in soup.find_all('div', class_='review'):
        title = review.find('a', class_='review-title').text.strip()
        content = review.find('span', class_='review-text-content').text.strip()
        rating = float(review.find('i', class_='review-rating').text.strip().split()[0])
        reviews.append({'title': title, 'content': content, 'rating': rating})
    return reviews

This function takes a product ID and a page number as input, constructs the URL, and sends an HTTP request to fetch the review page. It then parses the HTML content using BeautifulSoup and extracts the review title, content, and rating for each review on the page.

Step 4: Scrape Multiple Pages and Save the Dataset

Finally, create a function to scrape reviews from multiple pages and save them to a CSV file:

def scrape_all_reviews(product_id, num_pages):
    all_reviews = []
    for i in range(1, num_pages + 1):
        print(f"Scraping page {i}...")
        reviews = scrape_reviews(product_id, i)
    return all_reviews

# Replace 'num_pages' with the number of pages you want to scrape
num_pages = 10
dataset = scrape_all_reviews(product_id, num_pages)
df = pd.DataFrame(dataset)
df.to_csv('amazon_reviews.csv', index=False)

This function, scrape_all_reviews, takes the product ID and the number of pages you want to scrape. It calls the scrape_reviews function for each page and collects the reviews in a list. After all the pages have been scraped, it converts the list of reviews into a pandas DataFrame and saves it as a CSV file.


Congratulations! You've successfully created a dataset of Amazon reviews using Python and BeautifulSoup. You can now utilize this dataset for your machine learning or data science projects. This tutorial has provided you with a foundation for web scraping techniques and the ability to collect valuable data. Feel free to modify the script to suit your specific needs or to target other websites. We hope you found this tutorial beneficial in your journey towards data science mastery. Enjoy!

Key Takeaways


So there you have it! With these six strategies and this comprehensive tutorial, you should be well on building your dreams' dataset.

But wait a minute: since your dataset is likely to be ready by now, wouldn't it be time for you to annotate it? To help you keep on this dynamic, feel free to try the Kili Technology platform by signing up for a free trial.  

Get started

Get Started

Get started! Build better data, now.