LoadingLoading
Date2023-01-30 13:57
Read7min

How to Create Datasets - a Practical Guide

Are you tired of scouring the internet for the perfect dataset to train your ML models? Worry no more! This article will show you six tried-and-true methods for creating datasets that will make your models sing. These techniques may not be magic, but they'll fit the bill for most ML projects.

How to Create Datasets - a Practical Guide

Introduction

Who loves datasets?! At Kili Technology,  we do love datasets –it won't be a shocker. But guess what none of us like it? Spending too much time creating datasets (or searching for datasets). Although this step is essential to the machine learning process, we must admit it: this task gets daunting quickly. Do not worry, though: we've got you covered!

This article will go through the 6 common strategies to think of when building a dataset.

Although these strategies may not be suitable for every use case, they're common approaches to consider when building a dataset and should give you a hand in building your ML dataset. Without further due, let's create your dataset! 

Strategy #1 to Create your Dataset: ask your IT

When it comes to building and fine-tuning your machine-learning models, one strategy should be at the top of your list: using your data. Not only is this data naturally tailored to your specific needs, but it's also the best way to ensure that your model is optimized for the types of data it will encounter in the real world. So if you want to achieve maximum performance and accuracy, prioritize your internal data first.

Here are additional techniques to gather more data from your users:

User in the loop

users-in-the-loop-chart-to-create-dataset

Are you looking to get more data from your users? One effective way to do so is by designing your product to make it easy for users to share their data with you. Take inspiration from companies like Meta (formerly Facebook) and its fault-tolerant UX. Users might not see it, but its UX leads them to correct machine errors or improve ML algorithms

Side business

Let's focus on data gathered through the "freemium" model –which is particularly popular in the Computer Vision field. By offering a free-to-use app with valuable features, you can attract a large user base and gather valuable data in the process. A great example of this technique can be seen in popular photo-editing apps, which offer users powerful editing tools while collecting data (such as face images) for the company's core business. It's a win-win for everyone involved!"

Caveats

To make the most of your internal data, you should ensure it meets these three crucial criteria:

  1. Compliance: Ensure your data is fully compliant with all relevant legislation and regulations, such as the GDPR and CCPA.

  2. Security: Have the necessary credentials and safeguards to protect your data and ensure that only authorized personnel can access it.

  3. Timeliness: Keep your data fresh and up-to-date to ensure it's as valuable and relevant as possible.

Strategy #2 to Create your Dataset: Look for Research Dataset platforms

You can find several web pages or websites that gather ready-to-use datasets for machine learning. Among the most famous:

Strategy #3 to Create your Dataset: Look for GitHub Awesome pages

Awesome

GitHub Awesome pages are lists that gather resources for a specific domain –isn't it cool?! There are fantastic pages for many things, and lucky us: datasets as well. 

Awesome pages can be on more or less specific topics:
- You can find datasets on awesome pages that gather resources with a broad scope, ranging from agriculture to economy and more:
https://github.com/awesomedata/awesome-public-datasets or https://github.com/NajiElKotob/Awesome-Datasets
- But you can also find awesome pages on more narrow and specific topics. For example, datasets focusing on tiny objects detection https://github.com/kuanhungchen/awesome-tiny-object-detection or few shot learning https://github.com/Bryce1010/Awesome-Few-shot.

Strategy #4 to Create your Dataset: Crawl and Scrape the Web

crawl-and-scrapthe-web-to-find-datasets

Crawling is browsing a vast number of web pages that might interest you. Scrapping is about gathering data from given web pages.

Both tasks can be more or less complex. Crawling will be easier if you narrow the pages to a specific domain (for example, all Wikipedia pages).

Both these techniques enable the collection of different types of datasets:

  • Available raw text, which can be used to train large language models.

  • A specific introductory text that is used to train models specialized in tasks: product reviews and stars.

  • Text with metadata that enables to train of classification models.

  • Multilingual text that instructs translation models.

  • Images with legends that enables training image classification or image-to-text models…


Pro tip: you can build your crawler and scrapper with the following python packages:

You can also find more specific but ready-to-use repositories on Github, including: 

Google Image scrapper: https://github.com/jqueguiner/googleImagesWebScraping

News scrapper: https://github.com/fhamborg/news-please

Strategy #5 to Create your Dataset: Use products API

Some big service providers or media give an API in their product that you can use to get data when it is open source. You can, for example, think of:

Strategy #6 to Create your Dataset: Look for datasets used in research papers

look-for-datasets-used-in-research-papers

You may be scratching your head and wondering how on earth you'll raise the suitable dataset to visualize and solve your problem – no need to pull your hair over it!


Odds are some chances that some researchers were already interested in your use case and faced the same problem as you. If this is the case, you can find the datasets they used and sometimes built themselves. If they publish this dataset on an open-source platform, you can retrieve it. If not, you can contact them to see if they accept sharing their dataset – polite requests wouldn't hurt, wouldn't they?


Key Takeaways

key-takeways-on-how-to-create-datasets

So there you have it! With these six strategies, you should be well on building your dreams' dataset.

But wait a minute: since your dataset is likely to be ready by now, wouldn't it be time for you to annotate it? To help you keep on this dynamic, feel free to try the Kili Technology platform by signing up for a free trial.  


An article by Théo Dullin,
ML engineer @ Kili Technology

Other Articles on Topic
LoadingLoading
Get started

Get Started

Get started! Build better data, now.