Who loves datasets?! At Kili Technology, we do love datasets –it won't be a shocker. But guess what none of us like it? Spending too much time creating datasets (or searching for datasets). Although this step is essential to the machine learning process, we must admit it: this task gets daunting quickly. Do not worry, though: we've got you covered!
This article will go through the 6 common strategies to think of when building a dataset.
Although these strategies may not be suitable for every use case, they're common approaches to consider when building a dataset and should give you a hand in building your ML dataset. Without further due, let's create your dataset!
Strategy #1 to Create your Dataset: ask your IT
When it comes to building and fine-tuning your machine-learning models, one strategy should be at the top of your list: using your data. Not only is this data naturally tailored to your specific needs, but it's also the best way to ensure that your model is optimized for the types of data it will encounter in the real world. So if you want to achieve maximum performance and accuracy, prioritize your internal data first.
Here are additional techniques to gather more data from your users:
User in the loop
Are you looking to get more data from your users? One effective way to do so is by designing your product to make it easy for users to share their data with you. Take inspiration from companies like Meta (formerly Facebook) and its fault-tolerant UX. Users might not see it, but its UX leads them to correct machine errors or improve ML algorithms.
Let's focus on data gathered through the "freemium" model –which is particularly popular in the Computer Vision field. By offering a free-to-use app with valuable features, you can attract a large user base and gather valuable data in the process. A great example of this technique can be seen in popular photo-editing apps, which offer users powerful editing tools while collecting data (such as face images) for the company's core business. It's a win-win for everyone involved!"
To make the most of your internal data, you should ensure it meets these three crucial criteria:
Compliance: Ensure your data is fully compliant with all relevant legislation and regulations, such as the GDPR and CCPA.
Security: Have the necessary credentials and safeguards to protect your data and ensure that only authorized personnel can access it.
Timeliness: Keep your data fresh and up-to-date to ensure it's as valuable and relevant as possible.
Strategy #2 to Create your Dataset: Look for Research Dataset platforms
You can find several web pages or websites that gather ready-to-use datasets for machine learning. Among the most famous:
Kaggle dataset: https://www.kaggle.com/datasets
Hugging Face datasets: https://huggingface.co/docs/datasets/index
Amazon Datasets: https://registry.opendata.aws/
UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/index.php
Google's Datasets Search Engine: https://datasetsearch.research.google.com/
Paper with code datasets: https://paperswithcode.com/datasets
Subreddit datasets: r/datasets
Strategy #3 to Create your Dataset: Look for GitHub Awesome pages
GitHub Awesome pages are lists that gather resources for a specific domain –isn't it cool?! There are fantastic pages for many things, and lucky us: datasets as well.
Awesome pages can be on more or less specific topics:
- You can find datasets on awesome pages that gather resources with a broad scope, ranging from agriculture to economy and more:
https://github.com/awesomedata/awesome-public-datasets or https://github.com/NajiElKotob/Awesome-Datasets
- But you can also find awesome pages on more narrow and specific topics. For example, datasets focusing on tiny objects detection https://github.com/kuanhungchen/awesome-tiny-object-detection or few shot learning https://github.com/Bryce1010/Awesome-Few-shot.
Strategy #4 to Create your Dataset: Crawl and Scrape the Web
Crawling is browsing a vast number of web pages that might interest you. Scrapping is about gathering data from given web pages.
Both tasks can be more or less complex. Crawling will be easier if you narrow the pages to a specific domain (for example, all Wikipedia pages).
Both these techniques enable the collection of different types of datasets:
Available raw text, which can be used to train large language models.
A specific introductory text that is used to train models specialized in tasks: product reviews and stars.
Text with metadata that enables to train of classification models.
Multilingual text that instructs translation models.
Images with legends that enables training image classification or image-to-text models…
Pro tip: you can build your crawler and scrapper with the following python packages:
You can also find more specific but ready-to-use repositories on Github, including:
Google Image scrapper: https://github.com/jqueguiner/googleImagesWebScraping
News scrapper: https://github.com/fhamborg/news-please
Strategy #5 to Create your Dataset: Use products API
Some big service providers or media give an API in their product that you can use to get data when it is open source. You can, for example, think of:
Sentinelhub API to fetch satellite data from sentinels or Landsat satellites https://www.sentinel-hub.com/develop/api/
Bloomberg API for business news https://www.bloomberg.com/professional/support/api-library/
Spotify API to get metadata about songs: https://developer.spotify.com/documentation/web-api/
Strategy #6 to Create your Dataset: Look for datasets used in research papers
You may be scratching your head and wondering how on earth you'll raise the suitable dataset to visualize and solve your problem – no need to pull your hair over it!
Odds are some chances that some researchers were already interested in your use case and faced the same problem as you. If this is the case, you can find the datasets they used and sometimes built themselves. If they publish this dataset on an open-source platform, you can retrieve it. If not, you can contact them to see if they accept sharing their dataset – polite requests wouldn't hurt, wouldn't they?
So there you have it! With these six strategies, you should be well on building your dreams' dataset.
But wait a minute: since your dataset is likely to be ready by now, wouldn't it be time for you to annotate it? To help you keep on this dynamic, feel free to try the Kili Technology platform by signing up for a free trial.
An article by Théo Dullin,
ML engineer @ Kili Technology