Data Labeling Platform: Build vs. Buy
Data labeling is an essential part of many machine-learning project. As the quality of your data labeling will impact your ML model performances, deciding whether to buy or build your own data labeling solution can be difficult. In this article, we'll discuss the pros and cons of both options to help you decide which is best for your project.
In this article, we'll cover the cost and time investment involved, as well as the benefits and drawbacks of each approach. By the end of this article, you should have all the information you need to make the right decision for your project!
In a hurry? Here's a table that will give you the whole picture:
Build vs Buy - recap table
What is a Data Labeling Platform, again?
Although if you read this, odds are you're familiar with data labeling platforms, here's a friendly reminder of what a data labeling tool is and does.
A data labeling platform is a solution to annotate datasets to train ML models. It provides a centralized interface for the creation, management, and execution of data labeling tasks. It enables users to quickly and accurately label large amounts of data, typically in images, audio files, videos, and text documents. The set of features and capabilities of data labeling tools depends on their maturity and willingness to provide capabilities such as:
a user-friendly interface;
reporting features to monitor progress and quality.
Kili Technology Platform's Interface
Should I Build Or Should I Buy my Data Labeling Solution?
Deciding to use data labeling software to improve your machine learning models' quality and performance is a big deal. Organizations believe poor data quality to be responsible for an average of$15 millionper year in losses - says research firm Gartner. According to Cognilytica, companies spend more time performing data labeling than on any other phase of the data science lifecycle –crazy, right?
Spending on Different Phase of the Data Science Lifecycle
We saw that choosing whether to build or buy your data labeling solution will have long-term consequences. Hence, this choice deserves a lot of attention. So whether you buy or build a tool for your labeling activities, it's critical to choose wisely the best solution to support your process and ensure your data quality requirements are met, and your training dataset is correctly annotated.
Let's go through each option to explore the advantages and disadvantages of both.
Build A Data Labeling Solution: You Are Never Better Served Than By Yourself
Building your data labeling solution from scratch
Let's be real: when asked to fix a problem, most tech teams would say: "we can build something." Honestly, who could blame them? Who doesn't like to build custom software?
Besides the fact that building software is cool, building it by yourself has other perks. Although we won't go for an endless list, here are the most salient ones:
You have the freedom to design a tool the way you want. Unlike an off-the-shelf data labeling tool, a solution you build doesn't have to fit the market's requirements. As its only requirements are these of your company, the tool you'd develop would likely be much leaner. Last but not least, a tool that you build internally would perfectly fit your organization's requirements (look and feel, wording, set of features).
You're the master of your solution and the developments to come. It can be integrated into your other systems from the ground up, and you're free to make it evolve, given your will.
You're free from licensing costs.
But of course, as there's no free lunch, developing your own data labeling platform has also cons. Globally, building it from scratch is likely to come with (highly) unpredictable costs and timelines. Indeed, by building your tool, you will learn what makes good data labeling software – and how to do it properly. But in software, learning is done through try-fail-learn cycles. And as stated by the well-known Hofstadter law: "It always takes longer than you expect, even when you consider Hofstadter's Law."
As costs are a big chunk of the cons, let's dive deep into the most noticeable costs:
Cost of delay and "time to use." Since building complex software takes time, can you wait for months? What is the impact on your business?
Maintenance costs, aka costs involved in the bug fixing and/or the components upgrading for security and performance.
Operational costs, ranging from the infrastructure to monitoring, can get costly -given your internal required quality of service. On this same topic, you're also likely to face operational costs if your want to make your system scalable. Likewise, if you want to ensure your system is robust, you'll probably have to pay for it. Between you and us: who doesn't want a stable platform?
Lastly, changes in business requirements often lead to new and unplanned significant investments. Let's say you were doing only images, but now you also need to do audio or video. Boom: you've got a whole new domain to develop —and little to no bandwidth to do it now.
Let's sum it up: before deciding to build your data annotation tool from scratch, you've two core questions to bear in mind:
Is it so crucial for your core business that you are ready to dedicate time and resources to maintain such a system?
Isn't there an existing solution on the market that fits our requirements?
Open Source Data Labeling Tool: Someone Has Already Done This For Us
Here's where open-source software usually comes to the rescue. The engineer's joker to avoid buying —we see you guys! Open-source solutions are prevalent and at the core of the data-science software ecosystem. But when it comes to using open-source software, we must distinguish between two types of software:
Building blocks, like libraries, serve a single purpose on a relatively small scope and are usually integrated into a more extensive system.
Complete business solutions offered by for-profit organizations where we have a free and open-source "Community" or "Core" edition, as well as more advanced (and proprietary) versions with the features businesses are willing to purchase.
The pros and cons are totally different given what you are looking for. By using open building blocks, you'll lower your learning curve and efforts while still building custom software with the same pros and cons we've seen.
For example, many teams would pick CVAT, Bbox-visualizer, or Dataqa. CVAT was initially developed at Intel and is now supported by the OpenCV project. It's a good tool for building labeling interfaces and doing some labeling on your own. But this is not a packaged solution you can run online to have a team collaborating on your project. You have to build your own web app around it. And as you may know: a web app with authentication, permissions, project and team management, analytics, and reporting requires skills, time, and budget.
By looking for Enterprise-enabled versions, you are no longer building software; you're buying it. Let's rapidly underline the pros and cons of purchasing an open-source data labeling solution.
Pros of building your data labeling platform with open-source
On the bright side of things, we love this option because it has many free and open building blocks available to build around and speed up your project. What's also very convenient is adding custom features and using open-source code if it's a requirement for you.
Cons of building your data labeling solution with open-source
On the not-so-bright side of things, adding lots of custom code to open-source software often leads to the inability to update it or get community support. To further extend, licenses of software used must be carefully checked and monitored. Open source does not mean you can do what you want and can lead to legal issues; not to forget that you still have all the cons of building your own data annotation software.
Kili and open-source:
At Kili Technology, the core of our solution is closed-source. To build a long-standing business, we are incentivized to improve it and offer easy-to-use, best-of-breed software continuously. But we are not trying to lock you in:
Your data is always exportable in a simple JSON format;
Import and export labels can also be done using open model formats (e.g., YOLO).
On the other side, our SDK is open-source, allowing you to integrate the Kili Technology Platform into your workflow, customize it, and even contribute to improving it.
Kili Technology Platform
Buy A Data Labeling Solution: Get Ready Now
Buying software seems more straightforward than building it. But where you take the risk of underestimating effort in building your labeling tools, primarily due to the learning curve of doing it, you may not be well equipped to buy it too. What should you buy? What matters when comparing solutions? This is where we've got you covered with our article on How to Choose your Data Labeling Platform.
Nonetheless, software vendors' data labeling solutions also have pros and cons. Here are the most important to be mentioned, according to us.
Watch an on-demand replay of our webinar "Data labeling: what are my options" and get further insights on how to choose your data labeling platform.
Pros of buying your Data Labeling Tool
Let's start with the most apparent pros: time. Buying a data labeling solution saves you time to value since you're usually set in hours. Next pro: buying your data labeling means benefiting from the platform and the feedback received by the vendor from companies addressing the same concerns as yours. In this sense, the software has continuously improved from this feedback. Similarly, buying your data labeling software means you'll benefit from your vendor's expertise guiding you. This point is essential since chances are your competitors are also using the same vendor, proving their ability to manage your use case.
Another pro of buying your data labeling solution has to do with data annotation itself. Indeed, data annotation is not only about producing annotations per se; but also about reviewing annotations and measuring their quality to improve your data. This part of the process is often underestimated and managed in available software. The costs and energy involved in running the platform are also often minimized. If you've picked a cloud offer, you won't have to worry about running the platform.
Last but not least: you are not married! You can switch to another vendor later or commit to the building anyway now that nothing on the market does the work the way you want, and you know what you need.
Cons of buying your Data Annotation Platform
As we tackle the cons of building your data annotation tool, let's focus on the attention points to keep in mind: data ownership and security & compliance. How do you make sure it's all good regarding your constraints? Make sure to discuss this with your data labeling solution provider.
Important reminder: if you buy your data labeling software but run on-premise, you will still have the internal ops costs on top of licensing.
Lastly, you may influence the vendor roadmap, but the product's direction could differ from what you expect.
Choosing your data labeling platform: Let's recap
Here's your digest of the pros and cons of build and buy options.
Tailor-made software. No fluff, only what you need the way you need it.
Full ownership of how it will evolve.
Easy to integrate and often build tightly around your existing stack.
Building is cool.
Relying on an external workforce means internal teams won't learn independently.
Time-consuming to get a project up and running, depending on the complexity of the data.
The professional-level approach can be overkill for simple projects.
In a nutshell
If you know what you need, and no vendor can satisfy your requirements, you have to build.
Versatile solution (configurable without code)
Built-in tools to integrate labeling services and/or a managed workforce
QA/QC tooling and label review workflows
Enterprise ready (secure, stable, able to handle collaboration, project management, and reporting)
SLA-backed customer support
An off-the-shelf product is never 100% perfect in regard to some use cases.
If you have very different use cases, finding a tool that addresses them all may be challenging.
Buying process must care about how data is managed, and if vendors comply with regulations, you may be exposed too.
It's difficult to estimate your consumption when buying for the first time is difficult.
In a nutshell
If you have no experience with data labeling or no energy to put into developing a piece of software that is a utility and that existing solutions seem to support your requirements, you should definitely buy.
Download your free checklist of how to choose your data labeling platform, save time using this excel template and collaborate easily with your team to choose your data labeling platform wisely and swiftly.