Why is annotation much more complicated than it seems?
Data annotation for machine learning should not be underestimated.
Labeled data is the missing brick to create an AI
To make artificial intelligence you need three components:
It is now widely available, easily scalable and relatively inexpensive with the cloud and GPU. And computing power is growing exponentially. On your iPhone you have more computing power than the entire Apollo program!
The state is roughly available on github thanks to Google and Facebook publications. You can now build a translation stack that embeds Google's latest Transformer architectures. The development of open source, benefiting from the network effect of the internet, is also exponential. The deep learning algorithms are in the process more and more greedy in learning data.
And there's a lot. Since the digital revolution, pretty much all the information in the world is recorded in digital format. But this data, you have to annotate it. That's what Facebook does today when it asks you to comment on a picture or identify a friend in a picture. Every day we upload more than a billion pictures to Facebook and annotate them. We produce a huge learning database for Facebook so that they have been able to develop models that can identify people from behind. But in most companies, the data is siloed and unstructured.
Teams are not equiped to scale annotation
That's what Kili Technology is all about :-)Discover Kili Technology
At BNPP, to solve this problem of lack of learning data, we have set up annotation interfaces. But in doing so, we were confronted with the following problems:
UX of annotation interfaces
How to develop, at a reasonable cost, interfaces that are both comfortable and accessible for business people and data-scientists?
For example, to annotate cancer cells, I need a pixel-precise annotation. If I don't have a trackpad available and I need to annotate with the mouse from a PC, which UX offers the best efficiency?
As someone in the business, I want a graphical interface that is level, sexy, ergonomic, intuitive and that can be integrated with my business tools. As a data scientist, I want to be able to drive my annotation from my python IDE, with an API or even ideally from a python module. How can I reconcile the two?
How to have versatile interfaces to address all the ML issues; text, image, voice, OCR and each interface being customizable to combine for a project made of several different ML tasks under annotation tasks, most projects not being a simple annotation task?
Control of the quality of the annotated data
For example, in Speech to Text, do I have to annotate with or without punctuation? Can I annotate with or without spelling mistakes?
How do I make sure that the annotators I put on my task have understood what I expect from it and that they implement it correctly? How do I check the quality of my annotators' work when I have no indication of the correct answer to expect from the annotation?
On complex projects, with a strong subjectivity, for example the feeling on a video, you will want to put a high consensus for example 100% with 5 annotators. On the other hand, when there is just a basic quality control issue, you'll want to achieve a 10% consensus with 2 people. We want to be able to measure a consensus score between the annotators.
Number of annotators
For example, a retail customer is working on a project to provide a customer experience similar to Amazon Go. To do this, more than a million images need to be annotated, so you need to be able to take people on board and measure their performance and rendering quality. And to give you an order of magnitude, with a basic tool not very powerful, 10k images is 3 months for 1 person. So for a project like this, you need a lot of annotators and you have to coordinate them. How do I manage data access rights?
Annotation remains very long!
Even if you're good at interfaces and annotator coordination, it's still quite long. And so it has to be accelerated in an intelligent way by getting the best out of what man and machine can do. The machine being very good at repetitive tasks where man gets tired quickly and man being good at distinguishing and assimilating new nuances.How be faster with Kili
We're typically going to want to do
That is, to start learning a model being annotated in order to pre-annotate the data.
That is, which assets to start annotating. Typically if you do deep learning, you want maximum diversity from the very beginning of your training. How do you manage an optimal prioritization thread that is not at first glance the alphabetical order of the files to annotate?
Weakly supervised learning
How to use business rules to massively pre annotate? For example if I need to annotate product names in text and I can also extract a name dictionary from my product repository?
Perfect data is not enough to reach 100% performance!
Even if we manage to produce an annotated dataset in sufficient quantity and quality to train a good model and have acceptable results, we never get 100% performance.
Indeed, all of this allows the initial training to be done. We will extract the data from your systems and annotate it to create the dataset.
This makes it possible to obtain a model, but which is far from the 100% performance expected by the business.
And the performance will tend to deteriorate over time (model drift) with the arrival of new examples.
And people in the business never accept to take the risk of automating a task without having the guarantee of 100% performance.
Imagine if you do automatic entry of customer orders read in mails directly into the CRM and this launches the production line of a 12-ton metal bale?
On customer order emails, for example, we achieve well over 95% performance on classification and over 80% performance on named entity recognition. Which is already very good. But to capture only data with 100% reliability in the systems, in order to guarantee the integrity of the orders, it is essential to orchestrate human supervision in production.
You have to be able to keep the human in the loop, to capture the feedback to keep the models in production learning through annotation.
It's key to keep the humans in the loop when you want to put the A.I. into production. #Human in the loop.#