Why Kili Technology?
You might not know this, but:
MNIST’s dataset has an error rate of 3.4%, and is still cited by more than 38,000 papers.
The ImageNet dataset, with its crowdsourced labels, has an error rate of 6%. This dataset arguably underpins the most popular image recognition systems developed by Google and Facebook. Systemic errors in these datasets have real-world consequences. Models trained on error-containing data are forced to learn those errors, leading to false predictions or a need for retraining on ever-increasing amounts of data to “wash out” the errors.
Every industry has begun to understand the transformative potential of AI and have started investing in it. But the revolution of ML transformers and the relentless focus on ML model optimization is reaching the point of diminishing returns. What else is there?
1. Data Quality is critical
Reducing or eliminating labeling errors, getting the right annotations the first time, and focusing on the final input to good machine learning models are now paying huge dividends.
However, data quality can be the most difficult part of developing a reliable model. This is because there is a need for coordination between human intelligence, modeling expertise, project management, and the technology that binds them all together.
This can oftentimes be a painful endeavor. The real differentiator between businesses that are successful at AI and those that aren’t is data quality: what data is used to train & test the algorithm, how is it gathered and labeled, and how is it governed? Our customers’ experience and our experience is that the move to Data-Centric AI (DCAI) is the most important shift businesses need to make today.
2. Data Quality is priceless
Human-labeled data is becoming the fuel and compass for AI-based software systems. But the increasing focus on the scale, speed, and cost of building and improving datasets has impacted the data's quality and thus the models' quality.
We have seen reasons for concern first-hand: fairness and bias issues in labeled datasets, quality issues in benchmark datasets, benchmark limitations, reproducibility issues in machine learning research, lack of documentation and data replication, and unrealistic performance measures.
3. Data Quality is complex
While the quality of datasets remains everyone's primary concern, the way it is measured in practice is poorly understood and sometimes just plain wrong.
Data quality is complex—it is not just software bugs or human errors. It is typically the result of how well the annotation is done, how well a dataset and annotation ontology represents the actual task, and if the quality metrics that are available, are suitable for the job.
Data annotation is complex because there are multiple interpretations of the truth, because some gestures are hard, and because collaboration induces complex communication and synchronization.
The development of tools to make repeatable and systematic adjustments to datasets has lagged.
At Kili Technology, we want to reverse this and find new and systematic ways to promote seamless interactions between humans and data.
4. Models have to be developed iteratively
When developing a model, labeling and model testing should work at the same time to remove the unnecessary trial-and-error time spent on improving the model without having to worry or change inconsistent data.
So, if we want to be cost effective, the model development infrastructure must be tightly integrated with a supervision layer so that labeling, model training, and model diagnostics can work in parallel and directly influence the data used for the AI system.
The future of AI is getting the best out of humans and machines by creating a human-in-the-loop machine learning process, thus dramatically accelerating the set up of reliable AI applications.