Data-Centric AI Manifesto: The Future of AI is Now
The tremendous impact that Artificial Intelligence is having on our lives, and the limitations within its development.
There have always been mixed opinions and emotions surrounding Artificial Intelligence. In one light, we can see the tremendous impact it is having on our lives, and on the other hand, we can also see the limitations within the development of AI and if the applications are worth it or not.
Major companies such as Google, Meta, Microsoft, and Amazon who have been known to incorporate AI, have had huge successes and have placed huge bets on AI, have also experienced huge fails.
In March 2016, Tay an artificial intelligence bot that was originally released by Microsoft Corporation via Twitter was suspended due to the bot posting offensive tweets through its Twitter account.
In 2019, Amazon's facial recognition falsely matched 27 professional athletes to mugshots in the Super Bowl champions.
In April 2021, an AI system created by Google identifies and classified men wearing monocles as armed with a gun.
In Sept 2021, an AI from Meta suggests to Facebook users to continue watching videos about primates.
In Dec 2021, Zillow, an online real estate marketplace, takes a big hit to the business due to failed iBuying algorithms, leading to a loss of $500 million.
The list can go on and on. With the rise of technology and implementation of Artificial Intelligence in our day-to-day lives, you would have thought these errors would have not occurred, however, this is just the beginning.
Dealing with immature AI can cause a lot of damage to your brand, business, finances, and more. It fails to miss the business's needs and delivers incorrectly, leading to damage to the business's reputation. Depending on the error caused by AI, some can lead to bankruptcy.
Larry Page, co-founder of Google, says, and I quote:
“The risk of something seriously dangerous happening is in the five-year time frame. 10 years at most.”
Elon Musk, the CEO of Space X, Tesla, and co-founder of OpenAI says:
“If you’re not concerned about AI safety, you should be. It is vastly more risky than North Korea.”
To determine what was the origin of these failures, such as Tay the bot or Zillows businesses plan can be difficult to determine.
Tay the bot was fed with both positive and adversarial people interactions inputs without any filtering. On the other hand, Zestimate, the Zillow algorithm which was used to predict house price was an ensemble of roughly 1,000 models.
Therefore, if I had to learn one lesson from these failures, it would be the heavy focus on the machine learning model, and not enough on data handling.
Machine Learning has developed over the last few because with a lot of time and money spent on research in order to improve the code of the model. Whereas, if we look at many use cases, if the mindset was shifted from concentrating on improving the code and focusing on the improvement of the data, we would have seen more progress with AI.
The answer to this is Data-Centric AI.
Given that AI is growing at an exponential rate and that the level of risks associated with them are following suit, the questions that we need to answer are:
What is Data-Centric AI (DCAI)?
Why is DCAI so important for your business?
How can we now accelerate this already happening transformation?
What is DCAI?
As mentioned above, AI has historically focused primarily on code, with researchers investing and building more sophisticated models on fixed datasets.
Think Generative Pre-trained Transformer 3 (GPT-3), an autoregressive language model that uses deep learning to produce human-like text, that has 200 billion parameters and costs $12 million for a single training run.
When these models are put into the real world, they often fail or do not perform as well. This is due to the understanding that if you want to improve the performance of your model, it often is dependent on the quality of your data, and iterating on it that will bring you one step closer to having a successful AI project.
Data-Centric AI is a system that focuses on data instead of code. It is the practice of systematically engineering the data used to build AI systems, combining the valuable elements of both code and data.
For a model to perform well and effectively, it needs both clean data and diverse data. These two elements will determine the quality of your data. With respect to AI and other aspects, if garbage is being inputted, garbage is what will be outputted.
Quality Means Clean
Inconsistent data can decrease the overall time and performance of a machine learning model. It confuses the model, making it harder for the model to understand what needs to be learned, how the variables relate to one another, and how it can make its own decisions and predictions.
Let’s illustrate this with an example.
If you want to build a Computer Vision system that can detect forklifts and helmets. After having loaded your data and defined the check, you will allocate your team to label the data. However, if you ask three different members of your team to use different labeling tools, this will lead to your data becoming inconsistent.
If your data labeling is inconsistent, it can lead to a 10% reduction in the label accuracy means, which further leads to a 2-5% decrease in the overall model accuracy. To rectify this problem, you will need to double the amount of data that needs to be annotated to improve the performance of the model.
Here are a few ways you can improve the quality of the annotated data:
Implement an inter-annotator agreement measure to identify disagreements between the labels proposed by different labelers on the same image. An example of such a metric is the consensus between 0 and 100%. It equals 100% if the labelers produced exactly the same annotation on the same image and 0% if they only produced different annotations.
Use review to select the correct label from among the various labels offered. This can be achieved by filtering the image switch to a low consensus, where labelers disagree and choosing the correct label from among the different proposed labels
Update the instructions with the identified special cases. Indeed, the images on which the labelers disagree are often not trivial cases (otherwise they would agree) and are worth to be documented in the instructions which is a document shared among the labelers giving them examples on how to labeler correctly.
Iterate. The process of iteration allows for developers to design and test until they are told to stop. Features that show to prove sufficient outputs can continue to be iterated, and those that fail can be quickly abandoned.
Quality Means Diversity
For AI to make predictions, it needs to have learned about various data points that contain all possible scenarios. If your model has never been exposed to it, it will not be able to identify and classify it correctly. Therefore, the use and implementation of AI is only as good as the data it has been trained on.
In relation to the example, we used above about forklifts and helmets. If we want to classify the use of this equipment during different times in the day, such as night or day, or during different seasons, such as winter or spring; we will need to ensure that the model has been inputted with training data that contains all of these scenarios.
40% of the data in the training datasets are considered redundant. Data redundancy occurs when the same piece of data exists in multiple places. An example of data redundancy is when a person's name and home address are both present in different columns within a table.
To reduce this % of redundant data, you should start by annotating the most important asset by active learning. Active learning is the process in which the algorithm proactively selects the subset of examples to be labeled next from the pool of unlabeled data. This can reduce the amount of redundant data by 10%.
Overall, the aspects you need to take into consideration if you want to improve your machine learning model are:
Label some data.
Train a model.
Test the model on the data.
Identify the scenarios of data where it fails.
Collect more data on these scenarios.
Clean and diverse data has proven to improve the performance of a machine learning model, using DCAI.
Why is DCAI so Important for Your Industry?
Because it Means Better Performances
Our world is far from perfect, that’s why there are so many different types of metrics to measure accuracy. You can never be too sure about something.
Written texts contain plenty of different elements and variables, containing evidence of all kinds of social biases such as gender, race, social status, ability, age, and more.
If you ask a Machine Learning model: “This man works as a…”, it will answer with an occupation, such as “a doctor”, “a lawyer”, etc. If you ask the model the same question, but in relation to a female, the model's responses may be “a waitress”, “a nurse”, etc. Although data is our best friend when performing effective AI, it is also full of human biases. Without correct selection, presentation, and management, the models will end up learning these biases.
Deep learning models struggle to learn rare phenomena, and the world we live in is full of phenomena. Linguistics, the study of language follows the Zipf law. Zipf's law was originally formulated in relation to quantitative linguistics, stating that the frequency of any word is inversely proportional to its rank in the frequency table. In layman's terms, it means that most of the words are rare and thus harder for the models to learn.
If you lack to filter out training data, up to 40% of the data used is invaluable, providing no learned knowledge or experience for the model.
Tesla is pooling a lot of different rare events from their entire fleet of camera-equipped vehicles to teach their Autopilot system to react to them.
Therefore, DCAI is not an option when wanting to improve the performance of your model, it is a requirement if you want to implement AI in real application cases.
Because it Reduces Development Time
One of the major key elements to the success of the software industry over the past few years is Agile. Agile is an iterative approach to improving workflow, project management, and software development which has helped teams deliver value to their customers faster and with fewer issues.
In the Spring of 2000, a group of 17 software developers, including Martin Fowler & co, workers met at the Rogue River Lodge in Oregon to discuss how they could speed up development times to bring new software to the market faster.
They were able to make the development phase and testing activities work concurrently, giving more time for communication between developers, managers, testers, and customers. It proved to have increased cost-effectiveness, productivity, quality, cycle-time reduction, and customer satisfaction from 30% to 100%.
Data-centric AI is the agile of AI.
Labeling, model training, and model diagnostic can work in parallel and directly influence the data used for the AI system. It removes the unnecessary trial-and-error time spent on improving the model without having to worry or change inconsistent data and reduces the development time up to 10x faster.
Because it Promotes Collaboration
AI is a revolution, that is still in the making. Just like all other revolutions, when you first heard about it, it sounded ridiculous. This was in 2008. Then in 2014, people were calling it dangerous. A revolution is only complete when it becomes mainstream.
When computers started to become mainstream, there was a big shift in how people work, operate, document, and more. When the internet became mainstream, there was a tremendous shift, from new jobs to a change in people's livelihoods. When smartphones became mainstream, there was another big shift in how people communicate, get directions, and find things out.
However, the revolution of AI become so mainstream, that it will cause the biggest shift ever.
AI will generate billions, if not trillions of possibilities. DCAI is the path to making AI mainstream.
Using DCAI helps quality managers, subject experts, and developers work in conjunction with one another during the development process to label, train, and test an AI model. DCAI reduces the need for technical skills to build AI models, which enables companies to easily adopt machine learning models in their processes.
Starting a company today without machine learning is like starting a company ten years ago without software. Software 1.0 is when a human instructs the machine, line by line. Therefore, we can definitely say that AI is software 2.0.
Software 2.0 is a Neural Network that uses the concept of learning characteristics, variables, instructions, or rules that are needed to determine the desired outcome. Software 2.0 is king when the algorithm itself is difficult to design explicitly, for example, object detection in images. Data is the code for Software 2.0.
If you recognize Software 2.0 as a new and emerging programming paradigm and DCAI as agile for Software 2.0, how can we now accelerate this already happening transformation?
Discover how training data can make or break your AI projects, and how to implement the Data Centric AI philosophy in your ML projects.
How can We now Accelerate this Already Happening Transformation?
Set Up a Software 2.0 Stack
To deliver good software, developers around the world will write their code in a dedicated software named an IDE - Integrated Development Environment. An IDE is the Microsoft Word for code. A good IDE is designed to write good quality code, not just a lot of code.
An IDE enables programmers to combine the different aspects of writing a computer program. It is designed to develop an interactive and iterative process that requires a short time between development and testing. It aims to get the best out of the development team through collaboration at scale.
We’ve built up a vast amount of tooling that assists humans in writing 1.0 code, such as powerful IDEs with features like syntax highlighting, debuggers, profilers, go to def, git integration, etc.
In the Software 2.0 stack, programming is achieved by accumulating, massaging, and cleaning datasets. When the neural network fails in some hard or rare cases, we do not fix the errors or predictions by writing more code, however, we input more labeled examples of those cases.
To switch to Software 2.0, you will need a Software 2.0 IDE. The IDE will help with all the aspects of the workflow, for example accumulating, visualizing, cleaning, labeling, and sourcing datasets.
It can bubble up images that the network suspects identify as mislabeled. It assists in the labeling phase by seeding labels with predictions. It also suggests useful examples of labeling based on the uncertainty of the network’s predictions. It shares knowledge and enforces the reuse of data across the organization.
What is this IDE Software 2.0 that can achieve all of this? It is called Kili.
I know there is the question of “Is there space for a Software 2.0 Github? Or what is the Conda equivalent for neural networks?”.
I agree. Software 2.0 will become increasingly prevalent in any domain where repeated evaluation is possible and cheap, and where the algorithm itself is difficult to design explicitly. There are many startup opportunities to consider when choosing how to adapt the entire software development ecosystem to this new programming paradigm.
In the long run, the future of this paradigm is bright because it is increasingly clear that when we develop Artificial general intelligence (AGI), it will certainly be written in Software 2.0.
Set Up a Human in The Loop Culture
DCAI is part of the AI revolution and is a revolution in itself. Revolutionary change is not linear or constant, therefore it is the chaos that disturbs the organization and leads to the reshaping of its culture.
DCAI means bringing human intelligence to machine learning models, it means leveraging human expertise to train good AI.
To achieve this, you need to put the human in the center of a human-in-the-loop machine learning process. Real-world systems need both AI technology, tools, and domain experts collaborating closely with one another to inspire research, provide data, analyze, and develop algorithms that can solve the problem at hand.
We can admit that it is not trivial. It will change your processes, and the structure of your organization, possibly change your business model as well as curate a new vision and a long-term mission.
However, you won’t regret it. As a matter of fact, you don’t really have a choice. If we don’t do it, other companies will do it and you might find yourself getting trapped. Remember the issue with Zillow.
If you commit to it, it will improve the consistency, accuracy, transparency, and safety of your models. If you want to succeed, you need to start small.
Executing DCAI pilot projects will help you to gain momentum. It will also allow you to build a multi-disciplinary in-house DCAI team, which is made of subject matter experts, Machine Learning Engineers, and data quality managers. Which will all be provided with broad DCAI training, and have the ability and skill to also train others.
The Future is Already Here
If you want to see what the future entails, have a look here. There is a paradigm shift and it’s happening now.
The question is not if DCAI will be important for our industry, we are pretty much aware of that. The challenge today is not whether you will adapt to AI, we are seeing more and more mature companies, such as Google, Meta, and Apple adapting, therefore we will adapt too.
The challenge is whether you will adapt quickly enough to turn it into an opportunity.