AI experts and pundits sometimes refer to data as the new oil. Unlike oil, though, data is far from being a scarce resource: businesses worldwide find themselves swimming in it. Just because data is abundant, though, doesn’t mean you can simply take your data and use it as-is. There is a certain refinement process that you have to follow to ensure that your data provides you with accurate information that can then be later used in the decision-making process. And this is where data cleaning comes into play.
Understanding Data Cleaning
Data Cleaning: what is it, again?
Data cleaning, also known as data cleansing and data scrubbing, is the process of correcting incorrect or corrupt data so that it can be used to generate reliable visualizations and models and drive business decisions. It is an important part of the data preparation stage in every data science or machine learning workflow. It is considered one of the most important quality-related steps when working with data, no matter the data type.
Data cleansing may mean removing data errors, wrong or duplicate data, fixing poor-quality data points, or even reformatting the whole dataset. And although it may seem like a tedious task, unfortunately, there are no shortcuts here. That’s because working with invalid data can lead to all sorts of incorrect outputs.
Data Cleaning: Why is it Important in Machine Learning?
A machine learning model needs to learn how the variables in the dataset it’s presented with relate to one another and how it can use this information to make its own decisions and predictions. Naturally, then, for a model to perform well, it needs both clean and diverse data.
If you neglect this cleaning stage and a machine learning model is built using uncleaned data, the generalizations it will lead to false conclusions. This happens because inconsistent data will confuse the model, making it harder for it to extract useful information, which, as a result, will decrease the model's overall performance. For example, if you are working with two datasets, skipping the cleaning phase after merging them raises the risk of accidentally using duplicated data to train your model and finally getting the wrong results. Therefore, to get decent results, data analysts, data scientists, or machine learning engineers have to clean the data during the data preparation stage.
What Causes Dirty Data?
The data we process refers to the real world around us. Since the real world is full of errors and inconsistencies, the data we get from it will be the same, especially if it has been collected and processed by error-prone humans.
Here are a few common reasons for poor-quality data and irrelevant data:
During the data entry stage, individuals may need to insert data into a database manually. Since naturally, as human beings, we are susceptible to spelling mistakes and typos, this may be an important source of errors in your dataset.
Data is duplicated when multiple data sources or points share the same information more than once. This occurs when the same information is inserted multiple times, sometimes in different formats. For example, if, on several occasions, a customer called several different people working for support, her name may have been entered into the CRM system several times, each time with a different spelling.
In a dataset, you will see incomplete data defined as NaN, NA or Null. This is because data has simply not been entered at all. This can happen for many reasons.
Let's say you're working with survey data. In you context, a user may have left a section blank, or a data entry specialist did not add the tabular format.
Outliers in data
An outlier is an observation considered abnormal from the standard data. In short, this means data values far away from the general values in a dataset, for example, human height measured at 28’, differing from the standard numerical values.
How to Spot Dirty Data?
Since we already know what constitutes and causes bad-quality data, it’s time to name a few ways to spot it in your dataset.
Data profiling summarizes your dataset through descriptive statistics. The more statistical information you have on your dataset, the best you're at monitoring errors. Descriptive statistics include data types, mean, median, and standard deviation.
The four most common and general data profiling techniques are:
Column Profiling: scanning a dataset and then counting the times each value shows up within each column.
Cross-column Profiling: Analyzing the relationships between values that appear across several columns.
Cross-table Profiling: Analyzing the relationships between values that appear across several tables.
Data Rule Validation: Analyzing the dataset to check if it conforms to predefined rules, like specific value ranges, patterns, and length.
Exploratory Data Analysis
Data scientists and machine learning engineers perform exploratory data analysis (EDA) to analyze better and investigate the dataset and get better insights into the data. EDA is done by creating data visualizations in programming languages such as Python and R, using libraries and packages such as Plotly, Seaborn, Pandas or Matplotlib or through Business Intelligence (BI) tools like Tableau or Splunk. BI tools have interactive dashboards, robust security, and advanced visualization features. This analytical process provides professionals with a comprehensive overview of their data and is, ultimately, likely to provide them with adequate material for getting reliable answers.
Data Cleaning Best Practices and Techniques
Raw data can be very difficult to work with. One of the most important reasons for this is that datasets vary from one another. So, naturally, the data-cleaning processes will have to vary, too. This means that, unfortunately, you won’t be able to use a one-size-fits-all solution to get high-quality data. However, this does not mean there aren’t any good guidelines you can follow to help you during the process. Here are a few of them:
Define and Understand Your Goals
As a professional working with data, you should always have a good understanding of the task at hand. This gives you context on what your next steps need to be when working with the data.
For example, you may be working on an insurance task where you are asked to analyze the insurance premium on customers in a specific age group. This provides context on what is essential in your own tools (insurance premiums, age groups) and allows you to define your goals more specifically.
Plan and Organize
Planning always helps with your workflow. If your team has a strategic plan, clearly stated roles and responsibilities, with an organized timeline, you will have a smoother data-cleaning process.
Replace Missing Values and Outliers
As mentioned before, incomplete data, such as empty cells, rows and columns, is one of the major causes of bad data. So is the presence of outliers that skew the general data distribution. The best way to fix incorrect data, values, and outliers is through data imputation, which means predicting the data and/or replacing the missing values or outliers. There are a few data imputation techniques to choose from. Carefully consider the available options because they will affect the performance of your machine-learning model.
Replacing with zero
The fastest and easiest way to treat null values and outliers is to replace them with zeros. However, this may not be the best solution, as the missing data or the data that got converted to zeros could have provided valuable insights into a model.
Replacing with the mean value
A widely-used method for replacing missing values and outliers is using the mean value. If you’re dealing with outliers, though, the mean may not be the best option, as the original value will be distorted.
Replacing with the median value
When your data is skewed, the median value will be the best replacement for your missing values. However, this can lead to data leakage (using the information in the model training process that is not expected to be available at prediction time) and other issues with data accuracy.
Interpolation is an estimation method used to find and handle missing data. It consists of handling missing values, replacing outliers, and adding new data points employing a discrete set of known data points. What it means is: it takes a range of values from rows above and below a specific value and calculates the mean to replace it.
You can use machine learning algorithms such as K-Nearest Neighbours, Naive Bayes, and Decision Trees to help predict or replace the value of an outlier. The main benefit is that you will not compromise the data, as related cases will back the replacement value in the whole dataset full of records and statistical analysis.
Address Inconsistencies in Data
When you are going through your data, you will have to look out for inconsistencies that need to be addressed in order to get clean data. This may mean searching for structural errors in the naming conventions, typos, or incorrect capitalization, leading to problems like whole mislabeled categories or classes. For example, you may find both “N/A” and “Not Applicable” used the same unit in the same dataset.
Inconsistent data typically occurs when an organization stores the same information in different places and fails to synchronize it. For example, when a company stores customer information in its CRM, as well as its newsletter marketing tool. But inconsistency can also mean having widely different measure values, “y”, for the input variable “x”. For example, if we work with a weather station that measures a temperature of 30°C at t0 and -5°C at t0 + 1 sec, this is probably a sensor failure
You can address inconsistencies in many ways: your choice of tools depends on your specific case and defined business rules. Actions that people have to perform most frequently are expanding abbreviated acronyms, removing accents, deleting syntax errors, shifting strings to lowercase, and removing stop words.
Standardize and Normalize Data
Standardization and normalization of data can mean different things, depending on whether you are at the cleaning or the preprocessing stage. To get clean data, the data needs to be standardized in a format that different team members can use and stored on platforms that are easily accessible to all users. This enables collaboration between different teams, advanced data analytics process, reporting, and more. For example, as a data cleaning step, we can standardize dates from different time zones to UTC time or convert different measurement units to metric units.
Normalizing the data is the process of handling and employing stored information to improve data use across an organization and prevent your model from duplicate entries or unwanted observations. It ensures the data looks the same across all records and fields within the organization. For example, if you are working with customers' personal data and you want their phone numbers to include dashes across all datasets, you will need to convert the original stored across multiple field,s 2345678910 to 234-567-8910, across the organization.
Use Regular Expressions for Data Cleaning
A regular expression, known as regex, is a sequence of characters that you can use to automate your search process and replace elements within text strings. It can be used to add, remove, isolate, mitigate incorrectly formatted data, and manipulate text and data.
For example, if you were looking for the word ‘The’ in a document, you can use regex: ‘^The’, which will match any string starting with ‘The’.
You can get a step closer to clean data faster by using regex to fix problems like:
Dealing with special characters
Detecting and removing URLs
Detecting and removing HTML tags
Detecting and removing email IDs
Detecting and removing hashtags
Use Automation to Streamline the Process
Getting clean data may be a very time-consuming phase in a machine-learning workflow. Most data scientists and machine learning engineers tend to dislike data-cleaning. So why not automate your data cleaning process?
Unfortunately, automating data cleaning is easier said than done. As mentioned before, for different reasons, the datasets that you work with may vary. At the same time, automation is highly dependent on knowing the shape of the data and the exact domain-specific use case. As a result, you may only be able to automate specific parts of your whole data-cleaning procedure. Obvious candidates for steps that can be successfully automated are the actions performed repeatedly to clean your data across many different projects.
Did you know that GPT can also be an excellent QA engineer? If you're curious to learn how it can be leveraged to help cleaning your dataset efficiently, make sure to check out our webinar on how to slash your data labeling time in half using ChatGPT. Access your on-demand session now.
Tools and Techniques for Data Cleaning
Tools and Software to Know
Here is a list of tools and software to help you prevent erroneous data.
Pandas is a popular Python library for data processing and is very useful for data professionals working with data cleaning and analysis. The Pandas library provides users fast and effective ways to load, manage, explore, manipulate, and transform their data.
OpenRefine was previously a Google SaaS product called Google Refine. It is open-source and has a variety of plugins and extensions available. Its user-friendly interface is straightforward, allowing users to explore and clean their data without using code. You can also run Python scripts to perform more complex data-cleansing tasks and streamline the process to cater to your customer. OpenRefine is available in 15+ languages.
Trifacta is an interactive tool for data cleaning and transformation. Their Dataprep solution is an intelligent data service that allows users to explore, clean, and prepare their structured and/or unstructured data to be prepared for analysis, reporting, and machine learning. They have great built-in tools such as highlighting patterns, helping the user easily navigate any issues within the data.
Tibco Clarity is a tool cleansing data that's perfectly suitable for people who don’t know how to code. It provides simple integration with different data sources and formats (e.g, tabular format), allowing you to merge and clean your data easily. It will give the output in a single format, allowing you to automate your process through data collection, cleaning, and formatting. The tool allows users to identify data patterns and create visualizations of their findings easily.
Talend provides users several tools, such as data cleaning, evaluation, and formatting. The Talend Trust Assessor tool quickly scans through your data before cleaning it. The tool ensures that the data is trustworthy and valuable to the task before it goes through the cleaning process.
Challenges and Limitations of Data Cleaning
Data cleaning is time-consuming. As Steve Lohr of The New York Times once said: "Data scientists, according to interviews and expert estimates, spend 50 percent to 80 percent of their time mired in the mundane labor of collecting and preparing unruly digital data before it can be explored for useful nuggets."
Issues with large datasets
With the rapid growth in data volume, one of the biggest challenges for the data-cleaning phase is scalability. As you can imagine, the bigger the dataset, the harder the cleaning process. It goes without saying that the same challenge applies to multiple datasets, or complex data structures.
Big data requires complex computer data analysis and effective maintenance. Generally, larger datasets will naturally require investing more time and money and creating more complex processes to reap the benefits of analytical insights. Finally, the more data collected, the higher the risk of data wrangling. With large datasets, the need for secure and adequate storage is even more critical, as potentially losing it due to an accident or hackers' activity has more serious repercussions.
Data ethics deals with moral problems related to handling data. In a nutshell, companies must ensure that the data in their possession is used lawfully, fairly, and can be traced for valid lawful purposes, focusing on transparency, diversity, dignity, privacy, data integrity, and data quality culture and reliability. In your day-to-day work, you’ll most likely have to deal with bias and data privacy issues.
Bias in a dataset may mean an over or under-representation of a certain group in the dataset, like white males being over-represented in an image dataset fed to a facial recognition system or older people being under-represented in a database used to train a credit-scoring model. As a Data Scientist or a Machine Learning Engineer, you must address these issues. Failing to do so will result in skewed results, irrelevant observations, and possibly even legal consequences.
As far as data privacy is concerned, the rules can vary across jurisdictions, but generally, you will need to ensure that any sensitive data in your dataset is properly handled. In this context, data cleaning may mean that you’ll have to anonymize or even completely remove sensitive data to meet regulatory requirements.
Real-life example: a labeling platform
Labeling platforms enable companies to train their models on pre-labeled data sets. The general idea is that the data labeling phase adds meaningful information to provide context to raw data. Employees can either do this manually or use automated labeling. There are tons of helpful features designed to make this process as quick and seamless as possible, and every time something needs to be reworked or repeated, the general performance of the process gets a hit.
Now let’s imagine labelers who need to annotate a dataset that’s full of typos, null values or duplicate values. These errors require the labelers to ask questions or even skip an asset which makes the process longer and more expensive.
The data labeling process can be highly affected if it works with unclean data; here are some reasons why:
The labeling team will have to (try to) label data that are difficult to annotate;
Makes the annotation process even more tedious since deciding to choose an annotation over another gets more complex;
Likely to make the machine learning model less accurate because it will be trained out of partially bad-quality data
Represents a high risk of data leakage when having nearly duplicate images (one in the training dataset, the other one in the test dataset).
Generally, if your data labeling is inconsistent, it can lead to a significant reduction in the label accuracy means, which further leads to a sharp decrease in the overall model accuracy. To rectify this problem, you will need to double the amount of data that needs to be annotated to improve the model's performance.
When using Kili Technology, you can annotate all types of unstructured data rapidly and accurately with customizable annotation tasks & an interface optimized for productivity & quality. When labeling data, ensure that the process is efficient, accurate, and error-free.
Data labeling tools follow a Data-Centric process, in which there is a focus on data quality instead of code. It is the practice of systematically engineering the data used to build AI systems, combining the valuable elements of both code and data.
Data Cleaning: Final Thoughts
Clean data is a crucial phase in any machine-learning pipeline. Regardless of how sophisticated your machine learning model or analytical tools are, your processed data needs to be high-quality; the result will be a low-performing model and inaccurate outputs. Clean, good-quality data is the ticket to ensuring that your AI-assisted application can perform accurately in decision-making processes and that the decisions are evidence-based. Mastering the tools and techniques that you can use to improve your data-cleaning process will ensure accurate data and reliable outputs from your models and increase overall productivity.
Interested in speeding up your data labeling? Make sure to check out our webinar on how to slash your data labeling time in half using ChatGPT! Access your on-demand session now.