Cleaning your Dataset in Python: An Introduction
This process guide explains the data challenges scientists confront and how to remediate them using data cleaning techniques with Pandas and NumPy in Python
Meaningful and accurate data is vital for any data science project. The planning and preparation, data sketching, and subsequent analysis activities will be all pointless unless the initial dataset is free from errors and ambiguity.
Data preparation accounts for the majority of the time that data scientists spend on their assignments. In some cases, it can be upwards of seventy to eighty percent of the total project time.
While this is a known issue amongst data science endeavors, there are solutions to clean datasets by using the Pandas and NumPy libraries in Python.
This process guide will review why it is imperative to generate a clean data baseline before any data manipulation occurs, the common dataset issues confronting data scientists, and how specific functions in Pandas and NumPy can achieve positive dataset outcomes.
Common Issues With Data And Datasets
Alongside the data accessibility and widespread reach that the Internet provides, and coupled with automated data scraping and collection tools, there are constant streams of information to review and analyze.
You can add to these repositories the vast array of freely distributed data that most government agencies host online, and the expensive cloud-based data lakes most medium to large organizations fund to store data.
Therefore, having access to a wide variety of information is not a challenge that data scientists face. So, if there is plenty of information readily available, why are data scientists allocating so much time to data preparation?
Poor data input quality is public enemy number one for any artificial intelligence or machine learning project. These data-sensitive information systems can only generate reputable results based on what they are given to process.
Following are the key challenges that data scientists face with input data:
Duplicates And Duplication
It is not just data scientists who can be overwhelmed with an abundance of data; modern institutions face an onslaught internally and externally. Cloud data warehouses, streaming data repositories, and internal business databases will all have overlapping information, and with internal backup processes, multiple versions too.
Additionally, they will likely have business application and system silos. Once again, there is bound to be considerable duplication and overlapping among these sources.
Data duplicates typically occur with contact details, significantly affecting customer experience. Marketing campaigns resulting from data analysis projects will suffer if some prospects are missed out altogether or if others are repeatedly contacted. Duplicates increase the probability of skewed analytical results occurring.
Data accuracy is a critical component for highly regulated industries such as healthcare. Inaccurate or false data will not provide an accurate real-world view of any subject under analysis, and incorrect decisions will only occur when analyzing such data.
Regardless of the personalized customer experiences devised and implemented, they will only result in widespread disappointment. Marketing campaigns will underperform, and valuable organizational funds will be squandered if customer data is erroneous.
Inaccurate data can be traced to numerous factors, including human input error and data drift. Additionally, data quality will degrade over time and can lose its integrity as it is copied across various sub-systems. Automating data management can help, but dedicated data quality tools will deliver more acceptable outcomes.
An Outlier is a data object that differs remarkably from the remnant data objects in the dataset. They are generally caused by execution, measurement, or manual entry errors. The process for outlier detection is typically referred to as outlier mining.
There are numerous methods to detect data outliers, and the removal process within the data frame is relatively straightforward.
Outliers are detected using visualization or graphical techniques, executing mathematical formulas directly on a dataset, or by using a specific statistical approach.
Each of these methods will be discussed as follows:
A boxplot captures the visual overview of the data using only a straightforward box and whisker graph. A boxplot summarizes dataset sample data using the 25th, 50th, and 75th percentiles.
By merely examining the boxplot, one can garner insights (such as quartiles, median, and outliers) within the dataset. In addition, outlier data objects can be readily discovered as they will be obvious in relation to the bulk of the dataset objects.
A scatter plot is generally used when a dataset contains paired numerical data, when the dependent variable has multiple values for each independent variable, or when determining the relationship between two variables.
As per a boxplot, a scatter plot can be used for outlier data object detection.
A Z-Score is also named a standard score. The value, or score, helps comprehend exactly how far away a data point is from the mean. And after establishing a threshold value, one can employ a z-score of data point values to uncover any outliers.
IQR (or Inter Quartile Range)
The IQR is an approach to locating outlier data objects, and it is the most commonly used, and trusted, data cleansing approach within the research field.
The IQR is a benchmark of where the “middle fifty” is within a data set. Where a range is a measurement of where a beginning and an end are in a dataset, an IQR measures where the bulk of the values should reside.
For this reason, it is preferred over other measures of spread when reporting items such as educational performance scores. And it is highly efficient in detecting outlier values.
Ambiguous and Inexact Data
Within large and complex, data errors can infiltrate despite strict DBA supervision. For example, column headings can be misleading, data formatting issues happen automatically if back-end database rules, triggers, and views fire.
Spelling errors will occur undetected, especially if the aggregated data spans international borders (such as the spelling discrepancies that arise between English UK and English US). Ambiguous data such as these examples will introduce flaws with both reporting and analytics.
Hidden Or Missing Data
Organizations generally only use a portion of their data, while the remainder will be copied - and forgotten - within cloud backup data silos or dumped in low-access, low-speed data repositories.
For example, valuable and important customer data which is available only to the sales team may not be shared with the customer service team.
While sales are guarding valuable data and hence ring-fencing valuable customers, they are fostering an environment where the organization loses an opportunity to maintain complete customer profiles.
Hidden data bypasses valuable discovery opportunities to improve customer services, design ingenious products, and optimize business processes.
When working across multiple data sources, it is quite likely to encounter mismatches with the same information across those sources.
Discrepancies will occur amongst format discrepancies, units of measure, and spelling errors. In addition, inconsistent data generally occurs during data migration projects due to server consolidation or server upgrade projects.
Data inconsistencies will build up over time and minimize their value. Nevertheless, data-driven organizations closely watch for data consistencies because they understand the need for trusted and accurate data powering their analytics endeavors.
Discover how training data can make or break your AI projects, and how to implement the Data Centric AI philosophy in your ML projects.
How To Perform Routine Dataset Checks
So, how does one combat the various challenges that data scientists face with their erroneous data? Python has two valuable functional libraries that any data scientist can readily use to manage, manipulate, and clean their datasets. They are Pandas and NumPy.
Key Python Tools to Clean Dataset
Pandas is an open-source Python library package that data scientists widely used for complex data analysis and machine learning/ AI projects.
Pandas have been built on top of another library package named NumPy. As a result, pandas easily support multi-dimensional arrays and are considered one of the most popular data-managing packages available.
Pandas works well alongside other data science modules inside the Python ecosystem and is commonly compatible with every Python upgrade and new version.
The term NumPy is a short notation form for Numerical Python. It is a functional library package consisting of array objects and aggregation of sub-routines for processing those multi-dimensional arrays. When using NumPy, various mathematical and logical operations can be performed.
Importing Pandas And NumPy Using Python
This process guide assumes the reader has a basic understanding of the Pandas and NumPy libraries, including the following:
Series and DataFrame objects,
The common methods applied to these objects, and
Understanding of NumPy’s NaN.
To import both Pandas and NumPy libraries into Python, the following commands are used:
import pandas as pd import numpy as np
The next step is to read in and provide access to the raw data that requires data manipulation and cleaning. For example, create a DataFrame from a CSV file called: ‘DF-Rugby-Personnel-File-Historic.csv’.
In the data cleaning example below, we pass a pathway to pd.read_csv. This means that all of the datasets are in one folder named DatasetLibrary, which is in our current working directory:
df = pd.read_csv('DatasetLibrary/DF-Rugby-Personnel-File-Historic') df.head()
Once the libraries are accessible and the raw data loaded into our working directory, we can commence the data cleaning process.
Unnecessary whitespace will increase the size of your dataset in your database. It also hampers finding duplicates a challenge, as most search algorithms will factor in the whitespace when performing their matching algorithms.
This activity will check the pre-loaded dataset for whitespaces, as we see in the Name, Team, and Position columns below. We will then remove the whitespace elements of the data systematically.
The following python code, once executed, will search for, and remove whitespace in the previously mentioned columns. The code below passes the column name to the replace() function to remove leading and trailing whitespaces in the dataset.
# remove whitespaces from Name column data["Name"].str.replace(' ', '') # remove whitespaces from Team column data["Team"].str.replace(' ', '') # remove whitespaces from Position column data["Position"].str.replace(' ', '')
To finalize the data cleansing activity, check the dataset again to confirm that all whitespace has been removed.
Dropping Columns In A DataFrame
Occasionally, not all data categories within a dataset will be useful to the task at hand. They, therefore, will need to be removed from your dataset before analytical processing occurs.
For our dataset example, the column - 'Star Sign' is not useful, and therefore retaining this data category will consume space and potentially hinder runtime execution.
Pandas provide a useful method of removing unwanted rows or columns from a DataFrame by using the drop() function.
We can drop this column as follows:
to_drop = ['Star Sign'] df.drop(to_drop, inplace=True, axis=1)
In the code above, a list has been defined containing the column names to drop (only one in our case). The drop() function is executed, using the in-place True parameter, and the axis parameter set to 1.
This python command tells Pandas that the changes are to be made directly within our object and to seek those column values to be dropped.
Inspecting the DataFrame after this activity will show that the unwanted column has been removed:
Correcting Misspelled Words
Besides formatting errors, there will often be misspelled words within a dataset. This situation makes it problematic to analyze data. To resolve this scenario, there are pre-built spell-checker libraries within Python that can be readily used - as follows:
Run the following command to list all unique words using the unique() keyword in the Position column.
# Turn output into a list of unique words from the Position column unique_type = list(data["Position"].str.split(", ", expand=True).stack().unique()) # Print out the list unique_type
As we can see, there are two misspelled words (Proppe and Flankker) that should be “Prop” and “Flanker.” So now one can proceed through the entire dataset, find which rows have these misspelled words, and remediate them.
# Print out the list unique_type ['Prop', 'Hooker', 'No. 8', 'Proppe', 'Lock', 'Scrum-half', 'Fly-half', 'Flanker', 'Flankker', 'Centre', 'Winger', 'Fullback']
Removing Duplicate Values
Large datasets typically have duplicate entries, that will need identification and removal. The process of detecting and removing such erroneous data is to first look for duplicates by column name within the dataset.
Each entry in our sample dataset should have a unique entry within the Name column. However, as we can see below, Callum has two entries, one at row 5 and another one at row 7. Also, since the Weight column should only contain numbers, row 7 contains the excess pounds text in its Weight column. This row will need to be removed.
Run the following python command to remove the first duplicate using the data.drop_duplicates function and hold the last (keep="last") row occurrence.
data.drop_duplicates(subset=["Name"], keep="last", inplace=True) data.head(6)
Detecting And Fixing Outliers
To remove an outlier, one must follow the same process of removing an erroneous data entry from a dataset. And this involves finding its exact position within the dataset. The bulk of the time dealing with outliers is in detecting their existence. According to the detection method used, the resultant list of all those data items that satisfy the outlier definition will need to be deleted.
dataframe.drop(row_index, inplace = True)
Fixing Formatting Errors
Rather than dealing with duplicate data, another standard scenario is remediating formatting errors within a dataset. Inaccurate data records containing formatting issues can be readily remediated using Python.
Perhaps there is an entry in your dataset with words separated by dashes - as per the example below. It should be commas and spaces. If so, executing the apply() and replace() commands is required.
Run the command below to see how the data looks in your DataFrame.
Run the following command below to replace dashes with commas.
data["Injuries"] = data["Injuries"].apply(lambda x: x.replace(" -", ","))
Rerun the data.loc command as in the earlier step one to confirm there are now no dashes in the data.
As can be seen below, the output now shows commas separating the words.
Issues with data quality can be opportunities in disguise to be addressed so as to prevent future data loss and prediction inaccuracies. With a common understanding of data quality techniques, leverage your data to improve customer experience, uncover new marketing opportunities, and drive business growth forward.
This process guide described the key data challenges that data scientists confront on a daily basis, and we have learned how to perform simple, yet powerful, data cleaning activities using Python.
We have also learned that Pandas and NumPy are popular and valuable Python library packages that save valuable time cleaning datasets.