Clean data is the oxygen that enables the trained machine learning models to deliver Olympic-level performance.
When we started learning COBOL in high school, one of the first things the teacher introduced was the concept of GIGO. GIGO stands for “garbage in, garbage out”. If we input clutter mishmash data to a program, it will either error out or provide inaccurate results. This fundamental principle has not changed in machine learning programming. Moreover, it has become more relevant over time, considering the massive amount of data required to train a model for real-life artificial intelligence use cases.
Raw data is gathered from various sources and modes nowadays. One of the hitherto popular modes was user-contributed data inputs from surveys. However, as humans tend to make mistakes while recording the data, the dataset compiled with this mode generally requires good love and care in the data cleansing stage. Recently, with the rise of the internet of things (IoT), we have sensors and chips implanted in every possible place to collect machine-generated data in autonomous mode without any human intervention. As a result, data collected by sensors are much better quality but still need cleansing due to several issues like loss of connectivity. Unfortunately, we still need to depend on crowdsourced data in specific scenarios due to the nature of data inputs required.
Common issues we see with the raw data are as follows.
- Missing values or incorrect format: We can broadly classify the missing data into two categories: missing data at random without any specific pattern and missing data in a systematic pattern based on values of other attributes for the data record. The missing value is prevalent in datasets collected by human inputs. The usual suspects are date fields with confusion between DD/MM/YY and MM/DD/YY format and different decimal notations used in countries for numbers.
- Quantitative value in the different units of measure — The default answer to the same physical distance between two points will differ in the UK and India due to the measuring unit system in their respective countries. In the UK, the locals will reply in miles, while in India, it will be in kilometers.
- Duplicate records — There may be duplicate records in the dataset. It not only gives a false impression of the number of data records available for training and testing a machine learning model but also misleads the model’s ability to predict the results. Due to duplicate records, the model can easily predict the result for the same combination of duplicate records in the test dataset as it has already been seen during training. We feel the trained machine model is great in predicting the result. However, due to duplicate records, the model is evaluated on an unmasked test dataset. One of the ways to filter out duplicate data records is to use attributes like social security numbers, if available in the dataset, as the unique identifier. Alternatively, generate a concatenated field like full name followed by the date of birth as a unique field to identify duplicate records.
We must approach cleaning the raw source dataset with quantitative and qualitative techniques.
Quantitative Data Cleaning
These techniques are more mechanical and use statistics to quite an extent. We can identify the missing values with a couple of lines of the code. Several strategies are available to fill the missing value, like for numeric attributes replacing the missing values with the mean or median value of the attribute, the most common value, or a constant. The vital skill is selecting the proper replacement value strategy, which involves the business use case we are trying to solve with machine learning and data source.
Let us consider another example, suppose a few numeric data points appear outliers in the dataset on initial exploratory data analysis. In that case, the first thing to check is whether these data points recorded are in the same unit of measure as others. For example, height is in centimeters for most of the records, and a tiny proportion of the records is in millimeters. We risk dropping valuable data records thinking of them as rogue data points in such cases when the solution is to have a standard unit of measure for an attribute.
Sometimes the business case is to detect fraud or novelty and the source training dataset contains these outliers. It will be fatal to drop these outliers, considering it as some rogue entry in such a scenario. The actual act of data cleansing is very straightforward, with few lines of code. However, the main challenge is to make the appropriate decision while cleansing, as mentioned above, with examples.
Qualitative data Cleansing
These techniques are more complex and require domain knowledge and business rules to clean the data. For example, the below sample training dataset may look OK to someone outside the human resource (HR) domain. Employee records 56 and 76 are similar in educational qualification, experience, and designation. It shows that employee 76, living in the capital city with a higher city allowance in salary component, still earns a lesser salary than employee 56 living in some town. The difference is not so big that it will immediately strike some generalist data experts.
However, someone from the HR domain, understanding the company’s compensation package business rule, can immediately point out potential incorrect salary entries for record 76. However, cleaning the data from a qualitative perspective takes more effort and time.
Training the machine learning model with cleansed data is as important as choosing the appropriate algorithm. We have different tools in the swiss knife to tackle different nuts and screws. In the same way, we need to use quantitative and qualitative data cleaning techniques based on the problem we aim to solve with machine learning and dataset source.
Training the machine learning model with cleansed data is as important as choosing the appropriate algorithm. The general data scientist can do quantitative data cleansing in the project. However, it is essential to take advice and guidance from domain experts based on the dataset’s source and target objective to define the business rules for qualitative data cleansing.