[ad_1]

(Optional Primer)

Before we begin, it is good to understand the types of missing data and the various imputation techniques available. I have placed the primer in a separate article to keep this article brief. If you are already familiar with these concepts, feel free to skip this part.

Developed at Amazon Science, **DataWig**** **is a software package that applies missing value imputation to tables containing **heterogeneous **data types, i.e., numerical, categorical, and unstructured text.

The goal is to build a robust and scalable framework that allows users to impute missing values **without **extensive engineering efforts or a machine learning background.

DataWig runs three components to perform imputation for heterogeneous data: **Encode, Featurizer**, and **Imputer**.

We can see how DataWig works with an example involving **non-numerical** data. Let’s say we have a 3-row product catalog dataset where the ‘** color**‘ column has a missing value in the third row.

Thus, the ‘color’ column is the to-be-imputed column (aka **output **column), while the other columns are the **input **columns.

The aim is to use the first two rows (containing complete data) to train an imputation model and predict the missing ‘** color**‘ value in the third row.

- The data types of the columns are first determined automatically using
**heuristics**. For example, a column is defined as categorical instead plain text if it has at least ten times as many rows as unique values. - Features are converted to numerical representation using column encoders, e.g., one-hot encoding.
- Numerical-formatted columns are transformed into feature vectors.
- Feature vectors are concatenated into a latent representation to be parsed into the imputation model for training and prediction.

Let’s explore each of the three components:

## (i) Encoder

The ** ColumnEncoder** class transforms raw data into numerical representations. There are different types of encoders for the different data types, such as:

*SequentialEncoder*— Sequences of**string**symbols (e.g., characters)*BowEncoder*— Bag-of-words representation of**strings**as sparse vectors*CategoricalEncoder*— For categorical variables (**one-hot**encoding)*NumericalEncoder*— For numerical values (**normalization**of values)

## (ii) Featurizer

After encoding into numerical representations, the next step is transforming the data into feature vectors using featurizers.

The purpose is to feed the data as a vector representation into the imputation model’s computational graph for training and prediction.

There are also different types of featurizers to cater to the different data types:

*LSTMFeaturizer*— Map input sequences into latent vectors using LSTM*BowFeaturizer*— Convert string data into sparse vectors*EmbeddingFeaturizer*— Map encoded categorical data into vector representations (i.e., embeddings)*NumericalFeaturizer*— Extract feature vectors using fully connected layers

## (iii) Imputer

The final part is to create the **imputation** **model**, execute training, and generate predictions to fill in the missing values.

DataWig adopts the MICE technique for imputation, and the model used within is a neural network trained in the backend with MXNet.

In a nutshell, columns containing helpful information are used by the deep learning model to impute missing values in the to-be-imputed column.

Given that there will be different data types, the appropriate loss functions (e.g., squared loss or cross-entropy loss) are also selected automatically.

The Amazon Science team evaluated DataWig by comparing it against five popular techniques for imputing **missing numerical** values.

These other imputation techniques include mean imputation, kNN, matrix factorization (MF), and iterative imputation (linear regression and random forest). The comparison was conducted across synthetic and real-world data with varying amounts of missing data and types of missingness.

Based on the normalized mean-squared error, DataWig compared favorably to other approaches, even in the difficult MNAR missingness type. The results are displayed in the plot above.

Further details of the evaluation (including on unstructured text) can be found in the research paper.

*Author’s thought: Given DataWig’s purported strengths in handling categorical and text features, I was surprised that the research paper’s evaluation focus was on missing numerical values.*

To show how DataWig works, we will use the **Heart Disease Dataset** since it contains both numerical and categorical data types.

*Note: You can find the GitHub repo for this project **here** and the complete Jupyter notebook demo **here**.*

In particular, we will perform two imputations as part of the demo:

**Numerical imputation**: Fill in missing values in numerical

column (maximum heart rate achieved by person)**MaxHR****Categorical imputation**: Fill in missing values in categorical**ChestPain**

## Step 1 — Initial Setup

- Create and activate a new
*conda*environment with Python version 3.7. The reason is that DataWig currently works with version 3.7 and below.

`conda create -n myenv python=3.7`

conda activate myenv

`pip install datawig`

- If you would like the environment to appear in your Jupyter notebook, you can run the following:

`python -m ipykernel install --user --name myenv --display-name "myenv"`

*Note: Ensure pandas, NumPy, and scikit-learn libraries are updated to the latest versions.*

## Step 2 — Data Pre-processing

There are two preprocessing steps to do before imputation:

- Perform random shuffle train-test split (80/20)
- Randomly hide an arbitrary proportion (e.g., 25%) of values in the
**test**dataset to simulate missing data. The train set will remain completely non-missing for the imputation model to train on.

## Step 3 — Setup Imputation Model

The easiest way to build and deploy an imputation model is to use the `SimpleImputer`

class. It automatically detects the column data types and uses a set of default encoders and featurizers that yield good results on various datasets.

We first define a list of input columns deemed useful for predicting missing values in the to-be-imputed column. This list is based on the **user’s domain knowledge and critical judgment.**

We then create two instances of `SimpleImputer`

, one for each of the two columns to be imputed (i.e., **MaxHR**** **and

)**ChestPain**

## Step 4 — Fit Imputation Model

With our model instances ready, we can fit them on our train dataset. Beyond a simple model fit, we can leverage the hyperparameter optimization (HPO) `fit_hpo`

function of `SimpleImputer`

to find the best imputation model.

The HPO function uses a random search on the custom grid of hyperparameters (e.g., learning rate, batch size, number of hidden layers).

If HPO is not required, we can omit the hyperparameter search arguments (as shown in the categorical imputation example)

## Step 5 — Execute Imputation and Generate Predictions

The next step is to generate predictions by running the trained imputation models on the test set with missing values.

The output is the original dataframe **plus** a new column of the imputed data.

## Step 6 — Evaluation

Finally, let’s see how our imputation models fared with these evaluation metrics:

- Mean-squared error (MSE) for numerical imputation
- Matthew Correlation Coefficient (MCC) for categorical imputation

For this demonstration, the MSE is **342.4,** and MCC is **0.22**. These values form the benchmark for comparison with other imputation techniques.

Beyond the basic implementation described earlier, we can leverage advanced DataWig features for our specific project needs.

## (i) Imputer

If we want more control over the types of model and preprocessing steps in the imputation models, we can use the `Imputer`

class.

It provides greater flexibility for the custom specification of model parameters (such as particular encoders and featurizers) as compared to the default settings in `SimpleImputer`

.

Here is an example of how the encoder and featurizer for each column are explicitly defined in `Imputer`

:

The `Imputer`

instance can then be used to do a `.fit()`

and `.predict()`

.

*Author’s thought: The specific definition of column types can be helpful because automatic encoding and featurizing may not always work perfectly. For example, in this dataset, the **SimpleImputer** misidentified the categorical **Thal** column as a text column.*

## (ii) Label-shift Detection and Correction

The `SimpleImputer`

class has a handy function `check_for_label_shift`

that helps us detect issues of data drift (label shift in particular).

Label shift occurs when the marginal distribution differs between the training and real-world data. By understanding how the label distribution has changed, we can then account for the shift in our imputation.

The `check_for_label_shift`

function logs the severity of the shift and returns the weight factors for the labels. Here is a sample output of the weights:

We then retrain the model with a weighted likelihood by passing the weights when we re-fit the imputation model to correct the shift.

We have covered how DataWig can be used to impute missing values in data tables effectively and efficiently.

One important caveat is that imputation tools such as DataWig are **not** magic bullets for handling missing data.

Dealing with missing data is a challenging process that requires proper investigation and a strong understanding of the data and context. A clear example is shown in this demo, where users need to decide which input features to feed into the model to impute the output column accurately.

The GitHub repo for this project can be found **here**.

I welcome you to** join me on a data science learning journey!** Follow this Medium page and check out my GitHub to stay in the loop of more exciting data science content. Meanwhile, have fun imputing missing values with DataWig!

[ad_2]

Source link