Machine Learning News Hubb
Advertisement Banner
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
Machine Learning News Hubb
No Result
View All Result
Home Machine Learning

Validate Your pandas DataFrame with Pandera | by Nisar Ahmad | Sep, 2022

admin by admin
September 10, 2022
in Machine Learning


Make Sure Your Data Matches Your Expectation

In a data science project, it is not only important to test your functions, but it is also important to test your data to make sure they work as you expected.

Image by Author

Even though Great Expectations provide a lot of useful utilities, it can be complicated to create a validation suite with Great Expectations. For a small data science project, using Great Expectations can be overkill.

That is why in this article we will learn about Pandera, a simple Python library for validating a pandas DataFrame.

To install Pandera, type:

pip install pandera

To learn how Pandera works, let’s start with creating a simple dataset:

Image by Author

Imagine this scenario. Your manager told you that there can only be certain fruits and stores in the dataset, and the price must be less than 4.

To make sure your data follow these conditions, checking your data manually can cost too much time, especially when your data is big. Is there a way that you can automate this process?

That is when Pandera comes in handy. Specifically, we:

  • Create multiple tests for the entire dataset using DataFrameSchema
  • Create multiple tests for each column using Column
  • Specify the type of test using Check
SchemaError:  failed element-wise validator 0:

failure cases:
index failure_case
0 3 4

In the code above:

  • "name": Column(str, Check.isin(available_fruits)) checks if the column name is of type string and if all values of the column name are inside a specified list.
  • "price": Column(int, Check.less_than(4)) checks if all values in the column price are of type int and are less than 4.
  • Since not all values in the column price are less than 4, the test fails.

Find other built-in Checks methods here.

We can also create custom checks using lambda . In the code below, Check(lambda price: sum(price) < 20) checks if the sum of the column price is less than 20.

When our tests are complicated, using dataclass can make our tests look much cleaner than using a dictionary. Luckily, Pandera also allows us to create tests using a dataclass instead of a dictionary.

Now that we know how to create tests for our data, how do we use it to test the input of our function? A straightforward approach is to add schema.validate(input) inside a function.

However, this approach makes it difficult for us to test our function. Since the argument of get_total_price is both fruits and schema , we need to include both of these arguments inside the test:

test_get_total_price tests both the data and the function. Because a unit test should only test one thing, including data validation inside a function is not ideal.

Pandera provides a solution for this with the check_input decorator. The argument of this decorator is used to validate the input of the function.

If the input is not valid, Pandera will raise an error before the input is processed by your function:

SchemaError: error in check_input decorator of function 'get_total_price': expected series 'price' to have type int64, got object

Validating data before processing is very nice since it prevents us from wasting a significant amount of time on processing the data.

We can also use Pandera’s check_output decorator to check the output of a function:

Now you might wonder, is there a way to check both inputs and outputs? We can do that using the decorator check_io :

By default, Pandera will raise an error if there are null values in a column we are testing. If null values are acceptable, add nullable=True to our Column class:

By default, duplicates are acceptable. To raise an error when there are duplicates, use allow_duplicates=False :

SchemaError: series 'store' contains duplicate values: {2: 'Walmart'}

coerce=True changes the data type of a column. If coercion is not possible, Pandera raises an error.

In the code below, the data type of price is changed from integer to string.

name     object
store object
price object
dtype: object

What if we want to change all columns that start with the word store ?

Pandera allows us to apply the same checks on multiple columns that share a certain pattern by adding regex=True :

Using a YAML file is a neat way to show your tests to colleagues who don’t know Python. We can keep a record of all validations in a YAML file using schema.to_yaml() :

The schema.yml should look like the below:

To load from a YAML file, simple use pa.io.from_yaml(yaml_schema) :

Congratulations! You have just learned how to use Pandera to validate your dataset. Since data is an important aspect of a data science project, validating the inputs and outputs of your functions will reduce the errors down the pipeline.

Feel free to play and fork the source code of this article here:

Follow me for more



Source link

Previous Post

The Real Name Behind the Statistical Terms You’re Using | by Amjad El Baba | Sep, 2022

Next Post

Data Science: Linear Regression. Most of us are familiar with a basic… | by Andrew Morse | Sep, 2022

Next Post

Data Science: Linear Regression. Most of us are familiar with a basic… | by Andrew Morse | Sep, 2022

An Introduction to the Confusion Matrix | by Lucas | Sep, 2022

How to Best Utilize AI / ML and Creative Humans in 2022 | by Matthew Joseph Taylor | Sep, 2022

Related Post

Artificial Intelligence

Dates and Subqueries in SQL. Working with dates in SQL | by Michael Grogan | Jan, 2023

by admin
January 27, 2023
Machine Learning

ChatGPT Is Here To Stay For A Long Time | by Jack Martin | Jan, 2023

by admin
January 27, 2023
Machine Learning

5 steps to organize digital files effectively

by admin
January 27, 2023
Artificial Intelligence

Explain text classification model predictions using Amazon SageMaker Clarify

by admin
January 27, 2023
Artificial Intelligence

Human Resource Management Challenges and The Role of Artificial Intelligence in 2023 | by Ghulam Mustafa Shoaib | Jan, 2023

by admin
January 27, 2023
Deep Learning

Training Neural Nets: a Hacker’s Perspective

by admin
January 27, 2023

© 2023 Machine Learning News Hubb All rights reserved.

Use of these names, logos, and brands does not imply endorsement unless specified. By using this site, you agree to the Privacy Policy and Terms & Conditions.

Navigate Site

  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

Newsletter Sign Up.

No Result
View All Result
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

© 2023 JNews - Premium WordPress news & magazine theme by Jegtheme.