Make Sure Your Data Matches Your Expectation
In a data science project, it is not only important to test your functions, but it is also important to test your data to make sure they work as you expected.
Even though Great Expectations provide a lot of useful utilities, it can be complicated to create a validation suite with Great Expectations. For a small data science project, using Great Expectations can be overkill.
That is why in this article we will learn about Pandera, a simple Python library for validating a pandas DataFrame.
To install Pandera, type:
pip install pandera
To learn how Pandera works, let’s start with creating a simple dataset:

Imagine this scenario. Your manager told you that there can only be certain fruits and stores in the dataset, and the price must be less than 4.
To make sure your data follow these conditions, checking your data manually can cost too much time, especially when your data is big. Is there a way that you can automate this process?
That is when Pandera comes in handy. Specifically, we:
- Create multiple tests for the entire dataset using
DataFrameSchema
- Create multiple tests for each column using
Column
- Specify the type of test using
Check
SchemaError: failed element-wise validator 0:
failure cases:
index failure_case
0 3 4
In the code above:
"name": Column(str, Check.isin(available_fruits))
checks if the columnname
is of type string and if all values of the columnname
are inside a specified list."price": Column(int, Check.less_than(4))
checks if all values in the columnprice
are of typeint
and are less than 4.- Since not all values in the column
price
are less than 4, the test fails.
Find other built-in Checks
methods here.
We can also create custom checks using lambda
. In the code below, Check(lambda price: sum(price) < 20)
checks if the sum of the column price
is less than 20.
When our tests are complicated, using dataclass can make our tests look much cleaner than using a dictionary. Luckily, Pandera also allows us to create tests using a dataclass instead of a dictionary.
Now that we know how to create tests for our data, how do we use it to test the input of our function? A straightforward approach is to add schema.validate(input)
inside a function.
However, this approach makes it difficult for us to test our function. Since the argument of get_total_price
is both fruits
and schema
, we need to include both of these arguments inside the test:
test_get_total_price
tests both the data and the function. Because a unit test should only test one thing, including data validation inside a function is not ideal.
Pandera provides a solution for this with the check_input
decorator. The argument of this decorator is used to validate the input of the function.
If the input is not valid, Pandera will raise an error before the input is processed by your function:
SchemaError: error in check_input decorator of function 'get_total_price': expected series 'price' to have type int64, got object
Validating data before processing is very nice since it prevents us from wasting a significant amount of time on processing the data.
We can also use Pandera’s check_output
decorator to check the output of a function:
Now you might wonder, is there a way to check both inputs and outputs? We can do that using the decorator check_io
:
By default, Pandera will raise an error if there are null values in a column we are testing. If null values are acceptable, add nullable=True
to our Column
class:
By default, duplicates are acceptable. To raise an error when there are duplicates, use allow_duplicates=False
:
SchemaError: series 'store' contains duplicate values: {2: 'Walmart'}
coerce=True
changes the data type of a column. If coercion is not possible, Pandera raises an error.
In the code below, the data type of price is changed from integer to string.
name object
store object
price object
dtype: object
What if we want to change all columns that start with the word store
?
Pandera allows us to apply the same checks on multiple columns that share a certain pattern by adding regex=True
:
Using a YAML file is a neat way to show your tests to colleagues who don’t know Python. We can keep a record of all validations in a YAML file using schema.to_yaml()
:
The schema.yml
should look like the below:
To load from a YAML file, simple use pa.io.from_yaml(yaml_schema)
:
Congratulations! You have just learned how to use Pandera to validate your dataset. Since data is an important aspect of a data science project, validating the inputs and outputs of your functions will reduce the errors down the pipeline.
Feel free to play and fork the source code of this article here:
Follow me for more