The best open-source and paid tools to add to your project
Now, I love writing data models, don’t get me wrong. It’s one of the most fun parts of being an analytics engineer. However, there are a lot of tedious parts to writing them. It’s not just about the business logic that you are coding using SQL. You also need to clean the data before building your models, document it, validate it, and then implement data quality checks.
With one data model, this may not seem like a lot, but when you have tons of different data sources and continue to build more data models, it ends up taking up most of your time. Especially, if you’re like most data teams, and you wait until everything is near-perfect to implement all this nitty-gritty work.
Luckily, dbt makes data modeling a lot easier. And, with dbt, comes lots of helpful tools that integrate directly with it in order to save you a few headaches. If you’re not familiar with dbt (or data build tool), it is a data transformation tool that allows you to write modular code and follow modeling best practices. You can read more about it here.
Here are some tools, both open-source (aka free) and paid, that I’ve found to save me a lot of time and energy. Not everything has to be built from scratch, by you! There are a ton of smart people out there that have probably already done it for you. Take advantage of that.
re_data is one of my favorite free, open-source dbt packages. It is super easy to install and integrate into your already existing dbt project. While they have a lot of awesome features such as anomaly detection and alerting, they also have pre-built dbt macros to help you filter and clean your data.
Duplicates
They have two different filters for dealing with duplicate values, filter_remove_duplicates and filter_get_duplicates. You simply use these like any other dbt macro and select from it. They each take in three arguments- the model name, the unique columns, and the column you wish to sort the records by. filter_remove_duplicates returns a model with no duplicates and filter_get_duplicates returns only the duplicates in the model.
select
user_id,
name
from {{ re_data.filter_remove_duplicates(ref('users'), ['user_id']['created_at desc']) }} duplicates
These both save you time from having to aggregate and group columns in order to find or remove duplicates. It also makes for cleaner code.
Formatting
re_data also contains three different data formatting macros that you can use to clean your data- clean_additional_whitespaces, clean_blacklist, and clean_capitalize_words.
clean_additional_whitespaces takes the column name you wish to clean as an argument and removes any whitespace.
select
user_id,
{{ re_data.clean_additional_whitespaces('name') }} as name
from {{ ref('users') }}
clean_capitalize_words also takes in a column name and capitalizes the first letter of each word in the string. This is a great one to use on users’ names or even product names that are displayed on your website.
Lastly, clean_blacklist is used to mask personal information. This is a great replacement for Snowflake’s masking policies if you don’t have those available to you. Here, you specify the column you wish to mask, a list of words or a regex expression to look for, and then a string to replace those occurrences.
dbt expectations is another excellent open-source dbt package to utilize in your dbt project. It includes macros that test the output of your data against your expectations, hence the name. It is essentially more advanced testing to add to your data models on top of the dbt tests already existing within a dbt project.
These macros often test for specific use cases that you may try to write on your own when you come across them. Rather than spending hours writing these macros yourself, you can browse through the library to find one that answers your need. Here are a few of my favorites to use.
expect_column_values_to_be_of_type
This test allows you to check the data types of your columns. This comes in handy when you want to ensure all of your timestamp columns are of the same type, your dates are actually dates, and any columns you want to use in calculations are integers or floats.
I utilize this test the most in messy data sources such as Google sheets which are prone to a lot of human error. This way, I’m notified at the source rather than my model failing downstream. Here’s what it looks like in my src.yml
file:
columns:
- name: campaign_launch_date
description: "The date the campaign was launched"
tests:
- dbt_expectations.expect_column_values_to_be_of_type:
column_type: date
expect_column_values_to_be_between
This test is great for columns that have ids that refer to a mapping table. Oftentimes, we have too many values to include them all in an accepted_values
test within dbt. This acts as a great shortcut to prevent you src.yml
from becoming too overwhelming to read with its extensive tests.
I use this test often in state_id
columns. I don’t want to write out numbers 1–50, so I instead specify a range like so:
columns:
- name: state_id
description: "The unique identifier of a state"
tests:
- dbt_expectations.expect_column_values_to_be_between:
min_value: 1 # optional parameter
max_value: 50 # optional parameter
row_condition: "id is not null" # optional parameter
This test also has the option for a row condition in case you want to exclude null values or any other types of values.
Datafold is a powerful tool to show you the results of your code changes before actually making them. Using Github, it runs your previous code as well as the changes you are making in a pull request, and then compares the resultant datasets. This allows you to see how your data is going to change before you merge your code changes.
This is super helpful in validating that your code is correct and the data looks as expected before actually pushing to production. We all know how much of a nightmare it can be when something breaks production. You need to go back and find the root of the issue, revert back to an older version of your code, and redeploy it. Why worry about doing this when you can use a tool like Datafold to help prevent this in the first place?
In order to implement Datafold, you need three things: a Github repo with two branches, a data warehouse of your choice, and some dbt models. These all easily integrate with the tool in order to reap the benefits. You can learn more about the specifics of the set-up process here.
Here you can also see key metrics such as mismatched columns, distinct primary keys, null primary keys, and the total differing values. All metrics we would manually want to check by running a query!
Datafold saves a lot of our time as analytics engineers because it also tells us the exact rows in our datasets that don’t match up.
And it shows us this information for every table downstream that would have been affected by your code change. If the differing values throughout your tables are expected, we can merge our code without worrying. No manual validation and a bunch of aggregated queries are required. Datafold displays key data quality metrics in a way that’s easy to see and dig into.
If you’re looking for an open-source option, they also have a free command-line tool and python library called data-diff. This compares rows of data in two different data sources using their primary keys. It shows you which rows are present in one table but not the other and which row values do not match. Although you wouldn’t be able to do this until after a data replication occurs, it is still a tool that has saved me lots of time with validation in certain use cases.
Castor is an automated data catalog tool that brings transparency to all parts of your data stack through documentation and collaboration. It helps tell users in your organization where certain data sources are located, their quality status, and who to reach out to in case you have questions. It integrates with your dbt documentation, populating everything you’ve already documented directly into the tool.
One of my favorite features is the automated population of already-defined data columns. I can’t tell you how many hours I spent copying and pasting the same column definitions from one src.yml
file to another. Castor takes those definitions and fills them in for every instance of the same column name, removing the repeat, manual work. Yes! You really don’t have to define user_id
50 different times across multiple models!
Castor also offers column-level lineage for your dbt data models across your entire modern data stack. This means you can see the data source each column is being pulled from all the way from your data warehouse to your dashboard. This allows you to better handle dependencies between data models and see the potential downstream impacts of small code changes. I’ve personally changed column names in my base models, thinking it wouldn’t impact anything, only to find out the next day that it broke an entire production model. With column-level lineage charts, you don’t have to worry about this being an issue.
Lastly, dbt has built-in tests that allow you to check for the freshness of your source data as well as your data models. Freshness tests can be helpful in alerting you when a dataset hasn’t been updated as expected. When done at the source, this will help to prevent downstream issues such as stale data models. Being proactive about these types of data quality checks will save you time in debugging and fixing broken pipelines.
To add a freshness test to your dbt project, you simply add a freshness
block under your source or model name in the src.yml
file. You then include two fields: warn_after
and error_after
. warn_after
refers to the time period a source or model can go without new data before giving you a warning. This won’t fail your tests. error_after
refers to the time period a source or model can go without new data before failing. This will fail your tests.
sources:
- name: users
freshness:
warn_after:
count: 3
period: day
error_after:
count: 5
period: day
Keep in mind that you can choose minute
, hour
, or day
for your time period. Personally, my data pipeline typically only runs once per day, so the day
period makes the most sense for me.
dbt is a powerful tool that makes data transformation seamless. Better yet, it integrates with other powerful tools that can help save you time on tasks like data cleaning, validating code changes, documenting data sources, and debugging your data pipeline. Together, this suite of tools will help make you an efficient and thorough analytics engineer who writes quality code.
For more on analytics engineering and modern data stack tools, subscribe to my weekly newsletter.