Machine Learning News Hubb
Advertisement Banner
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
Machine Learning News Hubb
No Result
View All Result
Home Artificial Intelligence

5 dbt Data Model Hacks to Save You Precious Time | by Madison Schott | Sep, 2022

admin by admin
September 8, 2022
in Artificial Intelligence


The best open-source and paid tools to add to your project

Photo by Artem Maltsev on Unsplash

Now, I love writing data models, don’t get me wrong. It’s one of the most fun parts of being an analytics engineer. However, there are a lot of tedious parts to writing them. It’s not just about the business logic that you are coding using SQL. You also need to clean the data before building your models, document it, validate it, and then implement data quality checks.

With one data model, this may not seem like a lot, but when you have tons of different data sources and continue to build more data models, it ends up taking up most of your time. Especially, if you’re like most data teams, and you wait until everything is near-perfect to implement all this nitty-gritty work.

Luckily, dbt makes data modeling a lot easier. And, with dbt, comes lots of helpful tools that integrate directly with it in order to save you a few headaches. If you’re not familiar with dbt (or data build tool), it is a data transformation tool that allows you to write modular code and follow modeling best practices. You can read more about it here.

Here are some tools, both open-source (aka free) and paid, that I’ve found to save me a lot of time and energy. Not everything has to be built from scratch, by you! There are a ton of smart people out there that have probably already done it for you. Take advantage of that.

re_data is one of my favorite free, open-source dbt packages. It is super easy to install and integrate into your already existing dbt project. While they have a lot of awesome features such as anomaly detection and alerting, they also have pre-built dbt macros to help you filter and clean your data.

Duplicates

They have two different filters for dealing with duplicate values, filter_remove_duplicates and filter_get_duplicates. You simply use these like any other dbt macro and select from it. They each take in three arguments- the model name, the unique columns, and the column you wish to sort the records by. filter_remove_duplicates returns a model with no duplicates and filter_get_duplicates returns only the duplicates in the model.

select
user_id,
name
from {{ re_data.filter_remove_duplicates(ref('users'), ['user_id']['created_at desc']) }} duplicates

These both save you time from having to aggregate and group columns in order to find or remove duplicates. It also makes for cleaner code.

Formatting

re_data also contains three different data formatting macros that you can use to clean your data- clean_additional_whitespaces, clean_blacklist, and clean_capitalize_words.

clean_additional_whitespaces takes the column name you wish to clean as an argument and removes any whitespace.

select
user_id,
{{ re_data.clean_additional_whitespaces('name') }} as name
from {{ ref('users') }}

clean_capitalize_words also takes in a column name and capitalizes the first letter of each word in the string. This is a great one to use on users’ names or even product names that are displayed on your website.

Lastly, clean_blacklist is used to mask personal information. This is a great replacement for Snowflake’s masking policies if you don’t have those available to you. Here, you specify the column you wish to mask, a list of words or a regex expression to look for, and then a string to replace those occurrences.

dbt expectations is another excellent open-source dbt package to utilize in your dbt project. It includes macros that test the output of your data against your expectations, hence the name. It is essentially more advanced testing to add to your data models on top of the dbt tests already existing within a dbt project.

These macros often test for specific use cases that you may try to write on your own when you come across them. Rather than spending hours writing these macros yourself, you can browse through the library to find one that answers your need. Here are a few of my favorites to use.

expect_column_values_to_be_of_type

This test allows you to check the data types of your columns. This comes in handy when you want to ensure all of your timestamp columns are of the same type, your dates are actually dates, and any columns you want to use in calculations are integers or floats.

I utilize this test the most in messy data sources such as Google sheets which are prone to a lot of human error. This way, I’m notified at the source rather than my model failing downstream. Here’s what it looks like in my src.yml file:

columns:
- name: campaign_launch_date
description: "The date the campaign was launched"
tests:
- dbt_expectations.expect_column_values_to_be_of_type:
column_type: date

expect_column_values_to_be_between

This test is great for columns that have ids that refer to a mapping table. Oftentimes, we have too many values to include them all in an accepted_values test within dbt. This acts as a great shortcut to prevent you src.yml from becoming too overwhelming to read with its extensive tests.

I use this test often in state_id columns. I don’t want to write out numbers 1–50, so I instead specify a range like so:

columns:
- name: state_id
description: "The unique identifier of a state"
tests:
- dbt_expectations.expect_column_values_to_be_between:
min_value: 1 # optional parameter
max_value: 50 # optional parameter
row_condition: "id is not null" # optional parameter

This test also has the option for a row condition in case you want to exclude null values or any other types of values.

Datafold is a powerful tool to show you the results of your code changes before actually making them. Using Github, it runs your previous code as well as the changes you are making in a pull request, and then compares the resultant datasets. This allows you to see how your data is going to change before you merge your code changes.

This is super helpful in validating that your code is correct and the data looks as expected before actually pushing to production. We all know how much of a nightmare it can be when something breaks production. You need to go back and find the root of the issue, revert back to an older version of your code, and redeploy it. Why worry about doing this when you can use a tool like Datafold to help prevent this in the first place?

In order to implement Datafold, you need three things: a Github repo with two branches, a data warehouse of your choice, and some dbt models. These all easily integrate with the tool in order to reap the benefits. You can learn more about the specifics of the set-up process here.

Image by author

Here you can also see key metrics such as mismatched columns, distinct primary keys, null primary keys, and the total differing values. All metrics we would manually want to check by running a query!

Datafold saves a lot of our time as analytics engineers because it also tells us the exact rows in our datasets that don’t match up.

Image by author

And it shows us this information for every table downstream that would have been affected by your code change. If the differing values throughout your tables are expected, we can merge our code without worrying. No manual validation and a bunch of aggregated queries are required. Datafold displays key data quality metrics in a way that’s easy to see and dig into.

If you’re looking for an open-source option, they also have a free command-line tool and python library called data-diff. This compares rows of data in two different data sources using their primary keys. It shows you which rows are present in one table but not the other and which row values do not match. Although you wouldn’t be able to do this until after a data replication occurs, it is still a tool that has saved me lots of time with validation in certain use cases.

Castor is an automated data catalog tool that brings transparency to all parts of your data stack through documentation and collaboration. It helps tell users in your organization where certain data sources are located, their quality status, and who to reach out to in case you have questions. It integrates with your dbt documentation, populating everything you’ve already documented directly into the tool.

One of my favorite features is the automated population of already-defined data columns. I can’t tell you how many hours I spent copying and pasting the same column definitions from one src.yml file to another. Castor takes those definitions and fills them in for every instance of the same column name, removing the repeat, manual work. Yes! You really don’t have to define user_id 50 different times across multiple models!

column lineage from dbt to Metabase (image by author)

Castor also offers column-level lineage for your dbt data models across your entire modern data stack. This means you can see the data source each column is being pulled from all the way from your data warehouse to your dashboard. This allows you to better handle dependencies between data models and see the potential downstream impacts of small code changes. I’ve personally changed column names in my base models, thinking it wouldn’t impact anything, only to find out the next day that it broke an entire production model. With column-level lineage charts, you don’t have to worry about this being an issue.

Lastly, dbt has built-in tests that allow you to check for the freshness of your source data as well as your data models. Freshness tests can be helpful in alerting you when a dataset hasn’t been updated as expected. When done at the source, this will help to prevent downstream issues such as stale data models. Being proactive about these types of data quality checks will save you time in debugging and fixing broken pipelines.

To add a freshness test to your dbt project, you simply add a freshness block under your source or model name in the src.yml file. You then include two fields: warn_after and error_after. warn_after refers to the time period a source or model can go without new data before giving you a warning. This won’t fail your tests. error_after refers to the time period a source or model can go without new data before failing. This will fail your tests.

sources:
- name: users
freshness:
warn_after:
count: 3
period: day
error_after:
count: 5
period: day

Keep in mind that you can choose minute, hour, or day for your time period. Personally, my data pipeline typically only runs once per day, so the day period makes the most sense for me.

dbt is a powerful tool that makes data transformation seamless. Better yet, it integrates with other powerful tools that can help save you time on tasks like data cleaning, validating code changes, documenting data sources, and debugging your data pipeline. Together, this suite of tools will help make you an efficient and thorough analytics engineer who writes quality code.

For more on analytics engineering and modern data stack tools, subscribe to my weekly newsletter.



Source link

Previous Post

NumPy — A Python Library for Machine Learning ( A Beginner to PRO Trainer module) | by Aishwarya Venkat | Sep, 2022

Next Post

Top Programming Languages | 2022 Rating | by Elizaveta Gorelova | Sep, 2022

Next Post

Top Programming Languages | 2022 Rating | by Elizaveta Gorelova | Sep, 2022

A Comprehensive Tutorial on Stereo Geometry and Stereo Rectification with Python | by Neeraj Krishna | Sep, 2022

How I fixed over 50 label issues in a popular semantic segmentation dataset | by Jamie Murdoch | Sep, 2022

Related Post

Artificial Intelligence

Exploring TensorFlow Model Prediction Issues | by Adam Brownell | Feb, 2023

by admin
February 2, 2023
Machine Learning

Different Loss Functions used in Regression | by Iqra Bismi | Feb, 2023

by admin
February 2, 2023
Machine Learning

How to organize bills? – 3 ways to track bills

by admin
February 2, 2023
Artificial Intelligence

How to decide between Amazon Rekognition image and video API for video moderation

by admin
February 2, 2023
Artificial Intelligence

The Future of AI: GPT-3 vs GPT-4: A Comparative Analysis | by Mohd Saqib | Jan, 2023

by admin
February 2, 2023
Deep Learning

6 Ways To Streamline Tech Hiring With A Recruitment Automation Platform

by admin
February 2, 2023

© 2023 Machine Learning News Hubb All rights reserved.

Use of these names, logos, and brands does not imply endorsement unless specified. By using this site, you agree to the Privacy Policy and Terms & Conditions.

Navigate Site

  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

Newsletter Sign Up.

No Result
View All Result
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

© 2023 JNews - Premium WordPress news & magazine theme by Jegtheme.