While working with BigQuery for years, I observed 5 issues that are commonly made, even by experienced Data Scientists
Google BigQuery is popular for many reasons. It is incredibly fast, easy to work with, provides the full GCP suite, takes care of your data, and ensures to catch mistakes early on. On top of that, you can use standard SQL and some very nice built-in functions. Put short, it is almost the full package!
Always assume bugs and duplicates, always!
However, similar to other web services and programming languages, when working with BigQuery there are a few things one needs to know to avoid falling into a trap. Over the years, I made a lot of mistakes on my own and realized that almost everyone I knew, at some point, encountered the same issues. A handful of these issues I want to call out here because I discovered those fairly late in my career and also see other very experienced data scientists encountering the same issues.
Therefore, I will provide you with my top 5 list of potential mistakes almost everyone makes in BigQuery at some point and which one might not even know about. So make sure to avoid these because each point can have severe consequences and keep in mind the right attitude when working with data: Always assume bugs and duplicates, always!
It happens so fast. You are in a hurry and want to quickly check two tables and see if a certain item mentioned in one of the tables also exists inside of a second table. Why not then go with a NOT IN
statement, since it sounds so intuitive?
The problem is that NOT IN
doesn’t work as intended when you have NULL
values in your table. If so, you will not get the results you desire!
See for yourself and check out this code example in which I am just trying to find the categories from input_2 that are not inside of input_1:
WITH
input_1 AS (
SELECT
category
FROM (
SELECT
["a", "b", CAST(NULL AS STRING), "d"] AS category),
UNNEST(category) category ),input_2 AS (
SELECT…