Let’s start with the question which was suggested by a Sports Analyst. From time to time, their analytics team needs to create a points table based on the total matches played between different teams.
And that’s why they ask this question in every data analyst job interview.
This is a classic scenario of converting a table from a long-form (number of rows > number of columns ) to a wide form (number of columns > number of rows). This is also called Data Pivoting, which is an important use-case of CASE WHEN in SQL.
In this scenario, you have a table containing the names of the teams who played the match and the winner. You need to create a points table where you get information about how many matches each team played, how many matches they won, lost and how many matches draw.
Let’s see how to tackle this kind of question —
Here is the input table where each row belongs to a match between two teams and the column winner indicates which team won. The value NULL in the winner column indicates that the match was a draw i.e. none of the team won the match.
Let’s divide this question into the following sub-tasks.
- Find the total number of matches each team won
- Find the total number of matches each team lost
- Find the total number of matches where none of the teams won
- Find the total number of matches each team played
To understand how many matches a team won, you need to understand for each match which team won the match. You can do this by comparing the team_1 and team_2 columns with the winner column.
So, for a specific row when the values in columns team_1 and winner are equal then team_1 is the winner.
You can translate exactly the same logic using CASE..WHEN..THEN
statement in SQL as shown below.
SELECT team_1
, team_2
, winner
, CASE WHEN team_1 = winner THEN 1 ELSE 0 END as win_flag
, CASE WHEN winner IS NULL THEN 1 ELSE 0 END as draw_flag
FROM analyticswithsuraj.teams
As shown in the above query, you’ll create an additional column win_flag. When a team is a winner, you’ll assign the value 1 to this column. Similarly, if the winner column is NULL, then you’ll assign the value 1 to the draw_flag column.
So the above query will create the following output for all the teams in the column team_1.
Similarly, when the values in the columns team_2 and winner are equal then team_2 is the winner. So you can write exactly the similar query for all the teams in team_2
SELECT team_1
, team_2
, winner
, CASE WHEN team_2 = winner THEN 1 ELSE 0 END as win_flag
, CASE WHEN winner IS NULL THEN 1 ELSE 0 END as draw_flag
FROM analyticswithsuraj.teams
where you’ll get the following output for the values in team_2
Well, the above two queries are just for your understanding. In reality, you can create a single CTE for the win_flag and draw_flag of each team in the columns team_1 and team_2 as shown below.
WITH win_draw_flag AS
(
SELECT team_1 as team
, CASE WHEN team_1 = winner THEN 1 ELSE 0 END as win_flag
, CASE WHEN winner IS NULL THEN 1 ELSE 0 END as draw_flag
FROM analyticswithsuraj.teams
UNION ALL
SELECT team_2 as team
, CASE WHEN team_2 = winner THEN 1 ELSE 0 END as win_flag
, CASE WHEN winner IS NULL THEN 1 ELSE 0 END as draw_flag
FROM analyticswithsuraj.teams
)
This will create a CTE like this — I’ve shown this only for your understanding.
Remember you are still in the long form of the table and you now have information on whether or not each team wins the match.
Next, you need to simply aggregate the columns to get the total number of matches each team played, won, and lost. You can do it as simply as the following query.
SELECT team
, COUNT(*) AS matches_played
, SUM(win_flag) AS matches_won
, COUNT(*) - SUM(win_flag) - SUM(draw_flag) AS matches_lost
, SUM(draw_flag) AS matches_draw
FROM win_draw_flag
GROUP BY team
ORDER BY team
Where COUNT(*)
gives you the total number of times each team occurred in the CTE win_draw_flag and subtracting matches won & draw from it will give you the total number of matches each team lost.
Without creating a CTE separately, you can also write the query like the following and pass the entire CASE..WHEN query as a sub-query.
SELECT team
, COUNT(*) AS matches_played
, SUM(win_flag) AS matches_won
, COUNT(*) - SUM(win_flag) - SUM(draw_flag) AS matches_lost
, SUM(draw_flag) AS matches_draw
FROM
(
SELECT team_1 as team
, CASE WHEN team_1 = winner THEN 1 ELSE 0 END as win_flag
, CASE WHEN winner IS NULL THEN 1 ELSE 0 END as draw_flag
FROM analyticswithsuraj.teams
UNION ALL
SELECT team_2 as team
, CASE WHEN team_2 = winner THEN 1 ELSE 0 END as win_flag
, CASE WHEN winner IS NULL THEN 1 ELSE 0 END as draw_flag
FROM analyticswithsuraj.teams
) AS win_draw_flag
GROUP BY team
ORDER BY team
It will also result in exactly the same output as mentioned above.
Well, there can be multiple ways to solve this question — I found this approach easier. If you find any other solution to this question, feel free to mention it in the comments.
This is one of the classic examples of comparing dates and then implementing If..Else
logic using CASE..WHEN
statement. You can encounter this type of problem in any company that deals with customers.
The scenario is — You have an eCommerce website where every day customers visit and purchase products. Your task is to identify on each day how many customers were new and how many customers were repeated.
Here is an input table — orders — where you can see the customers with customer_id ABC101, BCD201, and ABD101 visited the website on multiple days and purchased different products.
Let’s break down the question into the following sub-tasks —
- Find the first time i.e. the first date when the customer visited the website
- Compare the first date with the order date to decide whether the customer is a repeated or first-time visitor
You can easily solve the first sub-task by using GROUP BY to group all the records by customer_id and find the minimum of the order_date, as shown below.
SELECT customer_id
, MIN(order_date) as first_order_date
FROM analyticswithsuraj.orders
GROUP BY customer_id
This was simple!
Next, to compare the first_order_date with each order_date, first, you need to bring both columns in a single table.
You can easily do that using JOIN
on customer_id as shown below. Here you can create a CTE using the above query so that you will get a temporary table to join with the input table.
WITH first_orders AS
(
SELECT customer_id
, MIN(order_date) as first_order_date
FROM analyticswithsuraj.orders
GROUP BY customer_id
)SELECT t1.*
, t2.first_order_date
FROM analyticswithsuraj.orders AS t1
INNER JOIN first_orders AS t2 ON t1.customer_id = t2.customer_id
Now, as you got both the columns in a single table, you can compare order_date with the first_order_date and implement the following If..Else
logic.
- If first_order_date & order_date are the same then the customer is the new customer
- If first_order_date & order_date are different then the customer is repeated customer
So ideally, you need to create two columns to implement the above two If..Else
statements using CASE WHEN in SQL.
You don’t need to create any separate table, rather you can add two more columns in the above query where you joined two tables. Here is how it can be done.
WITH first_orders AS
(
SELECT customer_id
, MIN(order_date) as first_order_date
FROM analyticswithsuraj.orders
GROUP BY customer_id
)SELECT t1.*
, t2.first_order_date
, CASE WHEN t1.order_date=t2.first_order_date THEN 1 ELSE 0 END AS new_customer_flag
, CASE WHEN t1.order_date!=t2.first_order_date THEN 1 ELSE 0 END AS repeat_customer_flag
FROM analyticswithsuraj.orders AS t1
INNER JOIN first_orders AS t2 ON t1.customer_id = t2.customer_id
As a result, the column new_customer_flag will be 1 when the columns first_order_date and order_date are equal. Similarly, the column repeat_customer_flag will be 1 when the columns first_order_date and order_date are different.
Now the last step is only to group all the records by order date and sum up the columns new_customer_flag and repeat_customer_flag.
To do this, you’ll need the above table which you can achieve by creating another CTE as shown below.
WITH first_orders AS
(
SELECT customer_id
, MIN(order_date) as first_order_date
FROM analyticswithsuraj.orders
GROUP BY customer_id
),customers AS
(
SELECT t1.*
, t2.first_order_date
, CASE WHEN t1.order_date=t2.first_order_date THEN 1 ELSE 0 END AS new_customer_flag
, CASE WHEN t1.order_date!=t2.first_order_date THEN 1 ELSE 0 END AS repeat_customer_flag
FROM analyticswithsuraj.orders AS t1
INNER JOIN first_orders AS t2 ON t1.customer_id = t2.customer_id
)
SELECT order_date
, SUM(new_customer_flag) AS number_of_new_customers
, SUM(repeat_customer_flag) AS number_of_repeat_customers
FROM customers
GROUP BY order_date
ORDER BY order_date
This is how you’ll get the required output. You can cross-check the results by comparing them with the input table.
Again, you can have a different approach to solving this question — this is the simplest approach I found. Don’t forget to mention your approach in the comments below.