Cracking SQL interview questions with useful procedures
In this article, we will go over a SQL question from an Airbnb data science interview. I hope the procedure explained in this article would help you become more effective in writing SQL queries.
Host Popularity Rental PricesYou’re given a table of rental property searches by users. The table consists of search results and outputs host information for searchers. Find the minimum, average, maximum rental prices for each host’s popularity rating. The host’s popularity rating is defined as below:
0 reviews: New
1 to 5 reviews: Rising
6 to 15 reviews: Trending Up
16 to 40 reviews: Popular
more than 40 reviews: HotSource: stratascratch.com
Step 1: Before we start writing queries, we need to examine the raw data and raise questions to better understand the relevant information. This is not a college exam, we’re allowed to ask the interviewer or data producer any reasonable questions and state assumptions. For example,
- Does the dataset contain duplicate properties? How can we uniquely identify a property? Can we use the field, “ID” to uniquely identify a property? For this exercise, ID represents the search ID, which is not relevant to property ID, therefore, we can ignore this field for this exercise. Yes, there are duplicate properties in the dataset because users are likely looking at the same properties from different searches and all the search outputs including duplicates would be recorded in the raw data. We would need to bring up this issue of duplicates with the interviewer because including duplicates would produce a skewed result. Therefore, we need to find a way to identify a property by ourselves (which we will talk more about in the next step).
- The question seems to suggest the host’s popularity is defined based on the number of reviews. Does the raw data include both the historical and most recent number of reviews for a given property? If so, we need to clarify how the most recent number of reviews can be identified. For example, is “Date of Search” available? But for this exercise, the dataset only contains the most recent number of reviews for a property, therefore, we don’t need to be worried about finding the most recent number of reviews.
Step 2: Not all the fields in the raw data would be used. We need to identify relevant variables and any additional variables we would need to create by ourselves to solve the problem. For this exercise, we would need the following variables.
- Property ID: We need to find a way to uniquely identify a property. One common way to concatenate a few characteristics variables, such as, price, room_type, host_since, zipcode, and number_of_reviews. By combining different fields, we should be able to uniquely identify a property.
- Host’s popularity rating: As instructed, property ratings are classified as “New”, “Rising”, “Trending Up”, “Popular”, and “Hot”, based on the number of reviews.
- Price: This variable is readily available in the raw data. Based on the raw price information, we can compute its minimum, average, and maximum for a given popularity rating category.
Step 3: We need to prepare the data and get it ready for analysis. In this step, we would focus on keeping and constructing the variables we’ve identified in step 2.
We can use
CONCAT(price, room_type, host_since, zipcode, number_of_reviews) to create property ID.
Host’s popularity rating requires a little bit of work because it would be assigned to different values (i.e., New”, “Rising”, “Trending Up”, “Popular”, and “Hot”) based on the number of reviews.
CAEE-WHEN-END function is a perfect function to handle multiple conditional statements.
There are duplicates in the raw data, we would need to prepare the data by de-duplicating the observations, otherwise, we would get skewed results when computing the average, min, and max. Adding
DISTINCT would address this issue.
We can run the following query to prepare the data.
SELECT DISTINCT CONCAT(price, room_type, host_since, zipcode, number_of_reviews) AS id, price,CASEWHEN number_of_reviews = 0 THEN 'New'WHEN number_of_reviews >=1 AND number_of_reviews <=5 THEN 'Rising'WHEN number_of_reviews >=6 AND number_of_reviews <=15 THEN 'Trending Up'WHEN number_of_reviews >=16 AND number_of_reviews <=40 THEN 'Popular'WHEN number_of_reviews >40 THEN 'Hot'END AS popularity_ratingFROM airbnb_host_searches
The preliminary dataset we expect to have would be something like the following table. We have property ID that represents a unique property with its rental price and popularity rating.
Step 4: Find the most effective way to write the query. Both subquery and Common Table Expression (CTE) are useful tools to handle complex SQL queries. I found CTE to be more effective and have more advantages over subquery.
CTE is reusable by design. Instead of having to declare the same subquery in every place you need to use it, you can use CTE to define a temporary table once, then refer to it whenever you need it.
CTE is more readable than Subqueries. Since CTE can be reusable, you can write less code using CTE than using subqueries. Also, people tend to follow logic and ideas easier in sequence than in a nested fashion. When you write a query, it is easier to break down a complex query into smaller pieces using CTE.
For this exercise, we can prepare the data in step 3 using
WITH statement and store it in a temporary table called, cte. Then we would use SQL aggregate functions,
MAX() to compute the minimum, average, and maximum rental price for each Popularity Rating Category with
The final solution would look like this:
WITH cte AS(SELECT DISTINCT CONCAT(price, room_type, host_since, zipcode, number_of_reviews) AS id, price,CASEWHEN number_of_reviews = 0 THEN 'New'WHEN number_of_reviews >=1 AND number_of_reviews <=5 THEN 'Rising'WHEN number_of_reviews >=6 AND number_of_reviews <=15 THEN 'Trending Up'WHEN number_of_reviews >=16 AND number_of_reviews <=40 THEN 'Popular'WHEN number_of_reviews >40 THEN 'Hot'END AS popularity_ratingFROM airbnb_host_searches)SELECT popularity_rating,MIN(price) price_min,AVG(price) price_avg,MAX(price) price_maxFROM cteGROUP BY popularity_rating
The query would produce the solution:
Solving SQL questions is similar to detective work. You’re tested to put scattered puzzles together and also need to explain the steps logically and reasonably. The key here is to keep the interviewer or your colleagues engaged in the process so that they can understand your point of view and any assumptions you make.
If you enjoy this article and would like to Buy Me a Coffee, please click here.
You can sign up for a membership to unlock full access to my articles, and have unlimited access to everything on Medium. Please subscribe if you’d like to get an email notification whenever I post a new article.