Machine Learning News Hubb
Advertisement Banner
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
Machine Learning News Hubb
No Result
View All Result
Home Artificial Intelligence

A Practical SQL Question for Data Science Interviews | by Aaron Zhu | Sep, 2022

admin by admin
September 8, 2022
in Artificial Intelligence


Cracking SQL interview questions with useful procedures

Photo by Andrea Davis on Unsplash

In this article, we will go over a SQL question from an Airbnb data science interview. I hope the procedure explained in this article would help you become more effective in writing SQL queries.

SQL Question:

Host Popularity Rental PricesYou’re given a table of rental property searches by users. The table consists of search results and outputs host information for searchers. Find the minimum, average, maximum rental prices for each host’s popularity rating. The host’s popularity rating is defined as below: 
0 reviews: New
1 to 5 reviews: Rising
6 to 15 reviews: Trending Up
16 to 40 reviews: Popular
more than 40 reviews: Hot
Source: stratascratch.com

Table: airbnb_host_searches

Image by Author

Step 1: Before we start writing queries, we need to examine the raw data and raise questions to better understand the relevant information. This is not a college exam, we’re allowed to ask the interviewer or data producer any reasonable questions and state assumptions. For example,

  • Does the dataset contain duplicate properties? How can we uniquely identify a property? Can we use the field, “ID” to uniquely identify a property? For this exercise, ID represents the search ID, which is not relevant to property ID, therefore, we can ignore this field for this exercise. Yes, there are duplicate properties in the dataset because users are likely looking at the same properties from different searches and all the search outputs including duplicates would be recorded in the raw data. We would need to bring up this issue of duplicates with the interviewer because including duplicates would produce a skewed result. Therefore, we need to find a way to identify a property by ourselves (which we will talk more about in the next step).
  • The question seems to suggest the host’s popularity is defined based on the number of reviews. Does the raw data include both the historical and most recent number of reviews for a given property? If so, we need to clarify how the most recent number of reviews can be identified. For example, is “Date of Search” available? But for this exercise, the dataset only contains the most recent number of reviews for a property, therefore, we don’t need to be worried about finding the most recent number of reviews.

Step 2: Not all the fields in the raw data would be used. We need to identify relevant variables and any additional variables we would need to create by ourselves to solve the problem. For this exercise, we would need the following variables.

  • Property ID: We need to find a way to uniquely identify a property. One common way to concatenate a few characteristics variables, such as, price, room_type, host_since, zipcode, and number_of_reviews. By combining different fields, we should be able to uniquely identify a property.
  • Host’s popularity rating: As instructed, property ratings are classified as “New”, “Rising”, “Trending Up”, “Popular”, and “Hot”, based on the number of reviews.
  • Price: This variable is readily available in the raw data. Based on the raw price information, we can compute its minimum, average, and maximum for a given popularity rating category.

Step 3: We need to prepare the data and get it ready for analysis. In this step, we would focus on keeping and constructing the variables we’ve identified in step 2.

We can use CONCAT(price, room_type, host_since, zipcode, number_of_reviews) to create property ID.

Host’s popularity rating requires a little bit of work because it would be assigned to different values (i.e., New”, “Rising”, “Trending Up”, “Popular”, and “Hot”) based on the number of reviews. CAEE-WHEN-END function is a perfect function to handle multiple conditional statements.

There are duplicates in the raw data, we would need to prepare the data by de-duplicating the observations, otherwise, we would get skewed results when computing the average, min, and max. Adding DISTINCT would address this issue.

We can run the following query to prepare the data.

SELECT DISTINCT CONCAT(price, room_type, host_since, zipcode, number_of_reviews) AS id, price,CASEWHEN number_of_reviews = 0 THEN 'New'WHEN number_of_reviews >=1  AND number_of_reviews <=5  THEN 'Rising'WHEN number_of_reviews >=6  AND number_of_reviews <=15 THEN 'Trending Up'WHEN number_of_reviews >=16 AND number_of_reviews <=40 THEN 'Popular'WHEN number_of_reviews >40 THEN 'Hot'END AS popularity_ratingFROM airbnb_host_searches

The preliminary dataset we expect to have would be something like the following table. We have property ID that represents a unique property with its rental price and popularity rating.

Image by Author

Step 4: Find the most effective way to write the query. Both subquery and Common Table Expression (CTE) are useful tools to handle complex SQL queries. I found CTE to be more effective and have more advantages over subquery.

CTE is reusable by design. Instead of having to declare the same subquery in every place you need to use it, you can use CTE to define a temporary table once, then refer to it whenever you need it.

CTE is more readable than Subqueries. Since CTE can be reusable, you can write less code using CTE than using subqueries. Also, people tend to follow logic and ideas easier in sequence than in a nested fashion. When you write a query, it is easier to break down a complex query into smaller pieces using CTE.

For this exercise, we can prepare the data in step 3 using WITH statement and store it in a temporary table called, cte. Then we would use SQL aggregate functions, MIN() ,AVG() ,MAX() to compute the minimum, average, and maximum rental price for each Popularity Rating Category with GROUP BY.

The final solution would look like this:

WITH cte AS(SELECT DISTINCT CONCAT(price, room_type, host_since, zipcode, number_of_reviews) AS id, price,CASEWHEN number_of_reviews = 0 THEN 'New'WHEN number_of_reviews >=1  AND number_of_reviews <=5  THEN 'Rising'WHEN number_of_reviews >=6  AND number_of_reviews <=15 THEN 'Trending Up'WHEN number_of_reviews >=16 AND number_of_reviews <=40 THEN 'Popular'WHEN number_of_reviews >40 THEN 'Hot'END AS popularity_ratingFROM airbnb_host_searches)SELECT popularity_rating,MIN(price) price_min,AVG(price) price_avg,MAX(price) price_maxFROM cteGROUP BY popularity_rating

The query would produce the solution:

Image by Author

Solving SQL questions is similar to detective work. You’re tested to put scattered puzzles together and also need to explain the steps logically and reasonably. The key here is to keep the interviewer or your colleagues engaged in the process so that they can understand your point of view and any assumptions you make.

If you enjoy this article and would like to Buy Me a Coffee, please click here.

You can sign up for a membership to unlock full access to my articles, and have unlimited access to everything on Medium. Please subscribe if you’d like to get an email notification whenever I post a new article.



Source link

Previous Post

Learning Docker, the Easy Way. A tutorial which covers the basics of… | by Percy Bolmér | Sep, 2022

Next Post

A turn towards textual practice and GAN-informed design. | by Egmontas Geras | Sep, 2022

Next Post

A turn towards textual practice and GAN-informed design. | by Egmontas Geras | Sep, 2022

ML goes after chemistry and material sciences -highlights of a review of interest to everybody using and developing models | by LucianoSphere | Sep, 2022

Sigmoid and SoftMax Functions in 5 minutes | by Gabriel Furnieles | Sep, 2022

Related Post

Artificial Intelligence

Dates and Subqueries in SQL. Working with dates in SQL | by Michael Grogan | Jan, 2023

by admin
January 27, 2023
Machine Learning

ChatGPT Is Here To Stay For A Long Time | by Jack Martin | Jan, 2023

by admin
January 27, 2023
Machine Learning

5 steps to organize digital files effectively

by admin
January 27, 2023
Artificial Intelligence

Explain text classification model predictions using Amazon SageMaker Clarify

by admin
January 27, 2023
Artificial Intelligence

Human Resource Management Challenges and The Role of Artificial Intelligence in 2023 | by Ghulam Mustafa Shoaib | Jan, 2023

by admin
January 27, 2023
Deep Learning

Training Neural Nets: a Hacker’s Perspective

by admin
January 27, 2023

© 2023 Machine Learning News Hubb All rights reserved.

Use of these names, logos, and brands does not imply endorsement unless specified. By using this site, you agree to the Privacy Policy and Terms & Conditions.

Navigate Site

  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

Newsletter Sign Up.

No Result
View All Result
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

© 2023 JNews - Premium WordPress news & magazine theme by Jegtheme.