How the use of parentheses when using DISTINCT keyword in SQL could cause confusion
One of the most common misinterpretations that many SQL users do — even the more experienced ones — is related to the way DISTINCT
is applied to the specified columns. It is quite common to see SQL queries attempting to apply a SELECT DISTINCT
clause over a subset of columns that need to be returned by the query.
And the misconception lies in the view that DISTINCT
is a function that essentially accepts s column name over which it is going to be applied, whilst the columns specified outside of the “function call” won’t be taken into account when computing the results.
But in reality, when trying to use parentheses in order to try and make DISTINCT
effective only over the columns enclosed in parentheses won’t work in the way you’d expect it to work.
Now let’s create an example table that we will be referencing throughout this article to demonstrate a few concepts and help us clarify precisely how DISTINCT
keyword works in SQL.
And now let’s add a few records in the newly created address
table.
Now let’s query the results to see the final example table:
The misconception about DISTINCT in SQL
Now let’s assume we want to get the distinct rows from our rental
table, using two different fields, namely customer_id
and store_id
. In other words, we would like to answer the following question:
What are the unique combinations of
customer_id
andstore_id
in our rental table?
To answer the above query we can simply query our table and fetch the DISTINCT
values for customer_id
and store_id
columns:
Now if we wanted to retrieve only a unique set of customers such that in the query result above we could only see one row per customer, then we would need to refine our query to do so.
And here’s exactly where the misconception about DISTINCT
lies in. A lot of users, have the (wrong!) impression that DISTINCT
is a function in which we can specify the columns we want to consider when applying it over our target table.
If you attempt to enclose customer_id
in parentheses when “calling” (not the right verb here since this is not a function) DISTINCT
you’ll notice that it has no effect at all:
We can still see “duplicate” customer IDs in our query result. This is because SELECT DISTINCT
clause, will always take into account all the column names specified, regardless of whether they were enclosed in parentheses.
In fact all of the expressions shown below are indeed equivalent:
SELECT DISTINCT customer_id, store_id FROM rental;
SELECT DISTINCT (customer_id), store_id FROM rental;
SELECT DISTINCT (customer_id), (store_id) FROM rental;
SELECT DISTINCT (store_id), customer_id FROM rental;
SELECT DISTINCT ((customer_id)), store_id FROM rental;
As a final note, I would highly recommend to avoid the use of parentheses when using SELECT
clause with the DISTINCT
qualifier as it might make other people (who may not be aware of what we’ve discussed today) mis-interpret the query and accidentally think that your intention is to apply DISTINCT
over a single column, despite the fact that this is not possible as we demonstrated already.
PostgreSQL and DISTINCT ON
If you are working with Postgres and would like to apply DISTINCT
only to a subset of the columns you want to retrieve in your results, then you can take advantage of DISTINCT ON
.
SELECT DISTINCT
eliminates duplicate rows from the result.
SELECT DISTINCT ON
eliminates rows that match on all the specified expressions.— Postgres Documentation
This is an extension built on top of DISTINCT
of standard SQL and it returns the first row for each set of rows that match the specified expression.
Note however that when using DISTINCT ON
it makes sense to also use the ORDER BY
clause. In this way, you will be able to specify the conditions for picking the desired result from conflicting rows. For instance, if two rows match your expression (in the above example, we have two records qualifying for customer with id 100
).
Now let’s assume that we want to fetch unique customer IDs with the corresponding store ID, but this time if multiple competing rows exist we would like to get the one with the smallest amount:
Note that the store_id
that corresponds to customer_id=100
has changed, since the rental row with the smallest amount is different now that we have ordered our results by amount in ascending order.
But in general, if you don’t really care about the order then you can omit it.
Final Thoughts
It is important to understand how the DISTINCT
keyword works with SELECT
statements in SQL since this is a source of confusion for many users — and I’d say even for experienced ones.
When writing queries with SELECT DISTINCT
clauses, many users tend to use DISTINCT
in the same way as they would with proper SQL functions. In other words, they enclose one column in the parentheses while at the same time they provide more column names after the clause — for example SELECT DISTINCT(user_id), first_name FROM ...
.
When reading such queries (and obviously the people writing them) you may end up believing that SELECT DISTINCT
is only applicable over the column specified (e.g. user_id
) and not over the remaining columns (e.g. first_name
). And as we’ve seen in today’s article, this is a misconception and quite dangerous assumption to make when writing queries.
Finally, we discussed a special case in PostgreSQL database that allows users to specify explicitly which column to be taken into account when applying DISTINCT
, using the special DISTINCT ON
clause.
Become a member and read every story on Medium. Your membership fee directly supports me and other writers you read. You’ll also get full access to every story on Medium.
Related articles you may also like