How the use of parentheses when using DISTINCT keyword in SQL could cause confusion
One of the most common misinterpretations that many SQL users do — even the more experienced ones — is related to the way
DISTINCT is applied to the specified columns. It is quite common to see SQL queries attempting to apply a
SELECT DISTINCT clause over a subset of columns that need to be returned by the query.
And the misconception lies in the view that
DISTINCT is a function that essentially accepts s column name over which it is going to be applied, whilst the columns specified outside of the “function call” won’t be taken into account when computing the results.
But in reality, when trying to use parentheses in order to try and make
DISTINCT effective only over the columns enclosed in parentheses won’t work in the way you’d expect it to work.
Now let’s create an example table that we will be referencing throughout this article to demonstrate a few concepts and help us clarify precisely how
DISTINCT keyword works in SQL.
And now let’s add a few records in the newly created
Now let’s query the results to see the final example table:
The misconception about DISTINCT in SQL
Now let’s assume we want to get the distinct rows from our
rental table, using two different fields, namely
store_id. In other words, we would like to answer the following question:
What are the unique combinations of
store_idin our rental table?
To answer the above query we can simply query our table and fetch the
DISTINCT values for
Now if we wanted to retrieve only a unique set of customers such that in the query result above we could only see one row per customer, then we would need to refine our query to do so.
And here’s exactly where the misconception about
DISTINCT lies in. A lot of users, have the (wrong!) impression that
DISTINCT is a function in which we can specify the columns we want to consider when applying it over our target table.
If you attempt to enclose
customer_id in parentheses when “calling” (not the right verb here since this is not a function)
DISTINCT you’ll notice that it has no effect at all:
We can still see “duplicate” customer IDs in our query result. This is because
SELECT DISTINCT clause, will always take into account all the column names specified, regardless of whether they were enclosed in parentheses.
In fact all of the expressions shown below are indeed equivalent:
SELECT DISTINCT customer_id, store_id FROM rental;
SELECT DISTINCT (customer_id), store_id FROM rental;
SELECT DISTINCT (customer_id), (store_id) FROM rental;
SELECT DISTINCT (store_id), customer_id FROM rental;
SELECT DISTINCT ((customer_id)), store_id FROM rental;
As a final note, I would highly recommend to avoid the use of parentheses when using
SELECT clause with the
DISTINCT qualifier as it might make other people (who may not be aware of what we’ve discussed today) mis-interpret the query and accidentally think that your intention is to apply
DISTINCT over a single column, despite the fact that this is not possible as we demonstrated already.
PostgreSQL and DISTINCT ON
If you are working with Postgres and would like to apply
DISTINCT only to a subset of the columns you want to retrieve in your results, then you can take advantage of
SELECT DISTINCTeliminates duplicate rows from the result.
SELECT DISTINCT ONeliminates rows that match on all the specified expressions.
— Postgres Documentation
This is an extension built on top of
DISTINCT of standard SQL and it returns the first row for each set of rows that match the specified expression.
Note however that when using
DISTINCT ON it makes sense to also use the
ORDER BY clause. In this way, you will be able to specify the conditions for picking the desired result from conflicting rows. For instance, if two rows match your expression (in the above example, we have two records qualifying for customer with id
Now let’s assume that we want to fetch unique customer IDs with the corresponding store ID, but this time if multiple competing rows exist we would like to get the one with the smallest amount:
Note that the
store_id that corresponds to
customer_id=100 has changed, since the rental row with the smallest amount is different now that we have ordered our results by amount in ascending order.
But in general, if you don’t really care about the order then you can omit it.
It is important to understand how the
DISTINCT keyword works with
SELECT statements in SQL since this is a source of confusion for many users — and I’d say even for experienced ones.
When writing queries with
SELECT DISTINCT clauses, many users tend to use
DISTINCT in the same way as they would with proper SQL functions. In other words, they enclose one column in the parentheses while at the same time they provide more column names after the clause — for example
SELECT DISTINCT(user_id), first_name FROM ....
When reading such queries (and obviously the people writing them) you may end up believing that
SELECT DISTINCT is only applicable over the column specified (e.g.
user_id) and not over the remaining columns (e.g.
first_name). And as we’ve seen in today’s article, this is a misconception and quite dangerous assumption to make when writing queries.
Finally, we discussed a special case in PostgreSQL database that allows users to specify explicitly which column to be taken into account when applying
DISTINCT, using the special
DISTINCT ON clause.
Become a member and read every story on Medium. Your membership fee directly supports me and other writers you read. You’ll also get full access to every story on Medium.
Related articles you may also like