[ad_1]

## Learn how to use rank, dense rank, row number, cumulative distribution, percentiles rank, quartiles, percentiles, and more

Numbering functions assign a number (or decimals) to each record in a table. They are mostly used for ranking or assigning a sequential number to the data for further processing (deduplication, filtering, grouping).

They are usually required to be ordered by a specific dimension (date, revenue, salary, ID, etc…).

They can be used to answer the following questions:

**What are the top revenue-generating countries?****How do I rank volleyball players per division and salary?****What are the top N-performing countries by product category?****What rows are duplicated based on an ingestion date?**

This article will be split into **two sections. **The first section will cover the mechanisms of `RANK()`

, `DENSE_RANK()`

, and `ROW_NUMBER()`

, as they have a very similar purpose but slightly different output and mechanism.

The second section will cover `PERCENT_RANK`

, `CUME_DIST`

, and `NTILE`

, which have different purposes, mechanisms, and outputs.

I also suggest reading the great Google documentation.

To better understand the distinctions between these functions, we will query the following dataset containing sales from the Google Merchandise Shop, for different countries and product categories.

## ROW_NUMBER

The function `ROW_NUMBER()`

will always return a unique number starting from 1 and incrementing (1,2,3,4…,8,9…) in sequence. It is not required to specify an order, and the output number will always be unique even if the rows or values are similar.

If you are not using an `ORDER BY`

clause, the results will be **non-deterministic**, meaning the outcomes will be different even with the same input data.

Let’s look at two examples:

Let’s say we want to break down the **row number per product category**. For that, we can use a `PARTITION BY`

clause, and order by descending revenue value.

This can be highly beneficial for ranking/assigning an order within different groups or categories you may have available in your dataset.

## RANK and DENSE_RANK

The functions `RANK()`

and `DENSE_RANK()`

will act identically as `ROW_NUMBER()`

with two exceptions: how they **sequence numbers** and how they **manage similar values**.

For `RANK()`

, similar rows will receive the same rank number, but the function will leave a gap after two or more identical rows.

For `DENSE_RANK()`

, similar rows will receive the same rank number, but the rank number is always incremented by 1 and there will be no gap in our number sequence.

Let’s illustrate the **three functions in one query**:

Here is the output for our different functions. You can focus on how the function works **on rows 5 and 6** for the country`Venezuela`

(which are duplicates) and after:

`ROW_ NUMBER()`

keep its incremental sequence (1,2,3,**4,5,6,7**)`RANK()`

gives the same output value (5,5) but then loses its incremental sequence (1,2,3,**4,5,5,7**)`DENSE_RANK()`

gives the same output value (5,5) but keeps its incremental sequence (1,2,3,**4,5,5,6**)

The **same mechanism happens for rows 10 and 11**, as we are ordering by revenue and they have the same revenue. This incremental sequence mechanism will work the same for any number of peer values.

**Why not use ROW_NUMBER instead of RANK or DENSE_RANK?**

You could use `ROW_NUMBER()`

as a ranking function, but it’s interesting sometimes to keep the same rank value for all similar/peer rows.

For certain use cases, keeping the number sequence always incremented by 1 with `DENSE_RANK()`

might be a good pick.

**CUME_DIST**

The function `CUME_DIST()`

computes the cumulative distribution of values within a dataset or a partition. It returns values from 0 to 1 (>0 and ≤1).

This function requires an `ORDER BY`

clause to sort out values.

According to Google’s documentation, it is computed using the formula: **NP/NR. **This is how we can try to explain it:

`NP`

is the number of rows that come before or are similar to the current row`NR`

is the total number of rows (of the entire dataset or a partition)

It will show you how the values in your dataset are distributed. As an example, the distribution of our dataset rows based on revenue:

We use the `ROUND()`

function and multiplication `*100`

to turn the data into a more readable percentage format.

Let’s make the calculation manually. For our first row, there is only 1 value and no revenue below **1323. **This gives us **1/12 = 8%** (we have 12 rows in our dataset).

Now, let’s look at row 4th, `Nigeria`

, there are 3 rows with a value under **3314 + **this current row. This gives us **4/12 = 33%**.

The interesting part is for **rows 2 and 3** (or rows 7 and 8). They have the same revenue value. So if we look at row 2, we could expect a calculation of **2/12 = 16%.** But, as row 3 is similar, both rows will result in **3/12 = 25%.**

## NTILE

The `NTILE()`

function allows you to split a set of ranked data points into evenly distributed buckets. You might know this as **quantiles**, which can be of different types:

**Quartiles**(4 quantiles)**Deciles**(10 quantiles)**Percentiles**(100 quantiles)

For example, **quartiles** divide your dataset into four buckets of equal size. This means that the first quartile (Q1) contains 25% of the data points.

Let’s apply quartiles to our table:

In this case, we divide our data points into four equal buckets based on the ordered revenue (each bucket has 3 rows, which is **3/12 = 25%**).

Keep in mind **this is not a percentage of the total. **If we** **sum the revenue for the 4th quartile countries (France, Japan, and the United States), it does represent 90% of the total revenue.

When using this function, you have to provide an input number, for example: `NTILE(4)`

. You can’t leave this parameter empty, give an input number that is 0 or a negative value without seeing an error.

You can use this method to **find outliers** in your data (data points above the 95% percentile), **categorize your customers** based on their purchase value (the top 25%), and so on.

## PERCENT_RANK

The function `PERCENT_RANK()`

calculates the percentile distribution of value within a set of values. It returns values from 0 to 1.

This function requires an `ORDER BY`

clause to sort out values.

Again, we use the `ROUND()`

function and multiplication `*100`

to turn the data into a more readable percentage format.

Looking at the output,`Germany`

has the lowest revenue (not greater than any other country), so the **percentile rank is zero.**

On the other side, `United States`

have the biggest revenue of all (greater than any other country), hence the **percentile rank is 1 (or 100%).**

For `France`

, the percentile rank is 82%. That means its revenue is higher than 82% of all other countries.

From a practical experience, the most commonly used functions are `ROW_NUMBER()`

, `RANK()`

, `DENSE_RANK()`

and `NTILE()`

.

As an example, one of my tasks required to use `DENSE_RANK()`

to identify **acquisition products**, basically ranking products overall customer’s orders to identify what were the first products purchased. This function allowed us to keep the sequence incremented by one, deal with multiple products inside a single order, and to still be able to count the right number of total orders for a customer.

In another mission, the `NTILE()`

helped to** classify customers** into recent and frequent purchaser buckets (like an **RFM model** (Recency, Frequency, Monetary)) that would be used to make segments in our email service provider system.

You can find these functions generally available in other database systems (Amazon Redshift, MySQL, Postgres, Snowflake, etc…) as they are popular SQL windows functions (at least for ranking and row numbers).

[ad_2]

Source link