Retrieve Top N Rows Per Group


The Challenge

Often there is the requirement to access the first or top n rows for every unique value of a given column: the cheapest product (= first row) within a product group (= unique values of the given column), the rows with the highest version number per entity within a historic table, the newest 10 log entries per user, ... . In the SQL world this is a three-step-job: a) group the table over the given column b) order the rows within the created groups according to a criteria and c) access the first or the top n rows within the created, ordered groups.

In complex cases like this one SQL offers not only a single solution. There is a bunch of possible formulations to get the expected result. At the logical level they are equivalent, but it is likely that their performance differs strongly from each other. And it's likely that the performance of the same formulation differs strongly on different database systems. The deviation in performance results from the fact, that SQL in general defines only WHAT the system shall do and not HOW it shall be done. It is the responsibility of the database system to find an optimal execution plan.

We offer some of the possible solutions - from primitive over complex to simple ones. They include subselects, joins, the FETCH FIRST clause, the use of a predicate, and finally window functions as the method of choice.

Example Table and Data

We use the example table product with a small number of data rows to discuss diverse strategies.

CREATE TABLE product (
  id             INTEGER      NOT NULL,
  name           VARCHAR(50)  NOT NULL,
  product_group  VARCHAR(20)  NOT NULL,
  prize          DECIMAL(5,2)         ,
  CONSTRAINT product_pk PRIMARY KEY (id)
);

INSERT INTO product VALUES ( 1, 'Big Fat One'    , 'Desktop',    545);
INSERT INTO product VALUES ( 2, 'SmartAndElegant', 'Laptop',     675);
INSERT INTO product VALUES ( 3, 'Angle',           'Laptop',     398);
INSERT INTO product VALUES ( 4, 'Wizzard 7',       'Smartphone', 380);
INSERT INTO product VALUES ( 5, 'Solid',           'Desktop',    565);
INSERT INTO product VALUES ( 6, 'AllRounder',      'Smartphone', 535);
INSERT INTO product VALUES ( 7, 'WhiteHorse',      'Laptop',     675);
INSERT INTO product VALUES ( 8, 'Workstation ONE', 'Desktop',    499);
INSERT INTO product VALUES ( 9, 'Air',             'Laptop',     450);
INSERT INTO product VALUES (10, 'Rusty',           'Laptop',     390);
INSERT INTO product VALUES (11, 'Tripple-A',       'Desktop',    580);
INSERT INTO product VALUES (12, 'Oxygen 8',        'Smartphone', 450);
INSERT INTO product VALUES (13, 'AllDay Basic',    'Smartphone',  75);
COMMIT;

With this structure and data we will try to access the rows with the highest prize per product group.

Insufficient Solutions

The first solution uses only the GROUP BY clause and reduces the problem in two way: a) it offers only the very first row per group (ignoring the second best, third best, etc. rows) by using the functions max or min and b) the solution has only access to the grouping criteria and the result of max / min. Due to the nature of the GROUP BY clause all remaining columns are not accessible - see here.

SELECT product_group, MAX(prize)
FROM   product
GROUP BY product_group;


product_group | max
--------------+-----
Smartphone    | 535
Desktop       | 580
Laptop        | 675

-- access to other columns is not possible
SELECT *
FROM   product
GROUP BY product_group;

We can extent this first solution to show more columns by combining it with a correlated or non-correlated subquery. This second solution offers access to all columns. Nevertheless the result is not what we expect as the number of accessed rows is 4. The MAX(prize) criteria is not necessarily unique. Thus we receive 4 rows for the 3 groups out of our small example table. And - as mentioned above - we don't have access to the row with the second highest prize.

-- SELECT with a non-correlated subquery. The subquery is executed only once.
SELECT *
FROM   product
WHERE  prize IN      -- prize is a very weak criterion
  (SELECT MAX(prize)
   FROM   product
   GROUP BY product_group
  )
;

id |      name       | product_group | prize
---+-----------------+---------------+-------
11 | Tripple-A       | Desktop       |   580
 2 | SmartAndElegant | Laptop        |   675
 7 | WhiteHorse      | Laptop        |   675
 6 | AllRound        | Smartphone    |   535


-- SELECT with a correlated subquery. Observe the performance! The subquery is executed
-- once per row of p1 !!!
SELECT *
FROM   product p1
WHERE  prize IN      -- prize is a very weak criterion
  (SELECT MAX(prize)
   FROM   product p2
   WHERE  p1.product_group = p2.product_group
  )
;

id |      name       | product_group | prize
---+-----------------+---------------+-------
11 | Tripple-A       | Desktop       |   580
 2 | SmartAndElegant | Laptop        |   675
 7 | WhiteHorse      | Laptop        |   675
 6 | AllRound        | Smartphone    |   535

Resume: If one uses nothing but the GROUP BY clause, he will miss columns and rows. If he puts the GROUP BY into a subquery he can access all columns. But the access to the rows keeps unsatisfactory.

The same holds true for the third solution. One can create a JOIN over the product_group and reduce the resulting rows to those with the highest prize within the group by using the HAVING clause. The result is the same as with the second solution.

SELECT p1.*
FROM   product p1
JOIN   product p2 ON (p1.product_group = p2.product_group)
GROUP BY p1.id, p1.name, p1.product_group, p1.prize
HAVING p1.prize = MAX(p2.prize)
;

id |      name       | product_group | prize
---+-----------------+---------------+-------
 7 | WhiteHorse      | Laptop        |   675
 2 | SmartAndElegant | Laptop        |   675
11 | Tripple-A       | Desktop       |   580
 6 | AllRound        | Smartphone    |   535

As the fourth solution we offer a last example how to express the same issue - with the same unperfect result. It uses a NOT EXISTS predicate to search those rows, to which there is no higher prize within their group.

SELECT *
FROM   product p1
WHERE NOT EXISTS
  (SELECT *
   FROM  product p2
   WHERE p1.product_group = p2.product_group
   AND   p1.prize < p2.prize
  )
;

Komplex Solutions

To overcome the above shortcomings we have to adjust two screws. First, the link between the two SELECTs (via join or subselect) must be changed to a column, which has only unique values like the ID. And second, we must use the FETCH FIRST clause in combination with an ORDER BY to count rows.

First, we show a modification of the upper second solution. The ORDER BY clause sorts the rows into the desired sequence and the FETCH FIRST clause restricts the number of resulting rows to any desired quantity per group. The result of the subquery is a list of ids. As the ids are an unique criterion within our example table the outer SELECT retrieves exactly the expected rows - with all their columns.

-- modification of the second solution (correlated subquery)
SELECT *
FROM   product p1
WHERE  id IN
  (SELECT id
   FROM   product p2
   WHERE  p1.product_group = p2.product_group
   ORDER BY prize DESC
   FETCH FIRST 1 ROW ONLY    -- replace "ONLY" with "WITH TIES" to include rows with identical prize at the cutting edge
  )
;

id |      name       | product_group | prize
---+-----------------+---------------+-------
11 | Tripple-A       | Desktop       |   580
 2 | SmartAndElegant | Laptop        |   675
 6 | AllRound        | Smartphone    |   535

Next, we use a JOIN LATERAL clause, which allows - similar to a correlated subquery - the use of previous named tables and their columns as a link to later named tables and their columns. In the example every row of p1 is joined with the first row (FETCH FIRST) of p2 within the same group (p1.product_group = p2.product_group). The resulting columns of p2 are propagated to the outside parts of the query with the name p3. Finally the join takes place over the id (ON p1.id = p3.id). As p2/p3 retrievs always the row with the hightest prize per group, only this one is part of the result.

SELECT p3.*
FROM   product p1
JOIN LATERAL (SELECT *
              FROM   product p2
              WHERE  p1.product_group = p2.product_group
              ORDER BY p2.prize DESC
              FETCH FIRST 1 ROW ONLY
             ) p3 ON p1.id = p3.id
;

Window Functions

Window functions offer a very flexible and rich set of features. They work on multiple rows of the (intermediate) result set by 'sliding' over them like a 'window' and produce their results from the rows actually seen in the window.

They are addressed by two parts: the name of the desired function plus a defintion of the 'sliding window', eg: SELECT row_number OVER as rownum .... In this case the name of the function is 'row_number' and the window definition 'OVER ' remains empty, which leads to a window where all rows are seen. As its name suggests the function counts the rows within the window.

In our case of 'n-rows-per-group' we must define windows which act on groups of rows (in the sense of the GROUP BY clause). To do so we expand the window definition to OVER (PARTITION BY product_group ... ) and get a counter per group:

SELECT product.*, row_number OVER (PARTITION BY product_group ORDER BY id)
FROM   product;

 id |      name       | product_group | prize  | row_number
----+-----------------+---------------+--------+------------
  1 | Big Fat One     | Desktop       | 545    |          1
  5 | Solid           | Desktop       | 565    |          2
  8 | Workstation ONE | Desktop       | 499    |          3
 11 | Tripple-A       | Desktop       | 580    |          4
  2 | SmartAndElegant | Laptop        | 675    |          1
  3 | Angle           | Laptop        | 398    |          2
  7 | WhiteHorse      | Laptop        | 675    |          3
  9 | Air             | Laptop        | 450    |          4
 10 | Rusty           | Laptop        | 390    |          5
  4 | Wizzard 7       | Smartphone    | 380    |          1
  6 | AllRounder      | Smartphone    | 535    |          2
 12 | Oxygen 8        | Smartphone    | 450    |          3
 13 | AllDay Basic    | Smartphone    |  75    |          4

Now the rownumber starts with the value '1' within each group respectively partition. We can take advantage of this behaviour by sorting the rows as desired and limit the resulting rows to any desired quantity by querying this rownumber in an outer WHERE clause.

As a window function cannot be used in the WHERE clause we must use it in a SELECT witch is nested in another SELECT. The outer SELECT reduces the number of lastly retrieved rows to one per group, which is the one with the hightest prize as the OVER clause contains a ORDER BY.

SELECT tmp.*
FROM
  (SELECT product.*, row_number OVER (PARTITION BY product_group ORDER BY prize DESC) AS rownumber_per_group
   FROM   product
  ) tmp
WHERE  rownumber_per_group < 2
;

 id |    name    | product_group | prize  | rownumber_per_group
----+------------+---------------+--------+---------------------
 11 | Tripple-A  | Desktop       | 580    |                   1
  7 | WhiteHorse | Laptop        | 675    |                   1
  6 | AllRounder | Smartphone    | 535    |                   1

You can easily modify this solution to enlarge the number of retrieved rows or to integrate additional window functions - eg. if you use rank instead of row_number, you get the additional row with id=2 and prize=675.

Lastely we show a more complex query which retrieves additional statistical values per group. For details please refere to the page Window functions.

SELECT *
FROM
  (SELECT product.*,
          row_number OVER (PARTITION BY product_group ORDER BY prize DESC) AS rownumber_per_group,
          min(prize)   OVER (PARTITION BY product_group ORDER BY prize DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS min,
          avg(prize)   OVER (PARTITION BY product_group ORDER BY prize DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS avg,
          max(prize)   OVER (PARTITION BY product_group ORDER BY prize DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS max
   FROM   product
  ) tmp
WHERE  rownumber_per_group < 2
;

 id |    name    | product_group | prize  | rownumber_per_group |  min   |  avg   | max  
----+------------+---------------+--------+---------------------+--------+--------+------
 11 | Tripple-A  | Desktop       | 580    |                   1 | 499    | 547.25 | 580
  7 | WhiteHorse | Laptop        | 675    |                   1 | 390    | 517.60 | 675
  6 | AllRounder | Smartphone    | 535    |                   1 |  75    | 360.00 | 535



  This article uses material from the Wikipedia page available here. It is released under the Creative Commons Attribution-Share-Alike License 3.0.

Structured_Query_Language/Retrieve_Top_N_Rows_per_Group
 



 

Connect with defaultLogic
What We've Done
Led Digital Marketing Efforts of Top 500 e-Retailers.
Worked with Top Brands at Leading Agencies.
Successfully Managed Over $50 million in Digital Ad Spend.
Developed Strategies and Processes that Enabled Brands to Grow During an Economic Downturn.
Taught Advanced Internet Marketing Strategies at the graduate level.


Manage research, learning and skills at defaultLogic. Create an account using LinkedIn or facebook to manage and organize your Digital Marketing and Technology knowledge. defaultLogic works like a shopping cart for information -- helping you to save, discuss and share.


  Contact Us