Switch to: V9V8V7V6V5

Selecting fields that are not in GROUP BY

Task Description

Assume we have some table where each record describes the person, person's phone and this phone's priority among the other phones belonged to this person.

CREATE TABLE Person ( name String, phone long, priority long );
 
INSERT INTO Person VALUES ( 'John Smith',  555111, 1 );
INSERT INTO Person VALUES ( 'Ivan Ivanov', 555222, 1 );
INSERT INTO Person VALUES ( 'John Smith',  555333, 2 );

And we need to get the phones with max priority for each person. Also, lets priority be in range 1 .. n.

Problem

The first approach could be following:

SELECT name, MAX(priority) FROM Person GROUP BY name;

It is ok until we don't want to see “phone” column in the result. But if we need to show such additional columns (i.e. “show me the top-priority phone of each person”), then we get problem.

We can NOT do it simply adding “phone” into the select clause, because according to SQL standard, SELECT clause may contain columns mentioned in the GROUP BY and expressions based on aggregative functions only.

Why? Because of ambiguity - “group by” produces a single record for each group of records. “Group By” columns contain the same value for all records in the group and any aggregative function produces the single value for a group, but “phone” can be 555111 or 555333 for 'John Smith' group - which one should we choose as a single value for 'John Smith' group?

!!! **WRONG** !!!
 
SELECT name, phone, MAX(priority) FROM Person GROUP BY name;

Also, we can NOT apply some aggregative function for “phone” (like we did for “priority”) because we may get values from different records:

!!! **WRONG** !!!
 
SELECT name, FIRST(phone), MAX(priority) FROM Person GROUP BY name;
--
'John Smith'   555111  2
'Ivan Ivanov'  555222  1
 
Instead of:
--
'John Smith'   555333  2
'Ivan Ivanov'  555222  1

Solutions

There are few tricks to solve such a problem:

Correlated subquery

SELECT * FROM Person t1 WHERE t1.priority = (SELECT max(t2.priority) FROM Person t2 WHERE t1.name = t2.name);
--
'John Smith'   555333  2
'Ivan Ivanov'  555222  1

This way is workable but can take a lot of time because of its correlation nature - we have to find max priority for particular person checking each record in the table.

ANY predicate

SELECT 
  * 
FROM 
  Person t1 
WHERE 
  (t1.name, t1.priority) = ANY(SELECT t2.name, max(t2.priority) FROM Person t2 GROUP BY t2.name);
--
'John Smith'   555333  2
'Ivan Ivanov'  555222  1

Here we will find all pairs (name, max(priority)) first (and we do it once!). Then we just check all candidates from t1 for being present in that set of pairs.

For particular test data (~5500 records in Person, ~1500 groups) this way 8 times faster then subqueries.

ORDER BY in aggregative functions First()/Last()

New for v.5.0

SELECT 
  * 
FROM 
  Person t1 
WHERE 
  t1.recid  IN (SELECT FIRST(t2.recid ORDER BY t2.priority) FROM Person t2 GROUP BY t2.name);
--
'John Smith'   555333  2
'Ivan Ivanov'  555222  1

Here we will find all “first 'recid' sorted out by 'priority'” inside of each group of records. Then we just check all candidates from t1 for being present in that set of pairs. Moreover, this way let us get rid of records with an equal priority inside each group.

For particular test data (~5500 records in Person, ~1500 groups) this way 400 times faster then correlated subqueries.

Note, that this solution is also more powerful, because after you have list of RecIDs, you can do SELECT *, i.e. select all/many fields of that records.