Grouper, joindre


      Cette page a été écrite grâce à des tests réalisés par mon collègue Julien.
      En plus du lancers de cailloux (table LANCERS), les géants (table GEANTS) réalisent des plantages de clous (table CLOUTAGES).
      Un jour, Oumpfor demande à Margiono de lui sortir les trois géants possédant les meilleures performances pour le lancer de cailloux en ne tenant compte que des performances de l'année en cours et de l'année précédente. Il veut également les meilleures performances pour le plantage de clous. Ces performances permettront de départager les géants en cas d'égalité au niveau du lancer de cailloux.
      Environnement de démonstration PostgreSQL 11 :

select version(); version -------------------------------------------------------------------------------------------------------- PostgreSQL 11.3 (Debian 11.3-1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-7) 8.3.0, 64-bit (1 ligne) create table geants(idge integer generated by default as identity primary key, idcl smallint, dtn timestamp, taille smallint, devise varchar(128), dicton varchar(128), berserk boolean); CREATE TABLE with recursive serie(i) as (select 1 UNION ALL select i + 1 from serie where i < 1000) insert into geants(idcl, dtn, taille, devise, dicton, berserk) select least(random()*10, random()*10), current_date - 3650 - trunc(random() * 10000 + 1)::int, 200 + (trunc(random() * 200 + 1)), upper(md5(random()::text)),upper(md5(random()::text)), case when random() < 0.001 then true else false end from serie; INSERT 0 1000 create table lancers (idge integer references geants(idge), dtl timestamp, perf integer ); CREATE TABLE insert into lancers(dtl, idge, perf) with recursive serie(i) as (select 1 UNION ALL select i + 1 from serie where i < 10000000) select clock_timestamp() - ((i || ' minutes')::interval)::interval, trunc(random() * 1000 + 1), trunc(random() * 100000 + 1) from serie; INSERT 0 10000000 create unique index on lancers(idge, dtl); CREATE INDEX create index on lancers using brin(dtl); CREATE INDEX create table cloutages (idge integer references geants(idge), dtl timestamp, perf integer ); CREATE TABLE insert into cloutages(dtl, idge, perf) with recursive serie(i) as (select 1 UNION ALL select i + 1 from serie where i < 150000) select clock_timestamp() - ((i || ' hours')::interval)::interval, trunc(random() * 1000 + 1), trunc(random() * 100 + 1) from serie; INSERT 0 150000 create unique index on cloutages(idge, dtl); CREATE INDEX create index on cloutages using brin(dtl); CREATE INDEX


      Margiono écrit spontanément la requête de manière directe :

explain select g.idge, g.idcl, max(l.perf) ml, max(c.perf) mc from geants g join lancers l on (g.idge = l.idge) join cloutages c on (g.idge = c.idge) where l.dtl >= date_trunc('year', current_date)- interval '1 year' and c.dtl >= date_trunc('year', current_date)- interval '1 year' group by g.idge, g.idcl order by ml, mc desc fetch first 3 rows only; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Limit (cost=149888.97..149888.98 rows=3 width=14) -> Sort (cost=149888.97..149891.47 rows=1000 width=14) Sort Key: (max(l.perf)), (max(c.perf)) DESC -> Finalize GroupAggregate (cost=149617.69..149876.04 rows=1000 width=14) Group Key: g.idge -> Gather Merge (cost=149617.69..149851.04 rows=2000 width=14) Workers Planned: 2 -> Sort (cost=148617.67..148620.17 rows=1000 width=14) Sort Key: g.idge -> Partial HashAggregate (cost=148557.84..148567.84 rows=1000 width=14) Group Key: g.idge -> Hash Join (cost=1828.92..118572.17 rows=3998090 width=14) Hash Cond: (l.idge = g.idge) -> Parallel Bitmap Heap Scan on lancers l (cost=206.47..71054.99 rows=315406 width=8) Recheck Cond: (dtl >= (date_trunc('year'::text, (CURRENT_DATE)::timestamp with time zone) - '1 year'::interval)) -> Bitmap Index Scan on lancers_dtl_idx (cost=0.00..17.23 rows=763042 width=0) Index Cond: (dtl >= (date_trunc('year'::text, (CURRENT_DATE)::timestamp with time zone) - '1 year'::interval)) -> Hash (cost=1464.01..1464.01 rows=12676 width=14) -> Hash Join (cost=52.71..1464.01 rows=12676 width=14) Hash Cond: (c.idge = g.idge) -> Bitmap Heap Scan on cloutages c (cost=15.21..1393.09 rows=12676 width=8) Recheck Cond: (dtl >= (date_trunc('year'::text, (CURRENT_DATE)::timestamp with time zone) - '1 year'::interval)) -> Bitmap Index Scan on cloutages_dtl_idx (cost=0.00..12.04 rows=18750 width=0) Index Cond: (dtl >= (date_trunc('year'::text, (CURRENT_DATE)::timestamp with time zone) - '1 year'::interval)) -> Hash (cost=25.00..25.00 rows=1000 width=6) -> Seq Scan on geants g (cost=0.00..25.00 rows=1000 width=6) select g.idge, g.idcl, max(l.perf) ml, max(c.perf) mc from geants g join lancers l on (g.idge = l.idge) join cloutages c on (g.idge = c.idge) where l.dtl >= date_trunc('year', current_date)- interval '1 year' and c.dtl >= date_trunc('year', current_date)- interval '1 year' group by g.idge, g.idcl order by ml, mc desc fetch first 3 rows only; idge | idcl | ml | mc ------+------+-------+----- 988 | 3 | 98927 | 91 503 | 0 | 99036 | 100 196 | 1 | 99137 | 92 Durée : 2732,026 ms (00:02,732) Durée : 2715,877 ms (00:02,716) Durée : 2719,984 ms (00:02,720)


      Les jointures sont effectuées avant toute opération de regroupement. Margiono se dit qu'il serait judicieux de joindre des ensembles plus petits et tente une autre écriture. Il effectue d'abord explicitement les regroupements (GROUP BY) puis joint les ensembles obtenus (JOIN) :

explain with l as (select idge, max(perf) ml from lancers where dtl >= date_trunc('year', current_date)- interval '1 year' group by idge), c as (select idge, max(perf) mc from cloutages where dtl >= date_trunc('year', current_date)- interval '1 year' group by idge) select g.idge, g.idcl, ml, mc from geants g join l on (g.idge = l.idge) join c on (g.idge = c.idge) order by ml, mc desc fetch first 3 rows only; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=75531.00..75531.01 rows=3 width=14) CTE l -> Finalize GroupAggregate (cost=73691.87..73945.22 rows=1000 width=8) Group Key: lancers.idge -> Gather Merge (cost=73691.87..73925.22 rows=2000 width=8) Workers Planned: 2 -> Sort (cost=72691.85..72694.35 rows=1000 width=8) Sort Key: lancers.idge -> Partial HashAggregate (cost=72632.02..72642.02 rows=1000 width=8) Group Key: lancers.idge -> Parallel Bitmap Heap Scan on lancers (cost=206.47..71054.99 rows=315406 width=8) Recheck Cond: (dtl >= (date_trunc('year'::text, (CURRENT_DATE)::timestamp with time zone) - '1 year'::interval)) -> Bitmap Index Scan on lancers_dtl_idx (cost=0.00..17.23 rows=763042 width=0) Index Cond: (dtl >= (date_trunc('year'::text, (CURRENT_DATE)::timestamp with time zone) - '1 year'::interval)) CTE c -> HashAggregate (cost=1456.47..1466.47 rows=1000 width=8) Group Key: cloutages.idge -> Bitmap Heap Scan on cloutages (cost=15.21..1393.09 rows=12676 width=8) Recheck Cond: (dtl >= (date_trunc('year'::text, (CURRENT_DATE)::timestamp with time zone) - '1 year'::interval)) -> Bitmap Index Scan on cloutages_dtl_idx (cost=0.00..12.04 rows=18750 width=0) Index Cond: (dtl >= (date_trunc('year'::text, (CURRENT_DATE)::timestamp with time zone) - '1 year'::interval)) -> Sort (cost=119.31..121.81 rows=1000 width=14) Sort Key: l.ml, c.mc DESC -> Hash Join (cost=72.64..106.39 rows=1000 width=14) Hash Cond: (c.idge = g.idge) -> CTE Scan on c (cost=0.00..20.00 rows=1000 width=8) -> Hash (cost=60.14..60.14 rows=1000 width=14) -> Hash Join (cost=37.50..60.14 rows=1000 width=14) Hash Cond: (l.idge = g.idge) -> CTE Scan on l (cost=0.00..20.00 rows=1000 width=8) -> Hash (cost=25.00..25.00 rows=1000 width=6) -> Seq Scan on geants g (cost=0.00..25.00 rows=1000 width=6) with l as (select idge, avg(perf) ml from lancers where dtl >= current_date - 300 group by idge), c as (select idge, avg(perf) mc from cloutages where dtl >= current_date - 300 group by idge) select g.idge, g.idcl, ml, mc from geants g join l on (g.idge = l.idge) join c on (g.idge = c.idge) order by ml, mc desc fetch first 3 rows only; idge | idcl | ml | mc ------+------+-------+----- 988 | 3 | 98927 | 91 503 | 0 | 99036 | 100 196 | 1 | 99137 | 92 (3 lignes) Temps : 653,565 ms Temps : 647,962 ms Temps : 646,952 ms

      Bingo, la deuxième écriture se révèle ici 4 fois plus performante à l'exécution.
      Grouper puis joindre (GROUP BY first then JOIN) est une règle d'optimisation intéressante avec PostgreSQL.
      Le planner (optimiseur) PostgreSQL devrait à l'avenir être capable de réaliser automatiquement cette optimisation, même avec la première écriture. Des développements effectués par la communauté PostgreSQL sur le sujet apparaissent dès 2017 mais cette règle est pour l'instant toujours valable.

Mise à jour : 07/06/2019