Regrouper selon plusieurs critères

(sujet mis à jour avec les versions 9.6 et 10 devel)

Un besoin simple couvert par SQL-92

      Un jour M.Tadiot, un chef, demande à son développeur senior, M.Jacques, de lui fournir RAPIDEMENT sur l’année 2015 le montant des ventes par produit, pour chacun des magasins, par mois et le total.
      Dans cet exemple nous allons considérer une table VENTES_AGG contenant le montant total des ventes en provenance de diverses sources. Nous avons pour le produit ID_PROD, dans le magasin ID_MAG, réalisé le montant de vente MTV dans la minute suivant le temps stocké dans la colonne DTV.
      M.Jacques se lance dans l’opération avec une requête "à l’ancienne" :

select version(); version ------------------------------------------------------------- PostgreSQL 9.5.1, compiled by Visual C++ build 1800, 64-bit (1 ligne) create table ventes_agg(dtv timestamp, id_prod smallint, id_mag smallint, mtv integer); CREATE TABLE show shared_buffers; shared_buffers ---------------- 2GB (1 ligne) show work_mem; work_mem ---------- 256MB (1 ligne) do $$ begin for i in reverse 669600..1 loop insert into ventes_agg values(current_timestamp - (i || ' minutes')::interval, trunc(random() * 100 + 1), trunc(random() * 100 + 1), trunc(random() * 90000 + 1)); end loop; end$$; DO explain (select id_prod, null::smallint as id_mag, null::smallint as month_dtv, sum(mtv) from ventes_agg where dtv >= to_timestamp('01/01/2015','DD/MM/YYYY') and dtv < to_timestamp('01/01/2016','DD/MM/YYYY') group by id_prod order by 1 desc) union all (select null, id_mag as id_mag, null, sum(mtv) from ventes_agg where dtv >= to_timestamp('01/01/2015','DD/MM/YYYY') and dtv < to_timestamp('01/01/2016','DD/MM/YYYY') group by id_mag order by 2 desc) union all (select null, null, extract(month from dtv) as month_dtv, sum(mtv) from ventes_agg where dtv >= to_timestamp('01/01/2015','DD/MM/YYYY') and dtv < to_timestamp('01/01/2016','DD/MM/YYYY') group by extract(month from dtv) order by 3 desc ) union all select null, null, null, sum(mtv) from ventes_agg where dtv >= to_timestamp('01/01/2015','DD/MM/YYYY') and dtv < to_timestamp('01/01/2016','DD/MM/YYYY'); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Append (cost=19657.70..142074.85 rows=528644 width=12) -> Result (cost=19657.70..39317.38 rows=158 width=6) -> Append (cost=19657.70..39317.38 rows=158 width=6) -> Subquery Scan on "*SELECT* 1" (cost=19657.70..19658.69 rows=79 width=6) -> Sort (cost=19657.70..19657.90 rows=79 width=6) Sort Key: ventes_agg.id_prod DESC -> HashAggregate (cost=19654.43..19655.22 rows=79 width=6) Group Key: ventes_agg.id_prod -> Seq Scan on ventes_agg (cost=0.00..17012.00 rows=528485 width=6) Filter: ((dtv >= to_timestamp('01/01/2015'::text, 'DD/MM/YYYY'::text)) AND (dtv < to_timestamp('01/01/2016'::text, 'DD/MM/YYYY'::text))) -> Subquery Scan on "*SELECT* 2" (cost=19657.70..19658.69 rows=79 width=6) -> Sort (cost=19657.70..19657.90 rows=79 width=6) Sort Key: ventes_agg_1.id_mag DESC -> HashAggregate (cost=19654.43..19655.22 rows=79 width=6) Group Key: ventes_agg_1.id_mag -> Seq Scan on ventes_agg ventes_agg_1 (cost=0.00..17012.00 rows=528485 width=6) Filter: ((dtv >= to_timestamp('01/01/2015'::text, 'DD/MM/YYYY'::text)) AND (dtv < to_timestamp('01/01/2016'::text, 'DD/MM/YYYY'::text))) -> GroupAggregate (cost=68569.68..79139.38 rows=528485 width=12) Group Key: (date_part('month'::text, ventes_agg_2.dtv)) -> Sort (cost=68569.68..69890.90 rows=528485 width=12) Sort Key: (date_part('month'::text, ventes_agg_2.dtv)) DESC -> Seq Scan on ventes_agg ventes_agg_2 (cost=0.00..18333.21 rows=528485 width=12) Filter: ((dtv >= to_timestamp('01/01/2015'::text, 'DD/MM/YYYY'::text)) AND (dtv < to_timestamp('01/01/2016'::text, 'DD/MM/YYYY'::text))) -> Aggregate (cost=18333.21..18333.22 rows=1 width=4) -> Seq Scan on ventes_agg ventes_agg_3 (cost=0.00..17012.00 rows=528485 width=4) Filter: ((dtv >= to_timestamp('01/01/2015'::text, 'DD/MM/YYYY'::text)) AND (dtv < to_timestamp('01/01/2016'::text, 'DD/MM/YYYY'::text))) (26 lignes) (select id_prod, null::smallint as id_mag, null::smallint as month_dtv, sum(mtv) from ventes_agg where dtv >= to_timestamp('01/01/2015','DD/MM/YYYY') and dtv < to_timestamp('01/01/2016','DD/MM/YYYY') group by id_prod order by 1 desc) union all (select null, id_mag as id_mag, null, sum(mtv) from ventes_agg where dtv >= to_timestamp('01/01/2015','DD/MM/YYYY') and dtv < to_timestamp('01/01/2016','DD/MM/YYYY') group by id_mag order by 2 desc) union all (select null, null, extract(month from dtv) as month_dtv, sum(mtv) from ventes_agg where dtv >= to_timestamp('01/01/2015','DD/MM/YYYY') and dtv < to_timestamp('01/01/2016','DD/MM/YYYY') group by extract(month from dtv) order by 3 desc ) union all select null, null, null, sum(mtv) from ventes_agg where dtv >= to_timestamp('01/01/2015','DD/MM/YYYY') and dtv < to_timestamp('01/01/2016','DD/MM/YYYY'); id_prod | id_mag | month_dtv | sum ---------+--------+-----------+------------- 100 | | | 234829908 .. 1 | | | 235149951 | 100 | | 237283838 .. | 1 | | 239420604 | | 12 | 2008147978 .. | | 1 | 2010666350 | | | 23663496163 (213 lignes) Temps : 4059,881 ms Temps : 4037,476 ms Temps : 4040,026 ms

Améliorer la performance ?

      M.Jacques a procédé de manière très classique avec des UNION ALL. M.Tadiot ne le remercie pas. Il avait dit RAPIDEMENT et il trouve le temps long lorsqu’il exécute la requête.
      M.Jacques tente d’améliorer les choses. Il décide d’éviter de balayer 4 fois la table VENTES_AGG avec une WITH query ou CTE (common table expression) :

explain with ventes_agg_agg as (select id_prod, id_mag, extract(month from dtv) month_dtv, sum(mtv) sum_mtv from ventes_agg where (dtv >= to_timestamp('01/01/2015','DD/MM/YYYY') and dtv < to_timestamp('01/01/2016','DD/MM/YYYY')) group by id_prod, id_mag, extract(month from dtv)) (select id_prod, null::smallint as id_mag, null::smallint as month_dtv, sum(sum_mtv) from ventes_agg_agg group by id_prod order by 1 desc) union all (select null, id_mag, null, sum(sum_mtv) from ventes_agg_agg group by id_mag order by 2 desc) union all (select null, null, month_dtv, sum(sum_mtv) from ventes_agg_agg group by month_dtv order by 3 desc ) union all select null, null, null, sum(sum_mtv) from ventes_agg_agg ; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------- Append (cost=43446.39..81789.37 rows=601 width=12) CTE ventes_agg_agg -> HashAggregate (cost=23618.06..30224.13 rows=528485 width=16) Group Key: ventes_agg.id_prod, ventes_agg.id_mag, date_part('month'::text, ventes_agg.dtv) -> Seq Scan on ventes_agg (cost=0.00..18333.21 rows=528485 width=16) Filter: ((dtv >= to_timestamp('01/01/2015'::text, 'DD/MM/YYYY'::text)) AND (dtv < to_timestamp('01/01/2016'::text, 'DD/MM/YYYY'::text))) -> Result (cost=13222.27..26449.54 rows=400 width=10) -> Append (cost=13222.27..26449.54 rows=400 width=10) -> Subquery Scan on "*SELECT* 1" (cost=13222.27..13224.77 rows=200 width=10) -> Sort (cost=13222.27..13222.77 rows=200 width=10) Sort Key: ventes_agg_agg.id_prod DESC -> HashAggregate (cost=13212.13..13214.63 rows=200 width=10) Group Key: ventes_agg_agg.id_prod -> CTE Scan on ventes_agg_agg (cost=0.00..10569.70 rows=528485 width=10) -> Subquery Scan on "*SELECT* 2" (cost=13222.27..13224.77 rows=200 width=10) -> Sort (cost=13222.27..13222.77 rows=200 width=10) Sort Key: ventes_agg_agg_1.id_mag DESC -> HashAggregate (cost=13212.13..13214.63 rows=200 width=10) Group Key: ventes_agg_agg_1.id_mag -> CTE Scan on ventes_agg_agg ventes_agg_agg_1 (cost=0.00..10569.70 rows=528485 width=10) -> Subquery Scan on "*SELECT* 3" (cost=13222.27..13224.77 rows=200 width=16) -> Sort (cost=13222.27..13222.77 rows=200 width=16) Sort Key: ventes_agg_agg_2.month_dtv DESC -> HashAggregate (cost=13212.13..13214.63 rows=200 width=16) Group Key: ventes_agg_agg_2.month_dtv -> CTE Scan on ventes_agg_agg ventes_agg_agg_2 (cost=0.00..10569.70 rows=528485 width=16) -> Aggregate (cost=11890.92..11890.93 rows=1 width=8) -> CTE Scan on ventes_agg_agg ventes_agg_agg_3 (cost=0.00..10569.70 rows=528485 width=8) (28 lignes) with ventes_agg_agg as (select id_prod, id_mag, extract(month from dtv) month_dtv, sum(mtv) sum_mtv from ventes_agg where (dtv >= to_timestamp('01/01/2015','DD/MM/YYYY') and dtv < to_timestamp('01/01/2016','DD/MM/YYYY')) group by id_prod, id_mag, extract(month from dtv)) (select id_prod, null::smallint as id_mag, null::smallint as month_dtv, sum(sum_mtv) from ventes_agg_agg group by id_prod order by 1 desc) union all (select null, id_mag, null, sum(sum_mtv) from ventes_agg_agg group by id_mag order by 2 desc) union all (select null, null, month_dtv, sum(sum_mtv) from ventes_agg_agg group by month_dtv order by 3 desc ) union all select null, null, null, sum(sum_mtv) from ventes_agg_agg ; id_prod | id_mag | month_dtv | sum ---------+--------+-----------+------------- 100 | | | 234829908 .. 1 | | | 235149951 | 100 | | 237283838 .. | 1 | | 239420604 | | 12 | 2008147978 .. | | 1 | 2010666350 | | | 23663496163 (213 lignes) Temps : 1405,630 ms Temps : 1400,498 ms Temps : 1411,651 ms


Et les extensions OLAP de SQL-99 ?

      Moins de 2 secondes pas mal ! Alex Mou, un stagiaire récemment arrivé dans l’entreprise, dit à M.Jacques qu’il aurait pu aussi utiliser GROUPING SETS pour regrouper selon différents critères au sein de la même clause GROUP BY.
      Cette fonctionnalité standard SQL-99 a été introduite avec PostgreSQL 9.5 :

explain (select id_prod, id_mag, extract(month from dtv) month_dtv, sum(mtv) from ventes_agg where dtv >= to_timestamp('01/01/2015','DD/MM/YYYY') and dtv < to_timestamp('01/01/2016','DD/MM/YYYY') group by grouping sets ((id_prod), (id_mag), (extract(month from dtv)), ()) order by 1 desc nulls last ,2 desc nulls last, 3 desc nulls last); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=251010.42..252332.03 rows=528644 width=16) Sort Key: id_prod DESC NULLS LAST, id_mag DESC NULLS LAST, (date_part('month'::text, dtv)) DESC NULLS LAST -> GroupAggregate (cost=68569.68..200757.69 rows=528644 width=16) Group Key: id_prod Group Key: () Sort Key: (date_part('month'::text, dtv)) DESC NULLS LAST Group Key: (date_part('month'::text, dtv)) Sort Key: id_mag DESC NULLS LAST Group Key: id_mag -> Sort (cost=68569.68..69890.90 rows=528485 width=16) Sort Key: id_prod DESC NULLS LAST -> Seq Scan on ventes_agg (cost=0.00..18333.21 rows=528485 width=16) Filter: ((dtv >= to_timestamp('01/01/2015'::text, 'DD/MM/YYYY'::text)) AND (dtv < to_timestamp('01/01/2016'::text, 'DD/MM/YYYY'::text))) (13 lignes) (select id_prod, id_mag, extract(month from dtv) month_dtv, sum(mtv) from ventes_agg where dtv >= to_timestamp('01/01/2015','DD/MM/YYYY') and dtv < to_timestamp('01/01/2016','DD/MM/YYYY') group by grouping sets ((id_prod), (id_mag), (extract(month from dtv)), ()) order by 1 desc nulls last ,2 desc nulls last, 3 desc nulls last); id_prod | id_mag | month_dtv | sum ---------+--------+-----------+------------- 100 | | | 234829908 .. | 100 | | 237283838 .. | 1 | | 239420604 | | 12 | 2008147978 .. | | 1 | 2010666350 | | | 23663496163 (213 lignes) Temps : 1522,307 ms Temps : 1518,830 ms Temps : 1520,983 ms


      Malgré des estimations de coût très éloignées de la réalité la requête avec GROUPING SETS est efficace à l’exécution. Elle n’entraîne notamment aussi qu’un balayage de la table VENTES_AGG. Elle a ici pris un dixième de seconde de plus que la requête avec la CTE mais reste bien plus rapide que la requête originale et la syntaxe est plus directe. Cette fonctionnalitê SQL-1999 de regroupement selon divers critêres dans une même clause GROUP BY est donc bien utilisable avec PostgreSQL.

Mise à jour : 12/03/2016