Regrouper selon plusieurs critères

(sujet préalablement traité avec les versions 9.5 et 9.6)

Un besoin simple couvert par SQL-92

      Un jour M.Tadiot, un chef, demande à son développeur senior, M.Jacques, de lui fournir RAPIDEMENT sur l’année 2016 le montant des ventes par produit, pour chacun des magasins, par mois et le total.
      Dans cet exemple nous allons considérer une table VENTES_AGG contenant le montant total des ventes en provenance de diverses sources. Nous avons pour le produit ID_PROD, dans le magasin ID_MAG, réalisé le montant de vente MTV dans la minute suivant le temps stocké dans la colonne DTV.
      M.Jacques se lance dans l’opération avec une requête "à l’ancienne" :

SELECT version(); version ---------------------------------------------------------------------------------------------------- PostgreSQL 10devel on x86_64-pc-linux-gnu, compiled by gcc (Debian 6.3.0-8) 6.3.0 20170221, 64-bit (1 ligne) Temps : 0,191 ms show max_parallel_workers_per_gather; max_parallel_workers_per_gather --------------------------------- 3 (1 ligne) Temps : 0,142 ms create unlogged table ventes_agg(dtv timestamp, id_prod smallint, id_mag smallint, mtv integer); CREATE TABLE Temps : 100,616 ms WITH serie_1000000(i) AS (SELECT generate_series(1000000,1,-1)) INSERT INTO ventes_agg (SELECT current_timestamp - (i || ' minutes')::interval, trunc(random() * 100 + 1), trunc(random() * 100 + 1), trunc(random() * 90000 + 1) FROM serie_1000000); INSERT 0 1000000 Time: 1589,322 ms (00:01,589) \d+ Liste des relations Schéma | Nom | Type | Propriétaire | Taille | Description --------+------------+-------+--------------+--------+------------- public | ventes_agg | table | postgres | 42 MB | (1 ligne) ANALYZE ventes_agg; ANALYZE Temps : 200,600 ms explain (select id_prod, null::smallint as id_mag, null::smallint as month_dtv, sum(mtv) from ventes_agg where dtv >= to_timestamp('01/01/2016','DD/MM/YYYY') and dtv < to_timestamp('01/01/2017','DD/MM/YYYY') group by id_prod order by 1 desc) union all (select null, id_mag as id_mag, null, sum(mtv) from ventes_agg where dtv >= to_timestamp('01/01/2016','DD/MM/YYYY') and dtv < to_timestamp('01/01/2017','DD/MM/YYYY') group by id_mag order by 2 desc) union all (select null, null, extract(month from dtv) as month_dtv, sum(mtv) from ventes_agg where dtv >= to_timestamp('01/01/2016','DD/MM/YYYY') and dtv < to_timestamp('01/01/2017','DD/MM/YYYY') group by extract(month from dtv) order by 3 desc ) union all select null, null, null, sum(mtv) from ventes_agg where dtv >= to_timestamp('01/01/2016','DD/MM/YYYY') and dtv < to_timestamp('01/01/2017','DD/MM/YYYY'); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Append (cost=15864.49..148548.59 rows=526527 width=20) -> Result (cost=15864.49..31738.47 rows=200 width=20) -> Append (cost=15864.49..31735.97 rows=200 width=14) -> Finalize GroupAggregate (cost=15864.49..15866.99 rows=100 width=14) Group Key: ventes_agg.id_prod -> Sort (cost=15864.49..15864.99 rows=200 width=10) Sort Key: ventes_agg.id_prod DESC -> Gather (cost=15835.84..15856.84 rows=200 width=10) Workers Planned: 2 -> Partial HashAggregate (cost=14835.84..14836.84 rows=100 width=10) Group Key: ventes_agg.id_prod -> Parallel Seq Scan on ventes_agg (cost=0.00..13739.33 rows=219302 width=6) Filter: ((dtv >= to_timestamp('01/01/2016'::text, 'DD/MM/YYYY'::text)) AND (dtv < to_timestamp('01/01/2017'::text, 'DD/MM/YYYY'::text))) -> Finalize GroupAggregate (cost=15864.49..15866.99 rows=100 width=14) Group Key: ventes_agg_1.id_mag -> Sort (cost=15864.49..15864.99 rows=200 width=10) Sort Key: ventes_agg_1.id_mag DESC -> Gather (cost=15835.84..15856.84 rows=200 width=10) Workers Planned: 2 -> Partial HashAggregate (cost=14835.84..14836.84 rows=100 width=10) Group Key: ventes_agg_1.id_mag -> Parallel Seq Scan on ventes_agg ventes_agg_1 (cost=0.00..13739.33 rows=219302 width=6) Filter: ((dtv >= to_timestamp('01/01/2016'::text, 'DD/MM/YYYY'::text)) AND (dtv < to_timestamp('01/01/2017'::text, 'DD/MM/YYYY'::text))) -> GroupAggregate (cost=85732.51..96259.03 rows=526326 width=20) Group Key: (date_part('month'::text, ventes_agg_2.dtv)) -> Sort (cost=85732.51..87048.33 rows=526326 width=12) Sort Key: (date_part('month'::text, ventes_agg_2.dtv)) DESC -> Seq Scan on ventes_agg ventes_agg_2 (cost=0.00..26721.81 rows=526326 width=12) Filter: ((dtv >= to_timestamp('01/01/2016'::text, 'DD/MM/YYYY'::text)) AND (dtv < to_timestamp('01/01/2017'::text, 'DD/MM/YYYY'::text))) -> Finalize Aggregate (cost=15287.80..15287.81 rows=1 width=20) -> Gather (cost=15287.59..15287.80 rows=2 width=8) Workers Planned: 2 -> Partial Aggregate (cost=14287.59..14287.60 rows=1 width=8) -> Parallel Seq Scan on ventes_agg ventes_agg_3 (cost=0.00..13739.33 rows=219302 width=4) Filter: ((dtv >= to_timestamp('01/01/2016'::text, 'DD/MM/YYYY'::text)) AND (dtv < to_timestamp('01/01/2017'::text, 'DD/MM/YYYY'::text))) (35 lignes) Temps : 0,814 ms (select id_prod, null::smallint as id_mag, null::smallint as month_dtv, sum(mtv) from ventes_agg where dtv >= to_timestamp('01/01/2016','DD/MM/YYYY') and dtv < to_timestamp('01/01/2017','DD/MM/YYYY') group by id_prod order by 1 desc) union all (select null, id_mag as id_mag, null, sum(mtv) from ventes_agg where dtv >= to_timestamp('01/01/2016','DD/MM/YYYY') and dtv < to_timestamp('01/01/2017','DD/MM/YYYY') group by id_mag order by 2 desc) union all (select null, null, extract(month from dtv) as month_dtv, sum(mtv) from ventes_agg where dtv >= to_timestamp('01/01/2016','DD/MM/YYYY') and dtv < to_timestamp('01/01/2017','DD/MM/YYYY') group by extract(month from dtv) order by 3 desc ) union all select null, null, null, sum(mtv) from ventes_agg where dtv >= to_timestamp('01/01/2016','DD/MM/YYYY') and dtv < to_timestamp('01/01/2017','DD/MM/YYYY'); id_prod | id_mag | month_dtv | sum ---------+--------+-----------+------------- 100 | | | 240839462 ... 1 | | | 236012275 | 100 | | 240659278 ... | 1 | | 237332255 | | 12 | 2008201606 ... | | 1 | 2005266721 | | | 23739001011 (213 lignes) 1ere exe : Time: 1829,908 ms (00:01,830) 2eme exe : Time: 1837,385 ms (00:01,837) 3eme exe : Time: 1831,847 ms (00:01,832)

Améliorer la performance ?

      M.Jacques a procédé de manière très classique avec des UNION ALL. M.Tadiot ne le remercie pas. Il avait dit RAPIDEMENT et il trouve le temps un peu long lorsqu’il exécute la requête. De plus l’exécution de la requête en parallèle consomme beaucoup de ressources.
      M.Jacques tente d’améliorer les choses. Il décide d’éviter de balayer 4 fois la table VENTES_AGG avec une WITH query ou CTE (common table expression) :

explain with ventes_agg_agg as (select id_prod, id_mag, extract(month from dtv) month_dtv, sum(mtv) sum_mtv from ventes_agg where (dtv >= to_timestamp('01/01/2016','DD/MM/YYYY') and dtv < to_timestamp('01/01/2017','DD/MM/YYYY')) group by id_prod, id_mag, extract(month from dtv)) (select id_prod, null::smallint as id_mag, null::smallint as month_dtv, sum(sum_mtv) from ventes_agg_agg group by id_prod order by 1 desc) union all (select null, id_mag, null, sum(sum_mtv) from ventes_agg_agg group by id_mag order by 2 desc) union all (select null, null, month_dtv, sum(sum_mtv) from ventes_agg_agg group by month_dtv order by 3 desc ) union all select null, null, null, sum(sum_mtv) from ventes_agg_agg ; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------- Append (cost=112058.96..150250.40 rows=601 width=44) CTE ventes_agg_agg -> GroupAggregate (cost=85732.51..98890.66 rows=526326 width=20) Group Key: ventes_agg.id_prod, ventes_agg.id_mag, (date_part('month'::text, ventes_agg.dtv)) -> Sort (cost=85732.51..87048.33 rows=526326 width=16) Sort Key: ventes_agg.id_prod, ventes_agg.id_mag, (date_part('month'::text, ventes_agg.dtv)) -> Seq Scan on ventes_agg (cost=0.00..26721.81 rows=526326 width=16) Filter: ((dtv >= to_timestamp('01/01/2016'::text, 'DD/MM/YYYY'::text)) AND (dtv < to_timestamp('01/01/2017'::text, 'DD/MM/YYYY'::text))) -> Result (cost=13168.29..26346.59 rows=400 width=44) -> Append (cost=13168.29..26341.59 rows=400 width=38) -> Sort (cost=13168.29..13168.79 rows=200 width=38) Sort Key: ventes_agg_agg.id_prod DESC -> HashAggregate (cost=13158.15..13160.65 rows=200 width=38) Group Key: ventes_agg_agg.id_prod -> CTE Scan on ventes_agg_agg (cost=0.00..10526.52 rows=526326 width=10) -> Sort (cost=13168.29..13168.79 rows=200 width=38) Sort Key: ventes_agg_agg_1.id_mag DESC -> HashAggregate (cost=13158.15..13160.65 rows=200 width=38) Group Key: ventes_agg_agg_1.id_mag -> CTE Scan on ventes_agg_agg ventes_agg_agg_1 (cost=0.00..10526.52 rows=526326 width=10) -> Sort (cost=13168.29..13168.79 rows=200 width=44) Sort Key: ventes_agg_agg_2.month_dtv DESC -> HashAggregate (cost=13158.15..13160.65 rows=200 width=44) Group Key: ventes_agg_agg_2.month_dtv -> CTE Scan on ventes_agg_agg ventes_agg_agg_2 (cost=0.00..10526.52 rows=526326 width=16) -> Aggregate (cost=11842.34..11842.35 rows=1 width=44) -> CTE Scan on ventes_agg_agg ventes_agg_agg_3 (cost=0.00..10526.52 rows=526326 width=8) (27 lignes) Temps : 47,328 ms with ventes_agg_agg as (select id_prod, id_mag, extract(month from dtv) month_dtv, sum(mtv) sum_mtv from ventes_agg where (dtv >= to_timestamp('01/01/2016','DD/MM/YYYY') and dtv < to_timestamp('01/01/2017','DD/MM/YYYY')) group by id_prod, id_mag, extract(month from dtv)) (select id_prod, null::smallint as id_mag, null::smallint as month_dtv, sum(sum_mtv) from ventes_agg_agg group by id_prod order by 1 desc) union all (select null, id_mag, null, sum(sum_mtv) from ventes_agg_agg group by id_mag order by 2 desc) union all (select null, null, month_dtv, sum(sum_mtv) from ventes_agg_agg group by month_dtv order by 3 desc ) union all select null, null, null, sum(sum_mtv) from ventes_agg_agg ; id_prod | id_mag | month_dtv | sum ---------+--------+-----------+------------- 100 | | | 240839462 ... 1 | | | 236012275 | 100 | | 240659278 ... | 1 | | 237332255 | | 12 | 2008201606 ... | | 1 | 2005266721 | | | 23739001011 (213 lignes) 1ere exe : Time: 1402,041 ms (00:01,402) 2eme exe : Time: 1383,532 ms (00:01,384) 3eme exe : Time: 1405,408 ms (00:01,405)


Et les extensions OLAP de SQL-99 ?

      Moins de 1s5 avec un 1 process alors qu’à l’origine la requête durait 1s8 avec 4 process...pas mal ! Alex Mou, un stagiaire récemment arrivé dans l’entreprise, dit à M.Jacques que son astuce avec la CTE est maligne mais qu’elle ne fonctionnerait pas si des moyennes avaient été demandées et pas des sommes. Il suggère une solution plus générale basée sur GROUPING SETS pour regrouper selon différents critères au sein de la même clause GROUP BY. Cette fonctionnalité standard SQL-99 a été introduite avec PostgreSQL 9.5, démonstration avec PostgreSQL 10 devel :

explain (select id_prod, id_mag, extract(month from dtv) month_dtv, sum(mtv) from ventes_agg where dtv >= to_timestamp('01/01/2016','DD/MM/YYYY') and dtv < to_timestamp('01/01/2017','DD/MM/YYYY') group by grouping sets ((id_prod), (id_mag), (extract(month from dtv)), ()) order by 1 desc nulls last ,2 desc nulls last, 3 desc nulls last); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=462759.65..466708.60 rows=1579581 width=20) Sort Key: id_prod DESC NULLS LAST, id_mag DESC NULLS LAST, (date_part('month'::text, dtv)) DESC NULLS LAST -> GroupAggregate (cost=85732.51..235341.01 rows=1579581 width=20) Group Key: id_prod Group Key: () Sort Key: (date_part('month'::text, dtv)) DESC NULLS LAST Group Key: (date_part('month'::text, dtv)) Sort Key: id_mag DESC NULLS LAST Group Key: id_mag -> Sort (cost=85732.51..87048.33 rows=526326 width=16) Sort Key: id_prod DESC NULLS LAST -> Seq Scan on ventes_agg (cost=0.00..26721.81 rows=526326 width=16) Filter: ((dtv >= to_timestamp('01/01/2016'::text, 'DD/MM/YYYY'::text)) AND (dtv < to_timestamp('01/01/2017'::text, 'DD/MM/YYYY'::text))) (13 lignes) Temps : 0,466 ms (select id_prod, id_mag, extract(month from dtv) month_dtv, sum(mtv) from ventes_agg where dtv >= to_timestamp('01/01/2016','DD/MM/YYYY') and dtv < to_timestamp('01/01/2017','DD/MM/YYYY') group by grouping sets ((id_prod), (id_mag), (extract(month from dtv)), ()) order by 1 desc nulls last ,2 desc nulls last, 3 desc nulls last); 1ere exe : Time: 1473,198 ms (00:01,473) 2eme exe : Time: 1480,931 ms (00:01,481) 3eme exe : Time: 1473,969 ms (00:01,474)

      Malgré des estimations de coût très éloignées de la réalité la requête avec GROUPING SETS est efficace à l’exécution. Elle n’entraîne notamment aussi qu’un balayage de la table VENTES_AGG. Elle est certes ici un tout petit peu moins performante que la requête avec la CTE mais son champ d’application est plus large et elle reste bien plus rapide que la requête originale. Si vous disposez d’une version récente de PostgreSQL 10, un "commit" du 27/03 améliore de toute façon encore les choses :

explain (select id_prod, id_mag, extract(month from dtv) month_dtv, sum(mtv) from ventes_agg where dtv >= to_timestamp('01/01/2016','DD/MM/YYYY') and dtv < to_timestamp('01/01/2017','DD/MM/YYYY') group by grouping sets ((id_prod), (id_mag), (extract(month from dtv)), ()) order by 1 desc nulls last ,2 desc nulls last, 3 desc nulls last); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=223282.47..224602.52 rows=528019 width=20) Sort Key: id_prod DESC NULLS LAST, id_mag DESC NULLS LAST, (date_part('month'::text, dtv)) DESC NULLS LAST -> MixedAggregate (cost=85916.80..162264.66 rows=528019 width=20) Hash Key: id_mag Group Key: id_prod Group Key: () Sort Key: (date_part('month'::text, dtv)) DESC NULLS LAST Group Key: (date_part('month'::text, dtv)) -> Sort (cost=85916.80..87236.35 rows=527818 width=16) Sort Key: id_prod DESC NULLS LAST -> Seq Scan on ventes_agg (cost=0.00..26725.54 rows=527818 width=16) Filter: ((dtv >= to_timestamp('01/01/2016'::text, 'DD/MM/YYYY'::text)) AND (dtv < to_timestamp('01/01/2017'::text, 'DD/MM/YYYY'::text))) (12 lignes) (select id_prod, id_mag, extract(month from dtv) month_dtv, sum(mtv) from ventes_agg where dtv >= to_timestamp('01/01/2016','DD/MM/YYYY') and dtv < to_timestamp('01/01/2017','DD/MM/YYYY') group by grouping sets ((id_prod), (id_mag), (extract(month from dtv)), ()) order by 1 desc nulls last ,2 desc nulls last, 3 desc nulls last); 1ere exe : Time: 1275,820 ms (00:01,276) 2eme exe : Time: 1248,420 ms (00:01,248) 3eme exe : Time: 1265,999 ms (00:01,266)

      BINGO ! Notez l’évolution du coût estimé et le fait que cette fois le temps d’exécution est inférieur à celui obtenu avec la CTE. Cette fonctionnalitê SQL-99 de regroupement selon divers critêres dans une même clause GROUP BY est donc bien utilisable et de plus en plus efficace avec PostgreSQL.

Mise à jour : 28/03/2017