Grouping sets

Regrouper selon plusieurs critères

(sujet préalablement traité avec la version 9.5 puis mis à jour aveec la version 10 devel)

Un besoin simple couvert par SQL-92

      Un jour M.Tadiot, un chef, demande à son développeur senior, M.Jacques, de lui fournir RAPIDEMENT sur l’année 2015 le montant des ventes par produit, pour chacun des magasins, par mois et le total.
      Dans cet exemple nous allons considérer une table VENTES_AGG contenant le montant total des ventes en provenance de diverses sources. Nous avons pour le produit ID_PROD, dans le magasin ID_MAG, réalisé le montant de vente MTV dans la minute suivant le temps stocké dans la colonne DTV.
      M.Jacques se lance dans l’opération avec une requête "à l’ancienne" :

SELECT version(); version ------------------------------------------------------------------------------------------ PostgreSQL 9.6.1 on x86_64-pc-linux-gnu, compiled by gcc (Debian 4.9.2-10) 4.9.2, 64-bit (1 ligne) create unlogged table ventes_agg(dtv timestamp, id_prod smallint, id_mag smallint, mtv integer); CREATE TABLE show shared_buffers; shared_buffers ---------------- 2GB (1 ligne) show work_mem; work_mem ---------- 64MB (1 ligne) WITH serie_1000000(i) AS (SELECT generate_series(1000000,1,-1)) INSERT INTO ventes_agg (SELECT current_timestamp - (i || ' minutes')::interval, trunc(random() * 100 + 1), trunc(random() * 100 + 1), trunc(random() * 90000 + 1) FROM serie_1000000); INSERT 0 1000000 Temps : 5013,398 ms ANALYZE VENTES_AGG; ANALYZE Temps : 148,111 ms explain (select id_prod, null::smallint as id_mag, null::smallint as month_dtv, sum(mtv) from ventes_agg where dtv >= to_timestamp('01/01/2015','DD/MM/YYYY') and dtv < to_timestamp('01/01/2016','DD/MM/YYYY') group by id_prod order by 1 desc) union all (select null, id_mag as id_mag, null, sum(mtv) from ventes_agg where dtv >= to_timestamp('01/01/2015','DD/MM/YYYY') and dtv < to_timestamp('01/01/2016','DD/MM/YYYY') group by id_mag order by 2 desc) union all (select null, null, extract(month from dtv) as month_dtv, sum(mtv) from ventes_agg where dtv >= to_timestamp('01/01/2015','DD/MM/YYYY') and dtv < to_timestamp('01/01/2016','DD/MM/YYYY') group by extract(month from dtv) order by 3 desc ) union all select null, null, null, sum(mtv) from ventes_agg where dtv >= to_timestamp('01/01/2015','DD/MM/YYYY') and dtv < to_timestamp('01/01/2016','DD/MM/YYYY'); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Append (cost=28011.80..174450.27 rows=520496 width=20) -> Result (cost=28011.80..56028.59 rows=200 width=20) -> Append (cost=28011.80..56026.09 rows=200 width=14) -> Subquery Scan on "*SELECT* 1" (cost=28011.80..28013.05 rows=100 width=14) -> Sort (cost=28011.80..28012.05 rows=100 width=14) Sort Key: ventes_agg.id_prod DESC -> HashAggregate (cost=28007.47..28008.47 rows=100 width=14) Group Key: ventes_agg.id_prod -> Seq Scan on ventes_agg (cost=0.00..25406.00 rows=520295 width=6) Filter: ((dtv >= to_timestamp('01/01/2015'::text, 'DD/MM/YYYY'::text)) AND (dtv < to_timestamp('01/01/2016'::text, 'DD/MM/YYYY'::text))) -> Subquery Scan on "*SELECT* 2" (cost=28011.80..28013.05 rows=100 width=14) -> Sort (cost=28011.80..28012.05 rows=100 width=14) Sort Key: ventes_agg_1.id_mag DESC -> HashAggregate (cost=28007.47..28008.47 rows=100 width=14) Group Key: ventes_agg_1.id_mag -> Seq Scan on ventes_agg ventes_agg_1 (cost=0.00..25406.00 rows=520295 width=6) Filter: ((dtv >= to_timestamp('01/01/2015'::text, 'DD/MM/YYYY'::text)) AND (dtv < to_timestamp('01/01/2016'::text, 'DD/MM/YYYY'::text))) -> GroupAggregate (cost=76106.07..86511.97 rows=520295 width=20) Group Key: (date_part('month'::text, ventes_agg_2.dtv)) -> Sort (cost=76106.07..77406.81 rows=520295 width=12) Sort Key: (date_part('month'::text, ventes_agg_2.dtv)) DESC -> Seq Scan on ventes_agg ventes_agg_2 (cost=0.00..26706.74 rows=520295 width=12) Filter: ((dtv >= to_timestamp('01/01/2015'::text, 'DD/MM/YYYY'::text)) AND (dtv < to_timestamp('01/01/2016'::text, 'DD/MM/YYYY'::text))) -> Aggregate (cost=26706.74..26706.75 rows=1 width=20) -> Seq Scan on ventes_agg ventes_agg_3 (cost=0.00..25406.00 rows=520295 width=4) Filter: ((dtv >= to_timestamp('01/01/2015'::text, 'DD/MM/YYYY'::text)) AND (dtv < to_timestamp('01/01/2016'::text, 'DD/MM/YYYY'::text))) (26 lignes) Temps : 7,631 ms (select id_prod, null::smallint as id_mag, null::smallint as month_dtv, sum(mtv) from ventes_agg where dtv >= to_timestamp('01/01/2015','DD/MM/YYYY') and dtv < to_timestamp('01/01/2016','DD/MM/YYYY') group by id_prod order by 1 desc) union all (select null, id_mag as id_mag, null, sum(mtv) from ventes_agg where dtv >= to_timestamp('01/01/2015','DD/MM/YYYY') and dtv < to_timestamp('01/01/2016','DD/MM/YYYY') group by id_mag order by 2 desc) union all (select null, null, extract(month from dtv) as month_dtv, sum(mtv) from ventes_agg where dtv >= to_timestamp('01/01/2015','DD/MM/YYYY') and dtv < to_timestamp('01/01/2016','DD/MM/YYYY') group by extract(month from dtv) order by 3 desc ) union all select null, null, null, sum(mtv) from ventes_agg where dtv >= to_timestamp('01/01/2015','DD/MM/YYYY') and dtv < to_timestamp('01/01/2016','DD/MM/YYYY'); id_prod | id_mag | month_dtv | sum ---------+--------+-----------+------------- 100 | | | 234388923 .. 1 | | | 232510807 | 100 | | 229662843 .. | 1 | | 235714229 | | 12 | 2003411842 .. | | 1 | 1733544479 | | | 23413583856 (213 lignes) Temps : 8517,258 ms Temps : 8539,567 ms Temps : 8555,327 ms

Améliorer la performance ?

M.Jacques a procédé de manière très classique avec des UNION ALL. M.Tadiot ne le remercie pas. Il avait dit RAPIDEMENT et il trouve le temps long lorsqu’il exécute la requête.
M.Jacques tente d’améliorer les choses. Il décide d’éviter de balayer 4 fois la table VENTES_AGG avec une WITH query ou CTE (common table expression) :

explain with ventes_agg_agg as (select id_prod, id_mag, extract(month from dtv) month_dtv, sum(mtv) sum_mtv from ventes_agg where (dtv >= to_timestamp('01/01/2015','DD/MM/YYYY') and dtv < to_timestamp('01/01/2016','DD/MM/YYYY')) group by id_prod, id_mag, extract(month from dtv)) (select id_prod, null::smallint as id_mag, null::smallint as month_dtv, sum(sum_mtv) from ventes_agg_agg group by id_prod order by 1 desc) union all (select null, id_mag, null, sum(sum_mtv) from ventes_agg_agg group by id_mag order by 2 desc) union all (select null, null, month_dtv, sum(sum_mtv) from ventes_agg_agg group by month_dtv order by 3 desc ) union all select null, null, null, sum(sum_mtv) from ventes_agg_agg ; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------- Append (cost=51430.89..89185.09 rows=601 width=44) CTE ventes_agg_agg -> HashAggregate (cost=31909.69..38413.38 rows=520295 width=20) Group Key: ventes_agg.id_prod, ventes_agg.id_mag, date_part('month'::text, ventes_agg.dtv) -> Seq Scan on ventes_agg (cost=0.00..26706.74 rows=520295 width=16) Filter: ((dtv >= to_timestamp('01/01/2015'::text, 'DD/MM/YYYY'::text)) AND (dtv < to_timestamp('01/01/2016'::text, 'DD/MM/YYYY'::text))) -> Result (cost=13017.52..26045.04 rows=400 width=44) -> Append (cost=13017.52..26040.04 rows=400 width=38) -> Subquery Scan on "*SELECT* 1" (cost=13017.52..13020.02 rows=200 width=38) -> Sort (cost=13017.52..13018.02 rows=200 width=38) Sort Key: ventes_agg_agg.id_prod DESC -> HashAggregate (cost=13007.37..13009.87 rows=200 width=38) Group Key: ventes_agg_agg.id_prod -> CTE Scan on ventes_agg_agg (cost=0.00..10405.90 rows=520295 width=10) -> Subquery Scan on "*SELECT* 2" (cost=13017.52..13020.02 rows=200 width=38) -> Sort (cost=13017.52..13018.02 rows=200 width=38) Sort Key: ventes_agg_agg_1.id_mag DESC -> HashAggregate (cost=13007.37..13009.87 rows=200 width=38) Group Key: ventes_agg_agg_1.id_mag -> CTE Scan on ventes_agg_agg ventes_agg_agg_1 (cost=0.00..10405.90 rows=520295 width=10) -> Subquery Scan on "*SELECT* 3" (cost=13017.52..13020.02 rows=200 width=44) -> Sort (cost=13017.52..13018.02 rows=200 width=44) Sort Key: ventes_agg_agg_2.month_dtv DESC -> HashAggregate (cost=13007.37..13009.87 rows=200 width=44) Group Key: ventes_agg_agg_2.month_dtv -> CTE Scan on ventes_agg_agg ventes_agg_agg_2 (cost=0.00..10405.90 rows=520295 width=16) -> Aggregate (cost=11706.64..11706.65 rows=1 width=44) -> CTE Scan on ventes_agg_agg ventes_agg_agg_3 (cost=0.00..10405.90 rows=520295 width=8) (28 lignes) Temps : 3,543 ms (select id_prod, id_mag, extract(month from dtv) month_dtv, sum(mtv) sum_mtv from ventes_agg where (dtv >= to_timestamp('01/11/2015','DD/MM/YYYY') and dtv < to_timestamp('01/01/2016','DD/MM/YYYY')) group by id_prod, id_mag, extract(month from dtv)) (select id_prod, null::smallint as id_mag, null::smallint as month_dtv, sum(sum_mtv) from ventes_agg_agg group by id_prod order by 1 desc) union all (select null, id_mag, null, sum(sum_mtv) from ventes_agg_agg group by id_mag order by 2 desc) union all (select null, null, month_dtv, sum(sum_mtv) from ventes_agg_agg group by month_dtv order by 3 desc ) union all select null, null, null, sum(sum_mtv) from ventes_agg_agg ; id_prod | id_mag | month_dtv | sum ---------+--------+-----------+------------- 100 | | | 234388923 .. 1 | | | 232510807 | 100 | | 229662843 .. | 1 | | 235714229 | | 12 | 2003411842 .. | | 1 | 1733544479 | | | 23413583856 (213 lignes) Temps : 2663,985 ms Temps : 2662,502 ms Temps : 2672,437 ms

Et les extensions OLAP de SQL-99 ?

Moins de 3 secondes pas mal ! Alex Mou, un stagiaire récemment arrivé dans l’entreprise, dit à M.Jacques que son astuce avec la CTE est maligne mais qu’elle ne fonctionnerait pas si des moyennes avaient été demandées et pas des sommes. Il suggère une solution plus générale basée sur GROUPING SETS pour regrouper selon différents critères au sein de la même clause GROUP BY. Cette fonctionnalité standard SQL-99 a été introduite avec PostgreSQL 9.5, démonstration avec PostgreSQL 9.6 :

explain (select id_prod, id_mag, extract(month from dtv) month_dtv, sum(mtv) from ventes_agg where dtv >= to_timestamp('01/01/2015','DD/MM/YYYY') and dtv < to_timestamp('01/01/2016','DD/MM/YYYY') group by grouping sets ((id_prod), (id_mag), (extract(month from dtv)), ()) order by 1 desc nulls last ,2 desc nulls last, 3 desc nulls last); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=398789.07..402692.79 rows=1561488 width=20) Sort Key: id_prod DESC NULLS LAST, id_mag DESC NULLS LAST, (date_part('month'::text, dtv)) DESC NULLS LAST -> GroupAggregate (cost=76106.07..206129.97 rows=1561488 width=20) Group Key: id_prod Group Key: () Sort Key: (date_part('month'::text, dtv)) DESC NULLS LAST Group Key: (date_part('month'::text, dtv)) Sort Key: id_mag DESC NULLS LAST Group Key: id_mag -> Sort (cost=76106.07..77406.81 rows=520295 width=16) Sort Key: id_prod DESC NULLS LAST -> Seq Scan on ventes_agg (cost=0.00..26706.74 rows=520295 width=16) Filter: ((dtv >= to_timestamp('01/01/2015'::text, 'DD/MM/YYYY'::text)) AND (dtv < to_timestamp('01/01/2016'::text, 'DD/MM/YYYY'::text))) (13 lignes) Temps : 1,107 ms (select id_prod, id_mag, extract(month from dtv) month_dtv, sum(mtv) from ventes_agg where dtv >= to_timestamp('01/01/2015','DD/MM/YYYY') and dtv < to_timestamp('01/01/2016','DD/MM/YYYY') group by grouping sets ((id_prod), (id_mag), (extract(month from dtv)), ()) order by 1 desc nulls last ,2 desc nulls last, 3 desc nulls last); id_prod | id_mag | month_dtv | sum ---------+--------+-----------+------------- 100 | | | 234388923 .. 1 | | | 232510807 | 100 | | 229662843 .. | 1 | | 235714229 | | 12 | 2003411842 .. | | 1 | 1733544479 | | | 23413583856 (213 lignes) Temps : 3313,446 ms Temps : 3296,236 ms Temps : 3281,949 ms

Malgré des estimations de coût très éloignées de la réalité la requête avec GROUPING SETS est efficace à l’exécution. Elle n’entraîne notamment aussi qu’un balayage de la table VENTES_AGG. Elle est certes un peu moins performante que la requête avec la CTE mais son champ d’application est plus large et elle reste bien plus rapide que la requête originale. Cette fonctionnalitê SQL-99 de regroupement selon divers critêres dans une même clause GROUP BY est donc bien utilisable avec PostgreSQL.

Regrouper selon plusieurs critères

(sujet préalablement traité avec la version 9.5 puis mis à jour aveec la version 10 devel)

Mise à jour : 04/12/2016