Regrouper selon plusieurs critères
(sujet préalablement traité avec les versions 9.5 et 9.6)
Un besoin simple couvert par SQL-92
Un jour M.Tadiot, un chef, demande à son développeur senior, M.Jacques, de lui fournir RAPIDEMENT sur lannée 2016 le montant des ventes par produit, pour chacun des magasins, par mois et le total.
Dans cet exemple nous allons considérer une table VENTES_AGG contenant le montant total des ventes en provenance de diverses sources. Nous avons pour le produit ID_PROD, dans le magasin ID_MAG, réalisé le montant de vente MTV dans la minute suivant le temps stocké dans la colonne DTV.
M.Jacques se lance dans lopération avec une requête "à lancienne" :
SELECT version();
version
----------------------------------------------------------------------------------------------------
PostgreSQL 10devel on x86_64-pc-linux-gnu, compiled by gcc (Debian 6.3.0-8) 6.3.0 20170221, 64-bit
(1 ligne)
Temps : 0,191 ms
show max_parallel_workers_per_gather;
max_parallel_workers_per_gather
---------------------------------
3
(1 ligne)
Temps : 0,142 ms
create unlogged table ventes_agg(dtv timestamp, id_prod smallint, id_mag smallint, mtv integer);
CREATE TABLE
Temps : 100,616 ms
WITH serie_1000000(i) AS (SELECT generate_series(1000000,1,-1)) INSERT INTO ventes_agg (SELECT current_timestamp - (i || ' minutes')::interval, trunc(random() * 100 + 1), trunc(random() * 100 + 1), trunc(random() * 90000 + 1) FROM serie_1000000);
INSERT 0 1000000
Time: 1589,322 ms (00:01,589)
\d+
Liste des relations
Schéma | Nom | Type | Propriétaire | Taille | Description
--------+------------+-------+--------------+--------+-------------
public | ventes_agg | table | postgres | 42 MB |
(1 ligne)
ANALYZE ventes_agg;
ANALYZE
Temps : 200,600 ms
explain
(select id_prod, null::smallint as id_mag, null::smallint as month_dtv, sum(mtv)
from ventes_agg
where dtv >= to_timestamp('01/01/2016','DD/MM/YYYY') and dtv < to_timestamp('01/01/2017','DD/MM/YYYY')
group by id_prod order by 1 desc)
union all
(select null, id_mag as id_mag, null, sum(mtv)
from ventes_agg
where dtv >= to_timestamp('01/01/2016','DD/MM/YYYY') and dtv < to_timestamp('01/01/2017','DD/MM/YYYY')
group by id_mag order by 2 desc)
union all
(select null, null, extract(month from dtv) as month_dtv, sum(mtv)
from ventes_agg
where dtv >= to_timestamp('01/01/2016','DD/MM/YYYY') and dtv < to_timestamp('01/01/2017','DD/MM/YYYY')
group by extract(month from dtv) order by 3 desc )
union all
select null, null, null, sum(mtv)
from ventes_agg
where dtv >= to_timestamp('01/01/2016','DD/MM/YYYY') and dtv < to_timestamp('01/01/2017','DD/MM/YYYY');
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Append (cost=15864.49..148548.59 rows=526527 width=20)
-> Result (cost=15864.49..31738.47 rows=200 width=20)
-> Append (cost=15864.49..31735.97 rows=200 width=14)
-> Finalize GroupAggregate (cost=15864.49..15866.99 rows=100 width=14)
Group Key: ventes_agg.id_prod
-> Sort (cost=15864.49..15864.99 rows=200 width=10)
Sort Key: ventes_agg.id_prod DESC
-> Gather (cost=15835.84..15856.84 rows=200 width=10)
Workers Planned: 2
-> Partial HashAggregate (cost=14835.84..14836.84 rows=100 width=10)
Group Key: ventes_agg.id_prod
-> Parallel Seq Scan on ventes_agg (cost=0.00..13739.33 rows=219302 width=6)
Filter: ((dtv >= to_timestamp('01/01/2016'::text, 'DD/MM/YYYY'::text)) AND (dtv < to_timestamp('01/01/2017'::text, 'DD/MM/YYYY'::text)))
-> Finalize GroupAggregate (cost=15864.49..15866.99 rows=100 width=14)
Group Key: ventes_agg_1.id_mag
-> Sort (cost=15864.49..15864.99 rows=200 width=10)
Sort Key: ventes_agg_1.id_mag DESC
-> Gather (cost=15835.84..15856.84 rows=200 width=10)
Workers Planned: 2
-> Partial HashAggregate (cost=14835.84..14836.84 rows=100 width=10)
Group Key: ventes_agg_1.id_mag
-> Parallel Seq Scan on ventes_agg ventes_agg_1 (cost=0.00..13739.33 rows=219302 width=6)
Filter: ((dtv >= to_timestamp('01/01/2016'::text, 'DD/MM/YYYY'::text)) AND (dtv < to_timestamp('01/01/2017'::text, 'DD/MM/YYYY'::text)))
-> GroupAggregate (cost=85732.51..96259.03 rows=526326 width=20)
Group Key: (date_part('month'::text, ventes_agg_2.dtv))
-> Sort (cost=85732.51..87048.33 rows=526326 width=12)
Sort Key: (date_part('month'::text, ventes_agg_2.dtv)) DESC
-> Seq Scan on ventes_agg ventes_agg_2 (cost=0.00..26721.81 rows=526326 width=12)
Filter: ((dtv >= to_timestamp('01/01/2016'::text, 'DD/MM/YYYY'::text)) AND (dtv < to_timestamp('01/01/2017'::text, 'DD/MM/YYYY'::text)))
-> Finalize Aggregate (cost=15287.80..15287.81 rows=1 width=20)
-> Gather (cost=15287.59..15287.80 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=14287.59..14287.60 rows=1 width=8)
-> Parallel Seq Scan on ventes_agg ventes_agg_3 (cost=0.00..13739.33 rows=219302 width=4)
Filter: ((dtv >= to_timestamp('01/01/2016'::text, 'DD/MM/YYYY'::text)) AND (dtv < to_timestamp('01/01/2017'::text, 'DD/MM/YYYY'::text)))
(35 lignes)
Temps : 0,814 ms
(select id_prod, null::smallint as id_mag, null::smallint as month_dtv, sum(mtv)
from ventes_agg
where dtv >= to_timestamp('01/01/2016','DD/MM/YYYY') and dtv < to_timestamp('01/01/2017','DD/MM/YYYY')
group by id_prod
order by 1 desc)
union all
(select null, id_mag as id_mag, null, sum(mtv)
from ventes_agg
where dtv >= to_timestamp('01/01/2016','DD/MM/YYYY') and dtv < to_timestamp('01/01/2017','DD/MM/YYYY')
group by id_mag
order by 2 desc)
union all
(select null, null, extract(month from dtv) as month_dtv, sum(mtv)
from ventes_agg
where dtv >= to_timestamp('01/01/2016','DD/MM/YYYY') and dtv < to_timestamp('01/01/2017','DD/MM/YYYY')
group by extract(month from dtv)
order by 3 desc )
union all
select null, null, null, sum(mtv)
from ventes_agg
where dtv >= to_timestamp('01/01/2016','DD/MM/YYYY') and dtv < to_timestamp('01/01/2017','DD/MM/YYYY');
id_prod | id_mag | month_dtv | sum
---------+--------+-----------+-------------
100 | | | 240839462
...
1 | | | 236012275
| 100 | | 240659278
...
| 1 | | 237332255
| | 12 | 2008201606
...
| | 1 | 2005266721
| | | 23739001011
(213 lignes)
1ere exe : Time: 1829,908 ms (00:01,830)
2eme exe : Time: 1837,385 ms (00:01,837)
3eme exe : Time: 1831,847 ms (00:01,832)
Améliorer la performance ?
M.Jacques a procédé de manière très classique avec des UNION ALL. M.Tadiot ne le remercie pas. Il avait dit RAPIDEMENT et il trouve le temps un peu long lorsquil exécute la requête. De plus lexécution de la requête en parallèle consomme beaucoup de ressources.
M.Jacques tente daméliorer les choses. Il décide déviter de balayer 4 fois la table VENTES_AGG avec une WITH query ou CTE (common table expression) :
explain
with ventes_agg_agg as
(select id_prod, id_mag, extract(month from dtv) month_dtv, sum(mtv) sum_mtv
from ventes_agg
where (dtv >= to_timestamp('01/01/2016','DD/MM/YYYY') and dtv < to_timestamp('01/01/2017','DD/MM/YYYY'))
group by id_prod, id_mag, extract(month from dtv))
(select id_prod, null::smallint as id_mag, null::smallint as month_dtv, sum(sum_mtv)
from ventes_agg_agg
group by id_prod
order by 1 desc)
union all
(select null, id_mag, null, sum(sum_mtv)
from ventes_agg_agg
group by id_mag
order by 2 desc)
union all
(select null, null, month_dtv, sum(sum_mtv)
from ventes_agg_agg
group by month_dtv
order by 3 desc )
union all
select null, null, null, sum(sum_mtv) from ventes_agg_agg ;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Append (cost=112058.96..150250.40 rows=601 width=44)
CTE ventes_agg_agg
-> GroupAggregate (cost=85732.51..98890.66 rows=526326 width=20)
Group Key: ventes_agg.id_prod, ventes_agg.id_mag, (date_part('month'::text, ventes_agg.dtv))
-> Sort (cost=85732.51..87048.33 rows=526326 width=16)
Sort Key: ventes_agg.id_prod, ventes_agg.id_mag, (date_part('month'::text, ventes_agg.dtv))
-> Seq Scan on ventes_agg (cost=0.00..26721.81 rows=526326 width=16)
Filter: ((dtv >= to_timestamp('01/01/2016'::text, 'DD/MM/YYYY'::text)) AND (dtv < to_timestamp('01/01/2017'::text, 'DD/MM/YYYY'::text)))
-> Result (cost=13168.29..26346.59 rows=400 width=44)
-> Append (cost=13168.29..26341.59 rows=400 width=38)
-> Sort (cost=13168.29..13168.79 rows=200 width=38)
Sort Key: ventes_agg_agg.id_prod DESC
-> HashAggregate (cost=13158.15..13160.65 rows=200 width=38)
Group Key: ventes_agg_agg.id_prod
-> CTE Scan on ventes_agg_agg (cost=0.00..10526.52 rows=526326 width=10)
-> Sort (cost=13168.29..13168.79 rows=200 width=38)
Sort Key: ventes_agg_agg_1.id_mag DESC
-> HashAggregate (cost=13158.15..13160.65 rows=200 width=38)
Group Key: ventes_agg_agg_1.id_mag
-> CTE Scan on ventes_agg_agg ventes_agg_agg_1 (cost=0.00..10526.52 rows=526326 width=10)
-> Sort (cost=13168.29..13168.79 rows=200 width=44)
Sort Key: ventes_agg_agg_2.month_dtv DESC
-> HashAggregate (cost=13158.15..13160.65 rows=200 width=44)
Group Key: ventes_agg_agg_2.month_dtv
-> CTE Scan on ventes_agg_agg ventes_agg_agg_2 (cost=0.00..10526.52 rows=526326 width=16)
-> Aggregate (cost=11842.34..11842.35 rows=1 width=44)
-> CTE Scan on ventes_agg_agg ventes_agg_agg_3 (cost=0.00..10526.52 rows=526326 width=8)
(27 lignes)
Temps : 47,328 ms
with ventes_agg_agg as
(select id_prod, id_mag, extract(month from dtv) month_dtv, sum(mtv) sum_mtv
from ventes_agg
where (dtv >= to_timestamp('01/01/2016','DD/MM/YYYY') and dtv < to_timestamp('01/01/2017','DD/MM/YYYY'))
group by id_prod, id_mag, extract(month from dtv))
(select id_prod, null::smallint as id_mag, null::smallint as month_dtv, sum(sum_mtv)
from ventes_agg_agg
group by id_prod
order by 1 desc)
union all
(select null, id_mag, null, sum(sum_mtv)
from ventes_agg_agg
group by id_mag
order by 2 desc)
union all
(select null, null, month_dtv, sum(sum_mtv)
from ventes_agg_agg
group by month_dtv
order by 3 desc )
union all
select null, null, null, sum(sum_mtv) from ventes_agg_agg ;
id_prod | id_mag | month_dtv | sum
---------+--------+-----------+-------------
100 | | | 240839462
...
1 | | | 236012275
| 100 | | 240659278
...
| 1 | | 237332255
| | 12 | 2008201606
...
| | 1 | 2005266721
| | | 23739001011
(213 lignes)
1ere exe : Time: 1402,041 ms (00:01,402)
2eme exe : Time: 1383,532 ms (00:01,384)
3eme exe : Time: 1405,408 ms (00:01,405)
Et les extensions OLAP de SQL-99 ?
Moins de 1s5 avec un 1 process alors quà lorigine la requête durait 1s8 avec 4 process...pas mal ! Alex Mou, un stagiaire récemment arrivé dans lentreprise, dit à M.Jacques que son astuce avec la CTE est maligne mais quelle ne fonctionnerait pas si des moyennes avaient été demandées et pas des sommes. Il suggère une solution plus générale basée sur GROUPING SETS pour regrouper selon différents critères au sein de la même clause GROUP BY. Cette fonctionnalité standard SQL-99 a été introduite avec PostgreSQL 9.5, démonstration avec PostgreSQL 10 devel :
explain
(select id_prod, id_mag, extract(month from dtv) month_dtv, sum(mtv)
from ventes_agg
where dtv >= to_timestamp('01/01/2016','DD/MM/YYYY') and dtv < to_timestamp('01/01/2017','DD/MM/YYYY')
group by grouping sets ((id_prod), (id_mag), (extract(month from dtv)), ())
order by 1 desc nulls last ,2 desc nulls last, 3 desc nulls last);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=462759.65..466708.60 rows=1579581 width=20)
Sort Key: id_prod DESC NULLS LAST, id_mag DESC NULLS LAST, (date_part('month'::text, dtv)) DESC NULLS LAST
-> GroupAggregate (cost=85732.51..235341.01 rows=1579581 width=20)
Group Key: id_prod
Group Key: ()
Sort Key: (date_part('month'::text, dtv)) DESC NULLS LAST
Group Key: (date_part('month'::text, dtv))
Sort Key: id_mag DESC NULLS LAST
Group Key: id_mag
-> Sort (cost=85732.51..87048.33 rows=526326 width=16)
Sort Key: id_prod DESC NULLS LAST
-> Seq Scan on ventes_agg (cost=0.00..26721.81 rows=526326 width=16)
Filter: ((dtv >= to_timestamp('01/01/2016'::text, 'DD/MM/YYYY'::text)) AND (dtv < to_timestamp('01/01/2017'::text, 'DD/MM/YYYY'::text)))
(13 lignes)
Temps : 0,466 ms
(select id_prod, id_mag, extract(month from dtv) month_dtv, sum(mtv)
from ventes_agg
where dtv >= to_timestamp('01/01/2016','DD/MM/YYYY') and dtv < to_timestamp('01/01/2017','DD/MM/YYYY')
group by grouping sets ((id_prod), (id_mag), (extract(month from dtv)), ())
order by 1 desc nulls last ,2 desc nulls last, 3 desc nulls last);
1ere exe : Time: 1473,198 ms (00:01,473)
2eme exe : Time: 1480,931 ms (00:01,481)
3eme exe : Time: 1473,969 ms (00:01,474)
Malgré des estimations de coût très éloignées de la réalité la requête avec GROUPING SETS est efficace à lexécution. Elle nentraîne notamment aussi quun balayage de la table VENTES_AGG. Elle est certes ici un tout petit peu moins performante que la requête avec la CTE mais son champ dapplication est plus large et elle reste bien plus rapide que la requête originale. Si vous disposez dune version récente de PostgreSQL 10, un "commit" du 27/03 améliore de toute façon encore les choses :
explain (select id_prod, id_mag, extract(month from dtv) month_dtv, sum(mtv) from ventes_agg where dtv >= to_timestamp('01/01/2016','DD/MM/YYYY') and dtv < to_timestamp('01/01/2017','DD/MM/YYYY') group by grouping sets ((id_prod), (id_mag), (extract(month from dtv)), ()) order by 1 desc nulls last ,2 desc nulls last, 3 desc nulls last);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=223282.47..224602.52 rows=528019 width=20)
Sort Key: id_prod DESC NULLS LAST, id_mag DESC NULLS LAST, (date_part('month'::text, dtv)) DESC NULLS LAST
-> MixedAggregate (cost=85916.80..162264.66 rows=528019 width=20)
Hash Key: id_mag
Group Key: id_prod
Group Key: ()
Sort Key: (date_part('month'::text, dtv)) DESC NULLS LAST
Group Key: (date_part('month'::text, dtv))
-> Sort (cost=85916.80..87236.35 rows=527818 width=16)
Sort Key: id_prod DESC NULLS LAST
-> Seq Scan on ventes_agg (cost=0.00..26725.54 rows=527818 width=16)
Filter: ((dtv >= to_timestamp('01/01/2016'::text, 'DD/MM/YYYY'::text)) AND (dtv < to_timestamp('01/01/2017'::text, 'DD/MM/YYYY'::text)))
(12 lignes)
(select id_prod, id_mag, extract(month from dtv) month_dtv, sum(mtv)
from ventes_agg
where dtv >= to_timestamp('01/01/2016','DD/MM/YYYY') and dtv < to_timestamp('01/01/2017','DD/MM/YYYY')
group by grouping sets ((id_prod), (id_mag), (extract(month from dtv)), ())
order by 1 desc nulls last ,2 desc nulls last, 3 desc nulls last);
1ere exe : Time: 1275,820 ms (00:01,276)
2eme exe : Time: 1248,420 ms (00:01,248)
3eme exe : Time: 1265,999 ms (00:01,266)
BINGO ! Notez lévolution du coût estimé et le fait que cette fois le temps dexécution est inférieur à celui obtenu avec la CTE. Cette fonctionnalitê SQL-99 de regroupement selon divers critêres dans une même clause GROUP BY est donc bien utilisable et de plus en plus efficace avec PostgreSQL.
Mise à jour : 28/03/2017