Parallélisme

(sujet mis à jour avec les versions 9.6 - 10 devel - 10 beta)

Diviser pour mieux régner

      Dans cette page présentant les index BRIN et le partitionnement nous avons vu qu’il était souvent possible d’optimiser une requête, souvent en réduisant les pages visitées à l’exécution. Vous avez une valeur ajoutée puisque vous savez quelles données peuvent être archivées etc.
      Cette fois, nous allons prendre pour exemple une table VENTES_AGG contenant les ventes par seconde de chacun de nos 200 produits sur un an. Pour le produit ID_PROD, le montant de vente MTV a été réalisé dans la seconde suivant le temps stocké dans la colonne DTV.
      M.Gnok vous demande de donner les 3 plus gros MTV cumulés pour une sélection de produits, vous pouvez lui donner les ID_PROD associés mais il s’en fiche donc inutile de gérer les ex æquo.
      Inutile de partitionner sur ID_PROD, la documentation PostgreSQL parle d’une centaine de partitions et nous avons déjà 200 références produit. Un index sur ID_PROD pourrait être utile ? La colonne ID_PROD n’est pas candidate à un index BRIN et même un index B-Tree sera probablement inefficace puisque la requête concerne 20% de nos produits.
      Il ne reste plus que la force brute. Paralléliser n’est PAS optimiser mais parfois c’est la seule solution. Nous allons simplement utiliser plus de ressources matérielles pour traiter la requête. PostgreSQL-XL permettait déjà de paralléliser mais c’est à présent aussi possible avec la version classique de PostgreSQL depuis la 9.6.
      PostgreSQL 9.6 est encore en développement au 5 février 2016 donc il ne s’agit que d’un aperçu, n’utilisez pas une version "devel" en production. Même quand la version sera en production, consultez l’état du système avant d’utiliser cette possibilité. Inutile de se lancer si votre serveur est déjè saturé au niveau CPU, en train de swapper etc. Dans l’exemple qui suit je fixe le paramètre MAX_PARALLEL_DEGREE à 3. Cela permet au process exécutant la requête d’appeler à l’aide 3 travailleurs supplémentaires. Avec les 4 coeurs Xeon dédiés à la machine virtuelle sur laquelle j’exécute le test inutile d’aller plus loin.
      En pratique :

select version(); version --------------------------------------------------------------------------------------------- PostgreSQL 9.6devel on x86_64-pc-linux-gnu, compiled by gcc (Debian 4.9.2-10) 4.9.2, 64-bit (1 row) create table ventes_agg(dtv timestamp, id_prod integer, mtv integer); CREATE TABLE Time: 12,797 ms do $$ declare x real; begin for i in reverse 31536000..1 loop x := random(); if x < 0.1 then insert into ventes_agg values(current_timestamp - (i || ' seconds')::interval, trunc(random() * 200 + 1), trunc(random() * 100 + 1)); end if; if x < 0.2 then insert into ventes_agg values(current_timestamp - (i || ' seconds')::interval, trunc(random() * 200 + 1), trunc(random() * 100 + 1)); end if; if x < 0.3 then insert into ventes_agg values(current_timestamp - (i || ' seconds')::interval, trunc(random() * 200 + 1), trunc(random() * 100 + 1)); end if; if x < 0.4 then insert into ventes_agg values(current_timestamp - (i || ' seconds')::interval, trunc(random() * 200 + 1), trunc(random() * 100 + 1)); end if; if x < 0.5 then insert into ventes_agg values(current_timestamp - (i || ' seconds')::interval, trunc(random() * 200 + 1), trunc(random() * 100 + 1)); end if; if x < 0.6 then insert into ventes_agg values(current_timestamp - (i || ' seconds')::interval, trunc(random() * 200 + 1), trunc(random() * 100 + 1)); end if; if x < 0.7 then insert into ventes_agg values(current_timestamp - (i || ' seconds')::interval, trunc(random() * 200 + 1), trunc(random() * 100 + 1)); end if; if x < 0.8 then insert into ventes_agg values(current_timestamp - (i || ' seconds')::interval, trunc(random() * 200 + 1), trunc(random() * 100 + 1)); end if; if x < 0.9 then insert into ventes_agg values(current_timestamp - (i || ' seconds')::interval, trunc(random() * 200 + 1), trunc(random() * 100 + 1)); end if; end loop; end$$; DO Time: 3266847,593 ms \d+ List of relations Schema | Name | Type | Owner | Size | Description --------+------------+-------+----------+---------+------------- public | ventes_agg | table | postgres | 6000 MB | (1 row) explain select id_prod, sum(mtv) from ventes_agg where id_prod in(5,11,14,21,24,31,36,42,43,51,57,61,68,72,78,81,87,92,94,101,108,110,113,120,122,130,135,141,149,150,157,161,168,170,171,182,183,191,197,199) group by id_prod order by 2 desc fetch first 3 rows only; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=9431439.43..9431439.44 rows=3 width=8) -> Sort (cost=9431439.43..9431439.53 rows=40 width=8) Sort Key: (sum(mtv)) DESC -> HashAggregate (cost=9431438.52..9431438.92 rows=40 width=8) Group Key: id_prod -> Seq Scan on ventes_agg (cost=0.00..9289787.16 rows=28330271 width=8) Filter: (id_prod = ANY ('{5,11,14,21,24,31,36,42,43,51,57,61,68,72,78,81,87,92,94,101,108,110,113,120,122,130,135,141,149,150,157,161,168,170,171,182,183,191,197,199}'::integer[])) select id_prod, sum(mtv) from ventes_agg where id_prod in(5,11,14,21,24,31,36,42,43,51,57,61,68,72,78,81,87,92,94,101,108,110,113,120,122,130,135,141,149,150,157,161,168,170,171,182,183,191,197,199) group by id_prod order by 2 desc fetch first 3 rows only; id_prod | sum ---------+---------- 135 | 35956197 130 | 35926872 108 | 35918015 (3 rows) Time: 98477,908 ms select id_prod, sum(mtv) from ventes_agg where id_prod in(5,11,14,21,24,31,36,42,43,51,57,61,68,72,78,81,87,92,94,101,108,110,113,120,122,130,135,141,149,150,157,161,168,170,171,182,183,191,197,199) group by id_prod order by 2 desc fetch first 3 rows only; id_prod | sum ---------+---------- 135 | 35956197 130 | 35926872 108 | 35918015 (3 rows) Time: 98488,546 ms select id_prod, sum(mtv) from ventes_agg where id_prod in(5,11,14,21,24,31,36,42,43,51,57,61,68,72,78,81,87,92,94,101,108,110,113,120,122,130,135,141,149,150,157,161,168,170,171,182,183,191,197,199) group by id_prod order by 2 desc fetch first 3 rows only; id_prod | sum ---------+---------- 135 | 35956197 130 | 35926872 108 | 35918015 (3 rows) Time: 98487,537 ms create index ventes_agg_i1 on ventes_agg(id_prod); CREATE INDEX Time: 495564,175 ms explain select id_prod, sum(mtv) from ventes_agg where id_prod in(5,11,14,21,24,31,36,42,43,51,57,61,68,72,78,81,87,92,94,101,108,110,113,120,122,130,135,141,149,150,157,161,168,170,171,182,183,191,197,199) group by id_prod order by 2 desc fetch first 3 rows only; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ---- Limit (cost=3111343.61..3111343.62 rows=3 width=8) -> Sort (cost=3111343.61..3111343.71 rows=40 width=8) Sort Key: (sum(mtv)) DESC -> HashAggregate (cost=3111342.69..3111343.09 rows=40 width=8) Group Key: id_prod -> Bitmap Heap Scan on ventes_agg (cost=502126.16..2969691.50 rows=28330239 width=8) Recheck Cond: (id_prod = ANY ('{5,11,14,21,24,31,36,42,43,51,57,61,68,72,78,81,87,92,94,101,108,110,113,120,122,130,135,141,149,150,157,161,168,170,171,182,183,191,197,199}'::integer[])) -> Bitmap Index Scan on ventes_agg_i1 (cost=0.00..495043.60 rows=28330239 width=0) Index Cond: (id_prod = ANY ('{5,11,14,21,24,31,36,42,43,51,57,61,68,72,78,81,87,92,94,101,108,110,113,120,122,130,135,141,149,150,157,161,168,170,171,182,183,191,197,199}'::integer[ ])) (9 rows) Time: 85,901 ms select id_prod, sum(mtv) from ventes_agg where id_prod in(5,11,14,21,24,31,36,42,43,51,57,61,68,72,78,81,87,92,94,101,108,110,113,120,122,130,135,141,149,150,157,161,168,170,171,182,183,191,197,199) group by id_prod order by 2 desc fetch first 3 rows only; id_prod | sum ---------+---------- 135 | 35956197 130 | 35926872 108 | 35918015 (3 rows) Time: 100904,446 ms select id_prod, sum(mtv) from ventes_agg where id_prod in(5,11,14,21,24,31,36,42,43,51,57,61,68,72,78,81,87,92,94,101,108,110,113,120,122,130,135,141,149,150,157,161,168,170,171,182,183,191,197,199) group by id_prod order by 2 desc fetch first 3 rows only; id_prod | sum ---------+---------- 135 | 35956197 130 | 35926872 108 | 35918015 (3 rows) Time: 100920,616 ms select id_prod, sum(mtv) from ventes_agg where id_prod in(5,11,14,21,24,31,36,42,43,51,57,61,68,72,78,81,87,92,94,101,108,110,113,120,122,130,135,141,149,150,157,161,168,170,171,182,183,191,197,199) group by id_prod order by 2 desc fetch first 3 rows only; id_prod | sum ---------+---------- 135 | 35956197 130 | 35926872 108 | 35918015 (3 rows) Time: 100893,226 ms drop index ventes_agg_i1; DROP INDEX Time: 625,719 ms set max_parallel_degree = 3; SET Time: 0,247 ms explain select id_prod, sum(mtv) from ventes_agg where id_prod in(5,11,14,21,24,31,36,42,43,51,57,61,68,72,78,81,87,92,94,101,108,110,113,120,122,130,135,141,149,150,157,161,168,170,171,182,183,191,197,199) group by id_prod order by 2 desc fetch first 3 rows only; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=6492474.29..6492474.30 rows=3 width=8) -> Sort (cost=6492474.29..6492474.39 rows=40 width=8) Sort Key: (sum(mtv)) DESC -> HashAggregate (cost=6492473.38..6492473.78 rows=40 width=8) Group Key: id_prod -> Gather (cost=1000.00..6350822.02 rows=28330271 width=8) Number of Workers: 3 -> Parallel Seq Scan on ventes_agg (cost=0.00..3516794.92 rows=9138797 width=8) Filter: (id_prod = ANY ('{5,11,14,21,24,31,36,42,43,51,57,61,68,72,78,81,87,92,94,101,108,110,113,120,122,130,135,141,149,150,157,161,168,170,171,182,183,191,197,199}'::integer[])) select id_prod, sum(mtv) from ventes_agg where id_prod in(5,11,14,21,24,31,36,42,43,51,57,61,68,72,78,81,87,92,94,101,108,110,113,120,122,130,135,141,149,150,157,161,168,170,171,182,183,191,197,199) group by id_prod order by 2 desc fetch first 3 rows only; id_prod | sum ---------+---------- 135 | 35956197 130 | 35926872 108 | 35918015 (3 rows) Time: 32050,718 ms select id_prod, sum(mtv) from ventes_agg where id_prod in(5,11,14,21,24,31,36,42,43,51,57,61,68,72,78,81,87,92,94,101,108,110,113,120,122,130,135,141,149,150,157,161,168,170,171,182,183,191,197,199) group by id_prod order by 2 desc fetch first 3 rows only; id_prod | sum ---------+---------- 135 | 35956197 130 | 35926872 108 | 35918015 (3 rows) Time: 32075,380 ms select id_prod, sum(mtv) from ventes_agg where id_prod in(5,11,14,21,24,31,36,42,43,51,57,61,68,72,78,81,87,92,94,101,108,110,113,120,122,130,135,141,149,150,157,161,168,170,171,182,183,191,197,199) group by id_prod order by 2 desc fetch first 3 rows only; id_prod | sum ---------+---------- 135 | 35956197 130 | 35926872 108 | 35918015 (3 rows) Time: 31969,106 ms

      La requête prenait 1min38 au départ. L’index B-Tree était inefficace puisque le temps obtenu est passé à 1min41 alors que l’optimiseur le pensait très utile. En le supprimant et en utilisant toutes les ressources de la machine virtuelle grâce au parallélisme le temps a pu être abaissé à 32 secondes. Le parallélisme est à manier avec précaution. Il est coûteux et doit être envisagé en dernier ressort. Mais parfois c’est la seule option pour gagner du temps sur une requête particulière.

Mise à jour : 05/02/2016