[Do not Merge] Investigating a MPI deadlock when compiling expect_and_grad (!654) · Merge requests · Vicentini Filippo / netket

Vicentini Filippo requested to merge investigation into master Apr 25, 2021
Created by: PhilipVinc
This happens when two different nodes get different shapes for \sigma_prime (connected elements) and THEN the shape changes only on one node leading to jax recompiling the function and he locks while recompiling.
If anyone wants to help...
export MPI4JAX_DEBUG=1
filippovicentini in netket at filippo-workstation on  investigation [$!?] via netket_env via 🐍 system took 7s
➜ mpirun -np 2 python Examples/Ising1d/ising1d.py
r0 | MPI_Bcast -> 0 with 2 items
r1 | MPI_Bcast -> 0 with 2 items
r0 | MPI_Bcast -> 0 with 2 items
r1 | MPI_Bcast -> 0 with 2 items
r0 | MPI_Bcast -> 0 with 2 items
r1 | MPI_Bcast -> 0 with 2 items
r0 | MPI_Bcast -> 0 with 4 items
r1 | MPI_Bcast -> 0 with 4 items
  0%|          | 0/300 [00:00<?, ?it/s] r0 - 0 - reset
 r0 - 0 - reset FLUSHED
 r1 - 0 - reset
 r1 - 0 - reset FLUSHED
 r0| FLUSH4    gradient for (1, 16, 12)->(16, 29, 12)
 r0| computing gradient for (1, 16, 12)->(16, 29, 12)
 r0| compiling grad_expect_hermitian for (1, 16, 12)->(16, 29, 12)
 r0| compiling 1
 r0| compiling 2
 r1| FLUSH4    gradient for (1, 16, 12)->(16, 33, 12)
 r1| computing gradient for (1, 16, 12)->(16, 33, 12)
 r1| compiling grad_expect_hermitian for (1, 16, 12)->(16, 33, 12)
 r1| compiling 1
 r1| compiling 2
 r1 | mpi_sum of (), float64
 r1 | mpi_sum of (), float64
 r1 | mpi_sum of (), float64
 r1 | mpi_sum of (), float64
 r1 | mpi_sum of (), float64
 r1 | mpi_sum of (), float64
 r1 | mpi_sum of (), float64
 r1| compiling 3
 r0 | mpi_sum of (), float64
 r0 | mpi_sum of (), float64
 r0 | mpi_sum of (), float64
 r0 | mpi_sum of (), float64
 r0 | mpi_sum of (), float64
 r0 | mpi_sum of (), float64
 r0 | mpi_sum of (), float64
 r0| compiling 3
 r0| DONE compiling grad_expect_hermitian: sum_inplace for (16, 12)->(16, 29, 12)
 r0 | mpi_sum of (12,), float64
 r0 | mpi_sum of (12, 12), float64
 r0 | mpi_sum of (12,), float64
 r0| DONE compiling grad_expect_hermitian for (16, 12)->(16, 29, 12)
 r1| DONE compiling grad_expect_hermitian: sum_inplace for (16, 12)->(16, 33, 12)
 r1 | mpi_sum of (12,), float64
 r1 | mpi_sum of (12, 12), float64
 r1 | mpi_sum of (12,), float64
 r1| DONE compiling grad_expect_hermitian for (16, 12)->(16, 33, 12)
 r0| DONE               for (1, 16, 12)->(16, 29, 12)
r0 | MPI_Allreduce START with 1 items. 94358810775488->94358810780544
 r1| DONE               for (1, 16, 12)->(16, 33, 12)
r1 | MPI_Allreduce START with 1 items. 94628771067904->94628778597888
r0 | MPI_Allreduce DONE  with 1 items. 94358810775488->94358810780544
r1 | MPI_Allreduce DONE  with 1 items. 94628771067904->94628778597888
r1 | MPI_Allreduce START with 1 items. 94628780221376->94628779520064
r0 | MPI_Allreduce START with 1 items. 94358798081920->94358809175008
r0 | MPI_Allreduce DONE  with 1 items. 94358798081920->94358809175008
r1 | MPI_Allreduce DONE  with 1 items. 94628780221376->94628779520064
r1 | MPI_Allreduce START with 1 items. 94628771067904->94628778597888
r0 | MPI_Allreduce START with 1 items. 94358810775488->94358810780544
r0 | MPI_Allreduce DONE  with 1 items. 94358810775488->94358810780544
r1 | MPI_Allreduce DONE  with 1 items. 94628771067904->94628778597888
r0 | MPI_Allreduce START with 1 items. 94358798081920->94358810780544
r1 | MPI_Allreduce START with 1 items. 94628780221376->94628778597888
r0 | MPI_Allreduce DONE  with 1 items. 94358798081920->94358810780544
r0 | MPI_Allreduce START with 1 items. 94358810775488->94358810780544
r1 | MPI_Allreduce DONE  with 1 items. 94628780221376->94628778597888
r1 | MPI_Allreduce START with 1 items. 94628771067904->94628778597888
r0 | MPI_Allreduce DONE  with 1 items. 94358810775488->94358810780544
r1 | MPI_Allreduce DONE  with 1 items. 94628771067904->94628778597888
r1 | MPI_Allreduce START with 1 items. 94628780221376->94628778597888
r0 | MPI_Allreduce START with 1 items. 94358798081920->94358810780544
r0 | MPI_Allreduce DONE  with 1 items. 94358798081920->94358810780544
r1 | MPI_Allreduce DONE  with 1 items. 94628780221376->94628778597888
r1 | MPI_Allreduce START with 1 items. 94628771067904->94628733114880
r0 | MPI_Allreduce START with 1 items. 94358810775488->94358807196864
r0 | MPI_Allreduce DONE  with 1 items. 94358810775488->94358807196864
r1 | MPI_Allreduce DONE  with 1 items. 94628771067904->94628733114880
r1 | MPI_Allreduce START with 144 items. 94628779518784->94628760373568
r0 | MPI_Allreduce START with 144 items. 94358809126528->94358789644608
r0 | MPI_Allreduce DONE  with 144 items. 94358809126528->94358789644608
r0 | MPI_Allreduce START with 12 items. 94358804493248->94358805442304
r1 | MPI_Allreduce DONE  with 144 items. 94628779518784->94628760373568
r1 | MPI_Allreduce START with 12 items. 94628774010240->94628766257792
r0 | MPI_Allreduce DONE  with 12 items. 94358804493248->94358805442304
r0 | MPI_Allreduce START with 12 items. 94358809174912->94358804493248
r1 | MPI_Allreduce DONE  with 12 items. 94628774010240->94628766257792
r1 | MPI_Allreduce START with 12 items. 94628779518784->94628774010240
r0 | MPI_Allreduce DONE  with 12 items. 94358809174912->94358804493248
r1 | MPI_Allreduce DONE  with 12 items. 94628779518784->94628774010240
 r0| DONE FLUSH           for (1, 16, 12)->(16, 29, 12)
 r0 - 0 - loss done
 r1| DONE FLUSH           for (1, 16, 12)->(16, 33, 12)
 r1 - 0 - loss done
 r0 - 0 - loss done FLUSHED
 r1 - 0 - loss done FLUSHED
  0%|          | 0/300 [00:02<?, ?it/s, Energy=5.3298 ± 0.0073 [σ²=0.0017, R̂=1.0000]] r1 - 1 - reset
 r1 - 1 - reset FLUSHED
 r1| FLUSH4    gradient for (1, 16, 12)->(16, 29, 12)
 r1| computing gradient for (1, 16, 12)->(16, 29, 12)
 r0 - 1 - reset
 r1| compiling grad_expect_hermitian for (1, 16, 12)->(16, 29, 12)
 r1| compiling 1
 r0 - 1 - reset FLUSHED
 r0| FLUSH4    gradient for (1, 16, 12)->(16, 33, 12)
 r0| computing gradient for (1, 16, 12)->(16, 33, 12)
 r0| compiling grad_expect_hermitian for (1, 16, 12)->(16, 33, 12)
 r0| compiling 1
 r1| compiling 2
 r0| compiling 2
 r1 | mpi_sum of (), float64
 r1 | mpi_sum of (), float64
 r1 | mpi_sum of (), float64
 r1 | mpi_sum of (), float64
 r1 | mpi_sum of (), float64
 r1 | mpi_sum of (), float64
 r1 | mpi_sum of (), float64
 r1| compiling 3
 r0 | mpi_sum of (), float64
 r0 | mpi_sum of (), float64
 r0 | mpi_sum of (), float64
 r0 | mpi_sum of (), float64
 r0 | mpi_sum of (), float64
 r0 | mpi_sum of (), float64
 r0 | mpi_sum of (), float64
 r0| compiling 3
 r0| DONE compiling grad_expect_hermitian: sum_inplace for (16, 12)->(16, 33, 12)
 r0 | mpi_sum of (12,), float64
 r0 | mpi_sum of (12, 12), float64
 r0 | mpi_sum of (12,), float64
 r0| DONE compiling grad_expect_hermitian for (16, 12)->(16, 33, 12)
 r1| DONE compiling grad_expect_hermitian: sum_inplace for (16, 12)->(16, 29, 12)
 r1 | mpi_sum of (12,), float64
 r1 | mpi_sum of (12, 12), float64
 r1 | mpi_sum of (12,), float64
 r1| DONE compiling grad_expect_hermitian for (16, 12)->(16, 29, 12)
 r1| DONE               for (1, 16, 12)->(16, 29, 12)
r1 | MPI_Allreduce START with 1 items. 94628760626432->94628760635840
 r0| DONE               for (1, 16, 12)->(16, 33, 12)
r0 | MPI_Allreduce START with 1 items. 94358812321280->94358811503040
r1 | MPI_Allreduce DONE  with 1 items. 94628760626432->94628760635840
r1 | MPI_Allreduce START with 1 items. 94628775151936->94628782379680
r0 | MPI_Allreduce DONE  with 1 items. 94358812321280->94358811503040
r0 | MPI_Allreduce START with 1 items. 94358808760768->94358815236928
r0 | MPI_Allreduce DONE  with 1 items. 94358808760768->94358815236928
r1 | MPI_Allreduce DONE  with 1 items. 94628775151936->94628782379680
r1 | MPI_Allreduce START with 1 items. 94628760626432->94628760635840
r0 | MPI_Allreduce START with 1 items. 94358812321280->94358811503040
r0 | MPI_Allreduce DONE  with 1 items. 94358812321280->94358811503040
r1 | MPI_Allreduce DONE  with 1 items. 94628760626432->94628760635840
r1 | MPI_Allreduce START with 1 items. 94628775151936->94628760635840
r0 | MPI_Allreduce START with 1 items. 94358808760768->94358811503040
r0 | MPI_Allreduce DONE  with 1 items. 94358808760768->94358811503040
r1 | MPI_Allreduce DONE  with 1 items. 94628775151936->94628760635840
r1 | MPI_Allreduce START with 1 items. 94628760626432->94628760635840
r0 | MPI_Allreduce START with 1 items. 94358812321280->94358811503040
r0 | MPI_Allreduce DONE  with 1 items. 94358812321280->94358811503040
r1 | MPI_Allreduce DONE  with 1 items. 94628760626432->94628760635840
r1 | MPI_Allreduce START with 1 items. 94628775151936->94628760635840
r0 | MPI_Allreduce START with 1 items. 94358808760768->94358811503040
r0 | MPI_Allreduce DONE  with 1 items. 94358808760768->94358811503040
r0 | MPI_Allreduce START with 1 items. 94358812321280->94358808517888
r1 | MPI_Allreduce DONE  with 1 items. 94628775151936->94628760635840
r1 | MPI_Allreduce START with 1 items. 94628760626432->94628779230336
r0 | MPI_Allreduce DONE  with 1 items. 94358812321280->94358808517888
r1 | MPI_Allreduce DONE  with 1 items. 94628760626432->94628779230336
r0 | MPI_Allreduce START with 144 items. 94358815235648->94358808128960
r1 | MPI_Allreduce START with 144 items. 94628782331200->94628779696512
r0 | MPI_Allreduce DONE  with 144 items. 94358815235648->94358808128960
r0 | MPI_Allreduce START with 12 items. 94358807119616->94358811164992
r1 | MPI_Allreduce DONE  with 144 items. 94628782331200->94628779696512
r1 | MPI_Allreduce START with 12 items. 94628774317952->94628770700032
r0 | MPI_Allreduce DONE  with 12 items. 94358807119616->94358811164992
r0 | MPI_Allreduce START with 12 items. 94358815235648->94358807119616
r0 | MPI_Allreduce DONE  with 12 items. 94358815235648->94358807119616
r1 | MPI_Allreduce DONE  with 12 items. 94628774317952->94628770700032
r1 | MPI_Allreduce START with 12 items. 94628782379584->94628774317952
r1 | MPI_Allreduce DONE  with 12 items. 94628782379584->94628774317952
 r0| DONE FLUSH           for (1, 16, 12)->(16, 33, 12)
 r1| DONE FLUSH           for (1, 16, 12)->(16, 29, 12)
 r1 - 1 - loss done
 r0 - 1 - loss done
 r0 - 1 - loss done FLUSHED
 r1 - 1 - loss done FLUSHED
 r1 - 2 - reset
  0%|          | 1/300 [00:00<01:29,  3.33it/s, Energy=5.3402 ± 0.0090 [σ²=0.0026, R̂=1.0000]] r1 - 2 - reset FLUSHED
  1%|          | 3/300 [00:00<00:29,  9.98it/s, Energy=5.3402 ± 0.0090 [σ²=0.0026, R̂=1.0000]] r0 - 2 - reset
 r1| FLUSH4    gradient for (1, 16, 12)->(16, 29, 12)
 r0 - 2 - reset FLUSHED
 r1| computing gradient for (1, 16, 12)->(16, 29, 12)
 r1| DONE               for (1, 16, 12)->(16, 29, 12)
r1 | MPI_Allreduce START with 1 items. 94628774766528->94628761944256
 r0| FLUSH4    gradient for (1, 16, 12)->(16, 29, 12)
 r0| computing gradient for (1, 16, 12)->(16, 29, 12)
 r0| DONE               for (1, 16, 12)->(16, 29, 12)
r0 | MPI_Allreduce START with 1 items. 94358819200896->94358797335936
r0 | MPI_Allreduce DONE  with 1 items. 94358819200896->94358797335936
r0 | MPI_Allreduce START with 1 items. 94358793905536->94358820191648
r0 | MPI_Allreduce DONE  with 1 items. 94358793905536->94358820191648
r1 | MPI_Allreduce DONE  with 1 items. 94628774766528->94628761944256
r1 | MPI_Allreduce START with 1 items. 94628775169408->94628780505056
r1 | MPI_Allreduce DONE  with 1 items. 94628775169408->94628780505056
r0 | MPI_Allreduce START with 1 items. 94358819200896->94358797335936
r1 | MPI_Allreduce START with 1 items. 94628774766528->94628761944256
r1 | MPI_Allreduce DONE  with 1 items. 94628774766528->94628761944256
r0 | MPI_Allreduce DONE  with 1 items. 94358819200896->94358797335936
r0 | MPI_Allreduce START with 1 items. 94358793905536->94358797335936
r1 | MPI_Allreduce START with 1 items. 94628775169408->94628761944256
r0 | MPI_Allreduce DONE  with 1 items. 94358793905536->94358797335936
r1 | MPI_Allreduce DONE  with 1 items. 94628775169408->94628761944256
r0 | MPI_Allreduce START with 1 items. 94358819200896->94358797335936
r1 | MPI_Allreduce START with 1 items. 94628774766528->94628761944256
r1 | MPI_Allreduce DONE  with 1 items. 94628774766528->94628761944256
r0 | MPI_Allreduce DONE  with 1 items. 94358819200896->94358797335936
r0 | MPI_Allreduce START with 1 items. 94358793905536->94358797335936
r1 | MPI_Allreduce START with 1 items. 94628775169408->94628761944256
r0 | MPI_Allreduce DONE  with 1 items. 94358793905536->94358797335936
r1 | MPI_Allreduce DONE  with 1 items. 94628775169408->94628761944256
r1 | MPI_Allreduce START with 1 items. 94628774766528->94628780876480
r0 | MPI_Allreduce START with 1 items. 94358819200896->94358816045888
r0 | MPI_Allreduce DONE  with 1 items. 94358819200896->94358816045888
r1 | MPI_Allreduce DONE  with 1 items. 94628774766528->94628780876480
r1 | MPI_Allreduce START with 144 items. 94628780456576->94628772995968
r0 | MPI_Allreduce START with 144 items. 94358820143168->94358810951872
r1 | MPI_Allreduce DONE  with 144 items. 94628780456576->94628772995968
r0 | MPI_Allreduce DONE  with 144 items. 94358820143168->94358810951872
r0 | MPI_Allreduce START with 12 items. 94358811544832->94358807578048
r1 | MPI_Allreduce START with 12 items. 94628778946944->94628778687552
r0 | MPI_Allreduce DONE  with 12 items. 94358811544832->94358807578048
r0 | MPI_Allreduce START with 12 items. 94358820191552->94358811544832
r1 | MPI_Allreduce DONE  with 12 items. 94628778946944->94628778687552
r1 | MPI_Allreduce START with 12 items. 94628780504960->94628778946944
r0 | MPI_Allreduce DONE  with 12 items. 94358820191552->94358811544832
r1 | MPI_Allreduce DONE  with 12 items. 94628780504960->94628778946944
 r0| DONE FLUSH           for (1, 16, 12)->(16, 29, 12)
 r0 - 2 - loss done
 r1| DONE FLUSH           for (1, 16, 12)->(16, 29, 12)
 r1 - 2 - loss done
 r0 - 2 - loss done FLUSHED
 r1 - 2 - loss done FLUSHED
 r1 - 3 - reset
  1%|          | 3/300 [00:00<00:29,  9.98it/s, Energy=5.342 ± 0.012 [σ²=0.004, R̂=1.0000]]    r0 - 3 - reset
 r1 - 3 - reset FLUSHED
 r0 - 3 - reset FLUSHED
 r1| FLUSH4    gradient for (1, 16, 12)->(16, 29, 12)
 r0| FLUSH4    gradient for (1, 16, 12)->(16, 31, 12)
 r1| computing gradient for (1, 16, 12)->(16, 29, 12)
 r1| DONE               for (1, 16, 12)->(16, 29, 12)
 r0| computing gradient for (1, 16, 12)->(16, 31, 12)
r1 | MPI_Allreduce START with 1 items. 94628733141312->94628771393344
 r0| compiling grad_expect_hermitian for (1, 16, 12)->(16, 31, 12)
 r0| compiling 1
 r0| compiling 2```
[Do not Merge] Investigating a MPI deadlock when compiling expect_and_grad

Merge request reports