I'm able to run the examples in the Tutorial using Stochastic Reconfiguration with 'UseIterative' set as False. But if I set it to True, I get the message "Segmentation fault (core dumped)". For example, below is what I get from the J1-J2 model with an empty .log file.
# Graph created
# Number of nodes = 20
# RBM Initizialized with nvisible = 20 and nhidden = 20
# Using visible bias = 1
# Using hidden bias = 1
# Machine initialized with random parameters
# Hamiltonian Metropolis sampler with parallel tempering is ready
# 16 replicas are being used
# Learning running on 1 processes
# Using the Stochastic reconfiguration method
# With iterative solver
Segmentation fault (core dumped)
I'm using Ubuntu 18.04 with gcc 7.3.0, python 3.6.5 and mpich 3.3a2. The other libraries come with the netket.
Designs
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related.
Learn more.
I cannot reproduce the crash on my machine and therefore did not investigate this thoroughly, however, I have a suspicion:
The crash could be related to the class MatrixReplacement and the fact that it stores a member of type Eigen::MatrixXcd: According to the Eigen documentation
fixed-size vectorizable Eigen objects must absolutely be created at 16-byte-aligned locations, otherwise SIMD instructions addressing them will crash.
This is not guaranteed for class member variables. This problem can be solved by specifying the EIGEN_MAKE_ALIGNED_OPERATOR_NEW macro within the class body.
I cannot test this on my machine right now (since I cannot reproduce the crash), but could you check whether making this change fixes the crash for you?
Thank you @wuyukai for the bug report, and thanks a lot @femtobit for looking into this!
I think this might be the likely source of the issue. Indeed on my machines I wasn't able to reproduce the bug, and it is most likely compiler/cpu-related.
@wuyukai would you be able to test out the patch? That would be really helpful. I can't test it myself right now, will look into this as soon as possible.
This really should not be the issue as Eigen's docs talk about fixed-size objects. The way these objects are usually implemented is that you avoid costly dynamic memory allocation by using arrays as member variables. However, if you want to use SIMD instructions, the data you're operating on should better be aligned. And seeing as alignas was introduced in C++11, Eigen probably can't (or doesn't want to) do that automatically for you with fixed-size objects. Eigen::MatrixXcd is not a fixed-size matrix, so I don't see how it could be the issue here. @wuyukai could you perhaps ask your favourite debugger for a stack trace? That would be really helpful for locating the bug.
I am able to reproduce this error (same OS & versions as reported). The proposed fix by @femtobit does not remove this issue unfortunately. Sorry for the ignorance in advance, but if you could let me know what extra info you need from a debugger I'd be happy to provide more than just the below:
Starting program: /usr/local/bin/netket j1j2.json
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff4d4c700 (LWP 19874)]
[New Thread 0x7fffef9f0700 (LWP 19875)]
############################################
# NetKet version 1.0.2 #
# Website: https://www.netket.org #
# Licensed under Apache-2.0 - see LICENSE #
############################################
# Graph created
# Number of nodes = 20
# RBM Initizialized with nvisible = 20 and nhidden = 20
# Using visible bias = 1
# Using hidden bias = 1
# Machine initialized with random parameters
# Hamiltonian Metropolis sampler with parallel tempering is ready
# 16 replicas are being used
# Learning running on 1 processes
# Using the Stochastic reconfiguration method
# With iterative solver
Thread 1 "netket" received signal SIGSEGV, Segmentation fault.
0x00005555555b0878 in Eigen::internal::general_matrix_vector_product, Eigen::internal::const_blas_data_mapper, long, 0>, 0, false, std::complex, Eigen::internal::const_blas_data_mapper, long, 1>, false, 0>::run(long, long, Eigen::internal::const_blas_data_mapper, long, 0> const&, Eigen::internal::const_blas_data_mapper, long, 1> const&, std::complex*, long, std::complex) ()
(gdb) bt
#0 0x00005555555b0878 in Eigen::internal::general_matrix_vector_product, Eigen::internal::const_blas_data_mapper, long, 0>, 0, false, std::complex, Eigen::internal::const_blas_data_mapper, long, 1>, false, 0>::run(long, long, Eigen::internal::const_blas_data_mapper, long, 0> const&, Eigen::internal::const_blas_data_mapper, long, 1> const&, std::complex*, long, std::complex) ()
#1 0x00005555555b12af in void Eigen::internal::conjugate_gradient, -1, 1, 0, -1, 1> const, -1, 1, true>, Eigen::Block, -1, 1, 0, -1, 1>, -1, 1, true>, Eigen::IdentityPreconditioner>(netket::MatrixReplacement const&, Eigen::Block, -1, 1, 0, -1, 1> const, -1, 1, true> const&, Eigen::Block, -1, 1, 0, -1, 1>, -1, 1, true>&, Eigen::IdentityPreconditioner const&, long&, Eigen::Block, -1, 1, 0, -1, 1>, -1, 1, true>::RealScalar&) ()
#2 0x00005555555b39c9 in netket::GroundState::UpdateParameters() ()
#3 0x00005555555bc956 in netket::GroundState::GroundState(netket::Hamiltonian&, netket::Sampler > >&, netket::Stepper&, nlohmann::basic_json, std::allocator >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer> const&) ()
#4 0x00005555555bfc3d in netket::Learning::Learning(nlohmann::basic_json, std::allocator >, bool, long, unsigned long, double, std::allocator, nlohmann::adl_serializer> const&) ()
#5 0x0000555555575cb9 in main ()
matrix_replacement is a class that basically applies a custom matrix to a vector, without explicitly forming the matrix. This is needed to solve a linear system S*deltaP=b through a CG method, to reduce the computational time resulting from explicitly forming the matrix S.
which is a square matrix of size npar_ * npar_. Thus rows() and cols() return the same value.
mp_mat_ is the O_k matrix, i.e. the variational derivatives on all the sampled configurations, which is a rectangular one, thus you get the error you were mentioning.
Still, I don't understand where this bug comes from....
Thanks @everthemore! I was able to reproduce the error on my local machine. The problem lies in the lifetime management (as usually in C++) ;) I'll create a PR.
I can't check this fix on a Linux machine now, @everthemore and/or @wuyukai could you please try the proposed fix and tell us? Thanks a lot.
Next week I am going to set up unit tests on this part of the code, which unfortunately is not covered yet (and I guess there is a correlation with the fact that we found this bug here...)