Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MCMC step stucked using the ESPEI 0.8.3 version #189

Open
HaihuiZhang opened this issue May 10, 2021 · 10 comments
Open

MCMC step stucked using the ESPEI 0.8.3 version #189

HaihuiZhang opened this issue May 10, 2021 · 10 comments

Comments

@HaihuiZhang
Copy link

Dear Brandon,

There were some problems when I using the 0.8.3 version. It has take a long time when running the MCMC step and finally stuck at the beginning, the log also empties only show some warning. Would you help me with this problem?
condalist.txt

/lustre/home/acct-msezbb/msezbb/.conda/envs/espei2021/lib/python3.9/site-packages/ipopt/init.py:13: FutureWarning: The module has been renamed to 'cyipopt' from 'ipopt'. Please import using 'import cyipopt' and remove all uses of 'import ipopt' in your code as this will be deprecated in a future release.
warnings.warn(msg, FutureWarning)
/lustre/home/acct-msezbb/msezbb/.conda/envs/espei2021/lib/python3.9/site-packages/cyipopt/utils.py:43: FutureWarning: The function named 'setLoggingLevel' will soon be deprecated in CyIpopt. Please replace all uses and use 'set_logging_level' going forward.
warnings.warn(msg, FutureWarning)

@bocklund
Copy link
Member

Those two warnings are safe to ignore and will go away when pycalphad 0.8.5 is released.

In 0.8 and later, the initial MCMC startup time will likely be a little longer, but overall each iteration should be the same or slightly faster. Can you provide some comparisons of the time to call the likelihood function that is printed out with the verbosity set to 2? In 0.8.3 and the latest 0.7.X release that you had working?

@HaihuiZhang
Copy link
Author

HaihuiZhang commented May 10, 2021 via email

@HaihuiZhang
Copy link
Author

It will take about 3 hours to finish running using 0.7.9+3.gd4625e7=dev_0.
I have set the verbosity to 2. But the log shows nothing after almost 3 days of running.

INFO:espei.espei_script - espei version 0.8.2
INFO:espei.espei_script - If you use ESPEI for work presented in a publication, we ask that you cite the following paper:
B. Bocklund, R. Otis, A. Egorov, A. Obaied, I. Roslyakova, Z.-K. Liu, ESPEI for efficient thermodynamic database development, modification, and uncertainty quantification: application to Cu-Mg, MRS Commun. (2019) 1-10. doi:10.1557/mrc.2019.59.
TRACE:espei.espei_script - Loading and checking datasets.
TRACE:espei.espei_script - Finished checking datasets

@bocklund
Copy link
Member

TRACE:espei.espei_script - Loading and checking datasets.
TRACE:espei.espei_script - Finished checking datasets

After these steps the dask server usually starts. Maybe your desk server is not starting correctly. Can you make progress with setting scheduler: null?

@HaihuiZhang
Copy link
Author

I use the high-performance computing center of school to run, so I don’t know how to set this up. Could you teach me how to set it up. All previous versions can run on this platform before.

@bocklund
Copy link
Member

Can you check that turning off the scheduler works first? I want to make sure everything else is working correctly first. https://espei.org/en/latest/writing_input.html#scheduler

@HaihuiZhang
Copy link
Author

According to the solution you provided, MCMC has started to run normally. Thank you for your help. May I ask what causes this problem?
log2.txt

@bocklund
Copy link
Member

bocklund commented May 11, 2021

According to the solution you provided, MCMC has started to run normally.

Great, so it looks like starting dask for parallelization was indeed the issue.

May I ask what causes this problem?

I'm not sure yet, but I think we can figure it out 🙂. ESPEI is intended to work on HPCs and works well when using one compute node without any special configuration.

  1. Are you trying to use scheduler: dask on your cluster or a scheduler file with MPI?
  2. Have you tried again with dask as the scheduler to verify that it's still not working?
  3. Are you trying to run on one node, multiple nodes? Any other relevant details from your HPC setup or batch submission file (if relevant) would be helpful.

@HaihuiZhang
Copy link
Author

  1. Since my computer keeps reporting errors after installing Conda, I have been using the school's HPC with scheduler: dask. The MCMC stuck problem was also calculated on the school cluster.
  2. Today I re-use scheduler: dask to test for 12 hours, but it is still stuck without calculation and no log output.
  3. I used 40 cores in a node when I calculated using HPC.
    This is the distributed.YAML file on the cluster.

distributed.txt

@bocklund
Copy link
Member

ESPEI basically starts a dask cluster this way:

import multiprocessing
from dask.distributed import LocalCluster, Client
cores = multiprocessing.cpu_count()
cluster = LocalCluster(n_workers=cores, threads_per_worker=1, processes=True, memory_limit=0)
client = Client(cluster)
print(client.scheduler_info())

Can you run a Python script containing with this and see it successfully start? The dask documentation may be helpful for you to review. This may require help from your HPC administrator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants