Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Clustering using HDBSCAN running" step dows not complete #5

Open
crastr opened this issue Dec 10, 2021 · 7 comments
Open

"Clustering using HDBSCAN running" step dows not complete #5

crastr opened this issue Dec 10, 2021 · 7 comments

Comments

@crastr
Copy link

crastr commented Dec 10, 2021

Hi @anuradhawick!

We managed to launch LRBinner in docker, but the "Clustering using HDBSCAN running" step ended with the following mistake.

docker run --rm -it --gpus '"device=3"' -v pwd:pwd -u id -u:id -g anuradhawick/lrbinner contigs -r $PWD/c1.fq -c $PWD/c1.fasta --k-size 4 --cuda --output $PWD/result

Output:
2021-12-10 17:35:54,303 - INFO - Command /usr/LRBinner/LRBinner contigs -r /mnt/40_tb_10/work/alex/other_labs/andronov/andornov_metag_2021/complete_polished/c1/c1.fq -c /mnt/40_tb_10/work/alex/other_labs/andronov/andornov_metag_2021/complete_polished/c1/c1.fasta --k-size 4 -t 40 --cuda --output /mnt/40_tb_10/work/alex/other_labs/andronov/andornov_metag_2021/complete_polished/c1/result --resume
2021-12-10 17:35:57,360 - INFO - CUDA found in system
2021-12-10 17:35:57,362 - INFO - Resuming the program from previous checkpoints
2021-12-10 17:35:57,363 - INFO - Loading contig lengths
2021-12-10 17:35:57,485 - INFO - Loading marker genes from previous computations
2021-12-10 17:38:00,783 - INFO - Contigs already split
2021-12-10 17:38:00,783 - INFO - 15-mer counting already performed
2021-12-10 17:38:00,783 - INFO - K-mer vectors already computed
2021-12-10 17:38:00,783 - INFO - Coverage vectors already computed
2021-12-10 17:38:01,196 - INFO - Numpy arrays already computed
2021-12-10 17:38:01,196 - INFO - VAE already trained
2021-12-10 17:38:01,248 - INFO - Clustering using HDBSCAN running
joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py", line 407, in _process_worker
call_item = call_queue.get(block=True, timeout=timeout)
File "/opt/conda/lib/python3.9/multiprocessing/queues.py", line 122, in get
return _ForkingPickler.loads(res)
File "sklearn/neighbors/_binary_tree.pxi", line 1057, in sklearn.neighbors._kd_tree.BinaryTree.setstate
File "sklearn/neighbors/_binary_tree.pxi", line 999, in sklearn.neighbors._kd_tree.BinaryTree._update_memviews
File "stringsource", line 658, in View.MemoryView.memoryview_cwrapper
File "stringsource", line 349, in View.MemoryView.memoryview.cinit
ValueError: buffer source array is read-only
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/LRBinner/LRBinner", line 197, in
main()
File "/usr/LRBinner/LRBinner", line 179, in main
pipelines.run_contig_binning(args)
File "/usr/LRBinner/mbcclr_utils/pipelines.py", line 242, in run_contig_binning
cluster_utils.perform_contig_binning_HDBSCAN(
File "/usr/LRBinner/mbcclr_utils/cluster_utils.py", line 494, in perform_contig_binning_HDBSCAN
labels = HDBSCAN(min_cluster_size=250).fit_predict(latent)
File "/opt/conda/lib/python3.9/site-packages/hdbscan/hdbscan_.py", line 941, in fit_predict
self.fit(X)
File "/opt/conda/lib/python3.9/site-packages/hdbscan/hdbscan_.py", line 919, in fit
self.min_spanning_tree) = hdbscan(X, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/hdbscan/hdbscan
.py", line 610, in hdbscan
(single_linkage_tree, result_min_span_tree) = memory.cache(
File "/opt/conda/lib/python3.9/site-packages/joblib/memory.py", line 349, in call
return self.func(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/hdbscan/hdbscan_.py", line 275, in _hdbscan_boruvka_kdtree
alg = KDTreeBoruvkaAlgorithm(tree, min_samples, metric=metric,
File "hdbscan/_hdbscan_boruvka.pyx", line 392, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.init
File "hdbscan/_hdbscan_boruvka.pyx", line 426, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds
File "/opt/conda/lib/python3.9/site-packages/joblib/parallel.py", line 1056, in call
self.retrieve()
File "/opt/conda/lib/python3.9/site-packages/joblib/parallel.py", line 935, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/opt/conda/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
return future.result(timeout=timeout)
File "/opt/conda/lib/python3.9/concurrent/futures/_base.py", line 445, in result
return self.__get_result()
File "/opt/conda/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result
raise self._exception
joblib.externals.loky.process_executor.BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.

The same line works perfectly on 10% of contigs and 10% of reads.

Quick googling showed that the possible problem could be connected to the number of rows (like in https://githubmemory.com/repo/scikit-learn/scikit-learn/issues/21228).

Thanks in advance!
Alexey

@anuradhawick
Copy link
Owner

Thanks for the issue. I will look into this. I may update the docker file accordingly and reply to you in a few days time.

@4njul1
Copy link

4njul1 commented Feb 10, 2022

Hi @anuradhawick!

We are having the same problem clustering using hdbscan but we are using conda instead of docker. Did you have time to look into the issue yet?

Thank you very much in advance!
Anjuli

2022-02-10 14:29:40,021 - INFO - Command /home/woo/tools/LRBinner/LRBinner contigs --reads-path reads.fasta --bin-count 10 --bin-size 32 --output microbiome_bins --k-size 3 --ae-dims 4 --ae-epochs 200 --threads 20 --contigs scaffolds.1Kb.fa --resume
2022-02-10 14:29:40,035 - INFO - Resuming the program from previous checkpoints
2022-02-10 14:29:40,044 - INFO - Loading contig lengths
2022-02-10 14:29:40,199 - INFO - Loading marker genes from previous computations
2022-02-10 14:31:11,071 - INFO - Contigs already split
2022-02-10 14:31:11,071 - INFO - 15-mer counting already performed
2022-02-10 14:31:11,072 - INFO - K-mer vectors already computed
2022-02-10 14:31:11,072 - INFO - Coverage vectors already computed
2022-02-10 14:31:13,264 - INFO - Numpy arrays already computed
2022-02-10 14:31:13,264 - INFO - VAE already trained
2022-02-10 14:31:14,670 - INFO - Clustering using HDBSCAN running
joblib.externals.loky.process_executor._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/woo/miniconda3/envs/lrbinner/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 407, in _process_worker
    call_item = call_queue.get(block=True, timeout=timeout)
  File "/home/woo/miniconda3/envs/lrbinner/lib/python3.7/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "sklearn/neighbors/_binary_tree.pxi", line 1057, in sklearn.neighbors._kd_tree.BinaryTree.__setstate__
  File "sklearn/neighbors/_binary_tree.pxi", line 999, in sklearn.neighbors._kd_tree.BinaryTree._update_memviews
  File "stringsource", line 658, in View.MemoryView.memoryview_cwrapper
  File "stringsource", line 349, in View.MemoryView.memoryview.__cinit__
ValueError: buffer source array is read-only
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/woo/tools/LRBinner/LRBinner", line 197, in <module>
    main()
  File "/home/woo/tools/LRBinner/LRBinner", line 179, in main
    pipelines.run_contig_binning(args)
  File "/home/woo/tools/LRBinner/mbcclr_utils/pipelines.py", line 243, in run_contig_binning
    output, fragment_parent, separate, contigs, threads)
  File "/home/woo/tools/LRBinner/mbcclr_utils/cluster_utils.py", line 494, in perform_contig_binning_HDBSCAN
    labels = HDBSCAN(min_cluster_size=250).fit_predict(latent)
  File "/home/woo/miniconda3/envs/lrbinner/lib/python3.7/site-packages/hdbscan/hdbscan_.py", line 941, in fit_predict
    self.fit(X)
  File "/home/woo/miniconda3/envs/lrbinner/lib/python3.7/site-packages/hdbscan/hdbscan_.py", line 919, in fit
    self._min_spanning_tree) = hdbscan(X, **kwargs)
  File "/home/woo/miniconda3/envs/lrbinner/lib/python3.7/site-packages/hdbscan/hdbscan_.py", line 615, in hdbscan
    core_dist_n_jobs, **kwargs)
  File "/home/woo/miniconda3/envs/lrbinner/lib/python3.7/site-packages/joblib/memory.py", line 349, in __call__
    return self.func(*args, **kwargs)
  File "/home/woo/miniconda3/envs/lrbinner/lib/python3.7/site-packages/hdbscan/hdbscan_.py", line 278, in _hdbscan_boruvka_kdtree
    n_jobs=core_dist_n_jobs, **kwargs)
  File "hdbscan/_hdbscan_boruvka.pyx", line 392, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.__init__
  File "hdbscan/_hdbscan_boruvka.pyx", line 426, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds
  File "/home/woo/miniconda3/envs/lrbinner/lib/python3.7/site-packages/joblib/parallel.py", line 1056, in __call__
    self.retrieve()
  File "/home/woo/miniconda3/envs/lrbinner/lib/python3.7/site-packages/joblib/parallel.py", line 935, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/home/woo/miniconda3/envs/lrbinner/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)
  File "/home/woo/miniconda3/envs/lrbinner/lib/python3.7/concurrent/futures/_base.py", line 435, in result
    return self.__get_result()
  File "/home/woo/miniconda3/envs/lrbinner/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
joblib.externals.loky.process_executor.BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.

@anuradhawick
Copy link
Owner

Hi Anjuli,

thanks for the issue.

can you tell me how you installed packages using conda? I need to know command to install hdbscan.

Thanks.

@4njul1
Copy link

4njul1 commented Feb 12, 2022

Hi @anuradhawick!

Thank you for your reply! We used the following command as suggested in the README.md

conda create -n lrbinner -y python=3.7 numpy scipy seaborn h5py tabulate pytorch hdbscan gcc openmp tqdm biopython

conda activate lrbinner

git clone https://github.com/anuradhawick/LRBinner.git
cd LRBinner/
python setup.py build

Thank you very much for looking into it. We are eager to use your tool on our data!

Cheers,
Anjuli

@anuradhawick
Copy link
Owner

Hi @4njul1 and @crastr,

Could you please try to install HDBSCAN using the command,

pip install --upgrade git+https://github.com/scikit-learn-contrib/hdbscan.git#egg=hdbscan

There are issues in the conda version and it is not the latest version.

Let me know if this helps,

~Anuradha

@4njul1
Copy link

4njul1 commented Feb 15, 2022

Hi @anuradhawick!,

thanks a lot for looking into this. I tried to upgrade HDBSCAN with the command you posted and it works!

Thank you very much for your help!

Best wishes,
Anjuli

@anuradhawick
Copy link
Owner

@4njul1 fantastic.

Please let me know how the tool performs, any artefacts and feedback when you have time.

Thanks
Anuradha.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants