Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable pickling of tmap.VectorUint #8

Open
rajarshi opened this issue Apr 18, 2020 · 2 comments
Open

Enable pickling of tmap.VectorUint #8

rajarshi opened this issue Apr 18, 2020 · 2 comments

Comments

@rajarshi
Copy link

Hi, I'm trying to compute MHFP's in parallel and index them in a parallel fashion. The code looks like

def fp_function(pair):
    smi, molid = pair
    mol = AllChem.MolFromSmiles(smi)
    fp = tmap.VectorUint(enc.encode_mol(mol, min_radius=0))
    return molid, fp

num_cores = multiprocessing.cpu_count()
fps = Parallel(n_jobs=num_cores)(delayed(fp_function)(input_pair) for input_pair in molcsv)
pickle.dump(fps, open("libcomp_fps.pkl", 'wb'))

However on running this I get

Traceback (most recent call last):
  File "/Users/guha/src/tmap/lsh_query.py", line 27, in <module>
    fps = Parallel(n_jobs=num_cores)(delayed(fp_function)(input_pair) for input_pair in molcsv)
  File "/anaconda3/envs/rdkit/lib/python3.6/site-packages/joblib/parallel.py", line 996, in __call__
    self.retrieve()
  File "/anaconda3/envs/rdkit/lib/python3.6/site-packages/joblib/parallel.py", line 899, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/anaconda3/envs/rdkit/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 517, in wrap_future_result
    return future.result(timeout=timeout)
  File "/anaconda3/envs/rdkit/lib/python3.6/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/anaconda3/envs/rdkit/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
TypeError: can't pickle tmap.VectorUint objects

Are there plans to make VectorUint pickleable? Or is there an alternative approach to parallelizing this type of computation?

@aparente-nurix
Copy link

I had asked @daenuprobst this question a while back (not on github, since I can't seem to find it on here) and the answer was to write them to a txt file. However, I've found writing/reading the fingerprints from text files can be quite slow. Would json be a better option? I have not tried this.

@rajarshi
Copy link
Author

One possibility is to look at supporting the chemfp binary format, which allows for impressively high speed I/O

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants