Mapping to Pfam IDs #75

christophfeinauer · 2020-07-10T19:46:07Z

Hi,

first thanks for creating this repo, it's really useful.

One question: It's not clear to me how I can go back to the original Pfam ID for a sequence from the LMDB databases. The reason I want to do this is because I need to use species annotation in a task.

Also, I did not find information as to how the data was created (which part of Pfam, is there preprocessing etc.). Is this documented somwhere and I didn't see it?

thomas-a-neil · 2020-07-10T20:39:42Z

Hello,

Thanks for your interest in our repo!

In order to get the original Pfam ID, you'll unfortunately have to compare the sequence of residues directly. If it is helpful, you can find the mapping from Pfam index to Pfam family in s3 here s3://proteindata/data/pfam/pfam_fams_public.pkl, which would allow you to restrict your search.

The process for creating our dataset is as follows: we downloaded Pfam-A.fasta from the Pfam 31 release (ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam31.0/), shuffled it, and then split into train/validation/test as described in our paper. So the id field in the lmdb doesn't correspond to any index in Pfam. We probably should have kept the Pfam id around for the type of annotation you suggest, but since we didn't use it for model training, we dropped it. The original pfam serialization script can be found here in the deprecated tensorflow repo https://github.com/songlab-cal/tape-neurips2019/blob/master/tape/data_utils/pfam_protein_serializer.py

christophfeinauer · 2020-07-11T16:36:49Z

Thanks!

Are you sure that you used Pfam 31? There are a lot of sequences in the dataset that are not in Pfam 31, but all appear in Pfam 32.

Also, if you are interested, I can send you the mapping if other people might need it.

thomas-a-neil · 2020-07-11T17:19:17Z

Ah yes, thank you for the correction. It should be most similar to Pfam 32. We downloaded Pfam-A.fasta from the "current release" ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release ftp link in March of 2019. Pfam 32 had already been released in August 2018, and the last modification to Pfam 33 was March 2020. If there are sequences that don't appear in Pfam 32, I would check Pfam 33.

And thanks for offering to send the mapping, that would be helpful to share with others!

psturmfels · 2020-07-17T17:12:05Z

+1 for the mapping to original Pfam IDs - I would be very interested in them!

christophfeinauer · 2020-07-17T20:26:12Z

Here it is

psturmfels · 2020-07-17T20:47:27Z

This is awesome! Thank you! Out of curiosity, how did you link back to the pfam_ids? Did you actually just compare every literal sequence string between the tape dataset and the Pfam release?

christophfeinauer · 2020-07-17T21:16:06Z

Yes. I just parsed Pfam-A.fasta and mapped the sequencs back to the lmdb files. With Pfam 32 there were no missing sequences. I also checked a random subset of the mapping manually and it looks good.

The script also creates a version of the lmdb databases that contains all the information about pfam mappings, species etc. I can share them if someone is interested (however, they are trivial to make with the mappings).

rmrao added documentation Improvements or additions to documentation quick-fix This is an easy fix and should be done soon labels Dec 17, 2020

thomas-a-neil mentioned this issue Apr 14, 2021

Pfam dataset version and preprocess #104

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mapping to Pfam IDs #75

Mapping to Pfam IDs #75

christophfeinauer commented Jul 10, 2020

thomas-a-neil commented Jul 10, 2020

christophfeinauer commented Jul 11, 2020

thomas-a-neil commented Jul 11, 2020

psturmfels commented Jul 17, 2020

christophfeinauer commented Jul 17, 2020

psturmfels commented Jul 17, 2020

christophfeinauer commented Jul 17, 2020

Mapping to Pfam IDs #75

Mapping to Pfam IDs #75

Comments

christophfeinauer commented Jul 10, 2020

thomas-a-neil commented Jul 10, 2020

christophfeinauer commented Jul 11, 2020

thomas-a-neil commented Jul 11, 2020

psturmfels commented Jul 17, 2020

christophfeinauer commented Jul 17, 2020

psturmfels commented Jul 17, 2020

christophfeinauer commented Jul 17, 2020