Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mapping to Pfam IDs #75

Open
christophfeinauer opened this issue Jul 10, 2020 · 7 comments
Open

Mapping to Pfam IDs #75

christophfeinauer opened this issue Jul 10, 2020 · 7 comments
Labels
documentation Improvements or additions to documentation quick-fix This is an easy fix and should be done soon

Comments

@christophfeinauer
Copy link

Hi,

first thanks for creating this repo, it's really useful.

One question: It's not clear to me how I can go back to the original Pfam ID for a sequence from the LMDB databases. The reason I want to do this is because I need to use species annotation in a task.

Also, I did not find information as to how the data was created (which part of Pfam, is there preprocessing etc.). Is this documented somwhere and I didn't see it?

@thomas-a-neil
Copy link
Member

Hello,

Thanks for your interest in our repo!

In order to get the original Pfam ID, you'll unfortunately have to compare the sequence of residues directly. If it is helpful, you can find the mapping from Pfam index to Pfam family in s3 here s3://proteindata/data/pfam/pfam_fams_public.pkl, which would allow you to restrict your search.

The process for creating our dataset is as follows: we downloaded Pfam-A.fasta from the Pfam 31 release (ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam31.0/), shuffled it, and then split into train/validation/test as described in our paper. So the id field in the lmdb doesn't correspond to any index in Pfam. We probably should have kept the Pfam id around for the type of annotation you suggest, but since we didn't use it for model training, we dropped it. The original pfam serialization script can be found here in the deprecated tensorflow repo https://github.com/songlab-cal/tape-neurips2019/blob/master/tape/data_utils/pfam_protein_serializer.py

@christophfeinauer
Copy link
Author

Thanks!

Are you sure that you used Pfam 31? There are a lot of sequences in the dataset that are not in Pfam 31, but all appear in Pfam 32.

Also, if you are interested, I can send you the mapping if other people might need it.

@thomas-a-neil
Copy link
Member

Ah yes, thank you for the correction. It should be most similar to Pfam 32. We downloaded Pfam-A.fasta from the "current release" ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release ftp link in March of 2019. Pfam 32 had already been released in August 2018, and the last modification to Pfam 33 was March 2020. If there are sequences that don't appear in Pfam 32, I would check Pfam 33.

And thanks for offering to send the mapping, that would be helpful to share with others!

@psturmfels
Copy link

+1 for the mapping to original Pfam IDs - I would be very interested in them!

@christophfeinauer
Copy link
Author

Here it is

The columns are id | species | uniprot_id | pfam_id | start | end. The id is just the id in the lmdb files.

@psturmfels
Copy link

This is awesome! Thank you! Out of curiosity, how did you link back to the pfam_ids? Did you actually just compare every literal sequence string between the tape dataset and the Pfam release?

@christophfeinauer
Copy link
Author

Yes. I just parsed Pfam-A.fasta and mapped the sequencs back to the lmdb files. With Pfam 32 there were no missing sequences. I also checked a random subset of the mapping manually and it looks good.

The script also creates a version of the lmdb databases that contains all the information about pfam mappings, species etc. I can share them if someone is interested (however, they are trivial to make with the mappings).

@rmrao rmrao added documentation Improvements or additions to documentation quick-fix This is an easy fix and should be done soon labels Dec 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation quick-fix This is an easy fix and should be done soon
Projects
None yet
Development

No branches or pull requests

4 participants