Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing a3m files from the filtered unclust30 database in OpenProteinSet #397

Open
damiano-sg opened this issue Jan 17, 2024 · 0 comments
Open

Comments

@damiano-sg
Copy link

damiano-sg commented Jan 17, 2024

Hello, I downloaded the entire uniclust30 filtered database from AWS and I see that some clusters have only the pdb folder and are missing the a3m folder with the MSA. Is there a reason for that?
I counted 677 clusters that have this problem. Here are some of them: A0A023B4W7, A0A023SCZ3, A0A044VF87, A0A059F1C3.
Here is a file with all the clusters missing the MSA: missing_msas.txt

Also, another question, how do I get the representative sequence for each cluster? Is it the first sequence in the a3m file? Because I saw that in some cases the first sequence is called consensus, like for instance in the case of A0A009FAV8, does that mean that the first sequence is not always the representative?
Otherwise I tried to look at the list of clusters in the Uniclust30-2018_08 website to find the representative sequences but it looks like the cluster names are not the same as in OpenProteinSet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant