Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ffindex_order giving many fewer entries in output .ff(data,index} files than in input files #343

Open
hrp1000 opened this issue Mar 24, 2023 · 4 comments

Comments

@hrp1000
Copy link

hrp1000 commented Mar 24, 2023

I'm trying to create a custom database, but find that I am getting many fewer entries in _a3m.ff{data,index} and _hhm.ffindex after running ffindex_order than are present in the input files.

I don't have the mpi versions installed.

My normal DB will have ~70K entries, but this behaviour can be seen with 100 entries in the initial _???.ff{data,index} files -

Running hh-suite/3.3.0 on Intel hardware, CentOS Linux release 7.9.2009 (Core):

cstranslate -f -x 0.3 -c 4 -I a3m -i full_a3m -o full_cs219
sort -k3 -n full_cs219.ffindex | cut -f1 > sorting.dat
cat sorting.dat | sed 's/.a3m$/.hhm/' > sorting.hhm # need this step because ffindex_order does not run for .hhm files with the extensions in the original sorting.dat file
ffindex_order sorting.hhm full_hhm.ff{data,index} full_hhm_ordered.ff{data,index}
ffindex_order sorting.dat full_a3m.ff{data,index} full_a3m_ordered.ff{data,index}

wc -l full*index
100 full_a3m.ffindex
6 full_a3m_ordered.ffindex
100 full_cs219.ffindex
100 full_hhm.ffindex
6 full_hhm_ordered.ffindex
312 total

My assumption is that I've screwed up somewhere - an indication of where would be most useful!

@idrinnen
Copy link

idrinnen commented Jun 5, 2023

I have the same issue sorting a the a3m and hhm databases of an entire proteome. Please let me know if you found a solution. Thanks!

@hrp1000
Copy link
Author

hrp1000 commented Jun 5, 2023

Hi Idrinnen

I'm still trying to find a robust solution to this; what I can say with some confidence is that sometimes it appears to work and sometimes it doesn't - the solutions in the github issues pages aren't very reliable. I have to put more effort in this week (by sheer coincidence!) to get it working.
I don't expect anyone from the Soeding team to respond because my understanding (which could well be in error - I hope it is!!) is that they have lost funding for this project - which would be a huge shame.

@hrp1000
Copy link
Author

hrp1000 commented Aug 21, 2023

Hi Idrinnen

Sorry about the long delay - I've been distracted by other things for a couple of months, but I can now come back to this.

What I have found so far is that if I try to add entries to an existing DB then mostly it works but barfs on some entries, so I end up with an incomplete (and inconsistent) DB. The only way I've managed to build full DBs from my particular selection (so far, I am now working on this again) is to rebuild the DB from scratch whenever I have to add new entries. I don't think this is the best solution but it works for me (unfortunately, I have to update my DB once a week to include new entries from the PDB) and at least doesn't fail.

My guess is that I'm doing something wrong somewhere...

@hrp1000
Copy link
Author

hrp1000 commented Oct 6, 2023

Finally got round to looking at this.

My solution (which works okay for me) is to rebuild the DB from the a3m files every time - I put this into a cron job to run once a week when the PDB is updated. Uses more processing time than just adding entries, but it does seem to be robust. Until hh-suite has more funding so that problems like this can be answered by the hh-suite group, this is what I will be going with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants