ffindex_order giving many fewer entries in output .ff(data,index} files than in input files #343

hrp1000 · 2023-03-24T16:36:17Z

I'm trying to create a custom database, but find that I am getting many fewer entries in _a3m.ff{data,index} and _hhm.ffindex after running ffindex_order than are present in the input files.

I don't have the mpi versions installed.

My normal DB will have ~70K entries, but this behaviour can be seen with 100 entries in the initial _???.ff{data,index} files -

Running hh-suite/3.3.0 on Intel hardware, CentOS Linux release 7.9.2009 (Core):

cstranslate -f -x 0.3 -c 4 -I a3m -i full_a3m -o full_cs219
sort -k3 -n full_cs219.ffindex | cut -f1 > sorting.dat
cat sorting.dat | sed 's/.a3m$/.hhm/' > sorting.hhm # need this step because ffindex_order does not run for .hhm files with the extensions in the original sorting.dat file
ffindex_order sorting.hhm full_hhm.ff{data,index} full_hhm_ordered.ff{data,index}
ffindex_order sorting.dat full_a3m.ff{data,index} full_a3m_ordered.ff{data,index}

wc -l full*index
100 full_a3m.ffindex
6 full_a3m_ordered.ffindex
100 full_cs219.ffindex
100 full_hhm.ffindex
6 full_hhm_ordered.ffindex
312 total

My assumption is that I've screwed up somewhere - an indication of where would be most useful!

idrinnen · 2023-06-05T07:10:03Z

I have the same issue sorting a the a3m and hhm databases of an entire proteome. Please let me know if you found a solution. Thanks!

hrp1000 · 2023-06-05T08:54:04Z

Hi Idrinnen

I'm still trying to find a robust solution to this; what I can say with some confidence is that sometimes it appears to work and sometimes it doesn't - the solutions in the github issues pages aren't very reliable. I have to put more effort in this week (by sheer coincidence!) to get it working.
I don't expect anyone from the Soeding team to respond because my understanding (which could well be in error - I hope it is!!) is that they have lost funding for this project - which would be a huge shame.

hrp1000 · 2023-08-21T11:20:36Z

Hi Idrinnen

Sorry about the long delay - I've been distracted by other things for a couple of months, but I can now come back to this.

What I have found so far is that if I try to add entries to an existing DB then mostly it works but barfs on some entries, so I end up with an incomplete (and inconsistent) DB. The only way I've managed to build full DBs from my particular selection (so far, I am now working on this again) is to rebuild the DB from scratch whenever I have to add new entries. I don't think this is the best solution but it works for me (unfortunately, I have to update my DB once a week to include new entries from the PDB) and at least doesn't fail.

My guess is that I'm doing something wrong somewhere...

hrp1000 · 2023-10-06T08:45:55Z

Finally got round to looking at this.

My solution (which works okay for me) is to rebuild the DB from the a3m files every time - I put this into a cron job to run once a week when the PDB is updated. Uses more processing time than just adding entries, but it does seem to be robust. Until hh-suite has more funding so that problems like this can be answered by the hh-suite group, this is what I will be going with.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ffindex_order giving many fewer entries in output .ff(data,index} files than in input files #343

ffindex_order giving many fewer entries in output .ff(data,index} files than in input files #343

hrp1000 commented Mar 24, 2023

idrinnen commented Jun 5, 2023

hrp1000 commented Jun 5, 2023

hrp1000 commented Aug 21, 2023

hrp1000 commented Oct 6, 2023

ffindex_order giving many fewer entries in output .ff(data,index} files than in input files #343

ffindex_order giving many fewer entries in output .ff(data,index} files than in input files #343

Comments

hrp1000 commented Mar 24, 2023

idrinnen commented Jun 5, 2023

hrp1000 commented Jun 5, 2023

hrp1000 commented Aug 21, 2023

hrp1000 commented Oct 6, 2023