You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As written, it assumes that the FASTA file is written with headers and sequences alternating every other line. Often, however, FASTA files are written with a header line (starting with ">") followed by multiple lines of sequence information, typically no more than 80 characters per line. Indeed, the BioPython implementation of writing sequences in FASTA format uses this multi-lined approach for long sequences.
For any FASTA file written with multi-line sequences, the current code will incorrectly treat some sequence lines as headers and some header lines as sequences.
Fortunately, this bug will only have an effect when setting fasta_chunk_size, as the default chunk size of len(seqs) results in an integer much larger the actual number of sequences in the FASTA file. Thus, the subsequent lines
will just recreate the original FASTA file. If a chunksize was set, though, then some sequences at the beginning and end of the chunk may end up truncated; some sequence fragments at the beginning will also be missing header information and some headers at the end may also be missing sequence information.
I'd recommend just using BioPython to load in the FASTA files as it can handle both single-line and multi-line FASTA file formats. This would also simplify the chunking procedure. For instance, the sequences can be loaded with:
seqs=list(SeqIO.parse(args.input_fasta, "fasta"))
Chunking and saving can then be performed as below:
If I run the code stepwise as is on this file, the last line will be cut off. Running through everything stepwise:
names = lines[::2] >> [>seq1, AAAAAA, BBBBBBBBBBBB, BBBB]
seqs = lines[1::2] >> [AAAAAAAAAAA, >seq2, BBBBBBBBBBBB]
chunk_size = len(seqs) >> 3
s = 0
while (s < 3): # Explicitly writing "3" for "len(seqs)"
e = s + 3 >> 3
chunk_fasta = [el for tup in zip(names[0:3], seqs[0:3]) for el in tup] >> [>seq1, AAAAAAAAAAA, AAAAAA, >seq2, BBBBBBBBBBBB, BBBBBBBBBBBB]
s = e ## Breaks loop as s == 3
With an even number of lines, we don't lose the last line and there will be no error as, fortunately, the chunking process just puts everthing back together. With an odd number of lines, however, the query sequence in the alignments for the last sequence in the FASTA file will be truncated--there will thus be a mismatch between the sequence input to the structure prediction script via the FASTA file and query sequence in the alignments. I think that this may be related to issue #186 , as when I run prediction with these mismatches, I get a singularity-like structure that fails minimization with the error "Minimization failed after 100 attempts" and repeated printouts of "article coordinate is nan". The coordinates are likely "nan" because they do not exist in the alignment.
The way the FASTA file is loaded in
precompute_alignments_mmseqs.py
is very fragile.openfold/scripts/precompute_alignments_mmseqs.py
Lines 36 to 40 in ef0c9fa
As written, it assumes that the FASTA file is written with headers and sequences alternating every other line. Often, however, FASTA files are written with a header line (starting with ">") followed by multiple lines of sequence information, typically no more than 80 characters per line. Indeed, the
BioPython
implementation of writing sequences in FASTA format uses this multi-lined approach for long sequences.For any FASTA file written with multi-line sequences, the current code will incorrectly treat some sequence lines as headers and some header lines as sequences.
Fortunately, this bug will only have an effect when setting
fasta_chunk_size
, as the default chunk size oflen(seqs)
results in an integer much larger the actual number of sequences in the FASTA file. Thus, the subsequent linesopenfold/scripts/precompute_alignments_mmseqs.py
Lines 51 to 55 in ef0c9fa
will just recreate the original FASTA file. If a chunksize was set, though, then some sequences at the beginning and end of the chunk may end up truncated; some sequence fragments at the beginning will also be missing header information and some headers at the end may also be missing sequence information.
I'd recommend just using
BioPython
to load in the FASTA files as it can handle both single-line and multi-line FASTA file formats. This would also simplify the chunking procedure. For instance, the sequences can be loaded with:Chunking and saving can then be performed as below:
The text was updated successfully, but these errors were encountered: