Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in Reading FASTAs for precompute_alignments_mmseqs.py #411

Open
brucejwittmann opened this issue Feb 26, 2024 · 1 comment
Open

Bug in Reading FASTAs for precompute_alignments_mmseqs.py #411

brucejwittmann opened this issue Feb 26, 2024 · 1 comment

Comments

@brucejwittmann
Copy link

The way the FASTA file is loaded in precompute_alignments_mmseqs.py is very fragile.

with open(args.input_fasta, "r") as f:
lines = [l.strip() for l in f.readlines()]
names = lines[::2]
seqs = lines[1::2]

As written, it assumes that the FASTA file is written with headers and sequences alternating every other line. Often, however, FASTA files are written with a header line (starting with ">") followed by multiple lines of sequence information, typically no more than 80 characters per line. Indeed, the BioPython implementation of writing sequences in FASTA format uses this multi-lined approach for long sequences.

For any FASTA file written with multi-line sequences, the current code will incorrectly treat some sequence lines as headers and some header lines as sequences.

Fortunately, this bug will only have an effect when setting fasta_chunk_size, as the default chunk size of len(seqs) results in an integer much larger the actual number of sequences in the FASTA file. Thus, the subsequent lines

s = 0
while(s < len(seqs)):
e = s + chunk_size
chunk_fasta = [el for tup in zip(names[s:e], seqs[s:e]) for el in tup]
s = e

will just recreate the original FASTA file. If a chunksize was set, though, then some sequences at the beginning and end of the chunk may end up truncated; some sequence fragments at the beginning will also be missing header information and some headers at the end may also be missing sequence information.

I'd recommend just using BioPython to load in the FASTA files as it can handle both single-line and multi-line FASTA file formats. This would also simplify the chunking procedure. For instance, the sequences can be loaded with:

seqs = list(SeqIO.parse(args.input_fasta, "fasta"))

Chunking and saving can then be performed as below:

s = 0
while(s < len(seqs)):
    e = s + chunk_size
    chunk_fasta = seqs[s: e]
    s = e
    
    chunk_fasta_path = os.path.join(args.output_dir, "tmp.fasta")
    with open(chunk_fasta_path, "w") as f:
        SeqIO.write(chunk_fasta, f, "fasta"))
@brucejwittmann
Copy link
Author

Bumping this as it actually causes a problem even when the chunksize is not set. Imagine a FASTA file with an odd number of lines in it:

>seq1
AAAAAAAAAAA
AAAAAA
>seq2
BBBBBBBBBBBB
BBBBBBBBBBBB
BBBB

If I run the code stepwise as is on this file, the last line will be cut off. Running through everything stepwise:

names = lines[::2]  >> [>seq1, AAAAAA, BBBBBBBBBBBB, BBBB]
seqs = lines[1::2] >> [AAAAAAAAAAA, >seq2, BBBBBBBBBBBB]

chunk_size = len(seqs) >> 3

s = 0
while (s < 3): # Explicitly writing "3" for "len(seqs)"
    e = s + 3 >> 3
    chunk_fasta = [el for tup in zip(names[0:3], seqs[0:3]) for el in tup] >> [>seq1, AAAAAAAAAAA, AAAAAA, >seq2, BBBBBBBBBBBB, BBBBBBBBBBBB] 
    s = e ## Breaks loop as s == 3

With an even number of lines, we don't lose the last line and there will be no error as, fortunately, the chunking process just puts everthing back together. With an odd number of lines, however, the query sequence in the alignments for the last sequence in the FASTA file will be truncated--there will thus be a mismatch between the sequence input to the structure prediction script via the FASTA file and query sequence in the alignments. I think that this may be related to issue #186 , as when I run prediction with these mismatches, I get a singularity-like structure that fails minimization with the error "Minimization failed after 100 attempts" and repeated printouts of "article coordinate is nan". The coordinates are likely "nan" because they do not exist in the alignment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant