Bug in Reading FASTAs for `precompute_alignments_mmseqs.py` #411

brucejwittmann · 2024-02-26T19:16:23Z

The way the FASTA file is loaded in precompute_alignments_mmseqs.py is very fragile.

openfold/scripts/precompute_alignments_mmseqs.py

Lines 36 to 40 in ef0c9fa

    
           with open(args.input_fasta, "r") as f: 
        
               lines = [l.strip() for l in f.readlines()] 
        
           names = lines[::2] 
        
           seqs =  lines[1::2]

As written, it assumes that the FASTA file is written with headers and sequences alternating every other line. Often, however, FASTA files are written with a header line (starting with ">") followed by multiple lines of sequence information, typically no more than 80 characters per line. Indeed, the BioPython implementation of writing sequences in FASTA format uses this multi-lined approach for long sequences.

For any FASTA file written with multi-line sequences, the current code will incorrectly treat some sequence lines as headers and some header lines as sequences.

Fortunately, this bug will only have an effect when setting fasta_chunk_size, as the default chunk size of len(seqs) results in an integer much larger the actual number of sequences in the FASTA file. Thus, the subsequent lines

openfold/scripts/precompute_alignments_mmseqs.py

Lines 51 to 55 in ef0c9fa

    
           s = 0 
        
           while(s < len(seqs)): 
        
               e = s + chunk_size 
        
               chunk_fasta = [el for tup in zip(names[s:e], seqs[s:e]) for el in tup]  
        
               s = e

will just recreate the original FASTA file. If a chunksize was set, though, then some sequences at the beginning and end of the chunk may end up truncated; some sequence fragments at the beginning will also be missing header information and some headers at the end may also be missing sequence information.

I'd recommend just using BioPython to load in the FASTA files as it can handle both single-line and multi-line FASTA file formats. This would also simplify the chunking procedure. For instance, the sequences can be loaded with:

seqs = list(SeqIO.parse(args.input_fasta, "fasta"))

Chunking and saving can then be performed as below:

s = 0
while(s < len(seqs)):
    e = s + chunk_size
    chunk_fasta = seqs[s: e]
    s = e
    
    chunk_fasta_path = os.path.join(args.output_dir, "tmp.fasta")
    with open(chunk_fasta_path, "w") as f:
        SeqIO.write(chunk_fasta, f, "fasta"))

The text was updated successfully, but these errors were encountered:

brucejwittmann · 2024-04-04T17:14:19Z

Bumping this as it actually causes a problem even when the chunksize is not set. Imagine a FASTA file with an odd number of lines in it:

>seq1
AAAAAAAAAAA
AAAAAA
>seq2
BBBBBBBBBBBB
BBBBBBBBBBBB
BBBB

If I run the code stepwise as is on this file, the last line will be cut off. Running through everything stepwise:

names = lines[::2]  >> [>seq1, AAAAAA, BBBBBBBBBBBB, BBBB]
seqs = lines[1::2] >> [AAAAAAAAAAA, >seq2, BBBBBBBBBBBB]

chunk_size = len(seqs) >> 3

s = 0
while (s < 3): # Explicitly writing "3" for "len(seqs)"
    e = s + 3 >> 3
    chunk_fasta = [el for tup in zip(names[0:3], seqs[0:3]) for el in tup] >> [>seq1, AAAAAAAAAAA, AAAAAA, >seq2, BBBBBBBBBBBB, BBBBBBBBBBBB] 
    s = e ## Breaks loop as s == 3

With an even number of lines, we don't lose the last line and there will be no error as, fortunately, the chunking process just puts everthing back together. With an odd number of lines, however, the query sequence in the alignments for the last sequence in the FASTA file will be truncated--there will thus be a mismatch between the sequence input to the structure prediction script via the FASTA file and query sequence in the alignments. I think that this may be related to issue #186 , as when I run prediction with these mismatches, I get a singularity-like structure that fails minimization with the error "Minimization failed after 100 attempts" and repeated printouts of "article coordinate is nan". The coordinates are likely "nan" because they do not exist in the alignment.

brucejwittmann mentioned this issue Apr 4, 2024

ValueError: Minimization failed after 100 attempts. #186

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in Reading FASTAs for `precompute_alignments_mmseqs.py` #411

Bug in Reading FASTAs for `precompute_alignments_mmseqs.py` #411

brucejwittmann commented Feb 26, 2024

brucejwittmann commented Apr 4, 2024

Bug in Reading FASTAs for precompute_alignments_mmseqs.py #411

Bug in Reading FASTAs for precompute_alignments_mmseqs.py #411

Comments

brucejwittmann commented Feb 26, 2024

brucejwittmann commented Apr 4, 2024

Bug in Reading FASTAs for `precompute_alignments_mmseqs.py` #411

Bug in Reading FASTAs for `precompute_alignments_mmseqs.py` #411