-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FATAL ERROR in extractor #668
Comments
The FATAL error seems justified, as apparently you are getting two different read IDs on consecutive lines:
Just to recap: this is genome-wide alignments without any subsequent filtering for specific regions or the like? In theory, the deduplication in paired-end mode should also fail in a similar way (I am currently unsure if it also checks if you specify
Given that you have a substantial amount of discrepancy between the OT and OB strands however, my guess would be that you ran out of memory during a later stage of the alignment process somehow. Maybe you could test whether only valid pairs are present in the original BAM file? You could also attempt to run the extractor on the un-deduplicated file to see it that succeeds (my guess is that it will also fail). If it does then it might have to do with the deduplication, did you get a complete deduplication report? I would personally probably restart the alignments first thing (making sure you've got enough resources!), and go on the error hunt once that's under way. |
I have submitted another run of alignment (a week ago) but it will take a long time to wait in line... Here is the deduplication report
The overall pipeline I follow for the WGBS is: where it suggests a flag '-p' for pair-end. |
Thanks for the extra information. If you leave out the I think you might be able to come up with a quick one liner (in |
Hi, |
Have you tried re-running the alignments? You might try running only a small amount of your files (e.g. If you are 'just a biologist' you could really find chatGPT very useful for such a task. I haven't tested it but in my experience it will either work out of the box, or require little refinement. I can of course understand if you would rather be able to do this yourself, or not at all. Just fyi, I asked the following:
import pysam
import sys
def read_bam_file(file_path):
try:
bamfile = pysam.AlignmentFile(file_path, 'rb')
prev_read = None
line_number = 0
for read in bamfile.fetch(until_eof=True):
line_number += 1
if line_number % 2 == 0: # Process two lines at a time
if prev_read.query_name != read.query_name:
print(f'Error: Read IDs do not match at lines {line_number-1} and {line_number}')
print(f'Line {line_number-1}: {prev_read}')
print(f'Line {line_number}: {read}')
sys.exit('Terminating due to non-matching read IDs.')
prev_read = read
except FileNotFoundError:
print('File not found. Please check the file path and try again.')
except Exception as e:
print(f'An error occurred: {e}')
if __name__ == '__main__':
if len(sys.argv) < 2:
print('Usage: python script.py <BAM file path>')
else:
read_bam_file(sys.argv[1]) |
I have read many closed issues with a similar problem: #211 #219 #360
I used the module already set up in the supercomputer (ver. 0.24.1).
Here is one of the FATAL errors:
[FATAL ERROR:] The IDs of Read 1 (A00977:664:H535JDSX7:3:1219:10402:31563_1:N:0:AAGACCGT+CAATCGAC) and Read 2 (A00977:664:H535JDSX7:3:1219:10963:31563_1:N:0:AAGACCGT+CAATCGAC) are not the same. This might be the result of sorting the BAM files by chromosomal position or merging several files with Samtools sort, and this is not compatible with correct methylation extraction. Please use an unsorted file instead or sort the file by name using the command 'samtools sort -n'. Paired-end files may be merged properly (without risking this error) using either 'samtools merge -n' or 'samtools cat'.
I did "grep" the two reads:
A00977:664:H535JDSX7:3:1219:10402:31563_1:N:0:AAGACCGT+CAATCGAC 99 chr5H 568314201 42 136M = 568314229 164 TTATTTTAATTTGTAATGATTTATTTATGGTTGTTTAATAAAAAAAATGTGAGTAATATAGAATTTATGGATATGAAAAGGTAAATTTGTTTTTGAGATGATTTTTATAGTGTTGTTGTTTGTTTTTATTTATTTA FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFF:FFFFFFFFF,:FF:FFF:FFFFFFFFFFFFFFFFFFF NM:i:24 MD:Z:1C4C2C10C0C3C7C19C2C7C0C20C2C2C10C0C0C12C3C2C0C1C4C0C1 XM:Z:.h....h..h..........hh...h.......h...................h..h.......hh....................x..h..x..........hhh............h...h..hh.h....hh. XR:Z:CT XG:Z:CT
A00977:664:H535JDSX7:3:1219:10963:31563_1:N:0:AAGACCGT+CAATCGAC 147 chr3H 306704954 42 136M = 306704693 -397 TTGTTTGTATTAAACGTTATTTCGTAACTGGTTGATTATAAAGGTGTTTTACAGGTATTTCTGAGGGTGTTTTTTGGGTTGGTATGGATCGAGATTGGGATGTTTTTCTGTATGACAGAGAGGTATTTTTGGGTTT F,FF,FFFF:FF:::FFF:F:FFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFF,FF:FF:FFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFF NM:i:19 MD:Z:10C6C1C16C9C1C9C11C1C9C11C8C0C0C20C1C4C0C0C0 XM:Z:..........h...Z..h.h..Z....X........h.........h.h..X......h.X.........h.h.........h......Z....x........hhh.X.......X..........h.x....hhh XR:Z:GA XG:Z:CT
Thus, I guess the problem is similar to #211. However, I am doing at the whole genome scale so I can't crop a region to do so.
This only happened to one of my four files. I went back to check the file after the aliment, I found this number discrepancy, which is much more than the others:
Number of sequence pairs with unique best (first) alignment came from the bowtie output:
CT/GA/CT: 127321563 ((converted) top strand)
GA/CT/CT: 0 (complementary to (converted) top strand)
GA/CT/GA: 0 (complementary to (converted) bottom strand)
CT/GA/GA: 159832084 ((converted) bottom strand)
I am not sure if this is the true cause or not so I need some suggestions to solve or bypass this...
command used:
bismark --genome GP.fa -o TNPS2 -1 TNPS2_R1P.fastq -2 TNPS2_R2P.fastq --parallel 3
deduplicate_bismark -p -o dedup/TNPS2_dedup.bam TNPS2/TNPS2_R1P_bismark_bt2_pe.bam
bismark_methylation_extractor --ignore_r2 2 --parallel 16 -o /Ext/TNPS2 --samtools_path dedup/TNPS2_dedup.bam
The text was updated successfully, but these errors were encountered: