Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number of bisulfite transformed reads are not equal between Reads 1 and 2 #657

Open
itsjoeka opened this issue Feb 27, 2024 · 6 comments
Open

Comments

@itsjoeka
Copy link

Dear Bismark Author,
I was running my data with the bismark tool and i got the error message below. the version of bismark is: version v0.24.2 and the code i used to run is:
/home/jkalami/.conda/envs/bismark/bin/./bismark --non_directional --parallel 8 -f /media/biodataD/omics_project/bisulphite-dataset/ref -1 SRR10493735_1.fa -2 SRR10493735_2.fa report.txt

ERROR MESSAGE:
Writing a C -> T converted version of the input file SRR10493735_2.fa.temp.2 to SRR10493735_2.fa.temp.2_C_to_T.fa
Writing a G -> A converted version of the input file SRR10493735_2.fa.temp.2 to SRR10493735_2.fa.temp.2_G_to_A.fa

Created C -> T as well as G -> A converted versions of the FastA file SRR10493735_2.fa.temp.6 (49610851 sequences in total)

[FATAL ERROR]: Number of bisulfite transformed reads are not equal between Read 1 (#49602128) and Read 2 (#49610851).
Possible causes: file truncation, or as a result of specifying read pairs that do not belong to each other?! Please re-specify file names! Exiting...

I hope to hear from you soon.
Kind regards
Jonathan.

@FelixKrueger
Copy link
Owner

FelixKrueger commented Feb 27, 2024

Hi Jonathan,

I would say this is the first time in well over a decade that I see anyone using FastA files for alignment. I am wondering how you got hold of FastA files? The typical procedure would involve downloading the data in FastQ format:

Accession	Title	Platform	Total bases	Create date	SRA URL	SRA filename	SRA nice filename	FastQ URL	FastQ Aspera URL	FastQ filename	FastQ nice filename
SRR10493735	GSM4176376: C1; Homo sapiens; Bisulfite-Seq	Illumina HiSeq 2000	1190884	1574899200000	ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR104/SRR10493735/SRR10493735.sra	SRR10493735.sra	SRR10493735_GSM4176376_C1_Homo_sapiens_Bisulfite-Seq.sra	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR104/035/SRR10493735/SRR10493735_1.fastq.gz	era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR104/035/SRR10493735/SRR10493735_1.fastq.gz	SRR10493735_1.fastq.gz	SRR10493735_GSM4176376_C1_Homo_sapiens_Bisulfite-Seq_1.fastq.gz
SRR10493735	GSM4176376: C1; Homo sapiens; Bisulfite-Seq	Illumina HiSeq 2000	1190884	1574899200000	ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR104/SRR10493735/SRR10493735.sra	SRR10493735.sra	SRR10493735_GSM4176376_C1_Homo_sapiens_Bisulfite-Seq.sra	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR104/035/SRR10493735/SRR10493735_2.fastq.gz	era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR104/035/SRR10493735/SRR10493735_2.fastq.gz	SRR10493735_2.fastq.gz	SRR10493735_GSM4176376_C1_Homo_sapiens_Bisulfite-Seq_2.fastq.gz

You would then determine which type of library it is, and run the data through Trim Galore (FastQ required). Once that is done you would proceed to the mapping step.

And just to make sure that the FastA files you have both have the same number of reads could you run the following to count the number of lines?

for i in *SRR10493735*.fa; do echo $i; gunzip -c $i | wc -l; done

I should probably stress that I would recommend the FastQ steps described above. Also as a word of warning: --parallel 8 will probably consume north of 120GB of RAM for the alignment, you might want to tune this down a little....

@itsjoeka
Copy link
Author

Hi Mr. Krueger,
Thanks for such a quick reply.
First, the workflow I am using for this project directed me to use the fasta files. I got hold of the fastA files by downloading the fastQ files from the SRA explorer like you described above, trimmed the adapters with trimgalore and converted them into the fastA format using the "awk" command.
Thank you for the advice, I will try with the fastQ step and let you know how it goes.
kinds regards
jonathan.

@itsjoeka
Copy link
Author

itsjoeka commented Mar 9, 2024

Hi Felix,
I hope this message finds you well. I resorted to using the fastq files for the aligment process. I am using hg38.fa as the reference genome for the analysis. Before the alignment step, I prepared the reference genome using the command; bismark_genome_preparation --bowtie2 --verbose /bisulphite-dataset/ref/, which to successful completion. Afterwards I moves to the alignment step by running the command; bismark --non_directional -f /bisulphite-dataset/ref -1 SRR10493736_1_val_1.fq.gz -2 SRR10493736_2_val_2.fq.gz report.txt.
Unforturnately, the process terminated and the screen read;

chr chrUn_KI270448v1 (7992 bp)
chr chrUn_KI270521v1 (7642 bp)
chr chrUn_GL000195v1 (182896 bp)
chr chrUn_GL000219v1 (179198 bp)
chr chrUn_GL000220v1 (161802 bp)
chr chrUn_GL000224v1 (179693 bp)
chr chrUn_KI270741v1 (157432 bp)
chr chrUn_GL000226v1 (15008 bp)
chr chrUn_GL000213v1 (164239 bp)
chr chrUn_KI270743v1 (210658 bp)
chr chrUn_KI270744v1 (168472 bp)
chr chrUn_KI270745v1 (41891 bp)
chr chrUn_KI270746v1 (66486 bp)
chr chrUn_KI270747v1 (198735 bp)
chr chrUn_KI270748v1 (93321 bp)
chr chrUn_KI270749v1 (158759 bp)
chr chrUn_KI270750v1 (148850 bp)
chr chrUn_KI270751v1 (150742 bp)
chr chrUn_KI270752v1 (27745 bp)
chr chrUn_KI270753v1 (62944 bp)
chr chrUn_KI270754v1 (40191 bp)
chr chrUn_KI270755v1 (36723 bp)
chr chrUn_KI270756v1 (79590 bp)
chr chrUn_KI270757v1 (71251 bp)
chr chrUn_GL000214v1 (137718 bp)
chr chrUn_KI270742v1 (186739 bp)
chr chrUn_GL000216v2 (176608 bp)
chr chrUn_GL000218v1 (161147 bp)
chr chrX (156040895 bp)
chr chrY (57227415 bp)
chr chrY_KI270740v1_random (37240 bp)

Single-core mode: setting pid to 1

Paired-end alignments will be performed

The provided filenames for paired-end alignments are /hicanu/canu-scripts/trimmed/SRR10493736_1_val_1.fq.gz and /hicanu/canu-scripts/trimmed/SRR10493736_2_val_2.fq.gz
Input files are in FastA format
Writing a C -> T converted version of the input file SRR10493736_1_val_1.fq.gz to SRR10493736_1_val_1.fq.gz_C_to_T.fa
Writing a G -> A converted version of the input file SRR10493736_1_val_1.fq.gz to SRR10493736_1_val_1.fq.gz_G_to_A.fa
Input file doesn't seem to be in FastA format at sequence 1: Inappropriate ioctl for device.

The error says my input file is not in the fastA format. Kindly help me know where i went wrong and what i can do to resolve the issue.
Kind regards
Jonathan.

@FelixKrueger
Copy link
Owner

Your command contains the flag -f which instructs Bismark to expect FastA files. Just drop it, and all will be fine.

@itsjoeka
Copy link
Author

Hello Mr. Felix,
Thanks for the wonderful feedbacks, they have helped me start the alignment process which is currently running. I am using the hg38.fa as the reference genome for the alignment. A little concern though, on the screen where I am running the alignment, I see these texts which I have pasted below, and it has been like this since after the c -> T and G -> A conversion step. Should I be concerned even though the process is still running and files being created. the command line I issued was;

bismark --non_directional /home/jupyter/epilepsyvar/dataset/ref_genome/ -1 SRR10493735_1_val_1.fq.gz -2 SRR10493735_2_val_2.fq.gz report>

Chromosomal sequence could not be extracted for SRR10493735.33103437_33103437_length=150 chrM 1
Chromosomal sequence could not be extracted for SRR10493735.33743215_33743215_length=150 chrM 2
Chromosomal sequence could not be extracted for SRR10493735.33744222_33744222_length=150 chrM 2
Processed 34000000 sequence pairs so far
Processed 35000000 sequence pairs so far
Chromosomal sequence could not be extracted for SRR10493735.35930829_35930829_length=150 chrM 2
Processed 36000000 sequence pairs so far
Chromosomal sequence could not be extracted for SRR10493735.36648650_36648650_length=150 chr1_KI270709v1_random 1
Chromosomal sequence could not be extracted for SRR10493735.36653440_36653440_length=150 chr1_KI270709v1_random 1
Processed 37000000 sequence pairs so far
Chromosomal sequence could not be extracted for SRR10493735.37699908_37699908_length=150 chrM 1
Processed 38000000 sequence pairs so far
Chromosomal sequence could not be extracted for SRR10493735.38295370_38295370_length=150 chrM 2
Chromosomal sequence could not be extracted for SRR10493735.38296453_38296453_length=150 chrM 2
Chromosomal sequence could not be extracted for SRR10493735.38411063_38411063_length=150 chrM 1
Processed 39000000 sequence pairs so far
Processed 40000000 sequence pairs so far
Chromosomal sequence could not be extracted for SRR10493735.40291152_40291152_length=150 chrM 1
Chromosomal sequence could not be extracted for SRR10493735.40847076_40847076_length=150 chrUn_KI270590v1 4535
Processed 41000000 sequence pairs so far
Chromosomal sequence could not be extracted for SRR10493735.41399548_41399548_length=150 chrUn_KI270529v1 2
Chromosomal sequence could not be extracted for SRR10493735.41534559_41534559_length=150 chr1_KI270710v1_random 40027
Processed 42000000 sequence pairs so far
Processed 43000000 sequence pairs so far
Chromosomal sequence could not be extracted for SRR10493735.43082757_43082757_length=150 chrM 16425
Chromosomal sequence could not be extracted for SRR10493735.43751028_43751028_length=150 chrM 2
Chromosomal sequence could not be extracted for SRR10493735.43762911_43762911_length=150 chrM 2
Processed 44000000 sequence pairs so far
Processed 45000000 sequence pairs so far
Processed 46000000 sequence pairs so far
Processed 47000000 sequence pairs so far
Chromosomal sequence could not be extracted for SRR10493735.47831134_47831134_length=150 chrUn_KI270529v1 1
Processed 48000000 sequence pairs so far
Chromosomal sequence could not be extracted for SRR10493735.48678792_48678792_length=150 chrM 1
Chromosomal sequence could not be extracted for SRR10493735.48679160_48679160_length=150 chrM 1
Chromosomal sequence could not be extracted for SRR10493735.48975409_48975409_length=150 chrM 2
Processed 49000000 sequence pairs so far
Processed 50000000 sequence pairs so far
Chromosomal sequence could not be extracted for SRR10493735.50032426_50032426_length=150 chrM 1
Chromosomal sequence could not be extracted for SRR10493735.50075308_50075308_length=150 chrM 2
Processed 51000000 sequence pairs so far

@FelixKrueger
Copy link
Owner

That's all fine, you just have a few reads that align to the very end of the mitochondrium. Just keep going!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants