Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

methylpy call-methylation-state #87

Open
MagdalenaWinklhofer opened this issue Sep 21, 2023 · 9 comments
Open

methylpy call-methylation-state #87

MagdalenaWinklhofer opened this issue Sep 21, 2023 · 9 comments

Comments

@MagdalenaWinklhofer
Copy link

Hi,

I am working with WGBS data on a non-model organism. I used Bismark (with Bowtie) to align my reads and would like to use Methylpy to identify DMRs. I installed methylpy in a conda environment on our HPC. Since we have a lot of modules available there I did not install other dependencies so far. I wanted to use the "methylpy call-methylation-state" command to get allc files from the alignment .bam files (from Birmark). I used the alignment files and not the deduplicated files (is that correct?). Since I got the error that my genome was not indexed, I indexed it with the following command:

methylpy build-reference
--input-files .../working.genome.masked.262contigs.fasta
--output-prefix genome_index \

This script finished and gave me the following files:

genome_index_f.1.bt2
genome_index_f.3.bt2
genome_index_f.rev.1.bt2
genome_index_r.1.bt2
genome_index_r.3.bt2
genome_index_r.rev.1.bt2
genome_index_f.2.bt2
genome_index_f.4.bt2
genome_index_f.rev.2.bt2
genome_index_r.2.bt2
genome_index_r.4.bt2
genome_index_r.rev.2.bt2

After that I tried it again:

methylpy call-methylation-state
--input-file A1_val_1_bismark_bt2_pe.bam
--paired-end True
--sample A1
--ref-fasta .../working.genome.masked.262contigs.fasta
--num-procs 2
--path-to-output .../11_DML/
--path-to-samtools /cluster/software/SAMtools/1.16.1-GCC-11.3.0 \

But I get multiple errors:

Begin flipping the strand of read 2
Thu Sep 21 16:13:23 2023

Input not indexed. Indexing...
Thu Sep 21 16:14:06 2023

[E::hts_idx_push] Unsorted positions on sequence #49: 1104419 followed by 1104238
samtools index: failed to create index for "A1-D17-1_R1_val_1_bismark_bt2_pe.bam.read2flipped.bam"
Traceback (most recent call last):
File ".../conda_environments/methylpy_env/lib/python3.7/site-packages/methylpy/call_mc_se.py", line 1483, in call_methylated_sites
open(inputf+".bai",'r')
FileNotFoundError: [Errno 2] No such file or directory: 'A1-D17-1_R1_val_1_bismark_bt2_pe.bam.read2flipped.bam.bai'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File ".../methylpy_env/bin/methylpy", line 5, in
parse_args()
File ".../conda_environments/methylpy_env/lib/python3.7/site-packages/methylpy/parser.py", line 221, in parse_args
keep_temp_files=args.keep_temp_files)
File ".../conda_environments/methylpy_env/lib/python3.7/site-packages/methylpy/call_mc_pe.py", line 1373, in call_methylated_sites_pe
keep_temp_files = keep_temp_files)
File ".../conda_environments/methylpy_env/lib/python3.7/site-packages/methylpy/call_mc_se.py", line 1486, in call_methylated_sites
subprocess.check_call(shlex.split(path_to_samtools+"samtools index "+inputf))
File ".../conda_environments/methylpy_env/lib/python3.7/subprocess.py", line 363, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['samtools', 'index', 'A1-D17-1_R1_val_1_bismark_bt2_pe.bam.read2flipped.bam']' returned non-zero exit status 1.

I am sure that the .bam file exists, and they are even in the same directory. I don't know what is meant by the "'['samtools', 'index'," output, since I state the path to SAMtools (is a module on our HPC and I made sure that it was loaded) in the command and I indexed the genome (.fasta file is also stated in the command).

I removed sensitive information in the code above!

Thank you very much for your help!!!

@yupenghe
Copy link
Owner

Hey, sorry that the error messages are not super clear.

  1. For genome indexing, it is about indexing the fasta file. Please checkout this link: http://www.htslib.org/doc/samtools-faidx.html
  2. For the error related to bam file, my guess is that the error is caused by the input bam file being unsorted. Could you try to sort the bam file with samtools sort (link: http://www.htslib.org/doc/samtools-sort.html) and then run methylpy?

In general I would suggest trying the above two fixes with a subset of the input files to know whether they work sooner.

@MagdalenaWinklhofer
Copy link
Author

Hi,
thank you for the very helpful reply. I got it working! :)
I had the genome already indexed, but my bam files were unsorted as you suspected. I ran "samtools sort" and "samtools index," after that, the DMRfind worked as usual.

I performed a test run so far and was surprised that you were always talking about differentially methylated regions (DMR), but in the "result" (output) file, I get positions that are 1-based (column: pos). Wouldn't that be a differentially methylated location (DML). Is there a way to get both outputs, one for DML and one for DMR? I am also struggling a bit to understand all the columns in the output file. I particularly looked at the "results.tsv" file. Do you write anywhere about what all those columns mean? I am particularly interested in understanding the abbreviations "h" and"uc".

Thank you for your help!

PS: Sorry for my late reply. I had a course the last weeks and no time to work on my project.

@yupenghe
Copy link
Owner

Hey, what is your command to run the DMR caller? DMR caller identifies DML or DMS (S = site) first and then merges them to DMRs. The output should be in the *.DMR.bed file. For the results.tsv file, m, uc and h means the # methylated reads, # unmethylated reads and the total # of reads.

@MagdalenaWinklhofer
Copy link
Author

Hi,
thank you for the information regarding the abbreviations, that really helped.

So far, I have used the "sorted.bam" files for each sample to extract the methylation state with the "methylpy call-methylation-state" command. The outputs I received were "allc.tsv_" files. After that, I merged all 4 biological replicates (a1,a2,a3,a4) files into one condition file (a = anoxia). I used that merged "allc_anoxia.tsv.gz" file in combination with the "allc_normoxia.tsv.gz" file as input for the "DMR find" script.

methylpy DMRfind
--allc-files allc_normoxia.tsv.gz allc_anoxia.tsv.gz
--samples normoxia anoxia
--mc-type "CG"
--num-procs 2
--output-prefix DMR_normxia_vs_anoxia \

For that script, I get four files:

  • DMR_normxia_vs_anoxia_rms_results_collapsed.tsv
  • DMR_normxia_vs_anoxia_rms_results_collapsed.tsv.DMS.bed
  • DMR_normxia_vs_anoxia_rms_results_collapsed.tsv.DMR.bed
  • DMR_normxia_vs_anoxia_rms_results.tsv.gz

I read in your documentation that you can also use a space-separated list for different input samples, but if I tried the script like this:

methylpy DMRfind
--allc-files allc_N1.tsv.gz,allc_N2.tsv.gz,allc_N3.tsv.gz,allc_N7.tsv.gz allc_A1.tsv.gz,allc_A2.tsv.gz,allc_A4.tsv.gz,allc_A7.tsv.gz
--samples N1,N2,N3,N4 A1,A2,A4,A7
--mc-type "CG"
--num-procs 2
--output-prefix DMR_normxia_vs_anoxia_test \

I got the following error:
Traceback (most recent call last):
File ".../conda_environments/methylpy_env/bin/methylpy", line 5, in
parse_args()
File ".../conda_environments/methylpy_env/lib/python3.7/site-packages/methylpy/parser.py", line 71, in parse_args
seed=args.seed)
File ".../conda_environments/methylpy_env/lib/python3.7/site-packages/methylpy/DMRfind.py", line 166, in DMRfind
chrom_pointer[sample] = read_allc_index(allc_file)
File ".../conda_environments/methylpy_env/lib/python3.7/site-packages/methylpy/utilities.py", line 724, in read_allc_index
index_allc_file(allc_file,reindex=False)
File ".../conda_environments/methylpy_env/lib/python3.7/site-packages/methylpy/utilities.py", line 686, in index_allc_file
f = open_allc_file(allc_file)
File ".../conda_environments/methylpy_env/lib/python3.7/site-packages/methylpy/utilities.py", line 1133, in open_allc_file
f = open(allc_file,'r')
FileNotFoundError: [Errno 2] No such file or directory: 'DMR_normxia_vs_anoxia_test_filtered_allc_N1,N2,N3,N4.tsv'

Is there another possibility to keep the biological samples unmerged and list them in the DMR find script? I would like to compare four biological replicates in the normoxia condition against the four biological replicates in the anoxia condition and receive both DML and DMR between both conditions.

Thank you for your help!!!

@yupenghe
Copy link
Owner

The error is due to the separator being comma --samples N1,N2,N3,N4 A1,A2,A4,A7. Switching to space separator should work.

Is there another possibility to keep the biological samples unmerged and list them in the DMR find script? I would like to compare four biological replicates in the normoxia condition against the four biological replicates in the anoxia condition and receive both DML and DMR between both conditions.

Can you explain more about "keep the biological samples unmerged"? DMRfind by default won't merge biological samples. But if you would like the replicates to be grouped during DMR finding process, you can use the sample category option.

@MagdalenaWinklhofer
Copy link
Author

Hi,

thank you for that quick reply. I have implemented the space-separated list and the sample category. That also solved the "merging" issue I had before.

Now I get four output files:

  • DMR_normoxia_vs_anoxia_rms_results_collapsed.tsv
  • DMR_normoxia_vs_anoxia_rms_results_collapsed.tsv.DMS.bed
  • DMR_normoxia_vs_anoxia_rms_results_collapsed.tsv.DMR.bed
  • DMR_normoxia_vs_anoxia_rms_results.tsv.gz

I understand that the DMS.bed file contains information about the differentially methylated sites, and the DMR.bed file contains information about the differentially methylated regions. However, I could not find any information about the output format of both files.

As far as I understood, the DMS file contains the following columns:

  1. chromosome
  2. start position
  3. stop position
  4. sequence context
  5. strand
    6)??? (always the value 0.0003)

The DMR file contains only the following columns:

  1. chromosome
  2. start position
  3. stop position
    4) ??? (numerical values between 1 and 3)

Could you verify that or tell me where I can find such information?

Why are the regions in the DMR file only one base pair long? Shouldn't it be longer when we are talking about regions?

@MagdalenaWinklhofer
Copy link
Author

Could you please answer my inquiry or refer me to a webpage where I could read up on the topic? That would really help my project.

@yupenghe
Copy link
Owner

yupenghe commented Nov 2, 2023

Sorry for the late reply. For the DMS file, the 6th column is the p-value from the differential methylation test. You can basically ignore it. For the DMR file, the 4th column is the number of DMSs within the DMR. If there is only one DMS in DMR, the DMR will be one base pair long. I would recommend applying a filter on the number of DMSs. For example, it is proper to only consider DMRs with at least 4 DMSs.

@MagdalenaWinklhofer
Copy link
Author

Thank you very much for the helpful information!! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants