Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pore-C data #5

Open
socialhang opened this issue Dec 18, 2023 · 25 comments
Open

pore-C data #5

socialhang opened this issue Dec 18, 2023 · 25 comments
Labels
enhancement New feature or request

Comments

@socialhang
Copy link

Hello, @zengxiaofei

Can HapHiC input porec data? If not, what adjustments do I need to make?

Thank you in advance for your response

@zengxiaofei
Copy link
Owner

Hi @socialhang,

I have a plan to support Pore-C data, but it is not a high priority at the moment.
I haven't conducted any tests on Pore-C data yet, so I am unable to provide any useful information about it at this time.

Best regards,
Xiaofei

@zengxiaofei zengxiaofei added the enhancement New feature or request label Dec 18, 2023
@zengxiaofei
Copy link
Owner

zengxiaofei commented Dec 18, 2023

Hi @socialhang,

I have conducted an experiment on the test data of https://github.com/epi2me-labs/wf-pore-c. You can try the following steps with your own FASTA and FASTQ files:

(1) Execute the wf-pore-c pipeline to align Pore-C data and create a mock paired-end BAM file:

source ~/software/miniconda3/bin/activate nextflow

# modify --fastq, --ref, --cutter, and --chunk_size
nextflow run epi2me-labs/wf-pore-c \
             --fastq /work/software/genome/wf-pore-c-265a1a4/test_data/porec_test.concatemers.fastq --chunk_size 100 \
             --ref /work/software/genome/wf-pore-c-265a1a4/test_data/porec_test.fasta \
             --cutter NlaIII \
             --paired_end_minimum_distance 100 --paired_end_maximum_distance 200 --hi_c --mcool --paired_end

conda deactivate

Then, you will obtain a paired-end BAM file named null.ns.bam in the output/paired_end directory.

(2) Rename the Hi-C reads in null.ns.bam using a custom script rename_reads_for_bam.py and filter Hi-C reads:

python3 rename_reads_for_bam.py null.ns.bam | samtools view - -@ 14 -S -h -b -F 3340 -o HiC.bam
/path/to/HapHiC/utils/filter_bam.py HiC.bam 1 --threads 14 | samtools view -b -@ 14 -o HiC.filtered.bam

Finally, you can use HiC.filtered.bam for HapHiC scaffolding pipeline.

Here is the custom script:
rename_reads_for_bam.py.gz

Please be aware that this method has not been fully tested. I welcome your feedback on any attempts you make. This will be of great help. Once this method is validated, I will integrate it into HapHiC.

Best regards,
Xiaofei

@socialhang
Copy link
Author

Thanks a lot!

I will try today!

@socialhang
Copy link
Author

hello! @zengxiaofei

I'm back! I'm so sorry, the wf-pore-c software could not run on my server. So could I use falign to make the bam for replacing wf-pore-c?

@zengxiaofei
Copy link
Owner

Hi @socialhang,

You could try it. It seems that falign can produce pairwise SAM files that compatible with SALSA2 using the command sam2salsa2 in u4falign.

Best,
Xiaofei

@socialhang
Copy link
Author

OK!

@socialhang
Copy link
Author

socialhang commented Dec 24, 2023

Hello! @zengxiaofei

I produce pairwise bam file using falign , but some errors happend.

Traceback (most recent call last):
  File "/root/software/HapHiC_1.0.2/scripts/HapHiC_pipeline.py", line 483, in <module>
    main()
  File "/root/software/HapHiC_1.0.2/scripts/HapHiC_pipeline.py", line 464, in main
    inflation = haphic_cluster(args)
  File "/root/software/HapHiC_1.0.2/scripts/HapHiC_pipeline.py", line 323, in haphic_cluster
    HapHiC_cluster.run(args, log_file=LOG_FILE)
  File "/root/software/HapHiC_1.0.2/scripts/HapHiC_cluster.py", line 2361, in run
    flank_link_matrix, frag_index_dict = dict_to_matrix(
  File "/root/software/HapHiC_1.0.2/scripts/HapHiC_cluster.py", line 283, in dict_to_matrix
    shape = len(frag_set)
TypeError: object of type 'NoneType' has no len()
Traceback (most recent call last):
  File "/root/software/HapHiC_1.0.2/haphic", line 96, in <module>
    subprocess.run(commands, check=True)
  File "/root/miniconda3/envs/haphic/lib/python3.10/subprocess.py", line 524, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/root/software/HapHiC_1.0.2/scripts/HapHiC_pipeline.py', 'hifiasm_hic_ul_p_utg.fa', 'small.1.bam', '32', '--remove_allelic_links', '4', '--Nx', '100', '--gfa', 'hifiasm.asm.hic.p_utg.gfa', '--min_RE_sites', '1', '--min_links', '1', '--min_link_density', '0', '--threads', '40', '--max_ctg_len', '400000']' returned non-zero exit status 1.

I think its because of my bam format.
small.1.bam.gz

Can you help me point out the errors in my bam file?

Thanks a lot!

@zengxiaofei
Copy link
Owner

Hi @socialhang,

Could you please paste or upload 01.cluster/HapHiC_cluster.log?

Best,
Xiaofei

@socialhang
Copy link
Author

OK, waiting a minute

@socialhang
Copy link
Author

@zengxiaofei
Copy link
Owner

zengxiaofei commented Dec 25, 2023

Is the smalll.1.bam that you uploaded the same one you used for HapHiC scaffolding, or just a portion of it?

@socialhang
Copy link
Author

Is the smalll.1.bam that you uploaded the same one you used for HapHiC scaffolding, or just a portion of it?

Since my bam file is 400Gb in size, smalll.1.bam is part of the small data I extracted. In addition, I have also run haphic with this data, and the error reporting is completely consistent with big datasmalll.1.bam.

@zengxiaofei
Copy link
Owner

Please use this script (modify_bam_for_falign.py.gz) to modify the bam file:

python3 modify_bam_for_falign.py small.1.bam | samtools view -b -o new.bam

@socialhang
Copy link
Author

Please use this script (modify_bam_for_falign.py.gz) to modify the bam file:

python3 modify_bam_for_falign.py small.1.bam | samtools view -b -o new.bam

OK! Thanks! I will try!

@socialhang
Copy link
Author

socialhang commented Dec 25, 2023

some errors happen:

2023-12-25 09:02:22 <HapHiC_pipeline.py> [main] Pipeline started, HapHiC version: 1.0.2 (update: 2023.12.16)
2023-12-25 09:02:22 <HapHiC_pipeline.py> [main] Python version: 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:36:39) [GCC 10.4.0]
2023-12-25 09:02:22 <HapHiC_pipeline.py> [haphic_cluster] Step1: Execute preprocessing and Markov clustering for contigs...
2023-12-25 09:02:22 <HapHiC_cluster.py> [run] Program started, HapHiC version: 1.0.2 (update: 2023.12.16)
2023-12-25 09:02:22 <HapHiC_cluster.py> [run] Python version: 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:36:39) [GCC 10.4.0]
2023-12-25 09:02:22 <HapHiC_cluster.py> [parse_fasta] Parsing input FASTA file...
2023-12-25 09:02:34 <HapHiC_cluster.py> [parse_gfa] Parsing input gfa file(s)...
2023-12-25 09:02:38 <HapHiC_cluster.py> [stat_fragments] Making some statistics of fragments (contigs / bins)
2023-12-25 09:02:38 <HapHiC_cluster.py> [stat_fragments] bin_size is calculated to be 2000000 bp
2023-12-25 09:02:41 <HapHiC_cluster.py> [parse_bam] Parsing input BAM file...
Traceback (most recent call last):
  File "/root/software/HapHiC_1.0.2/scripts/HapHiC_pipeline.py", line 483, in <module>
    main()
  File "/root/software/HapHiC_1.0.2/scripts/HapHiC_pipeline.py", line 464, in main
    inflation = haphic_cluster(args)
  File "/root/software/HapHiC_1.0.2/scripts/HapHiC_pipeline.py", line 323, in haphic_cluster
    HapHiC_cluster.run(args, log_file=LOG_FILE)
  File "/root/software/HapHiC_1.0.2/scripts/HapHiC_cluster.py", line 2304, in run
    full_link_dict, flank_link_dict, HT_link_dict, clm_dict, frag_link_dict, ctg_coord_dict, ctg_pair_to_frag = parse_bam(
  File "/root/software/HapHiC_1.0.2/scripts/HapHiC_cluster.py", line 1481, in parse_bam
    frag_len_i, frag_len_j = frag_len_dict[frag_i], frag_len_dict[frag_j]
KeyError: 'utg000028l_bin4'
Traceback (most recent call last):
  File "/root/software/HapHiC_1.0.2/haphic", line 96, in <module>
    subprocess.run(commands, check=True)
  File "/root/miniconda3/envs/haphic/lib/python3.10/subprocess.py", line 524, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/root/software/HapHiC_1.0.2/scripts/HapHiC_pipeline.py', 'hifiasm_hic_ul_p_utg.fa', 'ul9.HiC.filter.correct0.bam', '32', '--remove_allelic_links', '4', '--Nx', '100', '--gfa', 'hifiasm.asm.hic.p_utg.gfa', '--min_RE_sites', '1', '--min_links', '1', '--min_link_density', '0', '--threads', '40', '--max_ctg_len', '400000']' returned non-zero exit status 1.

@zengxiaofei

@socialhang
Copy link
Author

@zengxiaofei
Copy link
Owner

zengxiaofei commented Dec 25, 2023

It seems that the position in the BAM file (the fourth column for pos or the eighth column for mpos ) exceeds the length of the contig.

The error message shows that the bin_size was calculated to be 2 Mb, and the error occurred when the mapping position of a read in the BAM file was within the range of [6000001, 8000000] (bin ID utg000028l_bin4). However, based on the bam file you uploaded, the contig utg000028l is only 4,798,928 bp in length.

@socialhang
Copy link
Author

Hi!@zengxiaofei

I've found the problem and I've updated the script.
correct_sam.py.zip
Now the process is running normally, but there are clusters of homologous chromosomes in the results, and I think the signal between the different haploids is not screened clean. How can I adjust the parameters?

image

@zengxiaofei
Copy link
Owner

Could you please provide a full view of the Hi-C contact heatmap that includes scaffold boundaries (called superscaffold in Juicebox, and surrounded by blue outlines)?

@socialhang
Copy link
Author

Could you please provide a full view of the Hi-C contact heatmap that includes scaffold boundaries (called superscaffold in Juicebox, and surrounded by blue outlines)?

could you give me an email address?I will send you the hic heat map privately. @zengxiaofei

@zengxiaofei
Copy link
Owner

You could send email to xiaofei_zeng#whu.edu.cn (replace # with @), or contact me on WeChat: zeng_xiaofei.

@LikeLucasLab
Copy link

LikeLucasLab commented Apr 25, 2024

@socialhang or @zengxiaofei have you somehow managed to adjust the parameters to improve your contactmap and distinguish between the different haploids ?

@zengxiaofei
Copy link
Owner

zengxiaofei commented May 7, 2024

Hi @LikeLucasLab,

The primary challenge faced when working with Pore-C data lies in achieving accurate mapping. We conducted some tests using falign, and adapted HapHiC to be compatible with this aligner through a Python script. Although Pore-C data showed performance comparable to NGS-based Hi-C data in scaffolding haplotype-collapsed assemblies, it exhibited much lower mapping accuracy in the context of distinguishing haplotypes, resulting in much stronger Hi-C signals being observed between haplotigs. We tried different MAPQ filtering criteria, but the results did not have significant improvements.

There may be some adjustments in HapHiC for Pore-C data in the future updates. When using the current version of HapHiC for scaffolding haplotype-phased assemblies, please always add the --remove_allelic_links parameter to remove Hi-C links between allelic contigs. Lower --nwindows or --concordance_ratio_cutoff value could potentially improve the scaffolding results.

@csxie-666
Copy link

Hi @zengxiaofei
Is it possible to improve mapping accuracy by correcting pore-c reads using herro (https://github.com/lbcb-sci/herro)?

@zengxiaofei
Copy link
Owner

zengxiaofei commented May 28, 2024

Hi @csxie-666,

As a update to this issue, I currently recommend using minimap2 other than falign for Pore-C reads mapping. HapHiC has supported the BAM file output by minimap2 using a Python script.

Hi @zengxiaofei Is it possible to improve mapping accuracy by correcting pore-c reads using herro (https://github.com/lbcb-sci/herro)?

According to the author's description, herro was exclusively trained on human data (lbcb-sci/herro#23). I'm not sure whether it is versatile for other species and ploidy levels. I'm trying out another haplotype-aware ONT read correction tool called DeChat on ONT ultra-long reads. This tool has been validated in simulated autopolyploids. However, I encounter a difficulty in running the program (refer to this issue: LuoGroup2023/DeChat#3).

It is worth noting that Pore-C data may significantly differ from ONT whole-genome sequencing data in the context of read correction. This is because Pore-C reads consist of sequences originating from different loci, resulting in lower coverage between overlapping Pore-C reads compared to ONT whole-genome sequencing reads.

Best,
Xiaofei

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants