Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicated rows in the TAD boundary file given by fac boundaries #163

Open
ziyin96 opened this issue Jul 23, 2023 · 4 comments
Open

Duplicated rows in the TAD boundary file given by fac boundaries #163

ziyin96 opened this issue Jul 23, 2023 · 4 comments

Comments

@ziyin96
Copy link

ziyin96 commented Jul 23, 2023

Hi,

I'm trying to call TAD boundaries using fanc insulation followed by fanc boundaries. The results looks fine but I found that several lines are duplicated in the output TAD boundary BED like this:

chr12   133430001       133440000       .       0.498882532119751       +
chr12   133430001       133440000       .       0.498882532119751       +

Details:
I used fan-c 0.9.25 and started with a published hic file downloaded from GSE116862.

I first calculated insulation score under 10-kb resolution, trying different window size.

fanc insulation data/hESC_D05_Rep1.hic@10kb \
                tmp/hESC_D05_Rep1.insulation \
                -w 100000 200000 500000 1000000 2000000     

After visually checking the insulation scores with the contact frequency map, I decided to identify the TAD boundaries using window size as 500 kb.

fanc boundaries tmp/hESC_D05_Rep1.insulation \
                 results/hESC_D05_Rep1.TAD_boundaries.bed \
                -w 500kb 

The boundaries in the BED file fits with the contact frequency heatmap well, but 80 lines are duplicated as I shown on the above.

boundary_file=results/hESC_D05_Rep1.TAD_boundaries.bed
wc -l ${boundary_file}    # 8901
uniq ${boundary_file}  | wc -l    # 8821
cut -f 1-3 ${boundary_file} | uniq | wc -l    # 8821

hESC_D05_Rep1.TAD_boundaries.bed.zip

By the way, I also checked the corresponding 500 kb insulation score file and all the rows in this file are unique.

wc -l results/hESC_D05_Rep1.insulation_500kb.bed    # 309581 
uniq results/hESC_D05_Rep1.insulation_500kb.bed | wc -l    # 309581
cut -f 1-3 results/hESC_D05_Rep1.insulation_500kb.bed | uniq | wc -l    # 309581

I'm wondering how this happened. Did I use the fan-C in a correct way?

thanks,

Ziyin

@kaukrise
Copy link
Collaborator

Hi, thanks for reporting this! It looks like a bug.
A large portion of the boundary calling code has not been written by me and I am currently on holiday, so it will take me a while to reproduce and fix this, I'm afraid.

@kaukrise
Copy link
Collaborator

Hey, can you quickly confirm that it is this file you have been using? https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3262960

kaukrise added a commit that referenced this issue Aug 17, 2023
@kaukrise
Copy link
Collaborator

Okay, I think I have a fix. This seems to be related to "shallow" insulation signal, I think. But I'm pretty sure I found the piece of code that led to the duplication. can you try the fixed version here?

fanc-0.9.26.tar.gz

@ziyin96
Copy link
Author

ziyin96 commented Aug 20, 2023

fanc-0.9.26 indeed resolved my issue! Thank you very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants