Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with low proportion_repeat #102

Open
mhguo1 opened this issue Dec 16, 2021 · 7 comments
Open

Issue with low proportion_repeat #102

mhguo1 opened this issue Dec 16, 2021 · 7 comments

Comments

@mhguo1
Copy link

mhguo1 commented Dec 16, 2021

Hello, I'm getting odd errors when using lower --proportion-repeat in strling extract. For example, when running this command:

strling extract -f GRCh38_full_analysis_set_plus_decoy_hla.fa -p 0.4 NA06989.final.cram test.bin

I get this error:
fatal.nim(49) sysFatal
Error: unhandled exception: genome_strs.nim(39, 12) w.start < w.stop repeat ACCTTT not found in expected region for (chrom: "chr2", start: 71388200, stop: 71388500, repeat: "ACCTTT"), TGAAGGCAGCTAAATTCTCTTACCCTGAGGCTAAGGGCAAGTAGTAGGTAACAAAGGAGTGTAAAGGAATTTATCTAGATAAGTTTATTTACTTTTGCCGACCTTTGATCATCCGACCTTTGATCATCCGACCTTTGATCATCTGACCTTTGATCATCTGACCTTTGATCATCCGACCTTTGATCATCTGACCTTTGATCATCCGCGTGCAGGACTGCTCCCTACAGGCGGGGGCAACAACTACCCACAGATTGTGTTGGCTCCAGGCCTTTGTCATTAAATCTGTACTAAATAAATACA, (chrom: "chr2", start: 71388500, stop: 71388500, repeat: "ACCTTT") [AssertionDefect]

strling version: 0.5.1
fatal.nim(49) sysFatal
Error: unhandled exception: unpack.nim(59, 12) fs != nil [strling] got nil fileStream in unpack_file. check given file-path [AssertionDefect]

I am using a 1000 Genomes sample downloaded from ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239459/NA06989.final.cram

Thanks!

brentp added a commit that referenced this issue Dec 16, 2021
this will have some performance consequences so we should evaluate.
@brentp
Copy link
Member

brentp commented Dec 16, 2021

Thanks for reporting. Using the lower --proportion-repeat does indeed uncover some problems. I can recreate and pushed a fix. We'll need to evaluate the change in performance. I also have another idea to try to fix which I'll post shortly. Note that this is an interesting read with this repeat structure (split on ACCTTT):

TGAAGGCAGCTAAATTCTCTTACCCTGAGGCTAAGGGCAAGTAGTAGGTAACAAAGGAGTGTAAAGGAATTTATCTAGATAAGTTTATTTACTTTTGCCG
ACCTTT
GATCATCCG
ACCTTT
GATCATCCG
ACCTTT
GATCATCTG
ACCTTT
GATCATCTG
ACCTTT
GATCATCCG
ACCTTT
GATCATCTG
ACCTTT
GATCATCCGCGTGCAGGACTGCTCCCTACAGGCGGGGGCAACAACTACCCACAGATTGTGTTGGCTCCAGGCCTTTGTCATTAAATCTGTACTAAATAAATACA

@brentp
Copy link
Member

brentp commented Dec 16, 2021

BTW, I'm interested to hear what @hdashnow thinks, but my experience is that lowering proportion_repeat is usually not fruitful.

@hdashnow
Copy link
Collaborator

Thanks @brentp

@mhguo1 I'm curious why you'd like to use proportion repeat = 0.4? This is extremely low, which is why we never tested this. If you're having trouble recovering a specific locus maybe we can address that in a different way?

@mhguo1
Copy link
Author

mhguo1 commented Dec 16, 2021

Hi all, thank you for looking into this! Perhaps I'm not fully understanding this parameter well. I'm interested in looking at milder expansions and felt that the default value of -p 0.8 was too high so was playing around with it. If I understand this correctly, with a 150 bp PE read, a value of 0.8 for a trinucleotide repeat expansion would only consider expansions past 40 repeats (150*0.8/3).

@hdashnow
Copy link
Collaborator

@mhguo1 You are correct, STRling is tuned to discover larger expansions. Are you looking to discover new loci, or call known ones? If you are looking at a specific locus you can force STRling to genotype it even if it is small.

@mhguo1
Copy link
Author

mhguo1 commented Dec 16, 2021

I'm mostly hoping to call known ones, so I guess I will just add a custom bed file with known sites to -l to the call command. Thanks!

@hdashnow
Copy link
Collaborator

Great! Let us know if you have additional questions, or see any other errors. It was useful to get this one fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants