GitHub - conbase/conbase: Conbase: a software for unsupervised discovery of clonal somatic mutations in single cells through read phasing

Conbase: a software for unsupervised discovery of clonal somatic mutations in single cells through read phasing

Conbase is a software for unsupervised discovery of clonal somatic mutations in single DNA sequencing data from whole genome amplified single cells. Conbase leverages phased read data from multiple samples in a dataset to achieve increased confidence in somatic variant calls and genotype predictions. https://doi.org/10.1186/s13059-019-1673-8

Algorithm developed by Ezeddin Al Hakim, Marie Kindblom & Joanna Hård.

Optional Preprocessing

Misc.py

python3 Misc.py --duplicate_regions <prefiltered germline SNV file> <bulk bam> <reference> <number of threads> <output>

In Params.py, adjust:

"bulk_ref_limit" : 0.9,  #Definition of a variant site, where the remaining fraction supports a non-reference base
"snp_nr_limit" : 10, #The maximum number of variant sites per genomic window
"snp_dist_limit" : 1000, #Genomic window size (bp)

Input format prefiltered germline SNV file

CHROM       POS     REF     ALT
1   23142496        C       A
.
.

How to run Conbase

Stats.py:

python3 bin/Main.py stats <germline SNV file> <bam path file> <reference> <number of threads> <output>

Analyze.py:

python3 bin/Main.py analyze <output.json> <output>

Input format bam path file

NAME	PATH
BULK	data/bulk.bam
cell_1	data/cell_1.bam
cell_2	data/cell_2.bam
.
.

Input format germline SNV file

CHROM	POS	REF	ALT
1	23142496	C	A
.
.

In Params.py, optionally adjust:

################# Stats #################
stats_params = {  

"fragment_length" : 650, #Insert size of sequencing library  

"mapping_quality" : 20, #Required mapping quality of analyzed reads  

"base_quality" : 20, #Required base quality of analyzed bases

"dp_limit" : 5, #Minimum read depth  

"alt_ratio_limit" : 0.2, #Minimum variant allele fraction  

"sample_vote_limit" : 2, #Minimum number of samples required to support a variant

"vote_ratio_limit" : 0.9, #Minimum fraction of samples required to support the same base

"snp_read_limit" : 1, #Minimun number of reads covering an SNP for a given sample to regard its reads (and voting) for the SNP

"indel_ratio" : 0.05, #The fraction of reads in the entire dataset allowed to contain an indel  

"bulk_ref_limit" : 1,  #The fraction of reads required to support the reference base in the unamplified bulk sample  

"acceptable_bases" : ['A', 'C', 'G', 'T'],
}  




############### Analyze #################  
analyze_params = {

# Genotyping is based on phased read concordance determined by analyzing tuples containing base observations in somatic variant sites and germline SNV sites present in the same read or read pair

# Please note the following annotations: a heterozygous site can be of type = {RR, AA} or {RA, AR} and a homozygous site is always {RR, RA} where R is the reference base and A the alternative. Any of these three combinations is a combination of two tuples, and will be referred to as a "tuple pair" below.

"dp_tuple_limit" : 5, #Minimum number of phased reads  

"snp_total_vote" : 0.9, #Fraction of phased germline SNVs required to contribute to genotype prediction  

"snp_vote_ratio" : 0.9, #Fraction of phased germline SNVs required to support the same genotype prediction  

"tuples_ratio" : 0.9, #The minimal ratio required of the total that is made out of a given tuple pair  

"tuples_internal_ratio" : 0.1, #The minimal internal ratio required for a tuple pair  

"tuples_c2_external_error_ratio" : 0.1, # 1-tuples_ratio  

# Tuple
"tuple_group_ratio" : 0.01, #The max "error" allowed between two groups of tuplepairs for a sample to vote for a winning tuplepair  

"win_internal_group_ratio" : 0.1, #Minimal internal ratio for a winning tuplepair in a voting sample  

"sample_tuple_vote_limit" : 2, #the minimum number of samples that needs to vote for a winning tuple pair  

"vote_tuple_ratio_limit" : 0.9, #the minimum percentage of votes that the winning tuple pair needs to acquire


# Genotype prediction in samples with low read depth
"a1_lower_limit" : 2, #minimum depth of alternative alleles (site only passes if conflicting_upper_limit and c3_conflicting_upper_limit also holds)  

"c3_homo_limit" : 2, #min depth of tuple type "RA" required when the winning tuplepair is {AA, RR} for the site to be regarded unmutated  

"c3_a1_limit" : 2, #min depth for an alternative base in the case of no support of a mutated site


# Filtering variant sites
"conflicting_upper_limit" : 0, #The number of samples with predicted genotypes allowed to display support for discordant alleles  

"c3_conflicting_upper_limit" : 0, #The number of samples with inferred genotypes allowed to display support for discordant alleles  

"homo_error_allowed" : 0, #The number of reads allowed to support a non-reference base in a sample predicted to be unmutated  

"bulk_dp_interval" : (15,65), #The minimal and maximal read depth required in the unamplified bulk sample in a predicted somatic variant site 


}

################ Misc ##################
misc_params = {

"mut_nr_limit" : 1, #Maximum number of somatic variants per genomic window
"mut_dist_limit" : 1000, #Genomic window size (bp)



# Defining output to display
# MIN_HET : the minimum number of samples required to be predicted mutated
# MIN_HOM : the minimum number of samples required to be predicted unmutated
# MAX_HET : the maximum number of samples required to be predicted mutated
# MAX_HOM : the maximum number of samples required to be predicted unmutated

}
trees = {
    'tree1':{
        'samples':['cell_1','cell_2','cell_3','cell_4','cell_5','cell_6','cell_7','cell_8','cell_9','cell_10'],
        'params': { 'MIN_HET': 2, 'MIN_HOM': 1, 'MAX_HET': 9, 'MAX_HOM': 8}
    },
    'all':{
        'samples':['cell_1','cell_2','cell_3','cell_4','cell_5','cell_6','cell_7','cell_8','cell_9','cell_10'],
        'params': { 'MIN_HET': 2, 'MIN_HOM': 1, 'MAX_HET': 9, 'MAX_HOM': 8}
    }
}



# Conbase can also be run in a supervised mode, to parse variant calls enriched in a defined population


}
trees = {
    'tree1':{
        'samples':['cell_1','cell_2','cell_3','cell_4','cell_5'],
        'params': { 'MIN_HET': 2, 'MIN_HOM': 1, 'MAX_HET': 4, 'MAX_HOM': 3}
    },
    'tree2':{
        'samples':['cell_6','cell_7','cell_8','cell_9','cell_10'],
        'params': { 'MIN_HET': 0, 'MIN_HOM': 0, 'MAX_HET': 0, 'MAX_HOM': 5}
    },
    'all':{
        'samples':['cell_1','cell_2','cell_3','cell_4','cell_5','cell_6','cell_7','cell_8','cell_9','cell_10'],
        'params': { 'MIN_HET': 2, 'MIN_HOM': 1, 'MAX_HET': 4, 'MAX_HOM': 8}
    }
}

Implementation by

Marie Kindblom (https://github.com/mkindblom) Ezeddin Al Hakim (https://github.com/ezeddin)

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Name		Name	Last commit message	Last commit date
Latest commit History 144 Commits
bin		bin
data		data
doc		doc
results		results
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
conbase_logo.png		conbase_logo.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bin

bin

data

data

doc

doc

results

results

.gitignore

.gitignore

LICENSE.md

LICENSE.md

README.md

README.md

conbase_logo.png

conbase_logo.png

Repository files navigation

Conbase: a software for unsupervised discovery of clonal somatic mutations in single cells through read phasing

Optional Preprocessing

How to run Conbase

Implementation by

License

About

Releases 1

Packages

Contributors 3

Languages

License

conbase/conbase

Folders and files

Latest commit

History

Repository files navigation

Conbase: a software for unsupervised discovery of clonal somatic mutations in single cells through read phasing

Optional Preprocessing

How to run Conbase

Implementation by

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages