Skip to content

hillerlab/GeneLossPipe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 

Repository files navigation

We recommend running TOGA which incorporates an improved version of the gene loss detection pipeline and offers many additional features, including gene annotation and orthology inference.

Computational pipeline to detect gene loss events

In order to achieve high specificity, this pipeline implements a number of steps that overcome assembly and alignment issues, and address evolutionary exon-intron structure changes in genes that are conserved.

The starting point of the pipeline is a multiple genome alignment. The output is a list of lost genes.

1. Quality masking of genome alignments

Genome assemblies can contain regions of poor sequence quality, where the probability of sequencing errors is substantially elevated. These sequencing errors can mimic gene-inactivating mutations. To avoid falsely calling these errors as inactivating mutations, we employ a quality masking procedure. The script mafQualityMasking.pl masks every base of low sequencing quality by replacing it with an “N” character.

mafQualityMasking.pl -input Input_maf_file -output Output_maf_file 

2. Filtering genome alignments for assembly gaps and unaligning sequence

Genome assembly gaps indicate regions where parts of the real genome are missing in the given assembly. Missing sequence can comprise exons or even entire genes, which mimics exon or gene deletions. To avoid falsely calling such regions as deletions, mafFilterElines.pl uses the assembly gap annotation to remove all deletions or regions of unaligning sequence from the alignment that overlap assembly gaps.

mafFilterElines.pl -input Input_maf_file -output Output_maf_file -reference ReferenceSpecies 
 -alignment alignment

3. Filtering genome alignments for paralogs and processed pseudogenes

Genome alignments can incorrectly align a gene to a processed pseudogene or paralog in the query species in case the ortholog is missing in the query assembly, which is especially problematic for lower-quality assemblies. Therefore, we filter the genome alignment by removing aligning blocks that likely come from a paralog or processed pseudogene alignment. This is achieved by first identifying chain blocks that are likely orthologous (we call them whiteListed chains) versus chain blocks that likely represent paralogs or processed pseudogene chains (we call them blackListed chains). This cleaning is achieved by a two step process: First, we convert the coordinates of these chains to bed format:

filterMafsForWhiteListedChainsPPS.pl ReferenceSpecies QuerySpecies geneChainFilteringDir 
 outputDirectoryWhite outputDirectoryBlack

where geneChainFilteringDir is a directory containing chain blocks that have been classified as whiteListed or blackListed, outputDirectoryWhite is a directory containing bed files of whiteListed chains and outputDirectoryBlack is a directory containing bed files of blackListed chains.

Subsequently, we overlap the coordinates of these chains with the coordinates of aligning sequence in the maf blocks and filter out sequences that come from paralogs/processed pseudogenes

filterMafsForWhiteListedChains.pl -in Input_maf_file -out Output_maf_file -ref ReferenceSpecies 
 -whiteListedChainsDir WhiteListedChainsDirectory -blackListedChainsDir BlackListedChainsDirectory 

where WhiteListedChainsDirectory BlackListedChainsDirectory is produced by filterMafsForWhiteListedChainsPPS.pl above.

4. Realignment with CESAR

Ambiguous alignments, where several suboptimal solutions exist, can result in spurious frameshifting indels and splice site mutations. Additionally, splice site positions can shift in evolution. To avoid these sequence alignment issues and to detect shifted splice sites, we use CESAR (Coding Exon-Structure Aware Realigner) to realign the query sequence that is aligned to an exon of the reference, taking reading frame and splice site information into account. This step involves determining intron length for every exon in the genome alignment to identify cases of intron deletion. Subsequently, exons where the intervening introns are deleted are aligned together as a single unit. All other exons are aligned individually.

cesarRealign.pl speciesList allTranscriptsList ReferenceSpecies bigBedIndex clade [human|mouse] jobsPrefix

where speciesList is a file listing query species (one species per line), allTranscriptsList is a file listing all the genes/transcripts in modified genePred format (full path needs to be specified for this file), bigBedIndex is the indexed alignment (.bb) file, and jobsPrefix specifies a prefix of the generated job files.

cesarRealign.pl will produce three jobs file: $jobsPrefix_short, $jobsPrefix_medium, $jobsPrefix_long. Each job in the $jobsPrefix_short typically finish within an hour, $jobsPrefix_medium need something between 4 to 8 hours, and $jobsPrefix_long run longer.

These jobs need to be executed on a compute cluster. Afterwards, merging the alignment files produced by CESAR to produce one file that contains alignments for all exons:

MergeBDB.perl realignedExons/realigned. allExons_CESAR.BDB

5. Creating tabular files:

The tabular format is a data structure that is used by geneLossPipeline to output mutations in the pairwise alignments (for reference-query species pair for every gene). createTabFiles.pl does this

createTabFiles.pl -input geneList -ref RefSpecies -out OutputDirectory -log logFile 
 -index allExons_CESAR.BDB  

where geneList is a file listing the genes in modified genePred format.

Parameters:

-v|verbose
-species [Optional: instead of all species in the genome alignment, restrict to one 
 or few species: a comma separated string] 
-pMammals [run only for placental mammals in the input alignment. This information 
 is contained in GeneLoss.config]
-alignment [Optional but mandatory if you specify pMammals] 
-sameSS [Only process genes where the entire gene in the query is on one 
 scaffold/chromosome and a single strand] 
-excludeGenes [Optional: a list of genes that should be excluded, one gene per line]  

6. Detecting mutations in pairwise alignments:

This step involves detection of mutations in pairwise alignments (for reference-query species pair for every gene)

geneLossPipeline.pl -input gene/geneListFile -out Output_Directory -alignment Alignment 
 -ref ReferenceSpecies -tabDir DirectoryWithTabularFiles

Parameters:

-fmt [Optional, if set to 'lst', an input file needs to be specified where each line 
 is one gene/transcript] 
-species [Optional: a comma separated file listing the species (the default behaviour is 
 to look for every species for which there is a tabular file)]
-pMammals [run only for placental mammals in the input alignment. This information 
 is contained in GeneLoss.config]
-EDE [Exclude Dubious Events - basically frameshifts that are flanked by a base of 
 low quality or frameshifting insertions where inserts have a base of low quality, 
 default is report all events]
-excludeFPI [exclude FramePreserving events]
-all [Run 3 modules - Indel, InframeStopCodon and SpliceSite, otherwise you could specify 
  -indel for frameshift, -ifsc for inflame stop codons and -ssm for splice site mutations]

Afterwards merge the mutation profile for every gene to produce one file:

MergeBDB.perl Output_Directory/geneLossPipe. geneLossPipeAllGenes.BDB

7. Calling gene losses from the mutation profile for every gene:

geneLossCaller.pl genesList GeneLossPipeBDB annotationFile species 
  ReadingFrameThreshold exonGroupsDir u12IntronList outFile

where genesList is a file listing one gene per line (the first field is the gene name, the second is a string of principal isoforms which are separated by comma), GeneLossPipeBDB is produced in Step 6, annotationFile is a the file with the modified genePred format, ReadingFrameThreshold is the fraction of the remaining intact reading frame (0.6 was used for our study), exonGroupsDir is produced by cesarRealign.pl, and u12IntronList is a file listing coordinates of U12 introns in bed format.

Optional parameters:

-compFS [Count frameshifts as inactivating mutations even if they are compensated] 
-verbose 
-longIndel [Treat every indel longer than 50 bp as an inactivating mutation] 

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published