-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2x Expected Gene Count #1010
Comments
What seems to happen for larger eukaryotes is that codingquarry and snap don't work very well. Can you check the logs and see what the number of genes predicted for each took prior to EVM is? Might need to turn off a few of the ab initio models. Also, is your assembly haploid? |
Yes, it's a haploid assembly. I took extra care also by running funnotate clean and funnotate sort prior to the pipeline if that's helpful I think the portion of the log you are asking for is here (so many logs!)
|
So it looks like snap and glimmerhmm aren't performing very well. So you can turn those off by passing If you are more "adventurous" and would perhaps tolerate a few bugs, I've been slowly working on |
OK, I'll give those settings a shot and see what happens. But yes, I'm feeling adventurous (adventure is my middle name! ha-- it's actually Raymond). Send me some info, and I'll be happy to be a test pilot. I totally understand the Rube Goldberg nature of things, and appreciate your efforts to simplify. If I can be of any assistance, let me know! |
Okay, so for # install external dependencies
mamba create -n funannotate2 "python<=3.10" biopython "evidencemodeler>=2" minimap2 miniprot snap "augustus==3.5.0" glimmerhmm diamond trnascan-se table2asn
# activate the environment
conda activate funannotate2
# install latest from git
python -m pip install git+https://github.com/nextgenusfs/funannotate2.git
# need to also set the env variable pointing to f1 FUNANNOTATE_DB
export FUNANNOTATE_DB=/path/to/existing/funannotate_db/folder You could probably also just install it in the same environment as your existing funannotate installation, and then would just need to add The help menus should all look and feel familiar $ funannotate2
usage: funannotate2 [-h] [--version] {clean,train,predict} ...
Funannotate: eukaryotic genome annotation pipeline
Commands:
clean Find and remove duplicated contigs, sort by size, rename headers.
train Train ab intio gene prediction algorithms.
predict Predict primary gene models in a eukaryotic genome.
Help:
-h, --help Show this help message and exit
--version show program's version number and exit One difference is that $ funannotate2 train
usage: funannotate2 train -f -s -o [-t] [--strain] [--cpus] [--optimize-augustus] [--header-len] [-h] [--version]
Gene model training from automated training of ab initio gene prediction algorithms:
Required arguments:
-f , --fasta genome in FASTA format
-s , --species Species name, use quotes for binomial, e.g. "Aspergillus fumigatus"
-o , --out Output folder name
Optional arguments:
-t , --training-set Training set to use in GFF3 format
--strain Strain/isolate name
--cpus Number of CPUs to use (default: 2)
--optimize-augustus Run Augustus mediated optimized training (not recommended) (default: False)
--header-len Max length for fasta headers (default: 100)
Other arguments:
-h, --help show this help message and exit
--version show program's version number and exit Example of a training parameters JSON output --> written to {
"name": "ophidiomyces_ophidiicola",
"species": "Ophidiomyces ophidiicola",
"taxonomy": {
"superkingdom": "Eukaryota",
"kingdom": "Fungi",
"phylum": "Ascomycota",
"class": "Eurotiomycetes",
"order": "Onygenales",
"family": "Onygenaceae",
"genus": "Ophidiomyces",
"species": "Ophidiomyces ophidiicola"
},
"busco-lineage": "/usr/local/share/funannotate/onygenales_odb10",
"assembly": {
"n_contigs": 115,
"size": 21919615,
"n50": 506472,
"n90": 122144,
"l50": 12,
"l90": 42,
"avg_length": 190605
},
"abinitio": {
"augustus": {
"location": "/Users/jon/software/funannotate2/tests/fun2_train/train_misc/augustus",
"species": "7975c7f5-2f76-445e-8880-7c021e5cb032",
"train_results": {
"tool": "augustus",
"model": "7975c7f5-2f76-445e-8880-7c021e5cb032",
"n_test_genes": 200,
"ref_genes_found": 198,
"ref_genes_missed": 2,
"extra_query_genes": 120,
"average_aed": 0.11175562640178095,
"nucleotide_sensitivity": 0.9075853572557586,
"nucleotide_precision": 0.9036721113763718,
"exon_sensitivity": 0.6288659793814433,
"exon_precision": 0.6606332098600142,
"gene_sensitivity": 0.9753086419753086,
"gene_precision": 0.3969849246231156
},
"training_set": "/Users/jon/software/funannotate2/tests/fun2_train/train_misc/training-models.train.gff3"
},
"glimmerhmm": {
"location": "/Users/jon/software/funannotate2/tests/fun2_train/train_misc/glimmerhmm/train",
"train_results": {
"tool": "glimmerhmm",
"model": "train",
"n_test_genes": 200,
"ref_genes_found": 199,
"ref_genes_missed": 1,
"extra_query_genes": 178,
"average_aed": 0.14522900680209533,
"nucleotide_sensitivity": 0.8494452668122087,
"nucleotide_precision": 0.9028987277614996,
"exon_sensitivity": 0.5104712041884817,
"exon_precision": 0.5557903266018449,
"gene_sensitivity": 0.9821428571428571,
"gene_precision": 0.23605150214592274
},
"training_set": "/Users/jon/software/funannotate2/tests/fun2_train/train_misc/training-models.train.gff3"
},
"snap": {
"location": "/Users/jon/software/funannotate2/tests/fun2_train/train_misc/snap/snap-trained.hmm",
"train_results": {
"tool": "snap",
"model": "snap-trained.hmm",
"n_test_genes": 200,
"ref_genes_found": 195,
"ref_genes_missed": 5,
"extra_query_genes": 127,
"average_aed": 0.18329479111453992,
"nucleotide_sensitivity": 0.8492347150671552,
"nucleotide_precision": 0.8563979566847235,
"exon_sensitivity": 0.31283422459893045,
"exon_precision": 0.2723601561836857,
"gene_sensitivity": 0.16666666666666666,
"gene_precision": 0.0078125
},
"training_set": "/Users/jon/software/funannotate2/tests/fun2_train/train_misc/training-models.train.gff3"
},
"genemark": {
"location": "/Users/jon/software/funannotate2/tests/fun2_train/train_misc/genemark/gmhmm.mod",
"train_results": {
"tool": "genemark",
"model": "gmhmm.mod",
"n_test_genes": 200,
"ref_genes_found": 200,
"ref_genes_missed": 0,
"extra_query_genes": 196,
"average_aed": 0.09756846438881216,
"nucleotide_sensitivity": 0.898724540750394,
"nucleotide_precision": 0.9344481427857675,
"exon_sensitivity": 0.6243654822335025,
"exon_precision": 0.6643433977190323,
"gene_sensitivity": 1.0,
"gene_precision": 0.2924187725631769
},
"training_set": "self training"
}
},
"date": "2023-10-26 12:03:48.138710",
"user": "jon"
} So with your data and since you have already generated RNA-seq mediated Trinity/PASA, I would pass the PASA GFF3 file to $ funannotate2 predict
usage: funannotate2 predict -f -o -p -s [-st] [-w] [-m] [-ps] [-ts] [-c] [-mi] [-hl] [-l] [-n] [-pt] [-t] [-h] [--version]
Gene model prediction from automated training of ab initio gene prediction algorithms:
Required arguments:
-f , --fasta genome in FASTA format (softmasked repeats)
-o , --out Output folder name
-p , --params , --pretrained Params.json
-s , --species Species name, use quotes for binomial, e.g. "Aspergillus fumigatus"
Optional arguments:
-st , --strain Strain/isolate name (e.g. Af293)
-w , --weights Gene predictors and weights
-m , --consensus-method Consensus model generation method [evm,gfftk] (default: gfftk)
-ps , --proteins Proteins to use for evidence
-ts , --transcripts Transcripts to use for evidence
-c , --cpus Number of cpus/threads to use (default: 2)
-mi , --max-intron Maximum intron length (default: 3000)
-hl , --header-len Max length for fasta headers (default: 100)
-l , --locus-tag Locus tag for genes, perhaps assigned by NCBI, eg. VC83 (default: FUN_)
-n , --numbering Specify start of gene numbering (default: 1)
-t , --tmpdir volume to write tmp files (default: /tmp)
Other arguments:
-h, --help show this help message and exit
--version show program's version number and exit So you will also (for now) need to pass transcripts and proteins manually for evidence, so for your dataset something like this: funannotate2 predict -p /path/to/genus_species.params.json -s "Genus species" -o out_folder \
-f sortedmasked.fa -ps $FUNANNOTATE_DB/uniprot_sprot.fasta \
-ts Trinity_GG.fasta --cpus N So I'm curious to know if the new consensus caller Of course let me know if something isn't clear. |
Hi Jon, Thanks for the detailed guidance on this. One quick question is that there are several PASA gff files output in my previous training run, which one should I use as input for the funnotate2 train -t parameter? In my output/training folder there is: within the output/training/pasa directory there are several more : |
A second note, I tried rerunning funannotate 1 with the parameters you suggested before (and increasing the max intron length). I ended up slightly more genes:
Source Weight Count I'm noticing there is quite a few incoming genes from GeneMark, whereas the others are all more on the reasonable side. I'm thinking I might also weight GeneMark 0 as well and run again? That leaves just PASA and Augustus |
funannotate_train.pasa.gff3 should be the same as [species]_pasa.assemblies.fasta.transdecoder.genome.gff3 I think. Basically you want the transdecoder filtered set that has genomic coordinates. Huh, wondering if EVM was actually passed the proper data, is that shown in the log file for EVM? |
So based on the numbers of ab-initio gene calls all tools are predicting many more genes than 25k.
So the PASA models are typically somewhere around 1/2 of what is normally in a genome (at least when we are talking about fungi - perhaps this is different in animals) -- effectively they are expressed genes, which I'd only expect to be max of maybe 60% of the genes in the genome -- unless you have the worlds best RNA-seq data from multiple cells/timepoints/etc. |
I'm still interested to see what the |
Hi @nextgenusfs --
Thanks for writing this incredible software! I've been experimenting with funannotate for the past week or so, and have it running on some new electric fish genomes. I've been meaning to try it out and compare results to MAKER, which has been our go-to pipeline for a while.
I've been following along with the suggestions at (https://funannotate.readthedocs.io/en/latest/tutorials.html) for Non-fungal genomes running the following:
funannotate train
-i species_genome_cleaned.sortedmasked.fa
-o species_train
-l ${r1_files}
-r ${r2_files}
--pasa_db mysql
--trinity Trinity-GG.fasta
--max_intronlen 10000
--species "Myfish species"
--cpus 16
Followed by:
funannotate predict
-i species_genome_cleaned.sortedmasked.fa
-o apteronotus_train -s "Myfish species"
--repeats2evm
--busco_seed_species zebrafish
--busco_db actinopterygii
--organism other
--max_intronlen 10000
--species "Myfish species"
--cpus 16
If I'm understanding the outputs correctly, the "Myfish_species.stats.json" file lists a whopping 58,036 genes-- we are expecting about half that (25k), in fact, what MAKER gives us on the identical genome, and what gene counts are listed for other closely related species.
I'm not sure where things went awry-- I am reusing the Genome Guided transcriptome that I pre-assembled for MAKER annotation rather than running it via Funannotate. I am using the same max_intronlen that I used for MAKER, however I'm not sure how strictly enforced that is in MAKER. I did noticed that from a related species that is annotated in NCBI, the maximum intron length is 500,000bp, which is obviously quite different. Finally, I did note a complaint in the output log at the Augustus Stage of training:
Augustus initial training results:
Feature Specificity Sensitivity
nucleotides 91.4% 84.3%
exons 71.9% 68.8%
genes 11.3% 9.6%
[Mar 05 06:09 PM]: Accuracy seems low, you can try to improve by passing the --optimize_augustus option.
I'm contemplating rerunning using this setting to see if this dramatically reduces gene count, but curious if there have been reports of this previously, or if there are parameters I might tweak to get a more realistic gene number?
I am noticing in the "stats.json" file that a pretty small proportion of transcripts and proteins overlap with evidence ,and gene lengths seem to be much smaller than predicted by MAKER:
"annotation": {
"genes": 58036,
"common_name": 0,
"mRNA": 50218,
"tRNA": 7818,
"ncRNA": 0,
"rRNA": 0,
"avg_gene_length": 3326.31,
"transcript-level": {
"CDS_transcripts": 50218,
"CDS_five_utr": 0,
"CDS_three_utr": 0,
"CDS_no_utr": 50218,
"CDS_five_three_utr": 0,
"CDS_complete": 48843,
"CDS_no-start": 687,
"CDS_no-stop": 502,
"CDS_no-start_no-stop": 186,
"total_exons": 272309,
"total_cds_exons": 272309,
"multiple_exon_transcript": 44082,
"single_exon_transcript": 6136,
"avg_exon_length": 168.64,
"avg_protein_length": 318.78,
"functional": {
"go_terms": 0,
"interproscan": 0,
"eggnog": 0,
"pfam": 0,
"cazyme": 0,
"merops": 0,
"busco": 0,
"secretion": 0
},
"pct_exon_overlap_transcript_evidence": 33.02,
"pct_exon_overlap_protein_evidence": 2.51
}
Thanks for writing this software, and for any thoughts you might have on improving our results!
The text was updated successfully, but these errors were encountered: