Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2x Expected Gene Count #1010

Open
jasongallant opened this issue Mar 8, 2024 · 10 comments
Open

2x Expected Gene Count #1010

jasongallant opened this issue Mar 8, 2024 · 10 comments

Comments

@jasongallant
Copy link

Hi @nextgenusfs --

Thanks for writing this incredible software! I've been experimenting with funannotate for the past week or so, and have it running on some new electric fish genomes. I've been meaning to try it out and compare results to MAKER, which has been our go-to pipeline for a while.

I've been following along with the suggestions at (https://funannotate.readthedocs.io/en/latest/tutorials.html) for Non-fungal genomes running the following:

funannotate train
-i species_genome_cleaned.sortedmasked.fa
-o species_train
-l ${r1_files}
-r ${r2_files}
--pasa_db mysql
--trinity Trinity-GG.fasta
--max_intronlen 10000
--species "Myfish species"
--cpus 16

Followed by:

funannotate predict
-i species_genome_cleaned.sortedmasked.fa
-o apteronotus_train -s "Myfish species"
--repeats2evm
--busco_seed_species zebrafish
--busco_db actinopterygii
--organism other
--max_intronlen 10000
--species "Myfish species"
--cpus 16

If I'm understanding the outputs correctly, the "Myfish_species.stats.json" file lists a whopping 58,036 genes-- we are expecting about half that (25k), in fact, what MAKER gives us on the identical genome, and what gene counts are listed for other closely related species.

I'm not sure where things went awry-- I am reusing the Genome Guided transcriptome that I pre-assembled for MAKER annotation rather than running it via Funannotate. I am using the same max_intronlen that I used for MAKER, however I'm not sure how strictly enforced that is in MAKER. I did noticed that from a related species that is annotated in NCBI, the maximum intron length is 500,000bp, which is obviously quite different. Finally, I did note a complaint in the output log at the Augustus Stage of training:

Augustus initial training results:
Feature Specificity Sensitivity
nucleotides 91.4% 84.3%
exons 71.9% 68.8%
genes 11.3% 9.6%
[Mar 05 06:09 PM]: Accuracy seems low, you can try to improve by passing the --optimize_augustus option.

I'm contemplating rerunning using this setting to see if this dramatically reduces gene count, but curious if there have been reports of this previously, or if there are parameters I might tweak to get a more realistic gene number?

I am noticing in the "stats.json" file that a pretty small proportion of transcripts and proteins overlap with evidence ,and gene lengths seem to be much smaller than predicted by MAKER:

"annotation": {
"genes": 58036,
"common_name": 0,
"mRNA": 50218,
"tRNA": 7818,
"ncRNA": 0,
"rRNA": 0,
"avg_gene_length": 3326.31,
"transcript-level": {
"CDS_transcripts": 50218,
"CDS_five_utr": 0,
"CDS_three_utr": 0,
"CDS_no_utr": 50218,
"CDS_five_three_utr": 0,
"CDS_complete": 48843,
"CDS_no-start": 687,
"CDS_no-stop": 502,
"CDS_no-start_no-stop": 186,
"total_exons": 272309,
"total_cds_exons": 272309,
"multiple_exon_transcript": 44082,
"single_exon_transcript": 6136,
"avg_exon_length": 168.64,
"avg_protein_length": 318.78,
"functional": {
"go_terms": 0,
"interproscan": 0,
"eggnog": 0,
"pfam": 0,
"cazyme": 0,
"merops": 0,
"busco": 0,
"secretion": 0
},
"pct_exon_overlap_transcript_evidence": 33.02,
"pct_exon_overlap_protein_evidence": 2.51
}

Thanks for writing this software, and for any thoughts you might have on improving our results!

@nextgenusfs
Copy link
Owner

What seems to happen for larger eukaryotes is that codingquarry and snap don't work very well. Can you check the logs and see what the number of genes predicted for each took prior to EVM is? Might need to turn off a few of the ab initio models. Also, is your assembly haploid?

@jasongallant
Copy link
Author

jasongallant commented Mar 8, 2024

Yes, it's a haploid assembly. I took extra care also by running funnotate clean and funnotate sort prior to the pipeline if that's helpful

I think the portion of the log you are asking for is here (so many logs!)

[Mar 05 06:09 PM]: Running Augustus gene prediction using apteronotus_albifrons parameters
[Mar 05 07:15 PM]: 51,732 predictions from Augustus
     Progress: 0 complete, 0 failed, 1471 remaining        ^M     Progress: 0 complete, 0 failed, 1471 remaining        ^M     Progress: 0 complete,$
[Mar 05 07:15 PM]: Pulling out high quality Augustus predictions
[Mar 05 07:15 PM]: Found 13,306 high quality predictions from Augustus (>90% exon evidence)
[Mar 05 07:15 PM]: Running SNAP gene prediction, using training data: apteronotus_train/predict_misc/final_training_models.gff3
[Mar 05 07:58 PM]: 104,593 predictions from SNAP
[Mar 05 07:58 PM]: Running GlimmerHMM gene prediction, using training data: apteronotus_train/predict_misc/final_training_models.gff3
[Mar 05 09:05 PM]: 106,875 predictions from GlimmerHMM
[Mar 05 09:05 PM]: Summary of gene models passed to EVM (weights):
[Mar 05 09:05 PM]: EVM: partitioning input to ~ 35 genes per partition using min 1500 bp interval
[Mar 06 12:41 AM]: Converting to GFF3 and collecting all EVM results
  Augustus	 1        38426
  Augustus HiQ   2        13306
  GeneMark	 1        78608
  GlimmerHMM     1        106875
  pasa           6        31714
  snap           1        104593
  Total          -        373522
[Mar 06 12:41 AM]: 51,654 total gene models from EVM
[Mar 06 12:41 AM]: Generating protein fasta files from 51,654 EVM models
[Mar 06 12:42 AM]: now filtering out bad gene models (< 50 aa in length, transposable elements, etc).
[Mar 06 12:42 AM]: Found 1,420 gene models to remove: 33 too short; 0 span gaps; 1,387 transposable elements
[Mar 06 12:42 AM]: 50,234 gene models remaining
[Mar 06 12:42 AM]: Predicting tRNAs
[Mar 06 01:16 AM]: 7,818 tRNAscan models are valid (non-overlapping)

@nextgenusfs
Copy link
Owner

nextgenusfs commented Mar 8, 2024

So it looks like snap and glimmerhmm aren't performing very well. So you can turn those off by passing --weights snap:0 glimmerhmm:0 to your funannotate predict command. It should re-use existing data and re-run EVM with the new weights, it will effectively ignore the snap and glimmerhmm models.

If you are more "adventurous" and would perhaps tolerate a few bugs, I've been slowly working on funannotate2 https://github.com/nextgenusfs/funannotate2. The training module and predict are mostly complete. I've re-written mostly everything to hopefully work better and have also written a new consensus gene calling method. Largely I'm trying to reduce dependencies and reduce complexities, funannotate tries to do too many things and install becomes difficult. I'd be interested to know if it is working better or not. Let me know if you'd like to give it a try and I can give you some instructions.

@jasongallant
Copy link
Author

OK, I'll give those settings a shot and see what happens.

But yes, I'm feeling adventurous (adventure is my middle name! ha-- it's actually Raymond). Send me some info, and I'll be happy to be a test pilot. I totally understand the Rube Goldberg nature of things, and appreciate your efforts to simplify. If I can be of any assistance, let me know!

@nextgenusfs
Copy link
Owner

Okay, so for f2 you can do this to install with mamba/conda/pip:

# install external dependencies
mamba create -n funannotate2 "python<=3.10" biopython "evidencemodeler>=2" minimap2 miniprot snap "augustus==3.5.0" glimmerhmm diamond trnascan-se table2asn

# activate the environment
conda activate funannotate2

# install latest from git
python -m pip install git+https://github.com/nextgenusfs/funannotate2.git

# need to also set the env variable pointing to f1 FUNANNOTATE_DB
export FUNANNOTATE_DB=/path/to/existing/funannotate_db/folder

You could probably also just install it in the same environment as your existing funannotate installation, and then would just need to add miniprot and table2asn via conda/mamba.

The help menus should all look and feel familiar

$ funannotate2
usage: funannotate2 [-h] [--version] {clean,train,predict} ...

Funannotate: eukaryotic genome annotation pipeline

Commands:
  clean       Find and remove duplicated contigs, sort by size, rename headers.
  train       Train ab intio gene prediction algorithms.
  predict     Predict primary gene models in a eukaryotic genome.

Help:
  -h, --help  Show this help message and exit
  --version   show program's version number and exit

One difference is that funannotate2 train will run all of the training steps and it takes an optional -t, --training-set in GFF3 format -- if a training-set is not passed then it will run buscolite to generate a training set (the tool will auto-detect the appropriate BUSCO models and download them based on the taxonomy you provide via the -s, --species parameter, so don't put a dummy value in there. Then those gene models will be used to train all of the ab-initio predictors. The output of train is then a JSON file that contains all of the training parameters that will be used by funannotate2 predict.

$ funannotate2 train
usage: funannotate2 train -f  -s  -o  [-t] [--strain] [--cpus] [--optimize-augustus] [--header-len] [-h] [--version]

Gene model training from automated training of ab initio gene prediction algorithms:

Required arguments:
  -f , --fasta          genome in FASTA format
  -s , --species        Species name, use quotes for binomial, e.g. "Aspergillus fumigatus"
  -o , --out            Output folder name

Optional arguments:
  -t , --training-set   Training set to use in GFF3 format
  --strain              Strain/isolate name
  --cpus                Number of CPUs to use (default: 2)
  --optimize-augustus   Run Augustus mediated optimized training (not recommended) (default: False)
  --header-len          Max length for fasta headers (default: 100)

Other arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit

Example of a training parameters JSON output --> written to output/train_results/<species>.params.json:

{
  "name": "ophidiomyces_ophidiicola",
  "species": "Ophidiomyces ophidiicola",
  "taxonomy": {
    "superkingdom": "Eukaryota",
    "kingdom": "Fungi",
    "phylum": "Ascomycota",
    "class": "Eurotiomycetes",
    "order": "Onygenales",
    "family": "Onygenaceae",
    "genus": "Ophidiomyces",
    "species": "Ophidiomyces ophidiicola"
  },
  "busco-lineage": "/usr/local/share/funannotate/onygenales_odb10",
  "assembly": {
    "n_contigs": 115,
    "size": 21919615,
    "n50": 506472,
    "n90": 122144,
    "l50": 12,
    "l90": 42,
    "avg_length": 190605
  },
  "abinitio": {
    "augustus": {
      "location": "/Users/jon/software/funannotate2/tests/fun2_train/train_misc/augustus",
      "species": "7975c7f5-2f76-445e-8880-7c021e5cb032",
      "train_results": {
        "tool": "augustus",
        "model": "7975c7f5-2f76-445e-8880-7c021e5cb032",
        "n_test_genes": 200,
        "ref_genes_found": 198,
        "ref_genes_missed": 2,
        "extra_query_genes": 120,
        "average_aed": 0.11175562640178095,
        "nucleotide_sensitivity": 0.9075853572557586,
        "nucleotide_precision": 0.9036721113763718,
        "exon_sensitivity": 0.6288659793814433,
        "exon_precision": 0.6606332098600142,
        "gene_sensitivity": 0.9753086419753086,
        "gene_precision": 0.3969849246231156
      },
      "training_set": "/Users/jon/software/funannotate2/tests/fun2_train/train_misc/training-models.train.gff3"
    },
    "glimmerhmm": {
      "location": "/Users/jon/software/funannotate2/tests/fun2_train/train_misc/glimmerhmm/train",
      "train_results": {
        "tool": "glimmerhmm",
        "model": "train",
        "n_test_genes": 200,
        "ref_genes_found": 199,
        "ref_genes_missed": 1,
        "extra_query_genes": 178,
        "average_aed": 0.14522900680209533,
        "nucleotide_sensitivity": 0.8494452668122087,
        "nucleotide_precision": 0.9028987277614996,
        "exon_sensitivity": 0.5104712041884817,
        "exon_precision": 0.5557903266018449,
        "gene_sensitivity": 0.9821428571428571,
        "gene_precision": 0.23605150214592274
      },
      "training_set": "/Users/jon/software/funannotate2/tests/fun2_train/train_misc/training-models.train.gff3"
    },
    "snap": {
      "location": "/Users/jon/software/funannotate2/tests/fun2_train/train_misc/snap/snap-trained.hmm",
      "train_results": {
        "tool": "snap",
        "model": "snap-trained.hmm",
        "n_test_genes": 200,
        "ref_genes_found": 195,
        "ref_genes_missed": 5,
        "extra_query_genes": 127,
        "average_aed": 0.18329479111453992,
        "nucleotide_sensitivity": 0.8492347150671552,
        "nucleotide_precision": 0.8563979566847235,
        "exon_sensitivity": 0.31283422459893045,
        "exon_precision": 0.2723601561836857,
        "gene_sensitivity": 0.16666666666666666,
        "gene_precision": 0.0078125
      },
      "training_set": "/Users/jon/software/funannotate2/tests/fun2_train/train_misc/training-models.train.gff3"
    },
    "genemark": {
      "location": "/Users/jon/software/funannotate2/tests/fun2_train/train_misc/genemark/gmhmm.mod",
      "train_results": {
        "tool": "genemark",
        "model": "gmhmm.mod",
        "n_test_genes": 200,
        "ref_genes_found": 200,
        "ref_genes_missed": 0,
        "extra_query_genes": 196,
        "average_aed": 0.09756846438881216,
        "nucleotide_sensitivity": 0.898724540750394,
        "nucleotide_precision": 0.9344481427857675,
        "exon_sensitivity": 0.6243654822335025,
        "exon_precision": 0.6643433977190323,
        "gene_sensitivity": 1.0,
        "gene_precision": 0.2924187725631769
      },
      "training_set": "self training"
    }
  },
  "date": "2023-10-26 12:03:48.138710",
  "user": "jon"
}

So with your data and since you have already generated RNA-seq mediated Trinity/PASA, I would pass the PASA GFF3 file to -t, --training-set. After training is complete then you can run funannotate2 predict which again should look mostly familiar -- currently you will need to pass the full path to the training JSON params file (eventually I'll make this "easy/automatic" as before). Also for comparison sake, the latest of funannotate1 can take this params JSON file as input to predict as well. But what is new in funannotate2 predict is a more efficient evidence alignment as well as a new consensus gene calling program.

$ funannotate2 predict
usage: funannotate2 predict -f  -o  -p  -s  [-st] [-w] [-m] [-ps] [-ts] [-c] [-mi] [-hl] [-l] [-n] [-pt] [-t] [-h] [--version]

Gene model prediction from automated training of ab initio gene prediction algorithms:

Required arguments:
  -f , --fasta                   genome in FASTA format (softmasked repeats)
  -o , --out                     Output folder name
  -p , --params , --pretrained   Params.json 
  -s , --species                 Species name, use quotes for binomial, e.g. "Aspergillus fumigatus"

Optional arguments:
  -st , --strain                 Strain/isolate name (e.g. Af293)
  -w , --weights                 Gene predictors and weights
  -m , --consensus-method        Consensus model generation method [evm,gfftk] (default: gfftk)
  -ps , --proteins               Proteins to use for evidence
  -ts , --transcripts            Transcripts to use for evidence
  -c , --cpus                    Number of cpus/threads to use (default: 2)
  -mi , --max-intron             Maximum intron length (default: 3000)
  -hl , --header-len             Max length for fasta headers (default: 100)
  -l , --locus-tag               Locus tag for genes, perhaps assigned by NCBI, eg. VC83 (default: FUN_)
  -n , --numbering               Specify start of gene numbering (default: 1)
  -t , --tmpdir                  volume to write tmp files (default: /tmp)

Other arguments:
  -h, --help                     show this help message and exit
  --version                      show program's version number and exit

So you will also (for now) need to pass transcripts and proteins manually for evidence, so for your dataset something like this:

funannotate2 predict -p /path/to/genus_species.params.json -s "Genus species" -o out_folder \
    -f sortedmasked.fa -ps $FUNANNOTATE_DB/uniprot_sprot.fasta \
    -ts Trinity_GG.fasta --cpus N

So I'm curious to know if the new consensus caller gfftk consensus results in a more appropriate number of gene models. It is a lot faster than EVM and should generate automatic weightings of the input sources as well.

Of course let me know if something isn't clear.

@jasongallant
Copy link
Author

Hi Jon,

Thanks for the detailed guidance on this. One quick question is that there are several PASA gff files output in my previous training run, which one should I use as input for the funnotate2 train -t parameter?

In my output/training folder there is:
pasa.step1.gff
funannotate_train.pasa.gff3 (which points to transcript.alignments.gff)

within the output/training/pasa directory there are several more :
[species]_pasa.assemblies.fasta.transdecoder.genome.gff3
[species]_pasa.assemblies.fasta.transdecoder.gff3
[species]_pasa.failed_blat_alignments.gff3
[species]_pasa.failed_custom_alignments.gff3
[species]_pasa.pasa_assemblies.gff3
[species]_pasa.valid_blat_alignments.gff3
[species]_pasa.valid_custom_alignments.gff3
blat.spliced_alignments.gff3
trinity.fasta.clean.transdecoder.gff3

@jasongallant
Copy link
Author

A second note, I tried rerunning funannotate 1 with the parameters you suggested before (and increasing the max intron length). I ended up slightly more genes:

 Progress: 2939 complete, 0 failed, 0 remaining

Source Weight Count
Augustus 1 38426
Augustus HiQ 2 13306
GeneMark 1 78608
GlimmerHMM 0 106875
pasa 6 31714
snap 0 104593
Total - 373522
[Mar 08 01:49 AM]: 60,699 total gene models from EVM
[Mar 08 01:49 AM]: Generating protein fasta files from 60,699 EVM models
[Mar 08 01:50 AM]: now filtering out bad gene models (< 50 aa in length, transposable elements, etc).
[Mar 08 01:50 AM]: Found 1,324 gene models to remove: 17 too short; 1 span gaps; 1,306 transposable elements
[Mar 08 01:50 AM]: 59,375 gene models remaining
[Mar 08 01:50 AM]: Predicting tRNAs
[Mar 08 01:50 AM]: 7,773 tRNAscan models are valid (non-overlapping)
[Mar 08 01:50 AM]: Generating GenBank tbl annotation file
[Mar 08 01:53 AM]: Collecting final annotation files for 67,148 total gene models

I'm noticing there is quite a few incoming genes from GeneMark, whereas the others are all more on the reasonable side. I'm thinking I might also weight GeneMark 0 as well and run again? That leaves just PASA and Augustus

@nextgenusfs
Copy link
Owner

funannotate_train.pasa.gff3 should be the same as [species]_pasa.assemblies.fasta.transdecoder.genome.gff3 I think. Basically you want the transdecoder filtered set that has genomic coordinates.

Huh, wondering if EVM was actually passed the proper data, is that shown in the log file for EVM?

@nextgenusfs
Copy link
Owner

nextgenusfs commented Mar 8, 2024

So based on the numbers of ab-initio gene calls all tools are predicting many more genes than 25k.

augustus --> 38426 + 13306 = 51732
genemark --> 78608
pasa -->  31714    

So the PASA models are typically somewhere around 1/2 of what is normally in a genome (at least when we are talking about fungi - perhaps this is different in animals) -- effectively they are expressed genes, which I'd only expect to be max of maybe 60% of the genes in the genome -- unless you have the worlds best RNA-seq data from multiple cells/timepoints/etc.

@nextgenusfs
Copy link
Owner

I'm still interested to see what the funannotate2 train module will do. You could also try a second training where you don't use the PASA gene models and try to use default BUSCO mediated training. Its possible that your PASA gene models are incomplete and shorter than they really are -- this could be then resulting in poor training I guess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants