Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best practice to use Mikado to score assembled lncRNA #435

Open
cc-prolix opened this issue Oct 13, 2022 · 3 comments
Open

Best practice to use Mikado to score assembled lncRNA #435

cc-prolix opened this issue Oct 13, 2022 · 3 comments

Comments

@cc-prolix
Copy link

Hello,

I want to use Mikado to score a subset of StringTie models that we have classified as potential lncRNA. I want to compare my assembled transcripts of interest with a trusted reference annotation and two further trusted annotations containing known lncRNAs. My main goal is to filter out assembled transcripts that are already contained in the reference or in the additional trusted lncRNA annotations - if there is already a trusted lncRNA in a locus it should be picked over the newly assembled one.

At the moment I am just using the Mikado pipeline and no optional inputs from Portcullis, Prodigal or Blast information.
I wanted to ask if you could comment and look over my intended approach?
I have included my config file, commands and scoring file.

Thank you very much, your input would be of great help!

config_tab.txt
mikado_cmds.txt
scoring.txt

@swarbred
Copy link
Collaborator

Hi @cc-prolix

So the metric mikado calculates are geared towards selecting from coding transcripts but we have used it similar to how you propose to bring together a high quality protein annotation with a set of lncRNA models and make a selection based on some basic priorities between the alternative inputs.

The --list file you have with the source scores as given will lead to you selecting models from reference.gtf over every other model that forms part of the same loci (as defined by mikado), where you have no reference.gtf the highest scoring model will come from even_more_trusted_lncRNAs.gtf, where you have neither of these you will select from trusted_lncRNAs.gtf and only if none of these models are included in the loci will you pull in models from novel_lncRNAs.gtf.

That seems to be the behaviour you want so looks ok to me.

you may want to change the alt splice overlap parameters in the mikado config (doesn't have to be as below) as for example the scoring requirement could mean that you dont bring together models from your different lncRNA inputs
pick:
alternative_splicing:
min_cdna_overlap: 0.2
min_cds_overlap: 0.2
min_score_perc: 0.1

note there are pick options that may be useful

setting reference_update: true, means that for the scoring when define the primary transcript only models marked as reference will be considered so you would only select from reference.gtf. This should have the same affect as giving the reference.gtf models a much higher source score but there may be some subtle difference. In your case only_reference_update: false makes sense as otherwise you will exclude all models not overlapping a reference annotation.

# only_reference_update: Boolean flag. If set, Mikado will run in reference-update mode, see
# documentation. Additionally, Mikado will ignore any locus where there
# is not at least one reference transcript.
only_reference_update: false
# reference_update: Boolean flag. If set, Mikado will run in reference-update mode, see
# documentation.
reference_update: false

Also in the mikado config consider the strip_cds: false (this is the default) option, as "true" you would strip out the CDS if there was some issue with the reference ORF and this model could then be present as a ncRNA in the final output. With this set to false the entire model would be removed.

The scoring is down to you but you might want to leave in more of the CDS UTR related metrics as then you would get a more meaningfull scoring within thin the protein coding models.

I've also never ran by commenting out the entire as_requirements: , #not_fragmentary: sections that may or may not cause an error is so you can just simplify the expression to not exclude anything.

take a look in the mikado config under

pick:
alternative_splicing:

as you may want to change options there i.e. likely you want to change only_confirmed_introns: false

It should do something useful as like i said we have used mikado in a similar way.

@cc-prolix
Copy link
Author

cc-prolix commented Oct 18, 2022

Hey @swarbred

Thank you so much for the detailed Info. I have adjusted my cofig file, scoring file and pipeline according to your recommendations.

The --list file you have with the source scores as given will lead to you selecting models from reference.gtf over every other model that forms part of the same loci (as defined by mikado), where you have no reference.gtf the highest scoring model will come from even_more_trusted_lncRNAs.gtf, where you have neither of these you will select from trusted_lncRNAs.gtf and only if none of these models are included in the loci will you pull in models from novel_lncRNAs.gtf.

That is exaclty the behaviour we are looking for! I have attached the to your recommendations adapted files. If you could take a quick look it would be of great help!

Could you also please clarify some remaining points?

you may want to change the alt splice overlap parameters in the mikado config (doesn't have to be as below) as for example the scoring requirement could mean that you dont bring together models from your different lncRNA inputs
pick:
alternative_splicing:
min_cdna_overlap: 0.2
min_cds_overlap: 0.2
min_score_perc: 0.1

If I do not provide any ORF or AS information will the CDS parameter be useful?
Is there a reason for the lower alternate splicing cut-offs? My first instinct was to keep the default values for assembled lncRNAs. Could this cause transcripts of different genes to merge to one loci?

Also in the mikado config consider the strip_cds: false (this is the default) option, as "true" you would strip out the CDS if there was some issue with the reference ORF and this model could then be present as a ncRNA in the final output. With this set to false the entire model would be removed.

Do we understand this correctly: This would strip problematic CDS from the reference and replace it with a lncRNA loci? We have a very good reference and only lncRNAs in our other annotations. Ideally we wanted to keep the reference annotation intact as it is in the output. Basically, only add novel transcripts to it if they pass Mikado. In our case we would leave it as false, correct?

The scoring is down to you but you might want to leave in more of the CDS UTR related metrics as then you would get a more meaningfull scoring within thin the protein coding models.

As we mentioned above, we have a very good reference and only lncRNAs in our other annotations. Do you still recommend leaving in more of the CDS/UTR related metrics and could you explain why?

as you may want to change options there i.e. likely you want to change only_confirmed_introns: false

Thank you, we set this now to false. Are there any other parameters you would recommend we change?

I've also never ran by commenting out the entire as_requirements: , #not_fragmentary: sections that may or may not cause an error is so you can just simplify the expression to not exclude anything.

I included the other sections again. Could you please take a quick look at our scoring file? I would highly appreciate it!

Thank you so much for your input, it has been of great help!

Two final questions:

  1. Can you also set two annotations as reference in the config file?

  2. Would it be possible for you to provide the config/scoring files you used in your experiment? The application seems very similar to what we want to accomplish!

Thank you!

scoring.txt
config_tab_new.txt
mikado_cmds.txt
configuration.txt

@swarbred
Copy link
Collaborator

swarbred commented Oct 21, 2022

Hi @cc-prolix

That is exaclty the behaviour we are looking for! I have attached the to your recommendations adapted files. If you could take a quick look it would be of great help!

The only change in config_tab_new looks to be that you have set exclude_redundant = True, so the default is actually False (and the -h indicating the default is True is wrong and should be corrected).

In mikado prepare the default exclude_redundant = False would lead to only removing exact duplicates i.e. truly identical models, if exclude_redundant = True, this changes the behaviour to exclude models having identical intron chains, i.e. this will exclude more models. The option is mainly there for cases where you have a large number of transcripts that have not already been clustered i.e. huge redundancy such as when aligning long reads with no prior assembly. In those case it can significantly reduce the number of transcripts output from prepare (which might save time downstream).

With what you plan to do at the pick stage with the source scoring you are using i.e. exclude models that form part of the same loci as your reference annotation then using this exclude_redundant = True option should be fine you will just exclude some of these non reference models at the prepare stage. I only mention as the description could be clearer in the help.

If I do not provide any ORF or AS information will the CDS parameter be useful?

I assume your reference.gtf models have CDS features ? if so then the CDS metrics would be calculated and could be used to score between reference.gtf models. The alternative_splicing: related option shown above help determine what is brought back as alt splice variants. If your reference.gtf models have splice variants mikado will regard the highest scoring model at the loci as the primary model, alt splice variants are then assessed relative to this so if the scoring selects a different primary model it may change what other alt splice models would be returned. It sounds like you want to bring back ALL reference models so you would want to make these less restrictive. Note there may be some reference models that will always be excluded i.e. models that link together what mikado defines as separate loci (alt splice variants are never returned that would join (i.e. splice between) two separate subloci).

Is there a reason for the lower alternate splicing cut-offs?

See above comment but also that min_score_perc: is there to ensure alt splice variants are not returned that are below this percentage of the (highest scoring) primary model. As you are setting big source scores then having this at the default may mean that you dont bring back models as splice variants across your different inputs i.e. at a locus with a trusted_lncRNAs.gtf model having min_score_perc: 0.5 would mean you never bring back models at this loci from ovel_lncRNAs.gtf if that is what you want to happen then this is fine.

My first instinct was to keep the default values for assembled lncRNAs. Could this cause transcripts of different genes to merge to one loci?

You could bring back models from your different inputs at one loci your source scores and the min_score_perc:

parameter would affect this, you can set the source scores in a way that would mean this never happens, as you have it potentially trusted_lncRNAs.gtf and even_more_trusted_lncRNAs.gtf could meet a 0.5 min_score_perc

Do we understand this correctly: This would strip problematic CDS from the reference and replace it with a lncRNA loci? We have a very good reference and only lncRNAs in our other annotations. Ideally we wanted to keep the reference annotation intact as it is in the output. Basically, only add novel transcripts to it if they pass Mikado. In our case we would leave it as false, correct?

The option I should have described is "strip_faulty_cds" not "strip_cds" (strip_cds does as it indicates and removes all CDS features at the prepare stage, you dont want this)

strip_faulty_cds is as default false

# strip_faulty_cds: Boolean flag. If set to false, any transcript - *including reference
 # transcripts* found to have an incorrect CDS will be discarded. If set
 # to to true, these transcripts will be retained but their CDS will be
 # stripped.

as default if a model has an in frame stop codon the model is removed in prepare, setting this to true would mean the model is retained but the CDS features are lost. So if you dont want to entirely lose reference models that may have in frame stops (i.e. you might have pseudogene loci that have CDS features) the you would set strip_faulty_cds: true .

There is no option to retain CDS features that have in-frame stops so strip_faulty_cds: true would be the best way of retaining the model (though without the CDS features). Any model with no CDS features is marked as ncRNA in the mikado output (so any models where CDS has been stripped would be marked as ncRNA).

As we mentioned above, we have a very good reference and only lncRNAs in our other annotations. Do you still recommend leaving in more of the CDS/UTR related metrics and could you explain why?

it really depends, but be aware that what ever metrics you use for the scoring will determine the primary model, the choice of primary model will have some knock on effect i.e. on alt splicing and filtering of fragments etc. I would keep them in as then your selection of the primary model for the reference.gtf models will likely be more sensible. I would play around with the different choices and see what works best for you.

I included the other sections again. Could you please take a quick look at our scoring file? I would highly appreciate it!

I would just go ahead and run this and then see if it works out how you want I.e. load the inputs and output to a browser IGV etc and review a few regions, plus also check the list of input models against the output list (i.e. alias tag) and see if you lose any reference.gtf models and if you are happy to lose those models.

Thank you, we set this now to false. Are there any other parameters you would recommend we change?

run mikado configure with --full to get the full configuration file

you might want

# check_references: boolean flag. If set to true, transcripts marked as reference will # still be checked for compliance with the requirements established in # the scoring file. check_references: true

This means the requirements section are also applied to "reference" models if you dont want to apply these check to the reference models then leave as default false.

Can you also set two annotations as reference in the config file?

Yes, as you are running assigning a model as reference prevents some checks in prepare being applied (in frame stops are still checked)

Would it be possible for you to provide the config/scoring files you used in your experiment? The application seems very similar to what we want to accomplish!

I will take a look and attach if useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants