Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specifying specific fasta tag in indexing #250

Open
pinin4fjords opened this issue May 28, 2021 · 4 comments
Open

Specifying specific fasta tag in indexing #250

pinin4fjords opened this issue May 28, 2021 · 4 comments

Comments

@pinin4fjords
Copy link

Hi!

Thanks for refgenie- just getting my head around it and have a question.

Using tags I seem to be able to create a reference genome and attach a cDNA like so:

refgenie build <assembly>/fasta --files fasta=<genome fasta file>
refgenie build <assembly>/fasta:cdna --files fasta=<cdna fasta file>

How do I then create a salmon index on the tagged cdna? My naive approach was:

refgenie build <assembly>:cdna/salmon_index

... but that doesn't work.

Does the cDNA have to be a top-level 'genome'? That was what I thought originally, but it would be a shame, it's nice if I can associate multiple cDNA files (e.g. from multiple Ensembl releases, of different biotype compositions) with the same genome build.

@stolarczyk
Copy link
Contributor

stolarczyk commented Jun 1, 2021

I think you should be able to do that like this:

  1. build a new fasta asset based on a cDNA sequence and tag it as cdna
refgenie build mm10/fasta:cdna --files fasta=mm10_cdna.fa.gz
  1. build the derived asset, but overwrite the parent asset with --assets option
refgenie build mm10/salmon_index --assets fasta=mm10/fasta:cdna

That accomplishes the goal, but due to the actual asset namespace identifiers (genomes) being derived from the FASTA file content, e.g. 0f10d83b1050c08dd53189986f60970b92a315aa7a16a6f1 instead of mm10, the namespace name depends on the order of the fasta asset building. So, if you built the mm10/fasta:cdna before mm10/fasta:default, the mm10 namespace identifier would reflect the contents of the cDNA sequences, not DNA. That's not ideal; we want the genome namespaces to be deterministic and that's why we keep the mm10_cdna namespace separate from mm10 here: mm10_cdna.

That being said, as far as I remember, the asset recipe system revamp will introduce the asset classes concept that can help solve the multi fasta asset per genome scenario.

@nsheff
Copy link
Contributor

nsheff commented Jun 1, 2021

wouldn't it be something like --assets fasta=mm10/fasta:cdna ?

@pinin4fjords
Copy link
Author

Ahh, --assets is what I was missing- thanks both.

I can make sure the genome fasta gets built first.

@pinin4fjords
Copy link
Author

For anyone finding this and wondering how to attach cDNA FASTAs to a genome my query above might mislead you. To correct the record, you should actually do e.g. :

refgenie build <assembly>/fasta --files fasta=<genome fasta file>
refgenie build <assembly>/fasta_txome:cdna --files fasta=<cdna fasta file>
refgenie build <assembly>/salmon_index --assets fasta=<assembly>/fasta_txome:cdna

Note 'fasta_txome' and see issue linked above for why.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants