Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arabidopsis annotation missing transcipt_name #109

Open
vuzun opened this issue Jul 9, 2020 · 2 comments
Open

Arabidopsis annotation missing transcipt_name #109

vuzun opened this issue Jul 9, 2020 · 2 comments

Comments

@vuzun
Copy link

vuzun commented Jul 9, 2020

Hi,

I encountered an error after the ReduceGTF step in the pipeline - Invalid GTF line and missing transcript_name.

I am running it on Arabidopsis data, so I changed the rule that fetches the annotation to point to ftp://ftp.ensemblgenomes.org/pub/plants/release-47/gtf/arabidopsis_thaliana, but it looks like the Arabidopsis annotation doesn't have the transcript_name field like the human does.

Do you know if there a way to use the id instead of the name somehow? Or anything else that would work?

Thanks for the tool and all the help around here!

@seb-mueller
Copy link
Collaborator

Hi vuzu,

I came across the same issue which seems to come down to a flawed ensemble annotation and dropseq-tools requiring transcript_name to generate the refFlat. Apparently they didn't have non human and mouse samples in mind.

I cobbled together a workaround, which is I still have to test
The idea is to copy the the gene_id as transcript_name when it is absent.
I'm happy to share the full workflow later (have to run now), but find at least the convert snippet below as a start:

cat annotation.gtf |
awk '{  if (!/gene_name/)
        print $0, "gene_name " $10;
        else print $0
        }' |
awk '{  if (!/transcript_id/)
        print $0, "transcript_id " $10;
        else print $0
        }' |
awk '{  if (!/transcript_name/)
        print $0, "transcript_name " $10;
        else print $0
        }' > tmp.gtf
# this is still run by the pipeline?
sed -i 's/[^"];/_/'  tmp.gtf
cp tmp.gtf annotation.gtf

@seb-mueller
Copy link
Collaborator

seb-mueller commented Aug 10, 2020

As an update to the above. I've found a lot of genes being excluded because they have the same name. Drop-Seq-tools excludes them by default. It turns out, transcript_name needs to be as unique as possible. In the snippet above, field $10 was used which corresponds to gene_id, however this removes all the isoforms since they share the same gene_id. Using transcript_id in field $12 turns out to prevent that from happening. So ultimately, the following should be used instead:

cat annotation.gtf |
awk '{  if (!/gene_name/)
        print $0, "gene_name " $10;
        else print $0
        }' |
awk '{  if (!/transcript_id/)
        print $0, "transcript_id " $10;
        else print $0
        }' |
awk '{  if (!/transcript_name/)
        print $0, "transcript_name " $12;
        else print $0
        }' > tmp.gtf
# replace names containing semicolons with underscores since they dropseqtools doesn't handle them well:
sed -i 's/[^"];/_/'  tmp.gtf
cp tmp.gtf annotation.gtf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants