Arabidopsis annotation missing transcipt_name #109

vuzun · 2020-07-09T12:49:38Z

Hi,

I encountered an error after the ReduceGTF step in the pipeline - Invalid GTF line and missing transcript_name.

I am running it on Arabidopsis data, so I changed the rule that fetches the annotation to point to ftp://ftp.ensemblgenomes.org/pub/plants/release-47/gtf/arabidopsis_thaliana, but it looks like the Arabidopsis annotation doesn't have the transcript_name field like the human does.

Do you know if there a way to use the id instead of the name somehow? Or anything else that would work?

Thanks for the tool and all the help around here!

seb-mueller · 2020-07-09T13:21:35Z

Hi vuzu,

I came across the same issue which seems to come down to a flawed ensemble annotation and dropseq-tools requiring transcript_name to generate the refFlat. Apparently they didn't have non human and mouse samples in mind.

I cobbled together a workaround, which is I still have to test
The idea is to copy the the gene_id as transcript_name when it is absent.
I'm happy to share the full workflow later (have to run now), but find at least the convert snippet below as a start:

cat annotation.gtf |
awk '{  if (!/gene_name/)
        print $0, "gene_name " $10;
        else print $0
        }' |
awk '{  if (!/transcript_id/)
        print $0, "transcript_id " $10;
        else print $0
        }' |
awk '{  if (!/transcript_name/)
        print $0, "transcript_name " $10;
        else print $0
        }' > tmp.gtf
# this is still run by the pipeline?
sed -i 's/[^"];/_/'  tmp.gtf
cp tmp.gtf annotation.gtf

seb-mueller · 2020-08-10T13:03:51Z

As an update to the above. I've found a lot of genes being excluded because they have the same name. Drop-Seq-tools excludes them by default. It turns out, transcript_name needs to be as unique as possible. In the snippet above, field $10 was used which corresponds to gene_id, however this removes all the isoforms since they share the same gene_id. Using transcript_id in field $12 turns out to prevent that from happening. So ultimately, the following should be used instead:

cat annotation.gtf |
awk '{  if (!/gene_name/)
        print $0, "gene_name " $10;
        else print $0
        }' |
awk '{  if (!/transcript_id/)
        print $0, "transcript_id " $10;
        else print $0
        }' |
awk '{  if (!/transcript_name/)
        print $0, "transcript_name " $12;
        else print $0
        }' > tmp.gtf
# replace names containing semicolons with underscores since they dropseqtools doesn't handle them well:
sed -i 's/[^"];/_/'  tmp.gtf
cp tmp.gtf annotation.gtf

seb-mueller mentioned this issue Oct 13, 2020

ConvertToRefFlat fails while running Drop-seq_tools-2.4.0 #113

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arabidopsis annotation missing transcipt_name #109

Arabidopsis annotation missing transcipt_name #109

vuzun commented Jul 9, 2020

seb-mueller commented Jul 9, 2020

seb-mueller commented Aug 10, 2020 •

edited

Arabidopsis annotation missing transcipt_name #109

Arabidopsis annotation missing transcipt_name #109

Comments

vuzun commented Jul 9, 2020

seb-mueller commented Jul 9, 2020

seb-mueller commented Aug 10, 2020 • edited

seb-mueller commented Aug 10, 2020 •

edited