Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Duplicate" kgID when doing annotation. #115

Open
sxwcasd opened this issue Jun 23, 2022 · 2 comments
Open

"Duplicate" kgID when doing annotation. #115

sxwcasd opened this issue Jun 23, 2022 · 2 comments

Comments

@sxwcasd
Copy link

sxwcasd commented Jun 23, 2022

I am trying to running the annotation (svaba-annotate.R) on GENCODE db.
However, the UCSC db records pulled by these 2 lines.

svaba/R/svaba-annotate.R

Lines 59 to 60 in 0f60e36

genes <- suppressWarnings(data.table::as.data.table(query(paste0("SELECT name, chrom, txStart, txEnd, strand, exonStarts, exonEnds, exonCount FROM ", assembly, ".knownGene"))))
codes <- suppressWarnings(data.table::as.data.table(query(paste0("SELECT kgID, mRNA, geneSymbol, spID, refSeq FROM ", assembly, ".kgXref"))))

Are having duplicates. example:

                   kgID         mRNA      geneSymbol   spID       refSeq chrom   txStart     txEnd strand
  1: ENST00000244174.11    NM_002186            IL9R Q01113    NM_002186  chrX 155997695 156010817      +
  2: ENST00000244174.11    NM_002186            IL9R Q01113    NM_002186  chrY  57184215  57197337      +

This is making sense to me, that the sex chromosomes have different position and share the some mRNA. But this will hit error at the following line:
Error: !any(duplicated(genes$kgID)) is not TRUE

Maybe we have have a better validation check at here?

@walaj
Copy link
Owner

walaj commented Oct 11, 2022 via email

@sxwcasd
Copy link
Author

sxwcasd commented Oct 14, 2022

Because we need both gencode and exonframe information, we loaded wgEncodeGencodeCompV36 table instead of knownGene or kgXref. And validate by start and end sites. I m not sure if that make sense to your original design (because we don't need to consider refseq id mapping). But it seems accomplished our goals and get around this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants