Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCSC databases (TxDB) #11

Open
ivirshup opened this issue Apr 26, 2023 · 2 comments
Open

UCSC databases (TxDB) #11

ivirshup opened this issue Apr 26, 2023 · 2 comments
Labels
enhancement New feature or request P2🏝 Low priority

Comments

@ivirshup
Copy link
Member

Description of feature

Getting UCSC data from TxDB bioconductor sources

@ivirshup ivirshup added enhancement New feature or request P2🏝 Low priority labels Apr 26, 2023
@ivirshup
Copy link
Member Author

ivirshup commented Apr 2, 2024

It would be useful to write down what exactly the differences between EnsDB and TxDB are.

To play around with this:

access via ibis

import genomic_features as gf
import ibis

!wget https://bioconductorhubs.blob.core.windows.net/annotationhub/ucsc/standard/3.15/TxDb.Hsapiens.UCSC.hg38.knownGene.sqlite

ensdb = gf.ensembl.annotation(species="Hsapiens", version="108").db
ucscdb = ibis.connect("TxDb.Hsapiens.UCSC.hg38.knownGene.sqlite")

for tbl_name in ensdb.list_tables():
    print(tbl_name, ensdb.table(tbl_name).schema())
EnsDB schema
chromosome ibis.Schema {
  seq_name     string
  seq_length   int32
  is_circular  int32
}
entrezgene ibis.Schema {
  gene_id   string
  entrezid  int32
}
exon ibis.Schema {
  exon_id         string
  exon_seq_start  int32
  exon_seq_end    int32
}
gene ibis.Schema {
  gene_id               string
  gene_name             string
  gene_biotype          string
  gene_seq_start        int32
  gene_seq_end          int32
  seq_name              string
  seq_strand            int32
  seq_coord_system      string
  description           string
  gene_id_version       string
  canonical_transcript  string
}
metadata ibis.Schema {
  name   string
  value  string
}
protein ibis.Schema {
  tx_id             string
  protein_id        string
  protein_sequence  string
}
protein_domain ibis.Schema {
  protein_id             string
  protein_domain_id      string
  protein_domain_source  string
  interpro_accession     string
  prot_dom_start         int32
  prot_dom_end           int32
}
tx ibis.Schema {
  tx_id             string
  tx_biotype        string
  tx_seq_start      int32
  tx_seq_end        int32
  tx_cds_seq_start  int32
  tx_cds_seq_end    int32
  gene_id           string
  tx_support_level  int32
  tx_id_version     string
  gc_content        float64
  tx_external_name  string
  tx_is_canonical   int32
}
tx2exon ibis.Schema {
  tx_id     string
  exon_id   string
  exon_idx  int32
}
uniprot ibis.Schema {
  protein_id            string
  uniprot_id            string
  uniprot_db            string
  uniprot_mapping_type  string
}
for tbl_name in ucscdb.list_tables():
    print(tbl_name, ucscdb.table(tbl_name).schema())
TxDB schema
cds ibis.Schema {
  _cds_id     int32
  cds_name    string
  cds_chrom   !string
  cds_strand  !string
  cds_start   !int32
  cds_end     !int32
}
chrominfo ibis.Schema {
  _chrom_id    int32
  chrom        !string
  length       int32
  is_circular  int32
}
exon ibis.Schema {
  _exon_id     int32
  exon_name    string
  exon_chrom   !string
  exon_strand  !string
  exon_start   !int32
  exon_end     !int32
}
gene ibis.Schema {
  gene_id  !string
  _tx_id   !int32
}
metadata ibis.Schema {
  name   string
  value  string
}
splicing ibis.Schema {
  _tx_id     !int32
  exon_rank  !int32
  _exon_id   !int32
  _cds_id    int32
  cds_phase  int32
}
transcript ibis.Schema {
  _tx_id     int32
  tx_name    string
  tx_type    string
  tx_chrom   !string
  tx_strand  !string
  tx_start   !int32
  tx_end     !int32
}

It does look like the UCSC sqlite files carry less information.

Docs/ links

It's probably worth looking into how the bioconductor packages deal with having two different schemas. E.g. do they subclass, are the annotation filters aware?

cc: @nvictus

@ivirshup
Copy link
Member Author

ivirshup commented Apr 3, 2024

Re discussion about nonstandard chromosome names @nvictus: jorainer/ensembldb#88

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request P2🏝 Low priority
Projects
None yet
Development

No branches or pull requests

1 participant