Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide mapping of non standard chromosome names from ensembl to UCSC #88

Open
ivanek opened this issue Dec 6, 2018 · 6 comments
Open

Comments

@ivanek
Copy link

ivanek commented Dec 6, 2018

Is there a chance to implement conversion of non-standard chromosome
names from ensembl format to UCSC (NCBI)?

The
(GenomeInfoDb)[http://bioconductor.org/packages/release/bioc/html/GenomeInfoDb.html]
package provides function fetchExtendedChromInfoFromUCSC to fetch
additional chromosome info, however the ensembl names are not part of
it. I guess an ideal situation would be, if this function would also
consider additional table present in goldenPath database directory.

For human (hg38):

url <- "http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/chromAlias.txt.gz"
con <- gzcon(url(url, open="r"))
chromAlias <- read.table(textConnection(readLines(con)))
head(chromAlias)
|V1           |V2    |V3      |
|:------------|:-----|:-------|
|NC_000001.11 |chr1  |refseq  |
|CM000663.2   |chr1  |genbank |
|1            |chr1  |ensembl |
|NC_000010.11 |chr10 |refseq  |
|CM000672.2   |chr10 |genbank |
|10           |chr10 |ensembl |

For mouse (mm10):

url <- "http://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/ucscToEnsembl.txt.gz"
con <- gzcon(url(url, open="r"))
ucscToEnsembl <- read.table(textConnection(readLines(con)))
head(ucscToEnsembl)
|V1             |V2         |
|:--------------|:----------|
|chr15          |15         |
|chrUn_JH584304 |JH584304.1 |
|chr12          |12         |
|chr13          |13         |
|chr11          |11         |
|chr9           |9          |

Unfortunately those table names and format are not identical across genome versions but
the fetchExtendedChromInfoFromUCSC function seems to handle this
inconsistency anyway for already provided information.

@jorainer
Copy link
Owner

jorainer commented Dec 6, 2018

Thanks @ivanek for the input! As you point out we have to include the GenomeInfoDb developers too here. ensembldb uses the genomeStyles from GenomeInfoDb to do the mapping. It would be nice if this function would return mappings also for patches etc.

@jorainer
Copy link
Owner

Commit 5d9ae11 adds the possibility to provide custom seqlevels mapping using a data.frame. A description can be found here - does that fix your issue @ivanek ?

@hpages
Copy link
Contributor

hpages commented Aug 11, 2022

Hello ensembldb people,

FWIW a while ago I replaced GenomeInfoDb::fetchExtendedChromInfoFromUCSC() with GenomeInfoDb::getChromInfoFromUCSC(). The latter has the capability to map UCSC chromosome names to Ensembl names as long as UCSC provides this mapping:

library(GenomeInfoDb)
hg38_chrominfo <- getChromInfoFromUCSC("hg38", add.ensembl.col=TRUE)

head(hg38_chrominfo)
#   chrom      size assembled circular ensembl
# 1  chr1 248956422      TRUE    FALSE       1
# 2  chr2 242193529      TRUE    FALSE       2
# 3  chr3 198295559      TRUE    FALSE       3
# 4  chr4 190214555      TRUE    FALSE       4
# 5  chr5 181538259      TRUE    FALSE       5
# 6  chr6 170805979      TRUE    FALSE       6

tail(hg38_chrominfo)
#                   chrom   size assembled circular            ensembl
# 635 chr22_KN196485v1_alt 156562     FALSE    FALSE CHR_HSCHR22_4_CTG1
# 636 chr22_KN196486v1_alt 153027     FALSE    FALSE CHR_HSCHR22_5_CTG1
# 637 chr22_KQ458387v1_alt 155930     FALSE    FALSE CHR_HSCHR22_6_CTG1
# 638 chr22_KQ458388v1_alt 174749     FALSE    FALSE CHR_HSCHR22_7_CTG1
# 639 chr22_KQ759761v1_alt 145162     FALSE    FALSE CHR_HSCHR22_8_CTG1
# 640  chrX_KV766199v1_alt 188004     FALSE    FALSE  CHR_HSCHRX_3_CTG7

Note that this only works for "registered genomes" (unfortunately genome registration in GenomeInfoDb is a manual process 😞). registered_UCSC_genomes() lists the UCSC genomes that are currently registered:

genomes <- registered_UCSC_genomes()
dim(genomes)
# [1] 83  6

head(genomes)[ , c(1:3, 5)]
#          organism  genome NCBI_assembly with_Ensembl
# 1  Apis mellifera apiMel1          <NA>        FALSE
# 2  Apis mellifera apiMel2      Amel_2.0        FALSE
# 3 Betacoronavirus wuhCor1   ASM985889v3        FALSE
# 4      Bos taurus bosTau1          <NA>        FALSE
# 5      Bos taurus bosTau2          <NA>        FALSE
# 6      Bos taurus bosTau3          <NA>        FALSE

tail(genomes)[ , c(1:3, 5)]
#               organism   genome             NCBI_assembly with_Ensembl
# 78          Sus scrofa  susScr2                Sscrofa9.2        FALSE
# 79          Sus scrofa  susScr3               Sscrofa10.2         TRUE
# 80          Sus scrofa susScr11               Sscrofa11.1         TRUE
# 81 Taeniopygia guttata  taeGut1                      <NA>         TRUE
# 82 Taeniopygia guttata  taeGut2 Taeniopygia_guttata-3.2.4        FALSE
# 83    Zaire ebolavirus  eboVir3                      <NA>        FALSE

The with_Ensembl column indicates whether UCSC provides a mapping to Ensembl chromosome names or not.

See ?getChromInfoFromUCSC for the details.

H.

@jorainer
Copy link
Owner

Thanks Herve @hpages , I'll look into how I can integrate that into ensembldb.

@jorainer
Copy link
Owner

jorainer commented Sep 5, 2022

Thanks for the info @hpages - is there a way to translate also the genome version from Ensembl to UCSC? Your getChromInfoFromUCSC function would be really helpful, but on my side I have e.g. the (Ensembl) genome version name GRCh38 - which UCSC does not understand - I would somehow need to automatically also map the genome version names first...

@hpages
Copy link
Contributor

hpages commented Sep 6, 2022

I'm not aware of an easy/reliable way to translate an Ensembl genome version to an UCSC genome. Could the data.frame returned by registered_UCSC_genomes() be used for that? E.g.:

library(GenomeInfoDb)
UCSC_genomes <- registered_UCSC_genomes()
subset(UCSC_genomes, grepl("GRCh38", UCSC_genomes$NCBI_assembly))
#        organism genome NCBI_assembly assembly_accession with_Ensembl circ_seqs
# 47 Homo sapiens   hg38    GRCh38.p13   GCF_000001405.39         TRUE      chrM

This is assuming that:

  1. Ensembl uses NCBI assembly names.
  2. There's actually a UCSC genome based on that NCBI assembly.
  3. This UCSC genome is registered in GenomeInfoDb.

That's a lot of assumption but maybe they are satisfied for the small set of organisms you want to support. Also for 3. we can always register new UCSC genomes in GenomeInfoDb.

Note however that using a loose assembly name like GRCh38 sounds risky. In the latest Ensembl release (107), they use GRCh38.p13 for Homo sapiens but this could change any time e.g. they could switch to GRCh38.p14 in the next release, in which case some Ensembl chromosome names will no longer be mapped to a UCSC name. FWIW GenomeInfoDb:::fetch_species_index_from_Ensembl_FTP() can be used to find the exact assembly name used by Ensembl for a given organism:

species_index <- GenomeInfoDb:::fetch_species_index_from_Ensembl_FTP()
subset(species_index[1:5], grepl("homo_sapiens", species_index$species))
#      name      species           division taxonomy_id   assembly
# 125 Human homo_sapiens EnsemblVertebrates        9606 GRCh38.p13

We could export and document it if that would help.

H.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants