Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to retrieve sequence lengths from Ensembl #134

Open
tomthun opened this issue May 3, 2022 · 1 comment
Open

Unable to retrieve sequence lengths from Ensembl #134

tomthun opened this issue May 3, 2022 · 1 comment

Comments

@tomthun
Copy link

tomthun commented May 3, 2022

Hello, i have a problem creating an annotation file for Seurat with GetGrangesFromEnsDb().
While curating the single-cell experiments, a certain GTF was used for genomic annotations.
When i try to load this file with ensDbFromGtf() the step Processing chromosomes ... Fetch seqlengths from ensembl ... FAILs.
Because of this GetGrangesFromEnsDb() throws an error as it cannot create a GRanges object from a Seqinfo object with NA seqlengths. I also tried the makeTxDbFromGFF() method from the GenomicFeatures library and run GetGrangesFromEnsDb() with the newly created DB, but suffer the same issue.

The following comprises all of the above and shows the exact phrasing of the error:

DB <- ensDbFromGtf(gtf= 'C:/Users/heinz/Downloads/Mus_musculus.GRCm38.97.gtf.gz')

Importing GTF file ... OK
Processing genes ... 
 Attribute availability:
  o gene_id ... OK
  o gene_name ... OK
  o entrezid ... Nope
  o gene_biotype ... OK
OK
Processing transcripts ... 
 Attribute availability:
  o transcript_id ... OK
  o gene_id ... OK
  o transcript_biotype ... OK
  o transcript_name ... OK
OK
Processing exons ... OK
Processing chromosomes ... Fetch seqlengths from ensembl ... FAIL
OK
Processing metadata ... OK
Generating index ... OK
  -------------
Verifying validity of the information in the database:
Checking transcripts ... OK
Checking exons ... OK
Warning messages:
1: In ensDbFromGRanges(GTF, outfile = outfile, path = path, organism = organism,  :
   I'm missing column(s): 'entrezid'. The corresponding database column(s) will be empty!
2: In tryGetSeqinfoFromEnsembl(organism, version, seqnames = chroms$seq_name) :
  Unable to retrieve sequence lengths from Ensembl.
3: In if (!missing(from) && !is.finite(if (is.character(from)) from <- as.numeric(from) else from)) stop("'from' must be a finite number") :
  closing unused connection 3 (ftp://ftp.ensembl.org/pub/release-97/mysql/)

EDB <- EnsDb(DB)
annotations <- GetGRangesFromEnsDb(ensdb = EDB)

Error in asMethod(object) : 
  cannot create a GRanges object from a Seqinfo object with NA seqlengths

Weirdly, the code snippet worked previously. However, I had to update my R version and libraries and now it doesn't.

EDIT: here are my attached libraries from SessionInfo():

attached base packages:
[1] grid      stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] AnnotationHub_3.4.0                BiocFileCache_2.4.0                dbplyr_2.1.1                      
 [4] EnsDb.Mmusculus.v79_2.99.0         BiocManager_1.30.17                cicero_1.3.6                      
 [7] Gviz_1.40.0                        BSgenome.Mmusculus.UCSC.mm10_1.4.3 BSgenome_1.64.0                   
[10] rtracklayer_1.56.0                 Biostrings_2.64.0                  XVector_0.36.0                    
[13] JASPAR2020_0.99.10                 TFBSTools_1.34.0                   patchwork_1.1.0.9000              
[16] cowplot_1.1.1                      openxlsx_4.2.5                     gprofiler2_0.2.1                  
[19] data.table_1.14.2                  ensembldb_2.20.1                   AnnotationFilter_1.20.0           
[22] GenomicFeatures_1.48.0             AnnotationDbi_1.58.0               RSQLite_2.2.13                    
[25] ggplot2_3.3.6                      future_1.25.0                      monocle3_1.0.0                    
[28] SingleCellExperiment_1.18.0        SummarizedExperiment_1.26.1        GenomicRanges_1.48.0              
[31] GenomeInfoDb_1.32.1                IRanges_2.30.0                     S4Vectors_0.34.0                  
[34] MatrixGenerics_1.8.0               matrixStats_0.62.0                 Biobase_2.56.0                    
[37] BiocGenerics_0.42.0                SeuratWrappers_0.3.0               zellkonverter_1.7.0               
[40] SeuratObject_4.0.4                 Seurat_4.1.0                       Signac_1.6.0    
@jorainer
Copy link
Owner

jorainer commented May 4, 2022

From the code snipped above it seems there is a problem retrieving the sequence (chromosome) lengths form Ensembl (Processing chromosomes ... Fetch seqlengths from ensembl ... FAIL). This could be a temporary problem or some re-organization of files on the ftp servers from Ensembl.

Before digging into this, would it be possible to use the official EnsDb instead of creating one from a GTF? Creating these databases from GFF and GTF files is always problematic, because the format can (and has several times) change(d) and the provided information in these files is also a little limited.

If you need mus musculus annotations for Ensembl 97:

> library(AnnotationHub)
> ah <- AnnotationHub()
snapshotDate(): 2022-04-21
> query(ah, "EnsDb.Mmusculus.v97")
AnnotationHub with 1 record
# snapshotDate(): 2022-04-21
# names(): AH73905
# $dataprovider: Ensembl
# $species: Mus musculus
# $rdataclass: EnsDb
# $rdatadateadded: 2019-05-02
# $title: Ensembl 97 EnsDb for Mus musculus
# $description: Gene and protein annotations for Mus musculus based on Ensem...
# $taxonomyid: 10090
# $genome: GRCm38
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("97", "AHEnsDbs", "Annotation", "EnsDb", "Ensembl", "Gene",
#   "Protein", "Transcript") 
# retrieve record with 'object[["AH73905"]]' 
> edb <- ah[["AH73905"]]
loading from cache
> edb
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.4
|Creation time: Sun Jul  7 08:07:59 2019
|ensembl_version: 97
|ensembl_host: localhost
|Organism: Mus musculus
|taxonomy_id: 10090
|genome_build: GRCm38
|DBSCHEMAVERSION: 2.1
| No. of genes: 56393.
| No. of transcripts: 144404.
|Protein data available.

This database will be cached locally - and that way you can also ensure reproducibility, because this database will never change and will always be there (in AnnotationHub). Also, you'll have full annotations, including protein annotations available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants