Unable to retrieve sequence lengths from Ensembl #134

tomthun · 2022-05-03T23:03:33Z

Hello, i have a problem creating an annotation file for Seurat with GetGrangesFromEnsDb().
While curating the single-cell experiments, a certain GTF was used for genomic annotations.
When i try to load this file with ensDbFromGtf() the step Processing chromosomes ... Fetch seqlengths from ensembl ... FAILs.
Because of this GetGrangesFromEnsDb() throws an error as it cannot create a GRanges object from a Seqinfo object with NA seqlengths. I also tried the makeTxDbFromGFF() method from the GenomicFeatures library and run GetGrangesFromEnsDb() with the newly created DB, but suffer the same issue.

The following comprises all of the above and shows the exact phrasing of the error:

DB <- ensDbFromGtf(gtf= 'C:/Users/heinz/Downloads/Mus_musculus.GRCm38.97.gtf.gz')

Importing GTF file ... OK
Processing genes ... 
 Attribute availability:
  o gene_id ... OK
  o gene_name ... OK
  o entrezid ... Nope
  o gene_biotype ... OK
OK
Processing transcripts ... 
 Attribute availability:
  o transcript_id ... OK
  o gene_id ... OK
  o transcript_biotype ... OK
  o transcript_name ... OK
OK
Processing exons ... OK
Processing chromosomes ... Fetch seqlengths from ensembl ... FAIL
OK
Processing metadata ... OK
Generating index ... OK
  -------------
Verifying validity of the information in the database:
Checking transcripts ... OK
Checking exons ... OK
Warning messages:
1: In ensDbFromGRanges(GTF, outfile = outfile, path = path, organism = organism,  :
   I'm missing column(s): 'entrezid'. The corresponding database column(s) will be empty!
2: In tryGetSeqinfoFromEnsembl(organism, version, seqnames = chroms$seq_name) :
  Unable to retrieve sequence lengths from Ensembl.
3: In if (!missing(from) && !is.finite(if (is.character(from)) from <- as.numeric(from) else from)) stop("'from' must be a finite number") :
  closing unused connection 3 (ftp://ftp.ensembl.org/pub/release-97/mysql/)

EDB <- EnsDb(DB)
annotations <- GetGRangesFromEnsDb(ensdb = EDB)

Error in asMethod(object) : 
  cannot create a GRanges object from a Seqinfo object with NA seqlengths

Weirdly, the code snippet worked previously. However, I had to update my R version and libraries and now it doesn't.

EDIT: here are my attached libraries from SessionInfo():

attached base packages:
[1] grid      stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] AnnotationHub_3.4.0                BiocFileCache_2.4.0                dbplyr_2.1.1                      
 [4] EnsDb.Mmusculus.v79_2.99.0         BiocManager_1.30.17                cicero_1.3.6                      
 [7] Gviz_1.40.0                        BSgenome.Mmusculus.UCSC.mm10_1.4.3 BSgenome_1.64.0                   
[10] rtracklayer_1.56.0                 Biostrings_2.64.0                  XVector_0.36.0                    
[13] JASPAR2020_0.99.10                 TFBSTools_1.34.0                   patchwork_1.1.0.9000              
[16] cowplot_1.1.1                      openxlsx_4.2.5                     gprofiler2_0.2.1                  
[19] data.table_1.14.2                  ensembldb_2.20.1                   AnnotationFilter_1.20.0           
[22] GenomicFeatures_1.48.0             AnnotationDbi_1.58.0               RSQLite_2.2.13                    
[25] ggplot2_3.3.6                      future_1.25.0                      monocle3_1.0.0                    
[28] SingleCellExperiment_1.18.0        SummarizedExperiment_1.26.1        GenomicRanges_1.48.0              
[31] GenomeInfoDb_1.32.1                IRanges_2.30.0                     S4Vectors_0.34.0                  
[34] MatrixGenerics_1.8.0               matrixStats_0.62.0                 Biobase_2.56.0                    
[37] BiocGenerics_0.42.0                SeuratWrappers_0.3.0               zellkonverter_1.7.0               
[40] SeuratObject_4.0.4                 Seurat_4.1.0                       Signac_1.6.0

The text was updated successfully, but these errors were encountered:

jorainer · 2022-05-04T11:30:32Z

From the code snipped above it seems there is a problem retrieving the sequence (chromosome) lengths form Ensembl (Processing chromosomes ... Fetch seqlengths from ensembl ... FAIL). This could be a temporary problem or some re-organization of files on the ftp servers from Ensembl.

Before digging into this, would it be possible to use the official EnsDb instead of creating one from a GTF? Creating these databases from GFF and GTF files is always problematic, because the format can (and has several times) change(d) and the provided information in these files is also a little limited.

If you need mus musculus annotations for Ensembl 97:

> library(AnnotationHub)
> ah <- AnnotationHub()
snapshotDate(): 2022-04-21
> query(ah, "EnsDb.Mmusculus.v97")
AnnotationHub with 1 record
# snapshotDate(): 2022-04-21
# names(): AH73905
# $dataprovider: Ensembl
# $species: Mus musculus
# $rdataclass: EnsDb
# $rdatadateadded: 2019-05-02
# $title: Ensembl 97 EnsDb for Mus musculus
# $description: Gene and protein annotations for Mus musculus based on Ensem...
# $taxonomyid: 10090
# $genome: GRCm38
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("97", "AHEnsDbs", "Annotation", "EnsDb", "Ensembl", "Gene",
#   "Protein", "Transcript") 
# retrieve record with 'object[["AH73905"]]' 
> edb <- ah[["AH73905"]]
loading from cache
> edb
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.4
|Creation time: Sun Jul  7 08:07:59 2019
|ensembl_version: 97
|ensembl_host: localhost
|Organism: Mus musculus
|taxonomy_id: 10090
|genome_build: GRCm38
|DBSCHEMAVERSION: 2.1
| No. of genes: 56393.
| No. of transcripts: 144404.
|Protein data available.

This database will be cached locally - and that way you can also ensure reproducibility, because this database will never change and will always be there (in AnnotationHub). Also, you'll have full annotations, including protein annotations available.

tomthun mentioned this issue May 4, 2022

Unable to retrieve sequence lengths from Ensembl stuart-lab/signac#1110

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to retrieve sequence lengths from Ensembl #134

Unable to retrieve sequence lengths from Ensembl #134

tomthun commented May 3, 2022 •

edited

jorainer commented May 4, 2022

Unable to retrieve sequence lengths from Ensembl #134

Unable to retrieve sequence lengths from Ensembl #134

Comments

tomthun commented May 3, 2022 • edited

jorainer commented May 4, 2022

tomthun commented May 3, 2022 •

edited