`read_tsv()` gives problems on gzipped file, not when uncompressed #1523

gavinband · 2023-11-23T18:12:58Z

Thanks for making readr and tidyverse!

I am using read_tsv() (read 2.1.4) to parse this largeish file from a public repository:

http://ftp.ensembl.org/pub/release-110/gff3/mus_musculus/Mus_musculus.GRCm39.110.chr.gff3.gz

My code is:

system( 'curl -O http://ftp.ensembl.org/pub/release-110/gff3/mus_musculus/Mus_musculus.GRCm39.110.chr.gff3.gz' )
filename = "Mus_musculus.GRCm39.110.chr.gff3.gz"
X = readr::read_tsv(
        filename,
        comment = '#',
        na = ".",
        col_names = c( 'seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes' ),
        col_types = readr::cols(
            readr::col_character(),
            readr::col_character(),
            readr::col_character(),
            readr::col_integer(),
            readr::col_double(),
            readr::col_double(),
            readr::col_character(),
            readr::col_integer(),
            readr::col_character()
        )
)

However, this reports:

One or more parsing issues, call `problems()` on your data frame for details

Sure enough there are problems:

> readr::problems(X)
# A tibble: 3,853,670 × 5
      row   col expected   actual            file 
    <int> <int> <chr>      <chr>             <chr>
 1 433594     4 an integer ensembl           ""   
 2 433594     5 a double   ncRNA_gene        ""   
 3 433595     4 an integer ensembl           ""   
 4 433595     5 a double   miRNA             ""   
 5 433596     4 an integer ensembl           ""   
 6 433596     5 a double   exon              ""   
 7 433597     4 an integer cpg               ""   
 8 433597     5 a double   biological_region ""   
 9 433598     4 an integer Eponine           ""   
10 433598     5 a double   biological_region ""   
# ℹ 3,853,660 more rows
# ℹ Use `print(n = ...)` to see more rows

The parsed line 433594 looks like this:

> X[433594,]
# A tibble: 1 × 9
  seqid source type  start   end    score strand   phase attributes
  <chr> <chr>  <chr> <int> <dbl>    <dbl> <chr>    <int> <chr>     
1 #     "#\n"  3        NA    NA 60677110 60677223    NA -

However if I unzip the file first then the problem goes away:

system( 'gunzip Mus_musculus.GRCm39.110.chr.gff3.gz' )
filename = 'Mus_musculus.GRCm39.110.chr.gff3'
X = readr::read_tsv(
        filename,
        comment = '#',
        na = ".",
        col_names = c( 'seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes' ),
        col_types = readr::cols(
            readr::col_character(),
            readr::col_character(),
            readr::col_character(),
            readr::col_integer(),
            readr::col_double(),
            readr::col_double(),
            readr::col_character(),
            readr::col_integer(),
            readr::col_character()
        )
)

With correct results on that line:

> X[433594,]
# A tibble: 1 × 9
  seqid source  type          start      end score strand phase attributes      
  <chr> <chr>   <chr>         <int>    <dbl> <dbl> <chr>  <int> <chr>           
1 13    ensembl ncRNA_gene 60677110 60677223    NA -         NA ID=gene:ENSMUSG…

(I can re-gzip the file to restore the problem)

One thing that may be relevant is that the file seems to be sprinkled with comment lines (they are '###\n') including one just around this problem line (but lots of others before this as well):

% gunzip -c Mus_musculus.GRCm39.110.chr.gff3.gz| head -n 433594 | tail | cut -f1-5 
13  havana  ncRNA_gene  44304071    44305429
13  havana  lnc_RNA 44304071    44305429
13  havana  exon    44304071    44304143
13  havana  exon    44305240    44305429
###
13  havana  ncRNA_gene  44339236    44369963
13  havana  lnc_RNA 44339236    44369910
13  havana  exon    44339236    44343939
13  havana  exon    44348532    44348783
13  havana  exon    44369847    44369910

Session info:

> packageVersion( 'readr' )
[1] ‘2.1.4’
> packageVersion( 'vroom' )
[1] ‘1.6.0’
> packageVersion( 'tidyverse' )
[1] ‘2.0.0’

> sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_4.2.1

Many thanks for any help with this issue.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`read_tsv()` gives problems on gzipped file, not when uncompressed #1523

`read_tsv()` gives problems on gzipped file, not when uncompressed #1523

gavinband commented Nov 23, 2023

read_tsv() gives problems on gzipped file, not when uncompressed #1523

read_tsv() gives problems on gzipped file, not when uncompressed #1523

Comments

gavinband commented Nov 23, 2023

`read_tsv()` gives problems on gzipped file, not when uncompressed #1523

`read_tsv()` gives problems on gzipped file, not when uncompressed #1523