Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_tsv() gives problems on gzipped file, not when uncompressed #1523

Open
gavinband opened this issue Nov 23, 2023 · 0 comments
Open

read_tsv() gives problems on gzipped file, not when uncompressed #1523

gavinband opened this issue Nov 23, 2023 · 0 comments

Comments

@gavinband
Copy link

Thanks for making readr and tidyverse!

I am using read_tsv() (read 2.1.4) to parse this largeish file from a public repository:

http://ftp.ensembl.org/pub/release-110/gff3/mus_musculus/Mus_musculus.GRCm39.110.chr.gff3.gz

My code is:

system( 'curl -O http://ftp.ensembl.org/pub/release-110/gff3/mus_musculus/Mus_musculus.GRCm39.110.chr.gff3.gz' )
filename = "Mus_musculus.GRCm39.110.chr.gff3.gz"
X = readr::read_tsv(
        filename,
        comment = '#',
        na = ".",
        col_names = c( 'seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes' ),
        col_types = readr::cols(
            readr::col_character(),
            readr::col_character(),
            readr::col_character(),
            readr::col_integer(),
            readr::col_double(),
            readr::col_double(),
            readr::col_character(),
            readr::col_integer(),
            readr::col_character()
        )
)

However, this reports:

One or more parsing issues, call `problems()` on your data frame for details

Sure enough there are problems:

> readr::problems(X)
# A tibble: 3,853,670 × 5
      row   col expected   actual            file 
    <int> <int> <chr>      <chr>             <chr>
 1 433594     4 an integer ensembl           ""   
 2 433594     5 a double   ncRNA_gene        ""   
 3 433595     4 an integer ensembl           ""   
 4 433595     5 a double   miRNA             ""   
 5 433596     4 an integer ensembl           ""   
 6 433596     5 a double   exon              ""   
 7 433597     4 an integer cpg               ""   
 8 433597     5 a double   biological_region ""   
 9 433598     4 an integer Eponine           ""   
10 433598     5 a double   biological_region ""   
# ℹ 3,853,660 more rows
# ℹ Use `print(n = ...)` to see more rows

The parsed line 433594 looks like this:

> X[433594,]
# A tibble: 1 × 9
  seqid source type  start   end    score strand   phase attributes
  <chr> <chr>  <chr> <int> <dbl>    <dbl> <chr>    <int> <chr>     
1 #     "#\n"  3        NA    NA 60677110 60677223    NA -         

However if I unzip the file first then the problem goes away:

system( 'gunzip Mus_musculus.GRCm39.110.chr.gff3.gz' )
filename = 'Mus_musculus.GRCm39.110.chr.gff3'
X = readr::read_tsv(
        filename,
        comment = '#',
        na = ".",
        col_names = c( 'seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes' ),
        col_types = readr::cols(
            readr::col_character(),
            readr::col_character(),
            readr::col_character(),
            readr::col_integer(),
            readr::col_double(),
            readr::col_double(),
            readr::col_character(),
            readr::col_integer(),
            readr::col_character()
        )
)

With correct results on that line:

> X[433594,]
# A tibble: 1 × 9
  seqid source  type          start      end score strand phase attributes      
  <chr> <chr>   <chr>         <int>    <dbl> <dbl> <chr>  <int> <chr>           
1 13    ensembl ncRNA_gene 60677110 60677223    NA -         NA ID=gene:ENSMUSG…

(I can re-gzip the file to restore the problem)

One thing that may be relevant is that the file seems to be sprinkled with comment lines (they are '###\n') including one just around this problem line (but lots of others before this as well):

% gunzip -c Mus_musculus.GRCm39.110.chr.gff3.gz| head -n 433594 | tail | cut -f1-5 
13  havana  ncRNA_gene  44304071    44305429
13  havana  lnc_RNA 44304071    44305429
13  havana  exon    44304071    44304143
13  havana  exon    44305240    44305429
###
13  havana  ncRNA_gene  44339236    44369963
13  havana  lnc_RNA 44339236    44369910
13  havana  exon    44339236    44343939
13  havana  exon    44348532    44348783
13  havana  exon    44369847    44369910

Session info:

> packageVersion( 'readr' )
[1] ‘2.1.4> packageVersion( 'vroom' )
[1] ‘1.6.0> packageVersion( 'tidyverse' )
[1] ‘2.0.0> sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_4.2.1

Many thanks for any help with this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant