Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exceed 2^31-1 bytes #64

Open
partizanos opened this issue Apr 16, 2024 · 8 comments
Open

exceed 2^31-1 bytes #64

partizanos opened this issue Apr 16, 2024 · 8 comments

Comments

@partizanos
Copy link

partizanos commented Apr 16, 2024

Hello I try to use ricu with sic dataset however I face this issue (below) any ideas?

sic$laboratory
Data for `sic` is missing
Setup now (Y/n)? Y
The requested tables have already been downloaded
── Importing 8 tables for `sic` ───────────────────────────────────────────────────
Error in paste(do.call("c", msg), collapse = "\n") : 
  result would exceed 2^31-1 bytes
In addition: There were 50 or more warnings (use warnings() to see the first 50)
@mcr1213
Copy link

mcr1213 commented Apr 21, 2024

I also have this issue with no solution yet. It seems to be specific when using import_src on the 2,15 GB data_float_h.csv.gz file from SICdb, all other datasets worked fine.

Some things I've tried:

  • Upgraded R to 4.3.3 and ricu 0.6.0
  • Upgraded all packages + fresh install of R.
  • Tried different hardware as I tried first on a M2 Mac, however another Linux system gives the same problem.
  • checked sha256 sums on downloaded files as in some other threads file-corruption was a cause

Full traceback is included:
Screenshot 2024-04-21 at 18 40 16

Any other suggestions to try would be much appreciated.

@manuelburger
Copy link

The configuration files under inst/extdata/config/data-sources.json for the SICdb database with sic tag do not correctly reflect the most recent version, which is downloaded from Physionet. Configurations, which are mostly correct can be found here in a previous PR to integrate the database, but seem to have not been merged entirely: #30 to the current main branch.

The error message posted stems from the fact, that ricu or more specifically the read_csv_chunked function raises a warning for every single erroneous line, when importing the csv. The most problematic is the configuration for the data_float_h table, where in the current main branch here:

the rawdata column is specified to be of type col_double. The database documentation here: https://www.sicdb.com/Documentation/Signal_Data clearly states, that this column is a binary data column compressing up to 60 floats into a single cell of the csv table, to keep the row count of the table manageable, while still providing up to a minute level of resolution for some variables. 60 compressed floats naturally do not cast well to a col_double and thus one gets a full error message for every single line of the entire data_float_h table, this error messages are all concatenated by ricu instead by this function report_problems here, concatenating this many error messages blows the R string size of 2^31-1 bytes, which explains the error message.

Interestingly there's a second report_problems function just above the first one here, which would handle this problem by only reporting the 10 first issues and ignoring the rest, well, since it's listed first in source code, the second function will ultimately be used and thus all messages are propagated at the moment.

Potential fix is:

  • Making sure the correct report_problems function is called, which ignores all but the 10 first functions.
  • Prior does not tackle the source of the problem, which is the wrong configuration. The rawdata column should be imported with type col_character and then the PR referenced above here: Enable SICdb in ricu #30 actually contains some code to unfold the 60 compressed floats to use the SICdb in its full high resolution.

Hope this helps

@mcr1213
Copy link

mcr1213 commented Apr 27, 2024

@manuelburger Thank you so much for the clear explanation. I've removed the redundant 'report_problems' function and changed the rawdata column from col_double to col_character in the config file. However, after 31% another error occurs:

Screenshot 2024-04-27 at 19 39 08

Probably this has to do with the changes you mentioned in #30 which are not merged with the main branch. Is there any particular reason that these changes are not available? Or is it only me for which SICdb 1.0.6 is not working in ricu?

@mcr1213
Copy link

mcr1213 commented Apr 29, 2024

So short update, I've taken the branch mentioned in #30 as created by @prockenschaub and recompiled the ricu package (the older 0.5.5 version that is) and tried with this to add SICdb. The previous error does not occur, however after importing 86% a new one does:

Screenshot 2024-04-29 at 19 21 41

I've tried tracing back the code to see if there was an obvious explanation, but could not find one. It is not clear to me what function res should be.

Is there anyone with a working SICdb environment? And could they tell me which codebase they used?

@prockenschaub
Copy link
Collaborator

prockenschaub commented Apr 29, 2024

@mcr1213 I originally meant to work with SICdb when it was released but this has been pushed back repeatedly, so I haven't touched the code in a while. I originally thought that SICdb was fully integrated in ricu 0.6.0 and there was no need for my code, but apparently not.

Since there appears to be increased interest in SICdb, maybe now is a good time to look at it again. I will try to find some time in the coming days to look at your error and see what's wrong / how we can bring the code into the latest version of ricu and SICdb.

Edit: I had a quick look. res should be the function sic_data_float_h as defined in data-sources.jsan:

"callback": "sic_data_float_h"

@mcr1213
Copy link

mcr1213 commented May 14, 2024

@prockenschaub Thanks for your suggestion. Unfortunately, I'm no expert in debugging R-packages and it does not yet work for me. At the moment my hypothesis is that the mentioned 'sic_data_float_h' cannot be found. When doing ls("package:ricu") this function does not show up in the available functions. I do know that this function is placed in the new (compared to the original release) file "./R/callback-tb-R". Searches in google/chatgpt suggested mentioning the file in the main DESCRIPTION file, but the other files are not referenced there.

I've also tried to 'Reoxygenize' the package to recreate NAMESPACE, but no luck.

Can you tell me if I'm on the right track? Does the sicdb work for you?

@dplecko
Copy link
Member

dplecko commented May 24, 2024

I will resolve this issue in the next version (i.e., in June). In the meantime, if this is an urgent matter for anyone, my suggestion is to simply perform manual conversion to fst. I am attaching below some (pretty raw) code that I used for converting the sic tables when I first accessed the data. This code could perhaps be helpful for anyone looking for a quick fix, until I resolve the issue properly.

First, I split the data_float_h table into chunks (since it is huge)

import csv, os

def split_csv_file(input_file, output_prefix, num_files):
    # Open the input CSV file
    with open(input_file, 'r') as file:
        # Create a CSV reader
        reader = csv.reader(file)
        
        # Read the header row
        header = next(reader)
        
        # Calculate the number of rows per file (excluding the header row)
        rows_per_file = (sum(1 for _ in reader) + num_files - 1) // num_files
        
        # Reset the file pointer to the beginning
        file.seek(0)
        
        # Split the CSV into smaller chunks
        chunk_index = 1
        for i, row in enumerate(reader):
            if (i % rows_per_file) == 0:
                # Open a new output file
                output_file = f"{output_prefix}_{chunk_index}.csv"
                with open(output_file, 'w', newline='') as output:
                    writer = csv.writer(output)
                    writer.writerow(header)  # Write the header row
                    
                    # Write rows to the current chunk until desired size
                    for j in range(rows_per_file):
                        try:
                            writer.writerow(next(reader))
                        except StopIteration:
                            break
                    print(f"Saved {output_file}")
                
                chunk_index += 1

input_path = os.path.expanduser("sic-data/data_float_h.csv")
split_csv_file(input_path, "output", 30)

And then all tables can be converted to fst


root <- rprojroot::find_root(".gitignore")
r_dir <- file.path(root, "r")
invisible(lapply(list.files(r_dir, full.names = TRUE), source))

library(fst)
library(ricu)

if (!dir.exists(file.path(data_dir(), "sic"))) 
  dir.create(file.path(data_dir(), "sic"))

convert_names <- c(
  "cases", "d_references", "data_range", "data_ref", "laboratory",
  "medication", "unitlog",
  "data_float_h"
)

data_path <- file.path("~", "Desktop", "sic-data")
if (is.element("data_float_hfull", convert_names)) {
  
  convert_names <- paste0(
    "data_float_h/",
    gsub(".csv", "", list.files(file.path(data_path, "data_float_h")))
  )
}

for (tab_name in convert_names) {
  
  if (file.exists(file.path(data_path, paste0(tab_name, ".csv")))) {
    
    tbl <- read.csv(file.path(data_path, paste0(tab_name, ".csv")))
    # file.remove(paste0(tab_name, ".parquet"))
    
    if (grepl("data_float_h_", tab_name)) 
      tab_name <- gsub("data_float_h_", "", tab_name)
    
    if (tab_name == "microbiology") {
      
      off_col <- which(names(tbl) == "offset")
      names(tbl)[off_col] <- "Offset"
    }
    
    if (tab_name == "gcs") {
      
      tbl$Offset <- 0
    }
    
    write_fst(tbl, path = file.path(data_dir(), "sic", paste0(tab_name, ".fst")))
    
  }
  
  print(tab_name)
}

fix_rawdata <- which(
  vapply(
    1:30,
    function(i) {
      class(
        read.fst(file.path(data_dir(), "sic", "data_float_h", 
                           paste0(i, ".fst")))$rawdata
      )
    }, character(1L) 
  ) == "logical"
)

for (i in fix_rawdata) {
  
  lgl_out <- read.fst(file.path(data_dir(), "sic", "data_float_h", 
                                paste0(i, ".fst")))
  lgl_out$rawdata <- as.numeric(lgl_out$rawdata)
  
  write.fst(lgl_out, file.path(data_dir(), "sic", "data_float_h", 
                               paste0(i, ".fst")))
}

Once the fst files are properly named and located in a folder called sic within the directory given by ricu::data_dir(), there should be no further issues.

@mcr1213
Copy link

mcr1213 commented May 29, 2024

Thanks for the help everyone! The tables can now be successfully imported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants