R crashes on large get_acoustic_detections query #278

PietrH · 2023-03-28T12:24:35Z

I seem to have run into a bug where etn crashes the whole R session when calling get_acoustic_detections() on a animal_project_code that will result in many detections.

This has been replicated by @peterdesmet , from what I understand, get_acoustic_detections() used to work on this animal_project_code.

library(etn)
albert <- get_acoustic_detections(connect_to_etn(),animal_project_code ="2013_ALBERTKANAAL")

Running this code results in:

I welcome any idea's or speculation on the cause of this bug.

=============================

My session info:

─ Session info ────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.2.1 (2022-06-23)
 os       Ubuntu 22.10
 system   x86_64, linux-gnu
 ui       RStudio
 language en_US
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Europe/Brussels
 date     2023-03-28
 rstudio  2022.12.0+353 Elsbeth Geranium (desktop)
 pandoc   NA

─ Packages ────────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.2.1)
 cli           3.6.1   2023-03-23 [1] CRAN (R 4.2.1)
 DBI           1.1.3   2022-06-18 [1] CRAN (R 4.2.1)
 dplyr         1.1.1   2023-03-22 [1] CRAN (R 4.2.1)
 etn           2.0.1   2023-02-10 [1] local
 fansi         1.0.4   2023-01-22 [1] CRAN (R 4.2.1)
 generics      0.1.3   2022-07-05 [1] CRAN (R 4.2.1)
 glue          1.6.2   2022-02-24 [1] CRAN (R 4.2.1)
 hms           1.1.3   2023-03-21 [1] CRAN (R 4.2.1)
 httr          1.4.5   2023-02-24 [1] CRAN (R 4.2.1)
 jsonlite      1.8.4   2022-12-06 [1] CRAN (R 4.2.1)
 lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.2.1)
 magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.2.1)
 pillar        1.9.0   2023-03-22 [1] CRAN (R 4.2.1)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.2.1)
 R6            2.5.1   2021-08-19 [1] CRAN (R 4.2.1)
 readr         2.1.4   2023-02-10 [1] CRAN (R 4.2.1)
 rlang         1.1.0   2023-03-14 [1] CRAN (R 4.2.1)
 rstudioapi    0.14    2022-08-22 [1] CRAN (R 4.2.1)
 sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.2.1)
 tibble        3.2.1   2023-03-20 [1] CRAN (R 4.2.1)
 tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.2.1)
 tzdb          0.3.0   2022-03-28 [1] CRAN (R 4.2.1)
 utf8          1.2.3   2023-01-31 [1] CRAN (R 4.2.1)
 vctrs         0.6.1   2023-03-22 [1] CRAN (R 4.2.1)

The text was updated successfully, but these errors were encountered:

PietrH · 2023-06-19T09:34:09Z

Todo:

I've requested a few more examples of user queries that fail, or are very slow
I will parse forward the debug view of the failing queries (in SQL) to VLIZ to be investigated.

We'll have a look at possible solutions to speed up the detections view. Indexing, perhaps implementing pagination.

PieterjanVerhelst · 2023-06-19T11:09:42Z

@PietrH, I'm doing a meta-analysis on silver eel migration by combining the data of various studies. Downloading all the data at once not only takes a lot of time, it also gave me a 'time out' error in the past, hence I had to download the data in 2 or 3 subsets. I am running it right now just to try (it has been running since 10.47h this morning). If you want to try yourself, here is the code:

# Create vector with relevant animal_project_codes
animal_projects <- c("2015_phd_verhelst_eel", 
                     "2012_leopoldkanaal",
                     "PTN-Silver-eel-Mondego",
                     "SEMP",
                     "DAK",
                     "EMMN",
                     "Noordzeekanaal",
                     "2011_Warnow",
                     "2017_Fremur",
                     "2011_Loire",
                     "2019_Grotenete",
                     "ESGL",
                     "2013_albertkanaal",
                     "2004_Gudena",
                     "2014_Frome",
                     "2014_Nene",
                     "life4fish"
)


# Download detection data ####
data <- get_acoustic_detections(my_con, scientific_name = "Anguilla anguilla",
                                animal_project_code = animal_projects,
                                limit = FALSE)

Update: download just crashed, giving the same error as above.

CLAUMEMO · 2023-07-17T07:06:55Z

Hi,
I run a similiar query as PJ's and got this error:

memory allocation error???   ut of memory while reading tuples.;
memory allocation error??? 
<SQL> 'SELECT
  det.id_pk AS detection_id,
  det.datetime AS date_time,
  tag_serial_number AS tag_serial_number, -- exclusive to detections_limited
  det.transmitter AS acoustic_tag_id,
  animal_project_code AS animal_project_code, -- exclusive to detections_limited
  animal_id_pk AS animal_id, -- exclusive to detections_limited
  animal_scientific_name AS scientific_name, -- exclusive to detections_limited
  network_project_code AS acoustic_project_code, -- exclusive to detections_limited
  det.receiver AS receiver_id,
  deployment_station_name AS station_name, -- exclusive to detections_limited
  deployment_latitude AS deploy_latitude, -- exclusive to detections_limited
  deployment_longitude AS deploy_longitude, -- exclusive to detections_limited
  det.sensor_value AS sensor_value,
  det.sensor_unit AS sensor_unit

PietrH · 2023-07-17T07:15:00Z

Hello Claudia, thank you for including the error message, very helpful! :-)

What was the exact R command you were running?

Currently the package seems to struggle with queries that result in very large R objects, I'm not sure yet if this a database or a package problem.

In the meantime you might be able to work around the issue by splitting up your query, but I'll know for sure if you can let me know what exactly you were doing.

PietrH · 2023-07-17T07:17:23Z

@bart-v Could the following error be due to the server/ODBC drivers running out of memory?


memory allocation error???   ut of memory while reading tuples.;
memory allocation error??? 
<SQL> 'SELECT
  det.id_pk AS detection_id,
  det.datetime AS date_time,
  tag_serial_number AS tag_serial_number, -- exclusive to detections_limited
  det.transmitter AS acoustic_tag_id,
  animal_project_code AS animal_project_code, -- exclusive to detections_limited
  animal_id_pk AS animal_id, -- exclusive to detections_limited
  animal_scientific_name AS scientific_name, -- exclusive to detections_limited
  network_project_code AS acoustic_project_code, -- exclusive to detections_limited
  det.receiver AS receiver_id,
  deployment_station_name AS station_name, -- exclusive to detections_limited
  deployment_latitude AS deploy_latitude, -- exclusive to detections_limited
  deployment_longitude AS deploy_longitude, -- exclusive to detections_limited
  det.sensor_value AS sensor_value,
  det.sensor_unit AS sensor_unit

bart-v · 2023-07-17T07:47:41Z

The query is not complete, so who knows...

PietrH · 2023-07-17T08:40:44Z

@bart-v

This is the query that get_acoustic_detections() is trying to execute in the example PieterjanVerhelst gave:

SELECT
  det.id_pk AS detection_id,
  det.datetime AS date_time,
  tag_serial_number AS tag_serial_number, -- exclusive to detections_limited
  det.transmitter AS acoustic_tag_id,
  animal_project_code AS animal_project_code, -- exclusive to detections_limited
  animal_id_pk AS animal_id, -- exclusive to detections_limited
  animal_scientific_name AS scientific_name, -- exclusive to detections_limited
  network_project_code AS acoustic_project_code, -- exclusive to detections_limited
  det.receiver AS receiver_id,
  deployment_station_name AS station_name, -- exclusive to detections_limited
  deployment_latitude AS deploy_latitude, -- exclusive to detections_limited
  deployment_longitude AS deploy_longitude, -- exclusive to detections_limited
  det.sensor_value AS sensor_value,
  det.sensor_unit AS sensor_unit,
  det.sensor2_value AS sensor2_value,
  det.sensor2_unit AS sensor2_unit,
  det.signal_to_noise_ratio AS signal_to_noise_ratio,
  det.file AS source_file,
  det.qc_flag AS qc_flag,
  det.deployment_fk AS deployment_id
  -- det.transmitter_name
  -- det.transmitter_serial: via tag_device instead
  -- det.station_name: deployment.station_name instead
  -- det.latitude: deployment.deploy_lat instead
  -- det.longitude: deployment.deploy_long instead
  -- det.detection_file_id
  -- det.receiver_serial
  -- det.gain
  -- external_id
FROM acoustic.detections_limited AS det
WHERE
  'True'
  AND 'True'
  AND 'True'
  AND LOWER(animal_project_code) IN ('2015_phd_verhelst_eel', '2012_leopoldkanaal', 'ptn-silver-eel-mondego', 'semp', 'dak', 'emmn', 'noordzeekanaal', '2011_warnow', '2017_fremur', '2011_loire', '2019_grotenete', 'esgl', '2013_albertkanaal', '2004_gudena', '2014_frome', '2014_nene', 'life4fish')
  AND animal_scientific_name IN ('Anguilla anguilla')
  AND 'True'
  AND 'True'
  AND 'True'
  LIMIT ALL}

I can replicate Claudia's issue with the following SQL query:

  SELECT
  det.id_pk AS detection_id,
  det.datetime AS date_time,
  tag_serial_number AS tag_serial_number, -- exclusive to detections_limited
  det.transmitter AS acoustic_tag_id,
  animal_project_code AS animal_project_code, -- exclusive to detections_limited
  animal_id_pk AS animal_id, -- exclusive to detections_limited
  animal_scientific_name AS scientific_name, -- exclusive to detections_limited
  network_project_code AS acoustic_project_code, -- exclusive to detections_limited
  det.receiver AS receiver_id,
  deployment_station_name AS station_name, -- exclusive to detections_limited
  deployment_latitude AS deploy_latitude, -- exclusive to detections_limited
  deployment_longitude AS deploy_longitude, -- exclusive to detections_limited
  det.sensor_value AS sensor_value,
  det.sensor_unit AS sensor_unit,
  det.sensor2_value AS sensor2_value,
  det.sensor2_unit AS sensor2_unit,
  det.signal_to_noise_ratio AS signal_to_noise_ratio,
  det.file AS source_file,
  det.qc_flag AS qc_flag,
  det.deployment_fk AS deployment_id
  -- det.transmitter_name
  -- det.transmitter_serial: via tag_device instead
  -- det.station_name: deployment.station_name instead
  -- det.latitude: deployment.deploy_lat instead
  -- det.longitude: deployment.deploy_long instead
  -- det.detection_file_id
  -- det.receiver_serial
  -- det.gain
  -- external_id
FROM acoustic.detections_limited AS det
WHERE
  'True'
  AND 'True'
  AND 'True'
  AND LOWER(animal_project_code) IN ('2013_albertkanaal')
  AND 'True'
  AND 'True'
  AND 'True'
  AND 'True'
  LIMIT ALL}

Executed directly via DBI::dbGetQuery()

bart-v · 2023-07-17T09:26:10Z

16M records.
This is something you should NOT do in one query, it will blow up your database, result set and download outcome.
Please page through the records or use more specific queries, i.e. by defining one project instead of 20.

PietrH · 2023-07-17T13:38:06Z

Wouldn't pagination require sorting the query result first, for every frame retrieved? In the past decisions were made to keep all sorting on the client side. Is there a better way to implement pagination?

I'm still wondering if anything has changed, since the query also fails on single larger datasets such as '2013_albertkanaal' , from what I was told etn was used to publish this dataset to Zenodo, thus I understand something like get_acoustic_detections(connect_to_etn(),animal_project_code ="2013_ALBERTKANAAL") used to be possible.

  SELECT
    det.id_pk AS detection_id,
    det.datetime AS date_time,
    tag_serial_number AS tag_serial_number, -- exclusive to detections_limited
    det.transmitter AS acoustic_tag_id,
    animal_project_code AS animal_project_code, -- exclusive to detections_limited
    animal_id_pk AS animal_id, -- exclusive to detections_limited
    animal_scientific_name AS scientific_name, -- exclusive to detections_limited
    network_project_code AS acoustic_project_code, -- exclusive to detections_limited
    det.receiver AS receiver_id,
    deployment_station_name AS station_name, -- exclusive to detections_limited
    deployment_latitude AS deploy_latitude, -- exclusive to detections_limited
    deployment_longitude AS deploy_longitude, -- exclusive to detections_limited
    det.sensor_value AS sensor_value,
    det.sensor_unit AS sensor_unit,
    det.sensor2_value AS sensor2_value,
    det.sensor2_unit AS sensor2_unit,
    det.signal_to_noise_ratio AS signal_to_noise_ratio,
    det.file AS source_file,
    det.qc_flag AS qc_flag,
    det.deployment_fk AS deployment_id
    -- det.transmitter_name
    -- det.transmitter_serial: via tag_device instead
    -- det.station_name: deployment.station_name instead
    -- det.latitude: deployment.deploy_lat instead
    -- det.longitude: deployment.deploy_long instead
    -- det.detection_file_id
    -- det.receiver_serial
    -- det.gain
    -- external_id
  FROM acoustic.detections_limited AS det
  WHERE
    LOWER(animal_project_code) IN ('2013_albertkanaal')
}

I believe this query should result in a 6.4M row table.

PietrH · 2023-08-30T13:16:25Z

@bart-v

I believe pagination is the right answer.

Do you think a LIMIT OFFSET strategy is the right choice?

I believe there are significant trade-offs, it gets more inefficient as the query result gets bigger, and it's inconsistent due to the random queries currently get returned in, we might get duplication and omission in the results.

Is it possible for the server to always return the results sorted by an index/key? Currently the order is inconsistent when limiting.

I'm thinking of using cursors or keyset pagination instead of OFFSET LIMIT to get around the inconsistency problem and to reduce server load.

Is there anything else server side that needs to change for us to experiment with pagination?

bart-v · 2023-08-30T20:33:07Z

Yes, LIMIT OFFSET is perfect
No, you need to sort yourself, in the query
Server side: indexes might be needed on sort field or other fields used in the filter

PietrH · 2023-08-31T06:47:57Z

@bart-v

We will need an index on acoustic.detections_limited at least on id_pk, but preferrably also on datetime, animal_project_code and network_project_code, maybe also on other commonly used fields for filtering.

I've implemented paging on the following query:

SELECT
  det.id_pk AS detection_id,
  det.datetime AS date_time,
  tag_serial_number AS tag_serial_number, -- exclusive to detections_limited
  det.transmitter AS acoustic_tag_id,
  animal_project_code AS animal_project_code, -- exclusive to detections_limited
  animal_id_pk AS animal_id, -- exclusive to detections_limited
  animal_scientific_name AS scientific_name, -- exclusive to detections_limited
  network_project_code AS acoustic_project_code, -- exclusive to detections_limited
  det.receiver AS receiver_id,
  deployment_station_name AS station_name, -- exclusive to detections_limited
  deployment_latitude AS deploy_latitude, -- exclusive to detections_limited
  deployment_longitude AS deploy_longitude, -- exclusive to detections_limited
  det.sensor_value AS sensor_value,
  det.sensor_unit AS sensor_unit,
  det.sensor2_value AS sensor2_value,
  det.sensor2_unit AS sensor2_unit,
  det.signal_to_noise_ratio AS signal_to_noise_ratio,
  det.file AS source_file,
  det.qc_flag AS qc_flag,
  det.deployment_fk AS deployment_id
  -- det.transmitter_name
  -- det.transmitter_serial: via tag_device instead
  -- det.station_name: deployment.station_name instead
  -- det.latitude: deployment.deploy_lat instead
  -- det.longitude: deployment.deploy_long instead
  -- det.detection_file_id
  -- det.receiver_serial
  -- det.gain
  -- external_id
FROM acoustic.detections_limited AS det
WHERE
LOWER(network_project_code) IN ('demer', 'dijle')

This is lifed from one of the old unit tests, so I know it used to work. Currently this query will fail without paging, and takes 29.8 minutes with paging. I used keyset pagination on id_pk, assuming it would be indexed, but I assume it can't be as slow as it is.

Half an hour is unacceptably slow for a query like get_acoustic_detections(acoustic_project_code = c("demer", "dijle")), which used to not only work, but also quickly enough for it to be used in unit tests.

bart-v · 2023-08-31T19:44:54Z

OK, we'll need some time to investigate and implement

Stijn-VLIZ · 2023-11-29T15:12:32Z

We made some changes to the underlying view, can you verify there is a major performance improvement?

PietrH · 2023-11-30T11:28:45Z

@Stijn-VLIZ

I'm getting this error:

Error: nanodbc/nanodbc.cpp:1769: 00000: Out of memory while reading tuples.;
memory allocation error???   ut of memory while reading tuples.;
memory allocation error??? 
<SQL> '  SELECT
    det.id_pk AS detection_id,
    det.datetime AS date_time,
    tag_serial_number AS tag_serial_number, -- exclusive to detections_limited
    det.transmitter AS acoustic_tag_id,
    animal_project_code AS animal_project_code, -- exclusive to detections_limited
    animal_id_pk AS animal_id, -- exclusive to detections_limited
    animal_scientific_name AS scientific_name, -- exclusive to detections_limited
    network_project_code AS acoustic_project_code, -- exclusive to detections_limited
    det.receiver AS receiver_id,
    deployment_station_name AS station_name, -- exclusive to detections_limited
    deployment_latitude AS deploy_latitude, -- exclusive to detections_limited
    deployment_longitude AS deploy_longitude, -- exclusive to detections_limited
    det.sensor_value AS sensor_value,

On this query, for a single animal project code:

  SELECT
    det.id_pk AS detection_id,
    det.datetime AS date_time,
    tag_serial_number AS tag_serial_number, -- exclusive to detections_limited
    det.transmitter AS acoustic_tag_id,
    animal_project_code AS animal_project_code, -- exclusive to detections_limited
    animal_id_pk AS animal_id, -- exclusive to detections_limited
    animal_scientific_name AS scientific_name, -- exclusive to detections_limited
    network_project_code AS acoustic_project_code, -- exclusive to detections_limited
    det.receiver AS receiver_id,
    deployment_station_name AS station_name, -- exclusive to detections_limited
    deployment_latitude AS deploy_latitude, -- exclusive to detections_limited
    deployment_longitude AS deploy_longitude, -- exclusive to detections_limited
    det.sensor_value AS sensor_value,
    det.sensor_unit AS sensor_unit,
    det.sensor2_value AS sensor2_value,
    det.sensor2_unit AS sensor2_unit,
    det.signal_to_noise_ratio AS signal_to_noise_ratio,
    det.file AS source_file,
    det.qc_flag AS qc_flag,
    det.deployment_fk AS deployment_id
    -- det.transmitter_name
    -- det.transmitter_serial: via tag_device instead
    -- det.station_name: deployment.station_name instead
    -- det.latitude: deployment.deploy_lat instead
    -- det.longitude: deployment.deploy_long instead
    -- det.detection_file_id
    -- det.receiver_serial
    -- det.gain
    -- external_id
  FROM acoustic.detections_limited AS det
  WHERE
    LOWER(animal_project_code) IN ('2013_albertkanaal')
}

To reproduce in R:

library(etn)
my_con <- connect_to_etn(username = "pieter.huybrechts@inbo.be",
               password = askpass::askpass())
               
query <- 
  glue::glue_sql(.con = my_con,
                 "
  SELECT
    det.id_pk AS detection_id,
    det.datetime AS date_time,
    tag_serial_number AS tag_serial_number, -- exclusive to detections_limited
    det.transmitter AS acoustic_tag_id,
    animal_project_code AS animal_project_code, -- exclusive to detections_limited
    animal_id_pk AS animal_id, -- exclusive to detections_limited
    animal_scientific_name AS scientific_name, -- exclusive to detections_limited
    network_project_code AS acoustic_project_code, -- exclusive to detections_limited
    det.receiver AS receiver_id,
    deployment_station_name AS station_name, -- exclusive to detections_limited
    deployment_latitude AS deploy_latitude, -- exclusive to detections_limited
    deployment_longitude AS deploy_longitude, -- exclusive to detections_limited
    det.sensor_value AS sensor_value,
    det.sensor_unit AS sensor_unit,
    det.sensor2_value AS sensor2_value,
    det.sensor2_unit AS sensor2_unit,
    det.signal_to_noise_ratio AS signal_to_noise_ratio,
    det.file AS source_file,
    det.qc_flag AS qc_flag,
    det.deployment_fk AS deployment_id
    -- det.transmitter_name
    -- det.transmitter_serial: via tag_device instead
    -- det.station_name: deployment.station_name instead
    -- det.latitude: deployment.deploy_lat instead
    -- det.longitude: deployment.deploy_long instead
    -- det.detection_file_id
    -- det.receiver_serial
    -- det.gain
    -- external_id
  FROM acoustic.detections_limited AS det
  WHERE
    LOWER(animal_project_code) IN ('2013_albertkanaal')
}
                 ")

detections <- DBI::dbGetQuery(my_con, query)

result <- DBI::dbSendQuery(my_con, query, immediate = FALSE)
DBI::dbFetch(result)
DBI::dbClearResult(result)

I will continue testing

PieterjanVerhelst · 2023-11-30T11:31:32Z

Just FYI: I am running RStudio with the code I earlier popped up above. Still running since 8.45h...

PietrH · 2023-11-30T11:32:57Z

I'm still crashing rstudio server via get_acoustic_detections(connect_to_etn(),animal_project_code ="2013_ALBERTKANAAL")

Stijn-VLIZ · 2023-11-30T16:12:35Z

As I can see now the query is done quite fast, but it is the readout that is taking a long time and uses a lot of resources.
Previously I was using an Rstudio installation on a server which has more then enough resources.
But when I installed Rstudio locally and connect through an ODBC connection I get a slow readout.
The memory used builds up to 7GiB after which it drops back to 1,3 GiB.
If you are already using a large portion of your memory this will result in you waiting a very long time before consuming it all or you get an out of memory error.

Proposed solutions: indeed, paging will resolve blowing up your memory, and that of the database.
Discovering why the memory usage increases so much before going down again. I assume the index building on the resulted data.frame is the culprit here. Maybe there is a better type in which a resultset can be first stored so the capturing is done faster.

PieterjanVerhelst · 2023-11-30T19:30:36Z

Just FYI: I am running RStudio with the code I earlier popped up above. Still running since 8.45h...

I eventually ran into an error...

PietrH · 2023-12-01T13:03:35Z

As I can see now the query is done quite fast, but it is the readout that is taking a long time and uses a lot of resources. Previously I was using an Rstudio installation on a server which has more then enough resources. But when I installed Rstudio locally and connect through an ODBC connection I get a slow readout. The memory used builds up to 7GiB after which it drops back to 1,3 GiB. If you are already using a large portion of your memory this will result in you waiting a very long time before consuming it all or you get an out of memory error.

Proposed solutions: indeed, paging will resolve blowing up your memory, and that of the database. Discovering why the memory usage increases so much before going down again. I assume the index building on the resulted data.frame is the culprit here. Maybe there is a better type in which a resultset can be first stored so the capturing is done faster.

I'm able to get '2013_albertkanaal' to work via the built in paging in DBI, it's still slow and I can't keep the object in memory without rstudio crashing. But we can get around that by writing to a temporary file and reading it again when done.

However, this still used a lot of memory, several GB, so I'm not sure how well it'll scale to even bigger queries. I'll experiment a bit further and publish a branch so my work isn't lost.

PietrH · 2023-12-01T14:46:29Z

I've split up sending the query to the database from the fetching of the result, the result is also fetched paginated. This speeds up a query like 2013_albertkanaal to under 1 minute (a great improvement!), however a query like the one from PieterjanVerhelst is still crashing before it even starts fetching the result.

I suppose we could split up the animal_project_codes and query the database once per animal_project_code, then combine the results, but I'm not sure this is such a great solution.

PietrH · 2023-12-01T16:06:41Z

I haven't got a good solution for you yet @PieterjanVerhelst , but this is getting closer:

# using the parsing branch of etn
animal_projects <- c(
  "2015_phd_verhelst_eel",
  "2012_leopoldkanaal",
  "PTN-Silver-eel-Mondego",
  "SEMP",
  "DAK",
  "EMMN",
  "Noordzeekanaal",
  "2011_Warnow",
  "2017_Fremur",
  "2011_Loire",
  "2019_Grotenete",
  "ESGL",
  "2013_albertkanaal",
  "2004_Gudena",
  "2014_Frome",
  "2014_Nene",
  "life4fish"
)

my_con <- connect_to_etn()

huge_result <-
  purrr::map(animal_projects,
               ~ get_acoustic_detections(animal_project_code = .x,
                                         con = my_con,
                                         scientific_name = "Anguilla anguilla",
                                         progress = TRUE), # progress for individual queries
               .progress = FALSE) # progress for the whole thing

temp_file <- tempfile()

# write the result to a single csv, and read it in again for col parsing
purrr::walk(huge_result, ~readr::write_csv(.x, temp_file, append = TRUE))
huge_result_parsed <- readr::read_csv(temp_file) # probably need to add col_names, handle parsing issues

This is however rather rough, and will not give you a nice sorted file with the right column names like get_acoustic_detections() is supposed to do.

With the experimental code in the parsing branch, we can get results for every individual animal_project_code, the current setup can't handle the query in a single go. Were you able to do this in one go in the past?

Instead of what I did above, it's probably better to write each project out to it's own file, including a header, and then to combine those files using purrr::list_rbind() or a fancy multi file readr function that might exist. If you want I could see if I could get that to work for you. Just let me know.

PieterjanVerhelst · 2023-12-01T20:12:15Z

@PietrH thanks for the info! Indeed, in the past I was also not able to extract the dataset in one go. I did it in three parts with each part calling data from a few animal_projects. While I can understand that it is now not possible to download such a large dataset at once and that a way to work around it can be to download it in pieces, know that it is just a matter of time before this 'data set size issue' will become a more frequent problem. We need something that can handle big and very big data.

PietrH · 2023-12-04T09:58:56Z

@PieterjanVerhelst I completely agree that we need to look ahead to systems that can handle much more data than we currently have.

With the weekend behind me, I had a second look and have gotten your query down to just under 5 minutes. This is how I do it:

# open connection
my_con <- connect_to_etn()

# list projects to query
animal_projects <- c(
  "2015_phd_verhelst_eel",
  "2012_leopoldkanaal",
  "PTN-Silver-eel-Mondego",
  "SEMP",
  "DAK",
  "EMMN",
  "Noordzeekanaal",
  "2011_Warnow",
  "2017_Fremur",
  "2011_Loire",
  "2019_Grotenete",
  "ESGL",
  "2013_albertkanaal",
  "2004_Gudena",
  "2014_Frome",
  "2014_Nene",
  "life4fish"
)

# create temp folder to write to
dir_to_write_to <- tempdir()

# query the projects one by one, save them as csv files into temp folder
for(current_animal_project in animal_projects){
  message(paste("working on", current_animal_project))
  data.table::fwrite(
    get_acoustic_detections(animal_project_code = current_animal_project,
                          con = my_con,
                          scientific_name = "Anguilla anguilla",
                          progress = TRUE),
    file.path(dir_to_write_to, paste0(current_animal_project, ".csv"))
  )
}

# read the csv files from the temp folder into a single data.frame
anguilla_detections <-
  vroom::vroom(fs::dir_ls(path = dir_to_write_to, glob="*csv"),
               show_col_types = FALSE)

# inspect the results
dplyr::glimpse(anguilla_detections)

This is what this looks like in the console (video is just under 5 minutes):

etn-paging.mp4

Stijn-VLIZ · 2023-12-04T10:22:43Z

my first thought in this is, what are the next steps in data processing you are doing.
Because when you get down from 16 000 000 rows to a couple of 100 000 in the next steps, it might be interesting to do that immediately when requesting the data. This will always be faster.
It depends, but I think trying to minimize your returned dataset is always better.
We can do all sorts of grouping, and associated aggregations, but then we should understand what it is you need.

PieterjanVerhelst · 2023-12-04T10:28:01Z

When I personally check the data, I always start from the raw data and perform various aggregations, creating numerous summaries to understand what is in the data. It is only after that step that one can identify which aggregation she or he needs. Therefore, I still believe we need access to the raw data before performing aggregations.

PietrH · 2023-12-12T13:31:24Z

@Stijn-VLIZ I reran a test on 2013_albertkanaal today, it worked one out of 3 times, running out of memory the other two attempts. It is still using around 10GB of memory while doing so. The server was much busier today than it was on the friday I developed the paging on, and only around 11 GB out of the 32GB of RAM on the server was available.

In the paging branch, we are not reading the whole result at once, but 1000 lines at a time, and writing those out to a temp file, and then as a last step reading that temp file into memory. I've never seen any failures on reading this temp file, the failures always occur on the database itself running out of memory.

EDIT: Let's discuss this, I've sent you an email to look for a timeslot to meet

PietrH · 2024-04-26T07:48:47Z

@PieterjanVerhelst More RAM was added to the server (months ago), does your query now work?

PieterjanVerhelst · 2024-04-28T10:54:48Z

@PietrH data from 2013_albertkanaal was retrieved in just a few minutes 👍 !

Update: also the following code returns data 👏 !

animal_projects <- c("2015_phd_verhelst_eel", 
                     "2012_leopoldkanaal",
                     "PTN-Silver-eel-Mondego",
                     "SEMP",
                     "DAK",
                     "EMMN",
                     "Noordzeekanaal",
                     "2011_Warnow",
                     "2017_Fremur",
                     "2011_Loire",
                     "2019_Grotenete",
                     "ESGL",
                     "2013_albertkanaal",
                     "2004_Gudena",
                     "2014_Frome",
                     "2014_Nene",
                     "life4fish"
)

# Download detection data ####
data <- get_acoustic_detections(my_con, scientific_name = "Anguilla anguilla",
                                animal_project_code = animal_projects,
                                limit = FALSE)

PietrH added the bug Something isn't working label Mar 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

R crashes on large get_acoustic_detections query #278

R crashes on large get_acoustic_detections query #278

PietrH commented Mar 28, 2023

PietrH commented Jun 19, 2023

PieterjanVerhelst commented Jun 19, 2023 •

edited

CLAUMEMO commented Jul 17, 2023

PietrH commented Jul 17, 2023

PietrH commented Jul 17, 2023

bart-v commented Jul 17, 2023

PietrH commented Jul 17, 2023

bart-v commented Jul 17, 2023 •

edited

PietrH commented Jul 17, 2023 •

edited

PietrH commented Aug 30, 2023

bart-v commented Aug 30, 2023

PietrH commented Aug 31, 2023

bart-v commented Aug 31, 2023

Stijn-VLIZ commented Nov 29, 2023

PietrH commented Nov 30, 2023

PieterjanVerhelst commented Nov 30, 2023

PietrH commented Nov 30, 2023

Stijn-VLIZ commented Nov 30, 2023

PieterjanVerhelst commented Nov 30, 2023

PietrH commented Dec 1, 2023

PietrH commented Dec 1, 2023

PietrH commented Dec 1, 2023

PieterjanVerhelst commented Dec 1, 2023

PietrH commented Dec 4, 2023

Stijn-VLIZ commented Dec 4, 2023

PieterjanVerhelst commented Dec 4, 2023

PietrH commented Dec 12, 2023 •

edited

PietrH commented Apr 26, 2024

PieterjanVerhelst commented Apr 28, 2024 •

edited

R crashes on large get_acoustic_detections query #278

R crashes on large get_acoustic_detections query #278

Comments

PietrH commented Mar 28, 2023

PietrH commented Jun 19, 2023

PieterjanVerhelst commented Jun 19, 2023 • edited

CLAUMEMO commented Jul 17, 2023

PietrH commented Jul 17, 2023

PietrH commented Jul 17, 2023

bart-v commented Jul 17, 2023

PietrH commented Jul 17, 2023

bart-v commented Jul 17, 2023 • edited

PietrH commented Jul 17, 2023 • edited

PietrH commented Aug 30, 2023

bart-v commented Aug 30, 2023

PietrH commented Aug 31, 2023

bart-v commented Aug 31, 2023

Stijn-VLIZ commented Nov 29, 2023

PietrH commented Nov 30, 2023

PieterjanVerhelst commented Nov 30, 2023

PietrH commented Nov 30, 2023

Stijn-VLIZ commented Nov 30, 2023

PieterjanVerhelst commented Nov 30, 2023

PietrH commented Dec 1, 2023

PietrH commented Dec 1, 2023

PietrH commented Dec 1, 2023

PieterjanVerhelst commented Dec 1, 2023

PietrH commented Dec 4, 2023

Stijn-VLIZ commented Dec 4, 2023

PieterjanVerhelst commented Dec 4, 2023

PietrH commented Dec 12, 2023 • edited

PietrH commented Apr 26, 2024

PieterjanVerhelst commented Apr 28, 2024 • edited

PieterjanVerhelst commented Jun 19, 2023 •

edited

bart-v commented Jul 17, 2023 •

edited

PietrH commented Jul 17, 2023 •

edited

PietrH commented Dec 12, 2023 •

edited

PieterjanVerhelst commented Apr 28, 2024 •

edited