Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R crashes on large get_acoustic_detections query #278

Open
PietrH opened this issue Mar 28, 2023 · 29 comments
Open

R crashes on large get_acoustic_detections query #278

PietrH opened this issue Mar 28, 2023 · 29 comments
Labels
bug Something isn't working

Comments

@PietrH
Copy link
Member

PietrH commented Mar 28, 2023

I seem to have run into a bug where etn crashes the whole R session when calling get_acoustic_detections() on a animal_project_code that will result in many detections.

This has been replicated by @peterdesmet , from what I understand, get_acoustic_detections() used to work on this animal_project_code.

library(etn)
albert <- get_acoustic_detections(connect_to_etn(),animal_project_code ="2013_ALBERTKANAAL")

Running this code results in:

image

I welcome any idea's or speculation on the cause of this bug.

=============================

My session info:

Session info ────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.2.1 (2022-06-23)
 os       Ubuntu 22.10
 system   x86_64, linux-gnu
 ui       RStudio
 language en_US
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Europe/Brussels
 date     2023-03-28
 rstudio  2022.12.0+353 Elsbeth Geranium (desktop)
 pandoc   NAPackages ────────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.2.1)
 cli           3.6.1   2023-03-23 [1] CRAN (R 4.2.1)
 DBI           1.1.3   2022-06-18 [1] CRAN (R 4.2.1)
 dplyr         1.1.1   2023-03-22 [1] CRAN (R 4.2.1)
 etn           2.0.1   2023-02-10 [1] local
 fansi         1.0.4   2023-01-22 [1] CRAN (R 4.2.1)
 generics      0.1.3   2022-07-05 [1] CRAN (R 4.2.1)
 glue          1.6.2   2022-02-24 [1] CRAN (R 4.2.1)
 hms           1.1.3   2023-03-21 [1] CRAN (R 4.2.1)
 httr          1.4.5   2023-02-24 [1] CRAN (R 4.2.1)
 jsonlite      1.8.4   2022-12-06 [1] CRAN (R 4.2.1)
 lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.2.1)
 magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.2.1)
 pillar        1.9.0   2023-03-22 [1] CRAN (R 4.2.1)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.2.1)
 R6            2.5.1   2021-08-19 [1] CRAN (R 4.2.1)
 readr         2.1.4   2023-02-10 [1] CRAN (R 4.2.1)
 rlang         1.1.0   2023-03-14 [1] CRAN (R 4.2.1)
 rstudioapi    0.14    2022-08-22 [1] CRAN (R 4.2.1)
 sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.2.1)
 tibble        3.2.1   2023-03-20 [1] CRAN (R 4.2.1)
 tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.2.1)
 tzdb          0.3.0   2022-03-28 [1] CRAN (R 4.2.1)
 utf8          1.2.3   2023-01-31 [1] CRAN (R 4.2.1)
 vctrs         0.6.1   2023-03-22 [1] CRAN (R 4.2.1)
@PietrH PietrH added the bug Something isn't working label Mar 28, 2023
@PietrH
Copy link
Member Author

PietrH commented Jun 19, 2023

Todo:

  • I've requested a few more examples of user queries that fail, or are very slow
  • I will parse forward the debug view of the failing queries (in SQL) to VLIZ to be investigated.

We'll have a look at possible solutions to speed up the detections view. Indexing, perhaps implementing pagination.

@PieterjanVerhelst
Copy link
Collaborator

PieterjanVerhelst commented Jun 19, 2023

@PietrH, I'm doing a meta-analysis on silver eel migration by combining the data of various studies. Downloading all the data at once not only takes a lot of time, it also gave me a 'time out' error in the past, hence I had to download the data in 2 or 3 subsets. I am running it right now just to try (it has been running since 10.47h this morning). If you want to try yourself, here is the code:

# Create vector with relevant animal_project_codes
animal_projects <- c("2015_phd_verhelst_eel", 
                     "2012_leopoldkanaal",
                     "PTN-Silver-eel-Mondego",
                     "SEMP",
                     "DAK",
                     "EMMN",
                     "Noordzeekanaal",
                     "2011_Warnow",
                     "2017_Fremur",
                     "2011_Loire",
                     "2019_Grotenete",
                     "ESGL",
                     "2013_albertkanaal",
                     "2004_Gudena",
                     "2014_Frome",
                     "2014_Nene",
                     "life4fish"
)


# Download detection data ####
data <- get_acoustic_detections(my_con, scientific_name = "Anguilla anguilla",
                                animal_project_code = animal_projects,
                                limit = FALSE)

Update: download just crashed, giving the same error as above.

@CLAUMEMO
Copy link

Hi,
I run a similiar query as PJ's and got this error:

memory allocation error???   ut of memory while reading tuples.;
memory allocation error??? 
<SQL> 'SELECT
  det.id_pk AS detection_id,
  det.datetime AS date_time,
  tag_serial_number AS tag_serial_number, -- exclusive to detections_limited
  det.transmitter AS acoustic_tag_id,
  animal_project_code AS animal_project_code, -- exclusive to detections_limited
  animal_id_pk AS animal_id, -- exclusive to detections_limited
  animal_scientific_name AS scientific_name, -- exclusive to detections_limited
  network_project_code AS acoustic_project_code, -- exclusive to detections_limited
  det.receiver AS receiver_id,
  deployment_station_name AS station_name, -- exclusive to detections_limited
  deployment_latitude AS deploy_latitude, -- exclusive to detections_limited
  deployment_longitude AS deploy_longitude, -- exclusive to detections_limited
  det.sensor_value AS sensor_value,
  det.sensor_unit AS sensor_unit

@PietrH
Copy link
Member Author

PietrH commented Jul 17, 2023

Hello Claudia, thank you for including the error message, very helpful! :-)

What was the exact R command you were running?

Currently the package seems to struggle with queries that result in very large R objects, I'm not sure yet if this a database or a package problem.

In the meantime you might be able to work around the issue by splitting up your query, but I'll know for sure if you can let me know what exactly you were doing.

@PietrH
Copy link
Member Author

PietrH commented Jul 17, 2023

@bart-v Could the following error be due to the server/ODBC drivers running out of memory?


memory allocation error???   ut of memory while reading tuples.;
memory allocation error??? 
<SQL> 'SELECT
  det.id_pk AS detection_id,
  det.datetime AS date_time,
  tag_serial_number AS tag_serial_number, -- exclusive to detections_limited
  det.transmitter AS acoustic_tag_id,
  animal_project_code AS animal_project_code, -- exclusive to detections_limited
  animal_id_pk AS animal_id, -- exclusive to detections_limited
  animal_scientific_name AS scientific_name, -- exclusive to detections_limited
  network_project_code AS acoustic_project_code, -- exclusive to detections_limited
  det.receiver AS receiver_id,
  deployment_station_name AS station_name, -- exclusive to detections_limited
  deployment_latitude AS deploy_latitude, -- exclusive to detections_limited
  deployment_longitude AS deploy_longitude, -- exclusive to detections_limited
  det.sensor_value AS sensor_value,
  det.sensor_unit AS sensor_unit

@bart-v
Copy link

bart-v commented Jul 17, 2023

The query is not complete, so who knows...

@PietrH
Copy link
Member Author

PietrH commented Jul 17, 2023

@bart-v

This is the query that get_acoustic_detections() is trying to execute in the example PieterjanVerhelst gave:

SELECT
  det.id_pk AS detection_id,
  det.datetime AS date_time,
  tag_serial_number AS tag_serial_number, -- exclusive to detections_limited
  det.transmitter AS acoustic_tag_id,
  animal_project_code AS animal_project_code, -- exclusive to detections_limited
  animal_id_pk AS animal_id, -- exclusive to detections_limited
  animal_scientific_name AS scientific_name, -- exclusive to detections_limited
  network_project_code AS acoustic_project_code, -- exclusive to detections_limited
  det.receiver AS receiver_id,
  deployment_station_name AS station_name, -- exclusive to detections_limited
  deployment_latitude AS deploy_latitude, -- exclusive to detections_limited
  deployment_longitude AS deploy_longitude, -- exclusive to detections_limited
  det.sensor_value AS sensor_value,
  det.sensor_unit AS sensor_unit,
  det.sensor2_value AS sensor2_value,
  det.sensor2_unit AS sensor2_unit,
  det.signal_to_noise_ratio AS signal_to_noise_ratio,
  det.file AS source_file,
  det.qc_flag AS qc_flag,
  det.deployment_fk AS deployment_id
  -- det.transmitter_name
  -- det.transmitter_serial: via tag_device instead
  -- det.station_name: deployment.station_name instead
  -- det.latitude: deployment.deploy_lat instead
  -- det.longitude: deployment.deploy_long instead
  -- det.detection_file_id
  -- det.receiver_serial
  -- det.gain
  -- external_id
FROM acoustic.detections_limited AS det
WHERE
  'True'
  AND 'True'
  AND 'True'
  AND LOWER(animal_project_code) IN ('2015_phd_verhelst_eel', '2012_leopoldkanaal', 'ptn-silver-eel-mondego', 'semp', 'dak', 'emmn', 'noordzeekanaal', '2011_warnow', '2017_fremur', '2011_loire', '2019_grotenete', 'esgl', '2013_albertkanaal', '2004_gudena', '2014_frome', '2014_nene', 'life4fish')
  AND animal_scientific_name IN ('Anguilla anguilla')
  AND 'True'
  AND 'True'
  AND 'True'
  LIMIT ALL}
  

I can replicate Claudia's issue with the following SQL query:

  SELECT
  det.id_pk AS detection_id,
  det.datetime AS date_time,
  tag_serial_number AS tag_serial_number, -- exclusive to detections_limited
  det.transmitter AS acoustic_tag_id,
  animal_project_code AS animal_project_code, -- exclusive to detections_limited
  animal_id_pk AS animal_id, -- exclusive to detections_limited
  animal_scientific_name AS scientific_name, -- exclusive to detections_limited
  network_project_code AS acoustic_project_code, -- exclusive to detections_limited
  det.receiver AS receiver_id,
  deployment_station_name AS station_name, -- exclusive to detections_limited
  deployment_latitude AS deploy_latitude, -- exclusive to detections_limited
  deployment_longitude AS deploy_longitude, -- exclusive to detections_limited
  det.sensor_value AS sensor_value,
  det.sensor_unit AS sensor_unit,
  det.sensor2_value AS sensor2_value,
  det.sensor2_unit AS sensor2_unit,
  det.signal_to_noise_ratio AS signal_to_noise_ratio,
  det.file AS source_file,
  det.qc_flag AS qc_flag,
  det.deployment_fk AS deployment_id
  -- det.transmitter_name
  -- det.transmitter_serial: via tag_device instead
  -- det.station_name: deployment.station_name instead
  -- det.latitude: deployment.deploy_lat instead
  -- det.longitude: deployment.deploy_long instead
  -- det.detection_file_id
  -- det.receiver_serial
  -- det.gain
  -- external_id
FROM acoustic.detections_limited AS det
WHERE
  'True'
  AND 'True'
  AND 'True'
  AND LOWER(animal_project_code) IN ('2013_albertkanaal')
  AND 'True'
  AND 'True'
  AND 'True'
  AND 'True'
  LIMIT ALL}
  

Executed directly via DBI::dbGetQuery()

@bart-v
Copy link

bart-v commented Jul 17, 2023

16M records.
This is something you should NOT do in one query, it will blow up your database, result set and download outcome.
Please page through the records or use more specific queries, i.e. by defining one project instead of 20.

@PietrH
Copy link
Member Author

PietrH commented Jul 17, 2023

Wouldn't pagination require sorting the query result first, for every frame retrieved? In the past decisions were made to keep all sorting on the client side. Is there a better way to implement pagination?

I'm still wondering if anything has changed, since the query also fails on single larger datasets such as '2013_albertkanaal' , from what I was told etn was used to publish this dataset to Zenodo, thus I understand something like get_acoustic_detections(connect_to_etn(),animal_project_code ="2013_ALBERTKANAAL") used to be possible.

  SELECT
    det.id_pk AS detection_id,
    det.datetime AS date_time,
    tag_serial_number AS tag_serial_number, -- exclusive to detections_limited
    det.transmitter AS acoustic_tag_id,
    animal_project_code AS animal_project_code, -- exclusive to detections_limited
    animal_id_pk AS animal_id, -- exclusive to detections_limited
    animal_scientific_name AS scientific_name, -- exclusive to detections_limited
    network_project_code AS acoustic_project_code, -- exclusive to detections_limited
    det.receiver AS receiver_id,
    deployment_station_name AS station_name, -- exclusive to detections_limited
    deployment_latitude AS deploy_latitude, -- exclusive to detections_limited
    deployment_longitude AS deploy_longitude, -- exclusive to detections_limited
    det.sensor_value AS sensor_value,
    det.sensor_unit AS sensor_unit,
    det.sensor2_value AS sensor2_value,
    det.sensor2_unit AS sensor2_unit,
    det.signal_to_noise_ratio AS signal_to_noise_ratio,
    det.file AS source_file,
    det.qc_flag AS qc_flag,
    det.deployment_fk AS deployment_id
    -- det.transmitter_name
    -- det.transmitter_serial: via tag_device instead
    -- det.station_name: deployment.station_name instead
    -- det.latitude: deployment.deploy_lat instead
    -- det.longitude: deployment.deploy_long instead
    -- det.detection_file_id
    -- det.receiver_serial
    -- det.gain
    -- external_id
  FROM acoustic.detections_limited AS det
  WHERE
    LOWER(animal_project_code) IN ('2013_albertkanaal')
}

I believe this query should result in a 6.4M row table.

@PietrH
Copy link
Member Author

PietrH commented Aug 30, 2023

@bart-v

I believe pagination is the right answer.

  • Do you think a LIMIT OFFSET strategy is the right choice?

I believe there are significant trade-offs, it gets more inefficient as the query result gets bigger, and it's inconsistent due to the random queries currently get returned in, we might get duplication and omission in the results.

  • Is it possible for the server to always return the results sorted by an index/key? Currently the order is inconsistent when limiting.

I'm thinking of using cursors or keyset pagination instead of OFFSET LIMIT to get around the inconsistency problem and to reduce server load.

  • Is there anything else server side that needs to change for us to experiment with pagination?

@bart-v
Copy link

bart-v commented Aug 30, 2023

  • Yes, LIMIT OFFSET is perfect
  • No, you need to sort yourself, in the query
  • Server side: indexes might be needed on sort field or other fields used in the filter

@PietrH
Copy link
Member Author

PietrH commented Aug 31, 2023

@bart-v

  • We will need an index on acoustic.detections_limited at least on id_pk, but preferrably also on datetime, animal_project_code and network_project_code, maybe also on other commonly used fields for filtering.

I've implemented paging on the following query:

SELECT
  det.id_pk AS detection_id,
  det.datetime AS date_time,
  tag_serial_number AS tag_serial_number, -- exclusive to detections_limited
  det.transmitter AS acoustic_tag_id,
  animal_project_code AS animal_project_code, -- exclusive to detections_limited
  animal_id_pk AS animal_id, -- exclusive to detections_limited
  animal_scientific_name AS scientific_name, -- exclusive to detections_limited
  network_project_code AS acoustic_project_code, -- exclusive to detections_limited
  det.receiver AS receiver_id,
  deployment_station_name AS station_name, -- exclusive to detections_limited
  deployment_latitude AS deploy_latitude, -- exclusive to detections_limited
  deployment_longitude AS deploy_longitude, -- exclusive to detections_limited
  det.sensor_value AS sensor_value,
  det.sensor_unit AS sensor_unit,
  det.sensor2_value AS sensor2_value,
  det.sensor2_unit AS sensor2_unit,
  det.signal_to_noise_ratio AS signal_to_noise_ratio,
  det.file AS source_file,
  det.qc_flag AS qc_flag,
  det.deployment_fk AS deployment_id
  -- det.transmitter_name
  -- det.transmitter_serial: via tag_device instead
  -- det.station_name: deployment.station_name instead
  -- det.latitude: deployment.deploy_lat instead
  -- det.longitude: deployment.deploy_long instead
  -- det.detection_file_id
  -- det.receiver_serial
  -- det.gain
  -- external_id
FROM acoustic.detections_limited AS det
WHERE
LOWER(network_project_code) IN ('demer', 'dijle')

This is lifed from one of the old unit tests, so I know it used to work. Currently this query will fail without paging, and takes 29.8 minutes with paging. I used keyset pagination on id_pk, assuming it would be indexed, but I assume it can't be as slow as it is.

Half an hour is unacceptably slow for a query like get_acoustic_detections(acoustic_project_code = c("demer", "dijle")), which used to not only work, but also quickly enough for it to be used in unit tests.

@bart-v
Copy link

bart-v commented Aug 31, 2023

OK, we'll need some time to investigate and implement

@Stijn-VLIZ
Copy link
Collaborator

We made some changes to the underlying view, can you verify there is a major performance improvement?

@PietrH
Copy link
Member Author

PietrH commented Nov 30, 2023

@Stijn-VLIZ

I'm getting this error:

Error: nanodbc/nanodbc.cpp:1769: 00000: Out of memory while reading tuples.;
memory allocation error???   ut of memory while reading tuples.;
memory allocation error??? 
<SQL> '  SELECT
    det.id_pk AS detection_id,
    det.datetime AS date_time,
    tag_serial_number AS tag_serial_number, -- exclusive to detections_limited
    det.transmitter AS acoustic_tag_id,
    animal_project_code AS animal_project_code, -- exclusive to detections_limited
    animal_id_pk AS animal_id, -- exclusive to detections_limited
    animal_scientific_name AS scientific_name, -- exclusive to detections_limited
    network_project_code AS acoustic_project_code, -- exclusive to detections_limited
    det.receiver AS receiver_id,
    deployment_station_name AS station_name, -- exclusive to detections_limited
    deployment_latitude AS deploy_latitude, -- exclusive to detections_limited
    deployment_longitude AS deploy_longitude, -- exclusive to detections_limited
    det.sensor_value AS sensor_value,

On this query, for a single animal project code:

  SELECT
    det.id_pk AS detection_id,
    det.datetime AS date_time,
    tag_serial_number AS tag_serial_number, -- exclusive to detections_limited
    det.transmitter AS acoustic_tag_id,
    animal_project_code AS animal_project_code, -- exclusive to detections_limited
    animal_id_pk AS animal_id, -- exclusive to detections_limited
    animal_scientific_name AS scientific_name, -- exclusive to detections_limited
    network_project_code AS acoustic_project_code, -- exclusive to detections_limited
    det.receiver AS receiver_id,
    deployment_station_name AS station_name, -- exclusive to detections_limited
    deployment_latitude AS deploy_latitude, -- exclusive to detections_limited
    deployment_longitude AS deploy_longitude, -- exclusive to detections_limited
    det.sensor_value AS sensor_value,
    det.sensor_unit AS sensor_unit,
    det.sensor2_value AS sensor2_value,
    det.sensor2_unit AS sensor2_unit,
    det.signal_to_noise_ratio AS signal_to_noise_ratio,
    det.file AS source_file,
    det.qc_flag AS qc_flag,
    det.deployment_fk AS deployment_id
    -- det.transmitter_name
    -- det.transmitter_serial: via tag_device instead
    -- det.station_name: deployment.station_name instead
    -- det.latitude: deployment.deploy_lat instead
    -- det.longitude: deployment.deploy_long instead
    -- det.detection_file_id
    -- det.receiver_serial
    -- det.gain
    -- external_id
  FROM acoustic.detections_limited AS det
  WHERE
    LOWER(animal_project_code) IN ('2013_albertkanaal')
}

To reproduce in R:

library(etn)
my_con <- connect_to_etn(username = "pieter.huybrechts@inbo.be",
               password = askpass::askpass())
               
query <- 
  glue::glue_sql(.con = my_con,
                 "
  SELECT
    det.id_pk AS detection_id,
    det.datetime AS date_time,
    tag_serial_number AS tag_serial_number, -- exclusive to detections_limited
    det.transmitter AS acoustic_tag_id,
    animal_project_code AS animal_project_code, -- exclusive to detections_limited
    animal_id_pk AS animal_id, -- exclusive to detections_limited
    animal_scientific_name AS scientific_name, -- exclusive to detections_limited
    network_project_code AS acoustic_project_code, -- exclusive to detections_limited
    det.receiver AS receiver_id,
    deployment_station_name AS station_name, -- exclusive to detections_limited
    deployment_latitude AS deploy_latitude, -- exclusive to detections_limited
    deployment_longitude AS deploy_longitude, -- exclusive to detections_limited
    det.sensor_value AS sensor_value,
    det.sensor_unit AS sensor_unit,
    det.sensor2_value AS sensor2_value,
    det.sensor2_unit AS sensor2_unit,
    det.signal_to_noise_ratio AS signal_to_noise_ratio,
    det.file AS source_file,
    det.qc_flag AS qc_flag,
    det.deployment_fk AS deployment_id
    -- det.transmitter_name
    -- det.transmitter_serial: via tag_device instead
    -- det.station_name: deployment.station_name instead
    -- det.latitude: deployment.deploy_lat instead
    -- det.longitude: deployment.deploy_long instead
    -- det.detection_file_id
    -- det.receiver_serial
    -- det.gain
    -- external_id
  FROM acoustic.detections_limited AS det
  WHERE
    LOWER(animal_project_code) IN ('2013_albertkanaal')
}
                 ")

detections <- DBI::dbGetQuery(my_con, query)

result <- DBI::dbSendQuery(my_con, query, immediate = FALSE)
DBI::dbFetch(result)
DBI::dbClearResult(result)

I will continue testing

@PieterjanVerhelst
Copy link
Collaborator

Just FYI: I am running RStudio with the code I earlier popped up above. Still running since 8.45h...

@PietrH
Copy link
Member Author

PietrH commented Nov 30, 2023

I'm still crashing rstudio server via get_acoustic_detections(connect_to_etn(),animal_project_code ="2013_ALBERTKANAAL")

@Stijn-VLIZ
Copy link
Collaborator

As I can see now the query is done quite fast, but it is the readout that is taking a long time and uses a lot of resources.
Previously I was using an Rstudio installation on a server which has more then enough resources.
But when I installed Rstudio locally and connect through an ODBC connection I get a slow readout.
The memory used builds up to 7GiB after which it drops back to 1,3 GiB.
If you are already using a large portion of your memory this will result in you waiting a very long time before consuming it all or you get an out of memory error.

Proposed solutions: indeed, paging will resolve blowing up your memory, and that of the database.
Discovering why the memory usage increases so much before going down again. I assume the index building on the resulted data.frame is the culprit here. Maybe there is a better type in which a resultset can be first stored so the capturing is done faster.

@PieterjanVerhelst
Copy link
Collaborator

Just FYI: I am running RStudio with the code I earlier popped up above. Still running since 8.45h...

I eventually ran into an error...

@PietrH
Copy link
Member Author

PietrH commented Dec 1, 2023

As I can see now the query is done quite fast, but it is the readout that is taking a long time and uses a lot of resources. Previously I was using an Rstudio installation on a server which has more then enough resources. But when I installed Rstudio locally and connect through an ODBC connection I get a slow readout. The memory used builds up to 7GiB after which it drops back to 1,3 GiB. If you are already using a large portion of your memory this will result in you waiting a very long time before consuming it all or you get an out of memory error.

Proposed solutions: indeed, paging will resolve blowing up your memory, and that of the database. Discovering why the memory usage increases so much before going down again. I assume the index building on the resulted data.frame is the culprit here. Maybe there is a better type in which a resultset can be first stored so the capturing is done faster.

I'm able to get '2013_albertkanaal' to work via the built in paging in DBI, it's still slow and I can't keep the object in memory without rstudio crashing. But we can get around that by writing to a temporary file and reading it again when done.

However, this still used a lot of memory, several GB, so I'm not sure how well it'll scale to even bigger queries. I'll experiment a bit further and publish a branch so my work isn't lost.

@PietrH
Copy link
Member Author

PietrH commented Dec 1, 2023

I've split up sending the query to the database from the fetching of the result, the result is also fetched paginated. This speeds up a query like 2013_albertkanaal to under 1 minute (a great improvement!), however a query like the one from PieterjanVerhelst is still crashing before it even starts fetching the result.

I suppose we could split up the animal_project_codes and query the database once per animal_project_code, then combine the results, but I'm not sure this is such a great solution.

@PietrH
Copy link
Member Author

PietrH commented Dec 1, 2023

I haven't got a good solution for you yet @PieterjanVerhelst , but this is getting closer:

# using the parsing branch of etn
animal_projects <- c(
  "2015_phd_verhelst_eel",
  "2012_leopoldkanaal",
  "PTN-Silver-eel-Mondego",
  "SEMP",
  "DAK",
  "EMMN",
  "Noordzeekanaal",
  "2011_Warnow",
  "2017_Fremur",
  "2011_Loire",
  "2019_Grotenete",
  "ESGL",
  "2013_albertkanaal",
  "2004_Gudena",
  "2014_Frome",
  "2014_Nene",
  "life4fish"
)

my_con <- connect_to_etn()

huge_result <-
  purrr::map(animal_projects,
               ~ get_acoustic_detections(animal_project_code = .x,
                                         con = my_con,
                                         scientific_name = "Anguilla anguilla",
                                         progress = TRUE), # progress for individual queries
               .progress = FALSE) # progress for the whole thing

temp_file <- tempfile()

# write the result to a single csv, and read it in again for col parsing
purrr::walk(huge_result, ~readr::write_csv(.x, temp_file, append = TRUE))
huge_result_parsed <- readr::read_csv(temp_file) # probably need to add col_names, handle parsing issues

This is however rather rough, and will not give you a nice sorted file with the right column names like get_acoustic_detections() is supposed to do.

With the experimental code in the parsing branch, we can get results for every individual animal_project_code, the current setup can't handle the query in a single go. Were you able to do this in one go in the past?

Instead of what I did above, it's probably better to write each project out to it's own file, including a header, and then to combine those files using purrr::list_rbind() or a fancy multi file readr function that might exist. If you want I could see if I could get that to work for you. Just let me know.

@PieterjanVerhelst
Copy link
Collaborator

@PietrH thanks for the info! Indeed, in the past I was also not able to extract the dataset in one go. I did it in three parts with each part calling data from a few animal_projects. While I can understand that it is now not possible to download such a large dataset at once and that a way to work around it can be to download it in pieces, know that it is just a matter of time before this 'data set size issue' will become a more frequent problem. We need something that can handle big and very big data.

@PietrH
Copy link
Member Author

PietrH commented Dec 4, 2023

@PieterjanVerhelst I completely agree that we need to look ahead to systems that can handle much more data than we currently have.

With the weekend behind me, I had a second look and have gotten your query down to just under 5 minutes. This is how I do it:

# open connection
my_con <- connect_to_etn()

# list projects to query
animal_projects <- c(
  "2015_phd_verhelst_eel",
  "2012_leopoldkanaal",
  "PTN-Silver-eel-Mondego",
  "SEMP",
  "DAK",
  "EMMN",
  "Noordzeekanaal",
  "2011_Warnow",
  "2017_Fremur",
  "2011_Loire",
  "2019_Grotenete",
  "ESGL",
  "2013_albertkanaal",
  "2004_Gudena",
  "2014_Frome",
  "2014_Nene",
  "life4fish"
)

# create temp folder to write to
dir_to_write_to <- tempdir()

# query the projects one by one, save them as csv files into temp folder
for(current_animal_project in animal_projects){
  message(paste("working on", current_animal_project))
  data.table::fwrite(
    get_acoustic_detections(animal_project_code = current_animal_project,
                          con = my_con,
                          scientific_name = "Anguilla anguilla",
                          progress = TRUE),
    file.path(dir_to_write_to, paste0(current_animal_project, ".csv"))
  )
}

# read the csv files from the temp folder into a single data.frame
anguilla_detections <-
  vroom::vroom(fs::dir_ls(path = dir_to_write_to, glob="*csv"),
               show_col_types = FALSE)

# inspect the results
dplyr::glimpse(anguilla_detections)

This is what this looks like in the console (video is just under 5 minutes):

etn-paging.mp4

@Stijn-VLIZ
Copy link
Collaborator

my first thought in this is, what are the next steps in data processing you are doing.
Because when you get down from 16 000 000 rows to a couple of 100 000 in the next steps, it might be interesting to do that immediately when requesting the data. This will always be faster.
It depends, but I think trying to minimize your returned dataset is always better.
We can do all sorts of grouping, and associated aggregations, but then we should understand what it is you need.

@PieterjanVerhelst
Copy link
Collaborator

When I personally check the data, I always start from the raw data and perform various aggregations, creating numerous summaries to understand what is in the data. It is only after that step that one can identify which aggregation she or he needs. Therefore, I still believe we need access to the raw data before performing aggregations.

@PietrH
Copy link
Member Author

PietrH commented Dec 12, 2023

@Stijn-VLIZ I reran a test on 2013_albertkanaal today, it worked one out of 3 times, running out of memory the other two attempts. It is still using around 10GB of memory while doing so. The server was much busier today than it was on the friday I developed the paging on, and only around 11 GB out of the 32GB of RAM on the server was available.

In the paging branch, we are not reading the whole result at once, but 1000 lines at a time, and writing those out to a temp file, and then as a last step reading that temp file into memory. I've never seen any failures on reading this temp file, the failures always occur on the database itself running out of memory.

EDIT: Let's discuss this, I've sent you an email to look for a timeslot to meet

@PietrH
Copy link
Member Author

PietrH commented Apr 26, 2024

@PieterjanVerhelst More RAM was added to the server (months ago), does your query now work?

@PieterjanVerhelst
Copy link
Collaborator

PieterjanVerhelst commented Apr 28, 2024

@PietrH data from 2013_albertkanaal was retrieved in just a few minutes 👍 !

Update: also the following code returns data 👏 !

animal_projects <- c("2015_phd_verhelst_eel", 
                     "2012_leopoldkanaal",
                     "PTN-Silver-eel-Mondego",
                     "SEMP",
                     "DAK",
                     "EMMN",
                     "Noordzeekanaal",
                     "2011_Warnow",
                     "2017_Fremur",
                     "2011_Loire",
                     "2019_Grotenete",
                     "ESGL",
                     "2013_albertkanaal",
                     "2004_Gudena",
                     "2014_Frome",
                     "2014_Nene",
                     "life4fish"
)

# Download detection data ####
data <- get_acoustic_detections(my_con, scientific_name = "Anguilla anguilla",
                                animal_project_code = animal_projects,
                                limit = FALSE)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants