Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenCPU approach #242

Open
2 of 4 tasks
peterdesmet opened this issue Mar 8, 2022 · 27 comments
Open
2 of 4 tasks

OpenCPU approach #242

peterdesmet opened this issue Mar 8, 2022 · 27 comments
Assignees
Milestone

Comments

@peterdesmet
Copy link
Member

peterdesmet commented Mar 8, 2022

Summary of March 8 meeting:

Create two "flavours" of all functions in the etn package: local and remote. Function names remain the same, the difference is made in the con variable. For local con is a database connection, for remote con are credentials to be passed via OpenCPU. The intent is to keep a single R package.

  • Users on rstudio.lifewatch.be use the local flavour (and connect directly to database)
  • Package on Docker will use the local flavour to connect to database. All functions will be exposed as OpenCPU endpoints (unless the access is closed via Apache).
  • Anyone can use the remote flavour and connect via OpenCPU.

For remote access, bandwidth and file size might become an issue. Potential solutions:

  • Transfer data as binary feather files. Read with arrow::read_feather(). Feather files are not compressed, so there is no gain here over csv. Most compressed format is very likelyrda, so planning to use that.
  • Use paging
  • Only offer download_acoustic_detections() (rather than get_accoustic_detections()) so users only have to transfer huge amounts of data once.

Next steps:

  • Get etn package working on docker, including DB connection. by @salvafern
  • Make some of the OpenCPU endpoints public. by @salvafern
  • Test endpoints to design remote flavour of functions. by @peterdesmet
  • Update package to make it work fully remotely. by @peterdesmet
@peterdesmet peterdesmet self-assigned this Mar 8, 2022
salvafern pushed a commit that referenced this issue Mar 9, 2022
@salvafern
Copy link
Collaborator

I changed slightly one function to make a connection to the etn directly by providing your username and password in 34254e2.

Now it works fine with opencpu! so it seems that we have to go for this solution instead of using the connection object. I hope this is ok for you @peterdesmet?

@peterdesmet
Copy link
Member Author

Great! For backwards compatibility, I suggest to keep using the con variable as a single parameter, e.g. as a list with:

con <- list(
  username = "x",
  password = "y"
)

That avoids parameters to be shifted in functions:

get_animals(my_con, 305)
# still calls:
get_animals(con = my_con, animal_id = 305)
# rather than:
get_animals(username = my_con, password = 305)

salvafern pushed a commit that referenced this issue Mar 10, 2022
@salvafern
Copy link
Collaborator

I gave it a try in 8f93f5a. It works but you will have to be careful with the encoding of the = equal symbol(See opencpu/opencpu#110).

This doesn't work

curl -L http://localhost:8004/ocpu/library/etn/R/list_animal_ids/ -X POST -d "con=list(username='salvador.fernandez@vliz.be', password='mypassword')"
# Unparsable argument: list(username

This does:

curl -L http://localhost:8004/ocpu/library/etn/R/list_animal_ids/ -X POST -d "con=list(username%3D'salvador.fernandez@vliz.be', password%3D'mypassword')"
# /ocpu/tmp/x0829c6393d9cea/R/.val
# /ocpu/tmp/x0829c6393d9cea/R/list_animal_ids
# /ocpu/tmp/x0829c6393d9cea/stdout
# /ocpu/tmp/x0829c6393d9cea/source
# /ocpu/tmp/x0829c6393d9cea/console
# /ocpu/tmp/x0829c6393d9cea/info
# /ocpu/tmp/x0829c6393d9cea/files/DESCRIPTION

See also the opencpu documentation about passing arguments: https://www.opencpu.org/api.html#api-arguments

I haven't tested in R but I think it will be fine with passing the arguments through utils::URLencode()

@peterdesmet
Copy link
Member Author

Ok great! Within the function(s) we can URLencode all parameters before calling the OpenCPU endpoint.

We can also extend con now to contain a remote property:

con = list(
  user = "x",
  password = "y",
  remote = TRUE
)

if con$remote {
  # use openCPU (with url encoded parameters)
} else {
  # use local DB connection
}

@peterdesmet
Copy link
Member Author

@salvafern I would like to implement the OpenCPU functionality over the summer. Are all the ETN package endpoints available in OpenCPU now?

@peterdesmet peterdesmet added this to the Summer 2022 milestone Jun 16, 2022
@salvafern
Copy link
Collaborator

Hi @peterdesmet we are working on it and they will be ready as soon as possible. I will let you know.

@salvafern
Copy link
Collaborator

The etn package is available at: https://opencpu.lifewatch.be/

@peterdesmet
Copy link
Member Author

I'm getting a 403 error for https://opencpu.lifewatch.be/

@salvafern
Copy link
Collaborator

The access is forbidden for internet browsers. Try with curl or from R.

@damianooldoni
Copy link
Member

Thanks @salvafern. Indeed, connection can be established via R (package curl):

curl::curl(url = "https://opencpu.lifewatch.be/")
A connection with                                           
description "https://opencpu.lifewatch.be/"
class       "curl"                         
mode        "r"                            
text        "text"                         
opened      "closed"                       
can read    "yes"                          
can write   "no" 

@PietrH
Copy link
Member

PietrH commented Dec 9, 2022

I gave it a try in 8f93f5a. It works but you will have to be careful with the encoding of the = equal symbol(See opencpu/opencpu#110).

This doesn't work

curl -L http://localhost:8004/ocpu/library/etn/R/list_animal_ids/ -X POST -d "con=list(username='salvador.fernandez@vliz.be', password='mypassword')"
# Unparsable argument: list(username

This does:

curl -L http://localhost:8004/ocpu/library/etn/R/list_animal_ids/ -X POST -d "con=list(username%3D'salvador.fernandez@vliz.be', password%3D'mypassword')"
# /ocpu/tmp/x0829c6393d9cea/R/.val
# /ocpu/tmp/x0829c6393d9cea/R/list_animal_ids
# /ocpu/tmp/x0829c6393d9cea/stdout
# /ocpu/tmp/x0829c6393d9cea/source
# /ocpu/tmp/x0829c6393d9cea/console
# /ocpu/tmp/x0829c6393d9cea/info
# /ocpu/tmp/x0829c6393d9cea/files/DESCRIPTION

See also the opencpu documentation about passing arguments: https://www.opencpu.org/api.html#api-arguments

I haven't tested in R but I think it will be fine with passing the arguments through utils::URLencode()

Does this method expose the credentials to anyone on the network? Or are they already encrypted somehow this way?

@bart-v
Copy link

bart-v commented Dec 9, 2022

Since opencpu.lifewatch.be is HTTPS by default the credentials are secure

@PietrH
Copy link
Member

PietrH commented Dec 9, 2022

Excellent,

is the /ocpu/tmp exposed? I'm getting a 403 on both https://opencpu.lifewatch.be/tmp and https://opencpu.lifewatch.be/ocpu/tmp paths. I can POST function calls just fine, but not retrieve the results.

The same code works on https://cloud.opencpu.org/ocpu so it might be something in the server setup? Or I might just have the address slightly wrong too. I'm trying to get to https://opencpu.lifewatch.be/tmp/x0715fee402d82f/stdout

@bart-v
Copy link

bart-v commented Dec 9, 2022

No /ocpu/tmp is not exposed.
From the tests by @salvafern this seemed not needed
Why has this changed?

@PietrH
Copy link
Member

PietrH commented Dec 9, 2022

As I understand in 4.3 in the manual, a user posts the function call with arguments, the response includes a tmp path where the user again gets the response objects. You can also request the function output immateriality as a json object in the call using the /json flag.

That second option is less attractive to me as some functions return rather large tabular outputs where I'd like a bit more control in the format that they are retrieved, probably rda using gzip compression to reduce server io.

Maybe I'm missing something?

@bart-v
Copy link

bart-v commented Dec 9, 2022

Yes, we have been using /json all the time.
Can you please start with that?

@PietrH PietrH self-assigned this Dec 9, 2022
@PietrH
Copy link
Member

PietrH commented Dec 12, 2022

I've adapted list_animal_ids() to list_animal_ids_api(), seems to work to me: a61a4bb

Next I'll adapt a more complicated function to work by directly providing username and password as arguments, I was thinking about get_acoustic_detections() so we can test retrieving tabular data via the API.

@PietrH PietrH mentioned this issue Dec 20, 2022
34 tasks
@PietrH
Copy link
Member

PietrH commented Feb 10, 2023

After further testing I'd like to argue in favor of exposing /ocpu/tmp:

  • I'd like to preserve data types of objects, I can't using JSON as an intermediary. For example I can transfer data.frame, but not the column specifications. The client has no way of knowing what the classes of columns are supposed to be.
  • using /ocpu/tmp allows for compression during transfer, speeding things up for both the server and the user for requests like "detections" which can result in big, but easy to compress, response objects
  • the /json route can still be used in parallel
  • Some functions output files as side effects, such as etn::write_dwc() and etn::download_acoustic_dataset(), which can't be parsed into JSON and back without losing data

@PietrH
Copy link
Member

PietrH commented Mar 13, 2023

Any opinions @bart-v @salvafern ?

@bart-v
Copy link

bart-v commented Mar 13, 2023

There is obvious some security issues involved, i.e. people could just "steal" the output of a more privileged user by guessing the session-id.
It seems this is handled by a cleaning cron job.
opencpu/opencpu#194

Can you confirm that the output is written in a random, temporary folder i.e. /ocpu/tmp/<random>/, and not immediately in /ocpu/tmp

@PietrH
Copy link
Member

PietrH commented Mar 14, 2023

I can confirm every function call should create a new dir under /ocpu/tmp, for example /ocpu/tmp/x04e894ea2366bd
, I've created a gist running on google colab to demonstrate:

https://gist.github.com/PietrH/14cdb3cb581a3b835221d8b641e74b51

This demo makes use of the opencpu test api (calling rnorm).

We could sanitize these paths on a steady interval. I also believe brute forcing the keys would be quite the challenge since you'd need to try a lot of keys with no guarantee on the type of result even if you manage to find a path that's in use, this risk is further mitigated with protections that might already be in place to protect from denial of service attacks.

@bart-v
Copy link

bart-v commented Mar 14, 2023

OK paths like /tmp/x04e894ea2366bd/ are now exposed

@PietrH
Copy link
Member

PietrH commented Mar 14, 2023

I'm still getting a 403 on

https://opencpu.lifewatch.be/ocpu/tmp/x010f9753592ec8/R/.val/print

Are subdirectories also exposed? Is there a mistake in my domain?

@bart-v
Copy link

bart-v commented Mar 14, 2023

The base path is without "ocpu"
So https://opencpu.lifewatch.be/tmp (...)

I thought we only downloaded files and not special paths like .val, etc...

@PietrH
Copy link
Member

PietrH commented Mar 14, 2023

My apologies for the confusion, after a POST request the client sends a GET request to one of the paths provided in the POST response body. The most common case will be /tmp/{key}/R/.val with then the requested datatype as a suffix, rds in our case. It's my understanding we'll also be able to use this workflow to get other formats such as is needed for write_dwc() and download_acoustic_dataset()

For example you'd GET https://opencpu.lifewatch.be/ocpu/tmp/x010f9753592ec8/R/.val/rds to get a rds stream (compressed) or GET https://opencpu.lifewatch.be/ocpu/tmp/x010f9753592ec8/R/.val/csv or https://opencpu.lifewatch.be/ocpu/tmp/x010f9753592ec8/R/.val/feather (this key might have been voided by the time you read this).


I'm following the opencpu manual, section 4.3: https://opencpu.github.io/server-manual/opencpu-server.pdf

Performing a HTTP POST on a function results in a function call where the HTTP request arguments
are mapped to the function call. In OpenCPU, a successful POST requests usually returns a HTTP
201 status, and the response body contains the locations of the output data

The output can then be retrieved using HTTP GET. When calling an R function, the output object
is always called .val. However, calling scripts might result in other R objects.

@bart-v
Copy link

bart-v commented Mar 14, 2023

OK, https://opencpu.lifewatch.be/tmp/x010f9753592ec8/R/.val/rds works now
Remember to drop the /ocpu/

@PietrH
Copy link
Member

PietrH commented Mar 14, 2023

It's working now! Thanks for all the help. I'll keep you updated with my progress.

@PietrH PietrH modified the milestones: Dev 2022, v3.0.0 Jun 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants