Skip to content

difuture-lmu/dsBinVal

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

R-CMD-check License: LGPL v3 codecov DOI

ROC-GLM and Calibration for DataSHIELD

The package provides functionality to conduct and visualize ROC analysis and calibration on decentralized data. The basis is the DataSHIELD infrastructure for distributed computing. This package provides the calculation of the ROC-GLM with AUC confidence intervals as well as calibration curves and the Brier score. In order to calculate the ROC-GLM or assess calibration it is necessary to push models and predict them at the servers which is also provided by this package. Note that DataSHIELD uses privacy filter from DataSHIELD v5 onwards that are also used in this package. Additionally, this package uses the old option datashield.privacyLevel (to indicate the minimal amount of values required to allow sharing an aggregation) as fallback. Instead of setting the option, we directly retrieve the fallback privacy level from the DESCRIPTION file each time a function calls for it. This options is set to 5 by default. The methodology of the package is explained in detail here.

Installation

At the moment, there is no CRAN version available. Install the development version from GitHub:

remotes::install_github("difuture-lmu/dsBinVal")

Register methods

It is necessary to register the assign and aggregate methods in the OPAL administration. These methods are registered automatically when publishing the package on OPAL (see DESCRIPTION).

Note that the package needs to be installed at both locations, the server and the analysts machine.

Installation on DataSHIELD

The two options are to use the Opal API:

  • Log into Opal ans switch to the Administration/DataSHIELD/ tab
  • Click the Add DataSHIELD package button
  • Select GitHub as source, and use difuture-lmu as user, dsBinVal as name, and main as Git reference.

The second option is to use the opalr package to install dsBinVal directly from R:

### User credentials (here from the opal test server):
surl     = "https://opal-demo.obiba.org/"
username = "administrator"
password = "password"

### Install package and publish methods:
opal = opalr::opal.login(username = username, password = password, url = surl)

opalr::dsadmin.install_github_package(opal = opal, pkg = "dsBinVal", username = "difuture-lmu", ref = "main")
opalr::dsadmin.publish_package(opal = opal, pkg = "dsBinVal")

opalr::opal.logout(opal)

Usage

A more sophisticated example is available here.

library(dsBinVal)

Log into DataSHIELD server

builder = newDSLoginBuilder()

surl     = "https://opal-demo.obiba.org/"
username = "administrator"
password = "password"

builder$append(
  server   = "ds1",
  url      = surl,
  user     = username,
  password = password,
  table    = "CNSIM.CNSIM1"
)
builder$append(
  server   = "ds2",
  url      = surl,
  user     = username,
  password = password,
  table    = "CNSIM.CNSIM2"
)
builder$append(
  server   = "ds3",
  url      = surl,
  user     = username,
  password = password,
  table    = "CNSIM.CNSIM3"
)

connections = datashield.login(logins = builder$build(), assign = TRUE)
#> 
#> Logging into the collaborating servers
#> 
#>   No variables have been specified. 
#>   All the variables in the table 
#>   (the whole dataset) will be assigned to R!
#> 
#> Assigning table data...

Load test model, push to DataSHIELD, and calculate predictions

# Load the model fitted locally on CNSIM:
load(here::here("Readme_files/mod.rda"))
# Model was calculated by:
#> glm(DIS_DIAB ~ ., data = CNSIM, family = binomial())

# Push the model to the DataSHIELD servers:
pushObject(connections, mod)
#> [2024-04-22 13:13:48.063726] Your object is bigger than 1 MB (5.75186157226562 MB). Uploading larger objects may take some time.

# Create a clean data set without NAs:
ds.completeCases("D", newobj = "D_complete")
#> $is.object.created
#> [1] "A data object <D_complete> has been created in all specified data sources"
#> 
#> $validity.check
#> [1] "<D_complete> appears valid in all sources"

# Calculate scores and save at the servers:
pfun =  "predict(mod, newdata = D, type = 'response')"
predictModel(connections, mod, "pred", "D_complete", predict_fun = pfun)

datashield.symbols(connections)
#> $ds1
#> [1] "D"          "D_complete" "mod"        "pred"      
#> 
#> $ds2
#> [1] "D"          "D_complete" "mod"        "pred"      
#> 
#> $ds3
#> [1] "D"          "D_complete" "mod"        "pred"

Calculate l2-sensitivity

# In order to securely calculate the ROC-GLM, we have to assess the
# l2-sensitivity to set the privacy parameters of differential
# privacy adequately:
l2s = dsL2Sens(connections, "D_complete", "pred")
l2s
#> [1] 0.001475989

# Due to the results presented in https://arxiv.org/abs/2203.10828, we set the privacy parameters to
# - epsilon = 0.2, delta = 0.1 if        l2s <= 0.01
# - epsilon = 0.3, delta = 0.4 if 0.01 < l2s <= 0.03
# - epsilon = 0.5, delta = 0.3 if 0.03 < l2s <= 0.05
# - epsilon = 0.5, delta = 0.5 if 0.05 < l2s <= 0.07
# - epsilon = 0.5, delta = 0.5 if 0.07 < l2s BUT results may be not good!

Calculate ROC-GLM

# The response must be encoded as integer/numeric vector:
ds.asInteger("D_complete$DIS_DIAB", "truth")
#> $is.object.created
#> [1] "A data object <truth> has been created in all specified data sources"
#> 
#> $validity.check
#> [1] "<truth> appears valid in all sources"
roc_glm = dsROCGLM(connections, truth_name = "truth", pred_name = "pred",
  dat_name = "D_complete", seed_object = "pred")
#> 
#> [2024-04-22 13:16:22.865938] L2 sensitivity is: 0.0015
#> 
#> [2024-04-22 13:16:24.798812] Setting: epsilon = 0.2 and delta = 0.1
#> 
#> [2024-04-22 13:16:24.7992] Initializing ROC-GLM
#> 
#> [2024-04-22 13:16:24.799205] Host: Received scores of negative response
#> [2024-04-22 13:16:24.799542] Receiving negative scores
#> [2024-04-22 13:16:26.746705] Host: Pushing pooled scores
#> [2024-04-22 13:16:30.121529] Server: Calculating placement values and parts for ROC-GLM
#> [2024-04-22 13:16:32.062631] Server: Calculating probit regression to obtain ROC-GLM
#> [2024-04-22 13:16:34.208119] Deviance of iter1=63.7694
#> [2024-04-22 13:16:36.187686] Deviance of iter2=98.4921
#> [2024-04-22 13:16:38.1557] Deviance of iter3=107.2788
#> [2024-04-22 13:16:40.124237] Deviance of iter4=107.4237
#> [2024-04-22 13:16:42.093282] Deviance of iter5=107.4237
#> [2024-04-22 13:16:44.063297] Deviance of iter6=107.4237
#> [2024-04-22 13:16:44.063721] Host: Finished calculating ROC-GLM
#> [2024-04-22 13:16:44.063974] Host: Cleaning data on server
#> [2024-04-22 13:16:46.610451] Host: Calculating AUC and CI
#> [2024-04-22 13:17:04.261138] Finished!
roc_glm
#> 
#> ROC-GLM after Pepe:
#> 
#>  Binormal form: pnorm(0.67 + 0.55*qnorm(t))
#> 
#>  AUC and 0.95 CI: [0.66----0.72----0.78]

plot(roc_glm)

Assess calibration

dsBrierScore(connections, "truth", "pred")
#> [1] 0.01222748

### Calculate and plot calibration curve:
cc = dsCalibrationCurve(connections, "truth", "pred")
cc
#> 
#> Calibration curve:
#> 
#>  Number of shared values:
#>            (0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6] (0.6,0.7]
#> n             7270        52        20        11         6         3         7
#> not_shared       0         0         0         2         3         3         2
#>            (0.7,0.8] (0.8,0.9] (0.9,1]
#> n                  1         1       0
#> not_shared         1         1     NaN
#> 
#> Values of the calibration curve:
#>               (0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6]
#> truth     0.009766162 0.2307692 0.3500000 0.1818182 0.3333333         0
#> predicted 0.010694587 0.1381766 0.2487212 0.2793309 0.2112020         0
#>           (0.6,0.7] (0.7,0.8] (0.8,0.9] (0.9,1]
#> truth     0.4285714         0         0     NaN
#> predicted 0.4537049         0         0     NaN
#> 
#> 
#> Missing values are indicated by the privacy level of 5.

plot(cc)
#> Warning: Removed 17 rows containing missing values or values outside the scale range
#> (`geom_point()`).
#> Warning: Removed 17 rows containing missing values or values outside the scale range
#> (`geom_line()`).
#> Warning: Removed 1 row containing missing values or values outside the scale range
#> (`geom_point()`).
#> Warning: Removed 1 row containing missing values or values outside the scale range
#> (`geom_line()`).

Deploy information:

Build by root (Darwin) on 2024-04-22 13:17:11.283748.

This readme is built automatically after each push to the repository and weekly on Monday. The autobuilt is computed by installing the package on the DataSHIELD test server and is therefore a test if the functionality of the package works on DataSHIELD servers. Additionally, the functionality is tested using the GH Actions with tests/testthat/test_on_active_server.R. The system information of the local and remote machines are:

  • Local machine:
    • R version: R version 4.3.3 (2024-02-29)
    • Version of DataSHELD client packages:
Package Version
DSI 1.5.0
DSOpal 1.4.0
dsBaseClient 6.3.0
dsBinVal 1.0.2
  • Remote DataSHIELD machines:
    • OPAL version of the test instance: 4.7.2
    • R version of ds1: R version 4.3.3 (2024-02-29)
    • R version of ds2: R version 4.3.3 (2024-02-29)
    • Version of server packages:
Package ds1: Version ds2: Version ds3: Version
dsBase 6.3.0 6.3.0 6.3.0
resourcer 1.4.0 1.4.0 1.4.0
dsBinVal 1.0.2 1.0.2 1.0.2