Introduction

Individual-level data of recent Alzheimer’s Disease (AD) trials are difficult to obtain. Synthetic/simulated data could be used for preparatory, training or explorative research with low risk of privacy breach.

We aimed to generate a synthetic version of an original real-world observational dataset, and make our method open-source available.

Method

Here we explain the steps in the R code

Original data

Obtain original real-world data

demographic (age, sex, education),
clinical (cognition: MMSE and ADAS; function: FAQ; composite cognition/function: CDR, ADCOMS) and
biological (genetics: APOE4; cerebrospinal fluid: ABeta, Tau; imaging: PET-SUVR-centiloid)
outcomes at baseline, 6, 12 and/or 18-month follow-up (35 variables), with missing data multiple-imputed to obtain 10 sets of 537 individuals.

Estimate (theoretical) minimum and maximum (all continuous variables) and proportions (all categorical variables).
Rescale to 0-1 range (continuous).
Estimate beta distribution shape parameters (method of moments; continuous).
Transform to cumulative density function (using shape parameters; continuous) and to cumulative probability (categorical).
Convert to a normal distribution.
Estimate variance-covariance matrix.

Synthetic data

Generate random correlated normal data using Cholesky decomposition of variance�covariance.
Transform to cumulative density function.
Transform to inverse cumulative density function of beta distribution (using beta distribution shape parameters; continuous).
Rescale to original range (using minimum and maximum and proportions from step 2).

More details

See file poster syntehtic data ISPOR.pdf for details and supporting figures on an application, accepted at www.ispor.org conference in Copenhagen 2023.

Limitations

We believe our method has the following limitations:

In case of non-normally distributed data, the variance-covariance matrix is fitted after rescaling to a normal distribution, therewith not representing the variance-covariance of the original data on its original non-normal scale (for example, high values in right skewed data have more impact on the covariance as compared to their covariance after rescaling them to a normal distribution). In case of categorical data, the covariance of the simulated data is based on continuous normal distribution which, after categorization, loses information due to categorization leading to a lower covariance in the categorized data.
Our method relies on complete cases. Any conditional missing data in the original data should be handled before creating a synthetic version of the original data.
Our method relies on correlation between all variables. It might be limited to represent specific non-linear patterns/associations and interactions.
We have simulated 2 outcomes based on the same underlying scale ((CDR global and CDR sum of boxes both variants of CDR). Some combinations of the two outcomes might be inplausible as we did not apply any restriction to their combinations. In addition, CDR-global scores can take values 0, 0.5, 1, 2 or 3. We have rounded them to 0.5 but values other than 0.5 should have been rounded to full numbers. As this is a specific condition of this scale we have not implemented this in our code for reasons to keep our code generic and leave any specific conditions to the user of our code.
We have not compared our method to alternatives (e.g., R package synthpop).

Acknowledgment

Developers:

Ron Handels (Maastricht University, Netherlands)
Linus Jonsson (Karolinska Institutet, Sweden)
Lars Lau Raket (Lund University, Sweden)

Data

For our original purpose we used a real-world dataset named ADNI. For reasons of data protection we created an artificial dataset outside this code loosely based on ADNI data. Random changes have been made on each individual and to mean outcomes making the orginal data available within our code disconnected and non-representative from any original ADNI data. We acknowledge the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu) during development of our method.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generate correlated data.R		generate correlated data.R
original_data.csv		original_data.csv
poster synthetic data ISPOR 2023.pdf		poster synthetic data ISPOR 2023.pdf
synthetic_data.csv		synthetic_data.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

generate correlated data.R

generate correlated data.R

original_data.csv

original_data.csv

poster synthetic data ISPOR 2023.pdf

poster synthetic data ISPOR 2023.pdf

synthetic_data.csv

synthetic_data.csv

Repository files navigation

Introduction

Method

Original data

Synthetic data

More details

Limitations

Acknowledgment

Developers:

Data

About

Releases 1

Packages

Languages

License

ronhandels/synthetic-correlated-data

Folders and files

Latest commit

History

Repository files navigation

Introduction

Method

Original data

Synthetic data

More details

Limitations

Acknowledgment

Developers:

Data

About

Topics

Resources

License

Stars

Watchers

Forks

Languages