Skip to content
This repository has been archived by the owner on Dec 18, 2023. It is now read-only.

Latest commit

 

History

History

transmart-copy

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

transmart-copy

Data uploader tool for TranSMART, specifically for PostgreSQL databases. For loading the observations data, it copies data from table specific data files to database table, only substituting indexes for database identifiers for subjects, trial visits and studies.

Download

The latest version can be downloaded here: transmart-copy-17.2.15.jar.

# Download transmart-copy
curl -f -L https://repo.thehyve.nl/service/local/repositories/releases/content/org/transmartproject/transmart-copy/17.2.15/transmart-copy-17.2.15.jar -o transmart-copy.jar

Usage

Usage:

java -jar transmart-copy.jar [-h|--help] [--delete <STUDY_ID>]

Parameters:

  • -d, --directory: Specifies a data directory.
  • -m, --mode <study|pedigree>: Load mode, specifies what type(s) of data to load (default: study).
  • -I, --incremental: Enable incremental loading of patient data for a study (supported only for study mode).
  • -D <STUDY_ID>, --delete <STUDY_ID: Deletes the study with id <STUDY_ID> and related data.
  • -U, --update-concept-paths: Workaround. Updates concept paths, names and tree nodes when there is concept code collision.
  • -n, --base-on-max-instance-num: Adds to each instance_num a base to avoid primary key collisions in observation_fact. The base is autodetected as max(observation_fact.instance_num).
  • -i, --drop-indexes: Drop indexes when loading.
  • -r, --restore-indexes: Restore indexes.
  • -u, --unlogged: Set observations table to unlogged when loading.
  • -v, --vacuum-analyze: Vacuum analyze the observation_fact table.
  • -b, --batch-size: Number of observation to insert in a batch (default: 500).
  • -f, --flush-size: Number of batches to flush to the database (default: 1000).
  • -w <file>, --write <file>: Write observations to TSV file <file>.
  • -p, --partition: Partition observation_fact table based on trial_visit_num (Experimental).
  • -h, --help: Shows the available parameters and exits.
  • -V, --version: Prints the application version and exits.

The program reads table data from the current working directory and inserts new rows into the database if a row with the same identifier does not yet exist. The input directory should have the same structure as the database: two directories i2b2metadata and i2b2demodata representing the schemas, containing .tsv files for each of the tables. See the example directory for an example of the directory structure.

For shared data (used across studies), the identifiers of existing records are fetched first. If a record already exists, the data is not updated. The tool only adds new records.

For patients, visits, studies, trial visits, dimension descriptions and relation types, identifiers are generated by the database. In the .tsv files, an index should be used in these columns. E.g., the first data row has the number 0 in the identifier column instead of the identifier. We always assume the first row to have the column names, exactly matching the columns that exist in the database.

If a study in the input data already exists in the database, the program aborts, unless incremental data loading is enabled.

Identifying columns for shared data

For shared data, the following columns are used to identify if a record already exists:

Table Column(s)
patient_dimension the patient_ide and patient_ide_source columns of patient_mapping are used.
visit_dimension the encounter_ide and encounter_ide_source columns of encounter_mapping are used.
concept_dimension concept_cd
study study_id
i2b2_secure path
i2b2_tags (path, tag_type, tags_idx)
relation_type label

Currently, for patients and visits, only one identifier is allowed per patient or visit. I.e., if the mapping contains multiple identifiers (from different sources) for the same patient, it fails. The patients and visits in the mapping files are expected to be numbered consecutively starting from 0. The patient_ide_source is expected to be SUBJ_ID.

Observations are inserted without checking, because it is assumed that no data for the study is already present in the database. For incremental data loading, pass the --incremental or -I option. Then prior to data loading all observations for for patients in the input data are deleted for the studies that are uploaded. This allows to update data for a subset of patients for an existing study.

For relations data, the relation table is first truncated, and then the data from relation.tsv is loaded.

Deleting a study

With the --delete parameter, a study and associated data (trial visits, observations) can be deleted from the database.

Database settings

The database settings are read from the environment variables:

  • PGHOST: the hostname of the database server (default: localhost)
  • PGPORT: the database server port (default: 5432)
  • PGDATABASE: the database name (default: transmart)
  • PGUSER: the database admin user (required)
  • PGPASSWORD: the password of the admin user (required)
  • MAXPOOLSIZE: maximum pool size to avoid too many clients connection issue in db (default: 8)

Build and run from source

gradle shadowJar
./transmart-copy.sh [-h|--help] [--delete <STUDY_ID>]