Skip to content

This repo includes the pipeline used to link and curate CVD PREVENT audit data to HES and death registration data into two tables. These are subsequently sent to OHID for analysis and publication.

License

NHSDigital/cvd-prevent-tool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DS_234: README

CVD Prevent Tool curated data pipeline

Repository owner: NHS England Analytical Services

Email: datascience@nhs.net

To contact us, raise an issue on Github or via email and we will respond promptly.

Warning - this repository is a snapshot of a repository internal to NHS England. This means that some links may not work for external readers.

This repository includes a suite of spark notebooks used to build a new data pipeline. These are used to build a new data asset to link and curate CVDPREVENT audit data to existing administrative data tables.

This codebase can only be run on NHS England's Data Access Environment Apache Spark V3.2.1. This is being shared for transparency and feedback on the algorithms used.

No sensitive data is stored within this repository.

Key features

  • The pipeline is structured using an object-oriented approach.
  • The pipeline is designed to be configured, using params notebooks, without altering the codebase
  • Outputs can be restricted for particular cohort populations or to include a subset of data sources.
  • Potential to include bespoke outcomes and patient characteristics in the output tables

What does the pipeline do

It takes information from a range of data sources, and summarises them in a number of standardised tables. The pipeline produces the following outputs:

  • Events table (row per event from each data source used)
  • Patient table (row per patient – patient only recorded if they satisfy inclusion criteria for either Cohort 1 or 2)
  • Report table (output of results check, error catching)

Quick Start Guide

The Prevent Tool Pipeline is run from the main notebook.

This notebook is able to run the codebase in multiple modes (selected from the .Run Mode widget):

Pre-Merge: Running of the unit test and integration test suite, prior to GitLab merging

Post-Merge: Running of the unit test suite, integration test suite, and full pipeline run (post GitLab-to-Databricks merge)

Pipeline Only: Running of the full pipeline only (General Running)

There are also additional modes from that widget that relate to controlling the notebook:

Selection Mode: Default value when first starting the notebook. Used as a placeholder to indicate the notebook needs configuring before running one of the main run modes.

Reset Mode: When selected (and run), returns all configurable values for main back to the default values.

A full pipeline run undertaken as part of the Post-Merge or Pipeline Only run modes requires additional widget inputs: cohort_table: Specifies the use of a previous eligible_cohort_table table - if supplied the CreateCohortTableStage stage of the pipeline will be skipped. Default is blank (run full pipeline).

git_version: Git commit hash from the current master branch in gitlab. Can also be set to dev_XX where XX are the initials of the user running the pipeline - used when testing pipeline code.

params_path: Path to the parameters notebook that controls the pipeline. Default is default. A custom path should only be used when using a non-standard parameters file.

prepare_pseduo_assets Specifies whether the pseudonymisation ready assets should be created (True) or not (False). Defaults to False.

The pipeline run function run_pipeline() outputs a verbose progress log of the running stages and times of the pipeline.

Once completed, assets will be available in the prevent_tool_collab database.

Configuration

The pipeline functionality and running can be controlled using the pipeline parameters (found in the params folder). Below is a brief summary of the different parameter notebooks and their purpose.

params

The main notebook for creating the params object. This notebook checks for the parameters path (default is default) and loads the specified params_util notebook.

params_util

This notebook contains the main parameter definitions and the creation of the params dataclass. Input and output data fields (columns) are specified here, alongside any intermediate fields used as part of the pipeline processing. This notebook loads the params_diagnostic_codes notebook to load the relevant SNOMED and ICD10 codes that form part of the inclusion criteria.

params_diagnostic_codes

This notebook is used to specify any clinical coding variables (ICD-10, SNOMED) that are used to create the pipeline parameters.

Documentation

The homepage of the pipeline's documentation is here.

Further documentation

Configuring the pipeline

Output data specification

Licence

Unless stated otherwise, the codebase is released under the MIT Licence. This covers both the codebase and any sample code in the documentation.

Documentation is © Crown copyright and available under the terms of the Open Government 3.0 licence.

About

This repo includes the pipeline used to link and curate CVD PREVENT audit data to HES and death registration data into two tables. These are subsequently sent to OHID for analysis and publication.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Languages