Skip to content

related-sciences/platform-pipe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build Status codecov

Open Targets Platform Pipeline

This project is intended to validate, normalize and score evidence for Open Targets (OT) Platform, and is structured as a counterpart to genetics-pipe. The pipeline here replaces much of what was originally maintained in data_pipeline, moving it to a more concise and efficient Scala/Spark framework. It is also intended to make scoring much more configurable for any clients that may wish to tune scoring coefficients for their own purposes.

There are two primary steps in this pipeline:

  1. Evidence Preparation
  • See EvidencePreparationPipeline.scala for implementation details.
  • This phase of the pipeline will validate and normalize the json evidence strings associated with OT data sources. These files are generated in large part by platform-input-support and are primarily stored in Google Storage (GS). Links to files for each source are maintained in a pipeline-configuration file that is updated with each new release.
  • At a high level, this step requires Elasticsearch index dumps (for gene/disease metadata) as well as GS files and produces a single parquet dataset (schema).
  • Notable operations performed in this phase include:
    • Validation of evidence strings against the OT Evidence Schema
    • Normalization of UniProt and non-reference (i.e. genes defined against non-reference assemblies, typically in highly polymorphic regions) targets
    • Evidence code aggregation and static scoring; i.e. some scores are defined purely based on evidence codes and need to overriden in this phase (see here for details)
    • Target and disease validation based on ensembl and EFO accessions, respectively
    • Aggregation of all filtering and nearly all mutation operations into ancillary datasets that can be used to trace why records were lost or altered; see this Validation Error Report for an example summary.
  1. Evidence Scoring
  • See ScoringPreparationPipeline.scala and ScoringCalculationPipeline.scala for implementation details
  • This phase of the pipeline will score evidence created by the preparation step. While this will likely expand in the future to include more of the parameters used in scoring, the data source weights, at least, are configurable as shown here in application.conf
  • Most of the trickier details related to per-source handling of evidence can be found in Scoring.scala

Validation

Both evidence preparation and scoring can be validated against output from the original data_pipeline implementation in evidence-prep-validation.ipynb and scoring-validation.ipynb, respectively. These notebooks contain checks for target/disease presence and field equality across all data sources. There were several issues encountered that are mentioned in the notebooks but all data was found equivalent outside of the issues raised on github.

There are also tests like this one intended to preserve a subset of these checks as part of the CI build.

Configuration

The configuration for the pipeline is determined entirely by application.conf (modify as necessary for your use case).

A key configuration property to keep in mind is pipeline.decorators.dataset-summary.enabled. When this is true, provenance around evidence record mutation and filtering is preserved at expense of making the evidence prep pipeline take over 2x longer (~47min vs 22min). This is disabled by default.

The input-dir, output-dir, and data-resources.local-dir properties also need to be changed if you are NOT using the provided ot-client docker container.

Prerequisities

Note: All of the below are specific to the 19.11 OT release

This project will expect two primary sources of input information and while the process outlined below is a bit cumbersome as of now, we expect to improve it after the scope of the project is solidified:

  1. Metadata index extracts from Elasticsearch
  • The gene, eco, and efo indexes are currently required
  • These can be created one of the following two ways:
    1. By setting up and running data_pipeline yourself
    1. By downloading and decompressing the files at https://storage.googleapis.com/platform-pipe/extract/{gene,eco,efo}.json.gz
    • An example script to do this is:
      mkdir -p $DATA_DIR/extract; cd $DATA_DIR/extract 
      for index in gene eco efo; do
      mkdir ${index}.json
      wget -P ${index}.json https://storage.googleapis.com/platform-pipe/extract/${index}.json.gz
      gzip -d ${index}.json/${index}.json.gz 
      done
  1. Evidence files
    • See download_evidence_files.sh for a script that will download this information
    • These files will collectively occupy about 23G of space (17G of which is from a single source, europepmc, so developers may find it convenient to remove or subset this file for testing)

The expected directory structure is shown below (once the pipeline has been run):

# Inputs
$DATA_DIR/extract/
$DATA_DIR/extract/gene.json
$DATA_DIR/extract/eco.json
$DATA_DIR/extract/efo.json
$DATA_DIR/extract/evidence_raw.json/{atlas-*.json, gwas-*.json, etc.}
# Outputs
$DATA_DIR/results/score_source.parquet
$DATA_DIR/results/score_association.parquet
$DATA_DIR/results/evidence_raw.parquet

Performance

Some expected times for pipeline runs are shown below for various configurations (in local mode on Ubuntu 18.04 8xCPU 128G RAM):

  • Evidence Preparation
    • ~12 minutes with raw evidence pre-serialized as parquet (no mutation/filtering provenance)
    • ~22 minutes with evidence files read as uncompressed json (no mutation/filtering provenance)
    • ~47 minutes with mutation/filtering provenance and json evidence file sources
  • Score Calculation
    • Both score preparation and calculation take around 2 minutes each

Execution

To build the project, run:

sbt clean assembly
# or for no tests: sbt "set test in assembly := {}" clean assembly

This will produce target/scala-2.12/platform-pipe.jar which can be deployed or run locally.

To execute all pipeline steps, run the following using the ot-client docker (see docker/README.md) container provided or your own cluster:

APP=$REPOS/platform-pipe/target/scala-2.12/platform-pipe.jar

for cmd in prepare-evidence prepare-scores calculate-scores; do
echo "Running command: $cmd"
# Note: High driver-memory below not necesssary on a cluster -- this is for runs in local mode
/usr/spark-2.4.4/bin/spark-submit \
--driver-memory 64g \
--class com.relatedsciences.opentargets.etl.Main $APP $cmd  \
--config $HOME/repos/platform-pipe/src/main/resources/application.conf
done

Developer Notes

Ancillary Script Execution

Evidence test data generation script:

# Run script on ot-client container
/usr/spark-2.4.4/bin/spark-shell --driver-memory 12g \
--jars $HOME/data/ot/apps/platform-pipe.jar \
-i $HOME/data/ot/apps/scripts/create_evidence_test_datasets.sc \
--conf spark.ui.enabled=false --conf spark.sql.shuffle.partitions=1 \
--conf spark.driver.args="\
extractDir=$HOME/data/ot/extract,\
testInputDir=$HOME/repos/platform-pipe/src/test/resources/pipeline_test/input,\
testExpectedDir=$HOME/repos/platform-pipe/src/test/resources/pipeline_test/expected"

Scalafmt Installation

A pre-commit hook to run scalafmt is recommended for this repo though installation of scalafmt is left to developers. The Installation Guide has simple instructions, and the process used for Ubuntu 18.04 was:

cd /tmp/  
curl -Lo coursier https://git.io/coursier-cli &&
    chmod +x coursier &&
    ./coursier --help
sudo ./coursier bootstrap org.scalameta:scalafmt-cli_2.12:2.2.1 \
  -r sonatype:snapshots \
  -o /usr/local/bin/scalafmt --standalone --main org.scalafmt.cli.Cli
scalafmt --version # "scalafmt 2.2.1" at TOW

The pre-commit hook can then be installed using:

cd $REPOS/platform-pipe
chmod +x hooks/pre-commit.scalafmt 
ln -s $PWD/hooks/pre-commit.scalafmt .git/hooks/pre-commit

After this, every commit will trigger scalafmt to run and --no-verify can be used to ignore that step if absolutely necessary.

About

Open Targets evidence normalization and scoring pipeline

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published