Skip to content
This repository has been archived by the owner on Jan 13, 2023. It is now read-only.
Honghan Wu edited this page Jul 3, 2018 · 21 revisions

Welcome to the SemEHR wiki!

Run SemEHR pipeline

A typical SemEHR process contains the following steps:

  1. query a database or read from elastichsearch instance to get the documents for processing
  2. NLP processing (currently using bio-yodie to annotate UMLS concepts)
  3. index contextualised concepts into an elaticsearch instance
  4. do patient centric indexing to integrate all patient docs and annotations

To do the process, the easiest way is to

  1. (only do this ONCE) initialise SemEHR index using the mapping file (for ES version <6.0) or mappings (ES version > 6.0): patient index mapping and contextualised concept mapping.
  2. setup the database view from which SemEHR will pull documents from.
  3. edit the process configuration file using this template.
  4. run the script python semehr_processor.py PATH_TO_YOUR_CONFIGURATION

semehr_process_settings.json explained

  • env - system variables for running SemEHR

    • java_home - path to JRE
    • gcp_home - path to GCP (Gate Cloud Processing toolkit)
    • gate_home - path to Gate
    • yodie_path - path to bio-yodie
    • ukb_home - path to UKB (used by bio-yodie to do PageRank computation for disambiguation)
  • yodie - settings for running bio-yodie NLP pipeline on documents

    • "os" - the type of Operating System; possible values: win, linux
    • "gcp_run_path" - bio-yodie working folder
    • "input_doc_file_path" - (optional) path to a folder containing a text document that lists all document ids to be processed
    • "thread_num" - number of concurrent threads to run bio-yodie
    • "memory" - max memory to run bio-yodie, e.g., 30g or 600m
    • "config_xml_path" - the full path to store bio-yodie configuration file (the file will be automatically generated)
    • "output_file_path" - (optional) path to the folder where JSON dumps of bio-yodie will be saved to
    • "output_destination" - output type of bio-yodie including 'sql', 'json'. sql - to be saved to a database server; json - to be saved as dumps of annotation files in JSON format.
    • "output_dbconn_setting_file" - path to a json database configuration for saving annotations to; check this example.
    • "output_table" - the table name to save annotations to if using sql output, e.g., [kconnect_annotations];
    • "output_concept_filter_file" - (optional) path to a text document containing concept IDs that should be saved; all other concepts will be discarded. The format is each line a UMLS CUI
    • "input_source" - where to read documents from. possible values include "sql" and "elasticsearch". Essentially, the system will use different input handlers for running bio-yodie. sql - read from database; elasticsearch - read from a elasticsearch server specified in the semehr section of this configuration
    • "input_dbconn_setting_file" - (optional) input document database configuration, only needed when input_source is sql. check this example.
  • semehr

    • "es_doc_url" - Elasticsearch host url for full text documents,
    • "full_text_doc_id" - doc id field name in the full text document,
    • "full_text_doc_date" - doc date field name in the full text document,
    • "full_text_index" - index name in the full text document,
    • "full_text_doc_type" - doc type name in the full text document,
    • "full_text_patient_field" - patient id field name in the full text document,
    • "full_text_text_field" - full text field name in the full text document,
    • "es_host" - Elasticsearch host url for SemEHR,
    • "index" - index name for SemEHR patients,
    • "concept_index" - index name for SemEHR contextualised concepts (remove this if you would like to have everything in the same index for ES < 6.0),
    • "concept_doc_type" - document type for contextualised concept,
    • "entity_doc_type": - document type for patient
  • new_docs - where to read new document IDs from, only needed if document IDs are read from database

    • "sql_query" - the SQL query to read document IDs, e.g., "select docid from ..."; the sql query can be a template with two placeholders of "{start_time_point}" and "{end_time_point}", which will be replaced with information stored in SemEHR's progress log - using last successful job time to replace {start_time_point} and current time for {end_time_point}.
    • "dbconn_setting_file" - database connection settings for reading document IDs, e.g. "dbconn.json", check this example.
  • job - todo list for a SemEHR process, for the first three items, each accepts yes or no, yes means do respective task.

    • "copy_docs" - copy elasticsearch documents from one index to another. designed for KCH use cases where full text documents have already indexed in CogStack.
    • "yodie" - run bio-yodie pipeline on documents; meta-map will be supported soon.
    • "semehr-concept" - do SemEHR concept indexing
    • "semehr-patients" - do SemEHR patient centric indexing
    • "job_id" - the unique job name
    • "job_status_file_path" - path to the folder where job progress log file is to be stored }

Useful Links

troubleshooting

  • when you see no concepts indexed for patients, please double check the index mapping to make sure the mappings are correct as defined in the script.