Dutch Medical Concepts

This repository contains instructions and code to create concept tables of Dutch medical names, such as primary names, synonyms, abbreviations, and common misspellings. By basing a concept table on UMLS, which includes MeSH, MedDRA, ICD-10 and ICPC, and adding SNOMED CT and a few manually added names, a comprehensive set of words commonly used in Dutch medical language is generated. Workflows for creating SNOMED and HPO based concept tables are also in this repository.

The resulting concept tables can be used in named entity recognition and linking methods, such as MedCAT, to identify entities in Dutch medical text.

Ontology	Number of concepts	Number of names	Primary source
Dutch UMLS	254835	574475	UMLS 2022AB
Dutch UMLS with English drug names	367913	754326	UMLS 2022AB
Dutch SNOMED	230277	521118	SNOMED CT Netherlands Edition September 2022 v1.0
Dutch HPO	13360	29164	Dutch HPO translations

Data and licenses should be acquired from UMLS Terminology Services and SNOMED MLDS.

Download pre-made Dutch MedCAT models

To download a pre-made Dutch MedCAT model based on the Dutch UMLS concept table generated in this repository, navigate to the MedCAT source repository (https://github.com/CogStack/MedCAT) and click on the link for the Dutch model. This requires authentication via UMLS Terminology Services to verify the user's UMLS license.

Folder structure

The methods of this repository require a number of input files, which should be downloaded by the user, and intermediate and output files are generated. This repository contains a folder structure that can be used for storing these files. The contents of these folders, except for 05_CustomChanges, are added not tracked by Git, which makes it easier to replace input files or recreate output files.

dutch-medical-concepts
└───01_Download
└───02_ExtractSubset
└───03_SqlDB
└───04_ConceptDB
└───05_CustomChanges

Output format

cui	name	ontologies	name_status	type_ids
C0000001	kanker	ONTOLOGY1\|ONTOLOGY2	P	T001
C0000001	neoplasma maligne	ONTOLOGY1	A	T001
C0000002	longkanker	ONTOLOGY1	P	T001
C0000003	kleincellige longkanker	ONTOLOGY1	P	T001
C0000004	chirurgie	ONTOLOGY1	P	T002
C0000004	chirurgische verrichting	ONTOLOGY1	A	T002
C0000005	chirurgie	ONTOLOGY1	P	T003
C0000005	specialisme chirurgie	ONTOLOGY1	A	T003

See https://github.com/CogStack/MedCAT/tree/master/examples for a detailed explanation of the columns.

I'm not sure whether the UMLS license allows for publishing snippets of UMLS data for demonstration purposes, so this repository uses mock data in the examples.

Data-flow

Generate UMLS concept table

1. Obtain license and download complete UMLS

To download UMLS, visit the NIH National Library of Medicine website. You'll have to apply for a license before you can download the files on https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html. In the following description I downloaded Full Release (umls-2022AB-full.zip). The advantage over UMLS Metathesaurus Full Subset is that the Full Release includes MetamorphoSys which makes it possible to create a subset of UMLS prior loading the data in a SQL database. This significantly decreases the required disk space and processing time.

2. Decompress and install MetamorphoSys

After decompressing the *-full.zip file, go into the folder (e.g. 01_Download/2022AB-full in this example) and decompress mmsys.zip. Afterwards, move the files in the new mmsys folder one level up, so they are in 2022AB-full. Next, run MetamorphoSys (./run_mac.sh on macOS)

3. Select UMLS concepts for Dutch medical language using MetamorphoSys

MetamorphoSys is used to install a subset of UMLS. During the installation process it is possible to select multiple sources, and thereby to craft a specific subset for your use case. In our case, our primary goal is to select the Dutch terms and we add some English sources for concept categories for common used English names in Dutch (such as drug names).

Select Install UMLS.
Select destination directory, for example ./02_ExtractSubset/2022AB/
Keep Metathesaurus checked, and uncheck Semantic Network and SPECIALIST Lexicon & Lexical Tools. Select OK.
Select New Configuration..., click Accept and click Ok. The Default Subset does not matter because we are making our own subset in the next step.
In the Output Options tab, select MySQL 5.6 under Select database.
In the Source List tab, Select Select sources to INCLUDE in subset. Sort the sources on the language column and at least select the 7 Dutch sources. To select multiple sources, hold the CMD key on macOS. In the popup window that will ask if you also want to include related sources, click Cancel. The Dutch sources I included were:

Source name	Source ID
ICD10, Dutch Translation, 200403	ICD10DUT_200403
ICPC2-ICD10 Thesaurus, Dutch Translation	ICPC2ICD10DUT_200412
ICPC2E Dutch	ICPC2EDUT_200203
ICPC, Dutch Translation, 1993	ICPCDUT_1993
LOINC Linguistic Variant - Dutch, Netherlands	LNC-NL-NL_273
MedDRA Dutch	MDRDUT25_0
MeSH Dutch	MSHDUT2005

More information on the sources can be found at: https://www.nlm.nih.gov/research/umls/sourcereleasedocs/index.html

For drug names some commonly used synonyms in Dutch langague are missing in these vocabularies. Therefore I also selected the following English sources:
- ATC_2022_22_09_06
- DRUGBANK5.0_2022_08_01
- RXNORM_20AA_220906F.
Also, it's useful to include other categories useful for remapping source ontologies (such as Dutch SNOMED -> English SNOMED -> UMLS). The selection of Dutch and English drug names is done in a Jupyter Notebook in a later step, so it's okay to include some more non-Dutch sources in this step. At UMCU we included the following sources:
- SNOMEDCT_US_2022_09_01 (required for adding Dutch SNOMED terms in a later step)
- HPO2020_10_12
- MTH
In the Suppressibility tab, make sure the obsolete terms are suppressed (LO for LOINC, OL for MedDRA; see https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/abbreviations.html). In my configuration, I unsuppressed the "Abbreviation in any source vocabulary" (AB) concepts from MedDRA.
On macOS, in the top bar, select Advanced Suppressibility Options and check all checkboxes. This makes sure the suppressed terms are excluded from the subset.
On macOS, in the top bar, select Done and Begin Subset. This process takes 5-10 minutes.
Save the SubSet log for future reference.

4. Load all terms in a SQL database

To select only the columns required for the target list of terms, first put all the resulting subset in a SQL DB.

# Create local .env file
cp .env-example .env

# Set local file paths & MySQL root password in .env

# Set MySQL loading config settings 
vim ./02_ExtractSubset/2022AB/META/populate_mysql_db.sh

MYSQL_HOME=/usr
user=root
password=${MYSQL_ROOT_PASSWORD}
db_name=${MYSQL_DATABASE}

# Start MySQL container in Docker
docker-compose up -d

# Enter docker container
docker exec -it umls bash

# Execute mysql loading script
cd /src_files/2022AB/META/
bash populate_mysql_db.sh

The official documentation for loading UMLS in a MySQL DB can be found at here.

5. Create concept table

Preferred method: Fine-tuned and filtered

Some source vocabularies contain concept types that are not useful for entity linking. Also, Dutch vocabularies in UMLS do not contain many drug names. Use dutch-umls_to_concept-table.ipynb to:

Download concepts, names and types from the MySQL database
Remove irrelevant concepts based on term type in source vocabulary
Remove irrelevant concepts based on UMLS type
Remove problematic names for Dutch language
Add many concepts and synonyms from Dutch SNOMED
Add custom names
Save the concept table in a format compatible with MedCAT

Note: This requires a Dutch SNOMED concept table, which can be created in dutch-snomed_to_concept-table.ipynb.

To setup the Python environment, install the required Python Packages with PIP:

pip install -r requirements.txt

Alternate method: rough & fast

Target file should follow the specifications defined in the MedCAT repository.

In short:

Column	Description	Example values
cui	concept id	C0000001
name	term name	Longkanker
name_status	Term type in source	PN (Primary name), SY (Synonym), for others see https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/abbreviations.html#TTY
type_ids	Semantic type identifier	T047 (Based on UMLS)
ontologies	Source Ontology	SNOMEDCT_US, MSHDUT, MDRDUT, ICPC2EDUT, ICPCDUT, or something custom

Select the relevant columns using your preferred way of interacting with SQL databases and save the results to a CSV-file. Also include the header. A quick way to do this, is:

SELECT distinct MRCONSO.cui, str as name, sab as ontologies, tty as name_status, tui as type_ids
FROM MRCONSO
LEFT JOIN MRSTY ON MRSTY.cui = MRCONSO.cui
ORDER BY MRCONSO.cui ASC;

The output should look like this:

% head -5 umls-dutch.csv 
cui,name,ontologies,name_status,type_ids
C0000001,A-zenuwvezels,ONTOLOGY1,PN,T001
C0000002,Abattoir,ONTOLOGY1,PN,T002
C0000002,Abattoirs,ONTOLOGY1,SY,T002
C0000003,Abbreviated Injury Scale,ONTOLOGY1,PN,T003

Generate SNOMED concept table

Method in: dutch-snomed_to_concept-table.ipynb

License and source files can be acquired from SNOMED MLDS.

Generate HPO concept table

Method in: dutch-hpo_to_concept-table.ipynb

Most HPO names were translated to Dutch by Radboudumc's Clinical Genetics team via CrowdIn. This is the primary source in creating a concept table of Dutch HPO concepts. In an effort to make a more complete set of synonyms, names from UMLS and SNOMED, and a few manually created names, are added.

Generate MedCAT models

To generate MedCAT models from the concept table, see the instructions in the MedCAT repository. For Dutch language, use these parameters in the configuration:

# Set Dutch spaCy model. The small or medium models should also be fine.
cat.general.spacy_model = 'nl_core_news_lg'

# This handles diacritics, which are common in Dutch language but not in English language.
cat.general.diacritics = True

# This distinguishes "ALS" (the disease) from "als" (voegwoord en voorzetsel).
cat.ner.check_upper_case_names = True

MedCAT contains functionality for unsupervised training to solve ambiguity of medical concepts. A relatively easy way to do this is by providing Dutch medical wikipedia articles. Although this is not representable for clinical language used in EHRs, the resulting model performs surprisingly well. See dutch-medical-wikipedia for instructions on how to generate this dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
01_Download		01_Download
02_ExtractSubset		02_ExtractSubset
04_ConceptDB		04_ConceptDB
05_CustomConcepts		05_CustomConcepts
.env-example		.env-example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
data-flow.jpg		data-flow.jpg
docker-compose.yml		docker-compose.yml
dutch-hpo_to_concept-table.ipynb		dutch-hpo_to_concept-table.ipynb
dutch-snomed_to_concept-table.ipynb		dutch-snomed_to_concept-table.ipynb
dutch-umls_to_concept-table.ipynb		dutch-umls_to_concept-table.ipynb
requirements.txt		requirements.txt
utils.py		utils.py

License

umcu/dutch-medical-concepts

Folders and files

Latest commit

History

Repository files navigation

Dutch Medical Concepts

Table of Contents

Download pre-made Dutch MedCAT models

Folder structure

Output format

Data-flow

Generate UMLS concept table

1. Obtain license and download complete UMLS

2. Decompress and install MetamorphoSys

3. Select UMLS concepts for Dutch medical language using MetamorphoSys

4. Load all terms in a SQL database

5. Create concept table

Preferred method: Fine-tuned and filtered

Alternate method: rough & fast

Generate SNOMED concept table

Generate HPO concept table

Generate MedCAT models

About

Topics

Resources

License

Stars

Watchers

Forks

Languages