PHICON

1.Introduction

This repository contains source code for paper "PHICON: Improving Generalization of Clinical Text De-identification Models via Data Augmentation"(accepted by EMNLP'20 The 3rd Clinical Natural Language Processing Workshop). PHICON is a simple yet effective data augmentation method to alleviate the generalization issue in de-identification. PHICON consists of PHI augmentation and Context augmentation (as shown in Figure 1), which creates augmented training corpora by replacing PHI entities with named-entities sampled from external sources, and by changing background context with synonym replacement or random word insertion, respectively.

Figure. 1: Toy examples of our PHICON data augmentation. SR: synonym replacement. RI: random insertion.

2. Usage

Setup

Download Stanford Parser, and change the corresponding path in rule_modules.py file
Install spaCy package

PHI Augmentation

The i2b2 2006 and i2b2 2014 de-identification dataset can be accessed from: https://portal.dbmi.hms.harvard.edu.

The data processing mainly refers to the guidance from:
https://github.com/juand-r/entity-recognition-datasets/tree/master/data/i2b2_2006
https://github.com/juand-r/entity-recognition-datasets/tree/master/data/i2b2_2014

We also show detailed steps on data process and PHI augmentation in the following two files:
PHI augmentation-i2b2-2006 dataset.ipynb
PHI augmentation-i2b2-2014 dataset.ipynb

If users already have de-identification datasets in BIO format, users can directly conduct PHI Augmentation according to the guidance in this file:
PHI augmentation-your-own-dataset.ipynb

Context Augmentation

python Context_Aug.py

3. Citation

Please kindly cite the paper if you use the code or any resources in this repo:

@inproceedings{yue2020phicon,
 title={PHICON: Improving Generalization of Clinical Text De-identification Models via Data Augmentation},
 author={Xiang Yue and Shuang Zhou},
 booktitle={Proceedings of the 3rd Clinical Natural Language Processing Workshop},
 year={2020}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
all_PHI_from_internet		all_PHI_from_internet
data/conll2003/en		data/conll2003/en
gen_data		gen_data
resources/for_insert		resources/for_insert
Context_Aug.py		Context_Aug.py
PHI augmentation-i2b2-2006 dataset.ipynb		PHI augmentation-i2b2-2006 dataset.ipynb
PHI augmentation-i2b2-2014 dataset.ipynb		PHI augmentation-i2b2-2014 dataset.ipynb
PHI augmentation-your-own-dataset.ipynb		PHI augmentation-your-own-dataset.ipynb
PHICON_example.png		PHICON_example.png
README.md		README.md
data_handling_for_heuristic.py		data_handling_for_heuristic.py
parameters.ini		parameters.ini
rule_modules.py		rule_modules.py
subset_generator.py		subset_generator.py

betterzhou/PHICON

Folders and files

Latest commit

History

Repository files navigation

PHICON

1.Introduction

2. Usage

Setup

PHI Augmentation

Context Augmentation

3. Citation

About

Resources

Stars

Watchers

Forks

Languages