Skip to content

Commit

Permalink
move training description to separate file
Browse files Browse the repository at this point in the history
  • Loading branch information
alistairewj committed Mar 18, 2021
1 parent 3863bc4 commit 0c48dde
Show file tree
Hide file tree
Showing 2 changed files with 100 additions and 101 deletions.
102 changes: 1 addition & 101 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,104 +48,4 @@ for p, pred in enumerate(preds):

# print the prediction labels out
print(f'{text[start:stop]:15s} {label} ({prob:0.3f})')
```

## Training and evaluating a transformer model

First, you'll need a suitable dataset. Right now this can be: i2b2_2014, i2b2_2006, PhysioNet, or Dernoncourt-Lee.
A dataset is considered suitable if it is saved in the right format. Dataset formats are as follows:

* a root folder dedicated to the dataset
* train/test subfolders
* each train/test subfolder has ann/txt subfolders
* the txt subfolder has files with the `.txt` extension containing the text to be deidentified
* the ann subfolder has files with the `.gs` extension containing a CSV of gold standard de-id annotations

Here's an example:

```
i2b2_2014
├── train
│ ├── ann
│ │ ├── 100-01.gs
│ │ ├── 100-02.gs
│ │ └── 100-03.gs
│ └── txt
│ ├── 100-01.txt
│ ├── 100-02.txt
│ └── 100-03.txt
└── test
├── ann
│ ├── 110-01.gs
│ ├── 110-02.gs
│ └── 110-03.gs
└── txt
├── 110-01.txt
├── 110-02.txt
└── 110-03.txt
```

With the dataset available, create the environment:

`conda create env -f environment.yml`

Activate the environment:

`conda activate deid`

Train a model (e.g. BERT):

```sh
python scripts/train_ner.py --data_dir /data/deid-gs/i2b2_2014 --data_type i2b2_2014 --model_type bert --model_name_or_path bert-base-uncased --do_lower_case --output_dir /data/models/bert-model-i2b2-2014 --do_train --overwrite_output_dir
```

Note this will only use data from the `train` subfolder of the `--data_dir` arg. Once the model is trained it can be used as above.

The `binary_evaluation.py` script can be used to assess performance on a test set. First, we'll need to generate the predictions:

```sh
export TEST_SET_PATH='/enc_data/deid-gs/i2b2_2014/test'
export MODEL_PATH='/enc_data/models/bert-i2b2-2014'
export PRED_PATH='out/'

python scripts/output_preds.py --data_dir ${TEST_SET_PATH} --model_dir ${MODEL_PATH} --output_folder ${PRED_PATH}
```

This outputs the predictions to the `out` folder. If we look at one of the files, we can see each prediction is a CSV of stand-off annotations. Here are the top few lines from the `110-01.pred` file:

```
document_id,annotation_id,start,stop,entity,entity_type,comment
110-01,4,16,20,2069,DATE,
110-01,5,20,21,-,DATE,
110-01,6,21,23,04,DATE,
110-01,7,23,24,-,DATE,
110-01,8,24,26,07,DATE,
```

We can now evaluate the predictions using the ground truth:

```sh
python scripts/binary_evaluation.py --pred_path ${PRED_PATH} --text_path ${TEST_SET_PATH}/txt --ref_path ${TEST_SET_PATH}/ann
```

For our trained model, this returned:

* Macro Se: 0.9818
* Macro P+: 0.9885
* Macro F1: 0.9840
* Micro Se: 0.9816
* Micro P+: 0.9892
* Micro F1: 0.9854

We can also look at individual predictions for a given file:

```sh
export FN=110-02
python scripts/print_annotation.py -p ${PRED_PATH}/${FN}.pred -t ${TEST_SET_PATH}/txt/${FN}.txt -r ${TEST_SET_PATH}/ann/${FN}.gs
```

If we would like a multi-class evaluation, we need to know about any label transformations done by the model, so we call a different script:

```sh
python scripts/eval.py --model_dir ${MODEL_PATH} --pred_path ${PRED_PATH} --text_path ${TEST_SET_PATH}/txt --ref_path ${TEST_SET_PATH}/ann
```
```
99 changes: 99 additions & 0 deletions TRAINING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# Training and evaluating a transformer model

First, you'll need a suitable dataset. Right now this can be: i2b2_2014, i2b2_2006, PhysioNet, or Dernoncourt-Lee.
A dataset is considered suitable if it is saved in the right format. Dataset formats are as follows:

* a root folder dedicated to the dataset
* train/test subfolders
* each train/test subfolder has ann/txt subfolders
* the txt subfolder has files with the `.txt` extension containing the text to be deidentified
* the ann subfolder has files with the `.gs` extension containing a CSV of gold standard de-id annotations

Here's an example:

```
i2b2_2014
├── train
│ ├── ann
│ │ ├── 100-01.gs
│ │ ├── 100-02.gs
│ │ └── 100-03.gs
│ └── txt
│ ├── 100-01.txt
│ ├── 100-02.txt
│ └── 100-03.txt
└── test
├── ann
│ ├── 110-01.gs
│ ├── 110-02.gs
│ └── 110-03.gs
└── txt
├── 110-01.txt
├── 110-02.txt
└── 110-03.txt
```

With the dataset available, create the environment:

`conda create env -f environment.yml`

Activate the environment:

`conda activate deid`

Train a model (e.g. BERT):

```sh
python scripts/train_ner.py --data_dir /data/deid-gs/i2b2_2014 --data_type i2b2_2014 --model_type bert --model_name_or_path bert-base-uncased --do_lower_case --output_dir /data/models/bert-model-i2b2-2014 --do_train --overwrite_output_dir
```

Note this will only use data from the `train` subfolder of the `--data_dir` arg. Once the model is trained it can be used as above.

The `binary_evaluation.py` script can be used to assess performance on a test set. First, we'll need to generate the predictions:

```sh
export TEST_SET_PATH='/enc_data/deid-gs/i2b2_2014/test'
export MODEL_PATH='/enc_data/models/bert-i2b2-2014'
export PRED_PATH='out/'

python scripts/output_preds.py --data_dir ${TEST_SET_PATH} --model_dir ${MODEL_PATH} --output_folder ${PRED_PATH}
```

This outputs the predictions to the `out` folder. If we look at one of the files, we can see each prediction is a CSV of stand-off annotations. Here are the top few lines from the `110-01.pred` file:

```
document_id,annotation_id,start,stop,entity,entity_type,comment
110-01,4,16,20,2069,DATE,
110-01,5,20,21,-,DATE,
110-01,6,21,23,04,DATE,
110-01,7,23,24,-,DATE,
110-01,8,24,26,07,DATE,
```

We can now evaluate the predictions using the ground truth:

```sh
python scripts/binary_evaluation.py --pred_path ${PRED_PATH} --text_path ${TEST_SET_PATH}/txt --ref_path ${TEST_SET_PATH}/ann
```

For our trained model, this returned:

* Macro Se: 0.9818
* Macro P+: 0.9885
* Macro F1: 0.9840
* Micro Se: 0.9816
* Micro P+: 0.9892
* Micro F1: 0.9854

We can also look at individual predictions for a given file:

```sh
export FN=110-02
python scripts/print_annotation.py -p ${PRED_PATH}/${FN}.pred -t ${TEST_SET_PATH}/txt/${FN}.txt -r ${TEST_SET_PATH}/ann/${FN}.gs
```

If we would like a multi-class evaluation, we need to know about any label transformations done by the model, so we call a different script:

```sh
python scripts/eval.py --model_dir ${MODEL_PATH} --pred_path ${PRED_PATH} --text_path ${TEST_SET_PATH}/txt --ref_path ${TEST_SET_PATH}/ann
```

0 comments on commit 0c48dde

Please sign in to comment.