move training description to separate file

alistairewj · Mar 18, 2021 · 0c48dde · 0c48dde
1 parent 3863bc4
commit 0c48dde
Show file tree

Hide file tree

Showing 2 changed files with 100 additions and 101 deletions.
diff --git a/README.md b/README.md
@@ -48,104 +48,4 @@ for p, pred in enumerate(preds):
 
     # print the prediction labels out
     print(f'{text[start:stop]:15s} {label} ({prob:0.3f})')
-```
-
-## Training and evaluating a transformer model
-
-First, you'll need a suitable dataset. Right now this can be: i2b2_2014, i2b2_2006, PhysioNet, or Dernoncourt-Lee.
-A dataset is considered suitable if it is saved in the right format. Dataset formats are as follows:
-
-* a root folder dedicated to the dataset
-* train/test subfolders
-* each train/test subfolder has ann/txt subfolders
-* the txt subfolder has files with the `.txt` extension containing the text to be deidentified
-* the ann subfolder has files with the `.gs` extension containing a CSV of gold standard de-id annotations
-
-Here's an example:
-
-```
-i2b2_2014
-├── train
-│   ├── ann
-│   │   ├── 100-01.gs
-│   │   ├── 100-02.gs
-│   │   └── 100-03.gs
-│   └── txt
-│       ├── 100-01.txt
-│       ├── 100-02.txt
-│       └── 100-03.txt
-└── test
-    ├── ann
-    │   ├── 110-01.gs
-    │   ├── 110-02.gs
-    │   └── 110-03.gs
-    └── txt
-        ├── 110-01.txt
-        ├── 110-02.txt
-        └── 110-03.txt
-```
-
-With the dataset available, create the environment:
-
-`conda create env -f environment.yml`
-
-Activate the environment:
-
-`conda activate deid`
-
-Train a model (e.g. BERT):
-
-```sh
-python scripts/train_ner.py --data_dir /data/deid-gs/i2b2_2014 --data_type i2b2_2014 --model_type bert --model_name_or_path bert-base-uncased --do_lower_case --output_dir /data/models/bert-model-i2b2-2014 --do_train --overwrite_output_dir
-```
-
-Note this will only use data from the `train` subfolder of the `--data_dir` arg. Once the model is trained it can be used as above.
-
-The `binary_evaluation.py` script can be used to assess performance on a test set. First, we'll need to generate the predictions:
-
-```sh
-export TEST_SET_PATH='/enc_data/deid-gs/i2b2_2014/test'
-export MODEL_PATH='/enc_data/models/bert-i2b2-2014'
-export PRED_PATH='out/'
-
-python scripts/output_preds.py --data_dir ${TEST_SET_PATH} --model_dir ${MODEL_PATH} --output_folder ${PRED_PATH}
-```
-
-This outputs the predictions to the `out` folder. If we look at one of the files, we can see each prediction is a CSV of stand-off annotations. Here are the top few lines from the `110-01.pred` file:
-
-```
-document_id,annotation_id,start,stop,entity,entity_type,comment
-110-01,4,16,20,2069,DATE,
-110-01,5,20,21,-,DATE,
-110-01,6,21,23,04,DATE,
-110-01,7,23,24,-,DATE,
-110-01,8,24,26,07,DATE,
-```
-
-We can now evaluate the predictions using the ground truth:
-
-```sh
-python scripts/binary_evaluation.py --pred_path ${PRED_PATH} --text_path ${TEST_SET_PATH}/txt --ref_path ${TEST_SET_PATH}/ann
-```
-
-For our trained model, this returned:
-
-* Macro Se: 0.9818
-* Macro P+: 0.9885
-* Macro F1: 0.9840
-* Micro Se: 0.9816
-* Micro P+: 0.9892
-* Micro F1: 0.9854
-
-We can also look at individual predictions for a given file:
-
-```sh
-export FN=110-02
-python scripts/print_annotation.py -p ${PRED_PATH}/${FN}.pred -t ${TEST_SET_PATH}/txt/${FN}.txt -r ${TEST_SET_PATH}/ann/${FN}.gs
-```
-
-If we would like a multi-class evaluation, we need to know about any label transformations done by the model, so we call a different script:
-
-```sh
-python scripts/eval.py --model_dir ${MODEL_PATH} --pred_path ${PRED_PATH} --text_path ${TEST_SET_PATH}/txt --ref_path ${TEST_SET_PATH}/ann
-```
+```
diff --git a/TRAINING.md b/TRAINING.md
@@ -0,0 +1,99 @@
+# Training and evaluating a transformer model
+
+First, you'll need a suitable dataset. Right now this can be: i2b2_2014, i2b2_2006, PhysioNet, or Dernoncourt-Lee.
+A dataset is considered suitable if it is saved in the right format. Dataset formats are as follows:
+
+* a root folder dedicated to the dataset
+* train/test subfolders
+* each train/test subfolder has ann/txt subfolders
+* the txt subfolder has files with the `.txt` extension containing the text to be deidentified
+* the ann subfolder has files with the `.gs` extension containing a CSV of gold standard de-id annotations
+
+Here's an example:
+
+```
+i2b2_2014
+├── train
+│   ├── ann
+│   │   ├── 100-01.gs
+│   │   ├── 100-02.gs
+│   │   └── 100-03.gs
+│   └── txt
+│       ├── 100-01.txt
+│       ├── 100-02.txt
+│       └── 100-03.txt
+└── test
+    ├── ann
+    │   ├── 110-01.gs
+    │   ├── 110-02.gs
+    │   └── 110-03.gs
+    └── txt
+        ├── 110-01.txt
+        ├── 110-02.txt
+        └── 110-03.txt
+```
+
+With the dataset available, create the environment:
+
+`conda create env -f environment.yml`
+
+Activate the environment:
+
+`conda activate deid`
+
+Train a model (e.g. BERT):
+
+```sh
+python scripts/train_ner.py --data_dir /data/deid-gs/i2b2_2014 --data_type i2b2_2014 --model_type bert --model_name_or_path bert-base-uncased --do_lower_case --output_dir /data/models/bert-model-i2b2-2014 --do_train --overwrite_output_dir
+```
+
+Note this will only use data from the `train` subfolder of the `--data_dir` arg. Once the model is trained it can be used as above.
+
+The `binary_evaluation.py` script can be used to assess performance on a test set. First, we'll need to generate the predictions:
+
+```sh
+export TEST_SET_PATH='/enc_data/deid-gs/i2b2_2014/test'
+export MODEL_PATH='/enc_data/models/bert-i2b2-2014'
+export PRED_PATH='out/'
+
+python scripts/output_preds.py --data_dir ${TEST_SET_PATH} --model_dir ${MODEL_PATH} --output_folder ${PRED_PATH}
+```
+
+This outputs the predictions to the `out` folder. If we look at one of the files, we can see each prediction is a CSV of stand-off annotations. Here are the top few lines from the `110-01.pred` file:
+
+```
+document_id,annotation_id,start,stop,entity,entity_type,comment
+110-01,4,16,20,2069,DATE,
+110-01,5,20,21,-,DATE,
+110-01,6,21,23,04,DATE,
+110-01,7,23,24,-,DATE,
+110-01,8,24,26,07,DATE,
+```
+
+We can now evaluate the predictions using the ground truth:
+
+```sh
+python scripts/binary_evaluation.py --pred_path ${PRED_PATH} --text_path ${TEST_SET_PATH}/txt --ref_path ${TEST_SET_PATH}/ann
+```
+
+For our trained model, this returned:
+
+* Macro Se: 0.9818
+* Macro P+: 0.9885
+* Macro F1: 0.9840
+* Micro Se: 0.9816
+* Micro P+: 0.9892
+* Micro F1: 0.9854
+
+We can also look at individual predictions for a given file:
+
+```sh
+export FN=110-02
+python scripts/print_annotation.py -p ${PRED_PATH}/${FN}.pred -t ${TEST_SET_PATH}/txt/${FN}.txt -r ${TEST_SET_PATH}/ann/${FN}.gs
+```
+
+If we would like a multi-class evaluation, we need to know about any label transformations done by the model, so we call a different script:
+
+```sh
+python scripts/eval.py --model_dir ${MODEL_PATH} --pred_path ${PRED_PATH} --text_path ${TEST_SET_PATH}/txt --ref_path ${TEST_SET_PATH}/ann
+```