Skip to content

This repository contains data and BioBert based NER model monologg/biobert_v1.1_pubmed from community-uploaded Hugging Face models for detecting entities such as chemical and disease.

License

Notifications You must be signed in to change notification settings

miladnouriezade/Ktrain-BioBert_NER

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ktrain BioBert_NER

This repository contains data and BioBert based NER model monologg/biobert_v1.1_pubmed from community-uploaded Hugging Face models for detecting entities such as chemical and disease.

Setting up an environment

  1. Follow the installation instructions for Conda.

  2. Create a Conda environment called "Ktrain_NER" with Python 3.7.0:

    conda create -n Ktrain_NER python=3.7.0
  3. Activate the Conda environment:

    conda activate Ktrain_NER

Installation

Install required packages .

$ pip install tensorflow==2.1.0
$ pip install pytorch==1.4.0
$ pip install ktrain==0.12.0

If you want to convert your IOB schemed data to BILOU schemed using iobToBilou.py in utilities folder, install spaCy using bellow command .

$ conda install -c conda-forge spacy

Dataset

Download dataset provided in data folder(BC5CDR-IOB), locate it in any directory you want and address TRAIN_DATA and VALIDATION_DATA in parameters.py . Use train-dev.tsv for training and test.tsv for validation.

Ktrain can use both validation and train datas or just train.

Learning rate hyper-parameter

lr_find() records loss over range of LRs .

def lr_find(self, start_lr=1e-7, lr_mult=1.01, max_epochs=None, 
                stop_factor=4, show_plot=False, verbose=1):

"""
Args:
            start_lr (float): smallest lr to start simulation
            lr_mult (float): multiplication factor to increase LR.
                             Ignored if max_epochs is supplied.
            max_epochs (int):  maximum number of epochs to simulate.
                               lr_mult is ignored if max_epoch is supplied.
                               Default is None. Set max_epochs to an integer
                               (e.g., 5) if lr_find is taking too long
                               and running for more epochs than desired.
            stop_factor(int): factor used to determine threhsold that loss 
                              must exceed to stop training simulation.
                              Increase this if loss is erratic and lr_find
                              exits too early.
            show_plot (bool):  If True, automatically invoke lr_plot
            verbose (bool): specifies how much output to print
        Returns:
            float:  Numerical estimate of best lr.  
                    The lr_plot method should be invoked to
                    identify the maximal loss associated with falling loss.
"""

For using lr_find() we need to a learner object; that we can construct it using ktrain.get_learner() function by passing model and data .

learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=128, eval_batch_size=64)

After trying some LRs(1e-5, 1e-4, 5e-3, 8e-4) we found that in our case optimal lr is approximately 1e-3 .

lr_find

Train and validate model

Use python run_ner.py to train and validate model.

Result

We got the best result using SGDR learning rate scheduler on BC5CDR-IOB with lr=1e-3,n_cycles=3, cycle_len=1 and cycle_mult=2. weights are availabel in weights folder.

learner.fit(1e-3, 3, cycle_len=1, cycle_mult=2, checkpoint_folder='/checkpoints/SGDR', early_stopping=3)

SGDR

precision recall f1-score support
Chemical 0.91 091 0.91 5385
Disease 0.75 0.81 0.78 4424
micro avg 0.83 0.87 0.85 9809
macro avg 0.84 0.87 0.85 9809

Result using fastText

We used crawl-300d-2M-subword from fastext pre-trained word vectors instead of randomly initialized word embeddings with the same parameters and data as before .

precision recall f1-score support
Disease 0.76 0.79 0.77 4424
Chemical 0.91 0.89 0.90 5385
micro avg 0.84 0.85 0.84 9809
macro avg 0.84 0.85 0.85 9809

Result using fastText and BILOU schemed data

In this expriment we used BC5CDR-BILOU _ BILOU schemed data set instead of IOB with crawl-300d-2M-subword(fastText word vector) and same parameters as before .

precision recall f1-score support
Chemical 0.91 0.74 0.82 5374
Disease 0.74 0.72 0.73 4397
micro avg 0.83 0.73 0.78 9771
macro avg 0.83 0.73 0.78 9771

Refernces

  1. Tunning Learning Rates

  2. English NER example notebook

  3. Text Sequence Tagging for Named Entity Recognition

  4. A Newbie’s Guide to Stochastic Gradient Descent With Restarts

  5. Exploring Stochastic Gradient Descent with Restarts (SGDR)

About

This repository contains data and BioBert based NER model monologg/biobert_v1.1_pubmed from community-uploaded Hugging Face models for detecting entities such as chemical and disease.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages