Skip to content

Changes the encoding of CoNLL-03 NER datasets from BIO to BIOLU

Notifications You must be signed in to change notification settings

taasmoe/BIO-to-BIOLU

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 

Repository files navigation

BIO-to-BIOLU

The CoNLL 2003 NER dataset is annotated using the BIO labeling scheme. Each word is labelled in accordance with its location relative to a named entity (NE), using the three following markers:

  • B- for the first token of a NE,
  • I- for tokens inside NE's,
  • O- for tokens outside any NE.

A labelling scheme shown to outperform BIO is the BIOLU scheme [Ratinov and Roth, 2009], where two additional markers are included:

  • L- for the last tokens of NE's,
  • U- for unit length NE's.

This Python script converts a BIO-encoded file to BIOLU.

Usage

Run the following in the command line, where you specify the path of the original BIO encoded file and the name of your converted file.

python biolu_encode.py bio_path biolu_path

Tested for Python 3.6.

Examples

eng-biolu.toy is the result when converting eng.toy

About

Changes the encoding of CoNLL-03 NER datasets from BIO to BIOLU

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages