IAMsystem

A Java implementation of IAMsystem algorithm, a fast dictionary-based approach for semantic annotation, a.k.a entity linking.

Installation

Add the dependency to your pom.xml:

<dependency>
 	<groupId>fr.erias</groupId>
	<artifactId>IAMsystem</artifactId>
	<version>2.2.0</version>
</dependency>

Usage

You provide a list of keywords you want to detect in a document, you can add and combine abbreviations, normalization methods (lemmatization, stemming) and approximate string matching algorithms, IAMsystem algorithm performs the semantic annotation.

See the documentation for the configuration details. Although the documentation is for the python implementation, the Java implementation offers the same functionalities and uses the same variable names.

Quick example

Matcher matcher = new MatcherBuilder()
		.keywords("North America", "South America")
		.stopwords("and")
		.abbreviations("amer", "America")
		.levenshtein(5, 1, Algorithm.TRANSPOSITION)
		.w(2)
		.build();
List<IAnnotation> annots = matcher.annot("Northh and south Amer.");
for (IAnnotation annot : annots) {
	System.out.println(annot);
}
// Northh Amer	0 6;17 21	North America
// south Amer	11 21	South America

Algorithm

The algorithm was developed in the context of a PhD thesis. It proposes a solution to quickly annotate documents using a large dictionary (> 300K keywords) and fuzzy matching algorithms. No string distance algorithm is implemented in this package, it imports and leverages external libraries. Its algorithmic complexity is O(n(log(m))) with n the number of tokens in a document and m the size of the dictionary. The formalization of the algorithm is available in this paper.

It has participated in several semantic annotation competitions in the medical field where it has obtained satisfactory results, for example by obtaining the best results in the Codiesp shared task. A dictionary-based model can achieve close performance to a transformer-based model when the task is simple or when the training set is small. Its main advantage is its speed, which allows a baseline to be generated quickly.

How it works

Like FlashText and Spacy's phrasematcher algorithms it stores a terminology in a tree data structure (called a trie) for low memory storage and fast lookup.

IAMsystem handles fuzzy string matching at the token level of a n-gram term. In the example below, it detects the term "insuffisance cardiaque aigue" in a document containing "ins cardiaqu aigue":

Approximate string matching

By default, IAMsystem performs exact match only. The following string matching algorithms are available in IAMsystem:

Apache common StringEncoders (Metaphone, Soundex, Caverphone...)
Levenshtein distance (by https://github.com/universal-automata/liblevenshtein-java)
Abbreviations (a dictionary must be provided)
Truncation
ClosestSubString
Regex

You can also add your own fuzzy matching algorithm. Examples of token matching with different algorithms:

token in document	token(s) in terminology	Approximate string matching algorithm
amocssicilllline	amoxicilline	Soundex
amoicilline	amoxicilline	Levenshtein (edit distance of 1)
amoxicil.	amoxicilline	Truncation
amoxicillinesssss	amoxicilline	ClosestSubString
bp	blood pressure	Abbreviations

References

Performance (recall, precision, F-measure) of IAMsystem were evaluated on two information extraction tasks of the CLEF eHealth initiative. IAMsystem's papers:

Cossin S, Jouhet V, Mougin F, Diallo G, Thiessard F. IAM at CLEF eHealth 2018: Concept Annotation and Coding in French Death Certificates. https://arxiv.org/abs/1807.03674
Cossin S and Jouhet V. IAM at CLEF eHealth 2020: Concept Annotation in Spanish Electronic Health Records. http://ceur-ws.org/Vol-2696/paper_198.pdf
Cossin S, Diallo G, Jouhet V. IAM at IberLEF 2022: NER of Species Mentions. CEUR workshop proceedings [Internet]. sept 2022. https://ceur-ws.org/Vol-3202/livingner-paper11.pdf

Organizers' papers:

Névéol A, Robert A, Grippo F, Morgand C, Orsi C, Pelikan L, et al. CLEF eHealth 2018 Multilingual Information Extraction Task Overview: ICD10 Coding of Death Certificates in French, Hungarian and Italian. In: CLEF. 2018. http://ceur-ws.org/Vol-2125/invited_paper_18.pdf
Miranda-Escalada A, Gonzalez-Agirre A, Armengol-Estapé J, Krallinger M. Overview of automatic clinical coding: annotations, guidelines, and solutions for non-English clinical cases at CodiEsp track of eHealth CLEF 2020. CEUR-WS. 2020; http://ceur-ws.org/Vol-2696/paper_263.pdf
A. Miranda-Escalada, E. Farré-Maduell, S. Lima-López, D. Estrada, L. Gascó, M. Krallinger, Mention detection, normalization & classification of species, pathogens, humans and food in clinical documents: Overview of livingner shared task and resources, Procesamiento del Lenguaje Natural (2022)

Release note:

Version
0.0.1	First publication of the algorithm (November 2018)
1.0.0	First major modification. Change the output object of the detector (December 2020), add TermDetector
1.2.0	Re-implement the trie and add a cache mechanism to improve performance
1.3.0	Add support to the Apache common StringEncoder library, add Troncation and ClosestSubString algorithms
2.1.0	Complete re-write of the library to be in sync with the Python implementation and its documentation. The core algorithm, aka the matching strategy, changed to the Window Strategy which allows detection of discontinuous sequences of tokens in a document. The strategy used in previous versions (<2.1.0) is called the NoOverlap strategy.
2.2.0	Fix issue 18: create multiple annotations when a keyword is repeated in the same window.

Acknowledgement

This annotation tool is part of the Drugs Systematized Assessment in real-liFe Environment (DRUGS-SAFE) research platform that is funded by the French Medicines Agency (Agence Nationale de Sécurité du Médicament et des Produits de Santé, ANSM). This platform aims at providing an integrated system allowing the concomitant monitoring of drug use and safety in France.

Citation

@article{cossin_iam_2018,
	title = {{IAM} at {CLEF} {eHealth} 2018: {Concept} {Annotation} and {Coding} in {French} {Death} {Certificates}},
	shorttitle = {{IAM} at {CLEF} {eHealth} 2018},
	url = {http://arxiv.org/abs/1807.03674},
	urldate = {2018-07-11},
	journal = {arXiv:1807.03674 [cs]},
	author = {Cossin, Sébastien and Jouhet, Vianney and Mougin, Fleur and Diallo, Gayo and Thiessard, Frantz},
	month = jul,
	year = {2018},
	note = {arXiv: 1807.03674},
	keywords = {Computer Science - Computation and Language},
}

Name		Name	Last commit message	Last commit date
Latest commit History 172 Commits
IAMsystem		IAMsystem
IAMsystemFR		IAMsystemFR
example		example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
search_algorithm.png		search_algorithm.png
trie_datastructure.png		trie_datastructure.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IAMsystem

IAMsystem

IAMsystemFR

IAMsystemFR

example

example

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

search_algorithm.png

search_algorithm.png

trie_datastructure.png

trie_datastructure.png

Repository files navigation

IAMsystem

Installation

Usage

Quick example

Algorithm

How it works

Approximate string matching

References

Release note:

Acknowledgement

Citation

About

Releases

Packages

Languages

License

scossin/IAMsystem

Folders and files

Latest commit

History

Repository files navigation

IAMsystem

Installation

Usage

Quick example

Algorithm

How it works

Approximate string matching

References

Release note:

Acknowledgement

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages