Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

(unreleased)

Removed

the config_file keyword, now replaced by config which accepts both filenames and dicts
old lookup list names, e.g. prefixes now replaced by prefix
annotator types custom, regexp, token_pattern, dd_token_pattern and annotation_context, all replaced by setting class directly as annotator_type
everything in deduce.pattern, patient patterns now replaced by PatientNameAnnotator

3.0.2 (2023-02-15)

Changed

recognize 4+ spaces as a token, blocking annotations

3.0.1 (2023-12-20)

Fixed

a bug with packaging base_config.json

3.0.0 (2023-12-20)

Added

speed optimizations, ~250%
pseudo-annotating eponymous diseases (e.g. Creutzfeldt-Jakob)
PatientNameAnnotator, which replaces deduce.pattern
a structured way for loading and building lookup structures (lists and tries), including caching
pre_match_words for some regexp annotators, speeding up the annotating
option to present a user config as dict (using config keyword)

Changed

speedup for TokenPatternAnnotator
some internals of ContextPatternAnnotator
initials now detected by lookup list, rather than pattern
redactor open and close chars from < > to [ ], as previous chars caused issues in html (so deidentified text now shows [PATIENT], [LOCATIE], etc.)
names of lookup structures to singular (prefix, rather than prefixes)
INSTELLING tag to ZIEKENHUIS and ZORGINSTELLING
refactored and simplified annotator loading, specifically the annotator_type config keyword now accepts references to classes (e.g deduce.annotator.TokenPatternAnnotator)
renamed interfix_with_capital annotator to interfix_with_name

Deprecated

the config_file keyword, now replaced by config which accepts both filenames and dicts
old lookup list names, e.g. prefixes now replaced by prefix
annotator types custom, regexp, token_pattern, dd_token_pattern and annotation_context, all replaced by setting class directly as annotator_type
everything in deduce.pattern, patient patterns now replaced by PatientNameAnnotator

Removed

automated coverage reporting on coveralls.io
options lowercase_lookup, lowercase_neg_lookup for token patterns
utils.any_in_text

Fixed

some small additions/removals for specific lookup lists
smaller bugs related to overlapping matches

2.5.0 (2023-11-28)

Added

the RegexpPseudoAnnotator component for filtering regexp matches based on preceding/following words
a prefix_with_interfix pattern for names, detecting e.g. Dr. van Loon

Changed

the age detection component, with improved logic and pseudo patterns
annotations are no longer counted adjacent when separated by a comma
streets are prioritized over names when merging overlapping annotations
removed some false positives for postal codes ending in gr or ie
extended the postbus pattern for xx.xxx format (old notation)
some smaller optimizations and exceptions for institution, hospital, placename, residence, medical term, first name, and last name lookup lists

Fixed

a bug with BsnAnnotator with non-digit characters in regexp

2.4.2 (2023-11-22)

Changed

multi-token lookup for first- and last names, so multi token names are now detected
some small lookup list additions

2.4.3 (2023-11-22)

Changed

extended list of medical terms

2.4.2 (2023-11-21)

Changed

name lookup list contents, extending names and adding more exceptions

2.4.1 (2023-11-15)

Added

detection of initials Ch., Chr., Ph. and Th.

2.4.0 (2023-11-15)

Added

logic for detecting hospitals, with added whitelist and separate annotator

Changed

logic for detecting (non-hospital) institutions, with extended lookup list

Removed

the separate Altrecht annotator, now included in the lookup list

2.3.1 (2023-11-01)

Fixed

include data files recursively in package

2.3.0 (2023-10-25)

Added

lookup lists (and logic) for Dutch provinces, regions, municipalities and streets

Changed

name of residences annotator to placenames, now includes provinces, regions and municipalities
lookup lists (and logic) for residences
logic for streets, housenumber and housenumber letters

2.2.0 (2023-09-28)

Changed

tokenizer logic:
- a token is now a sequence of alphanumeric characters, a single newline, or a single special character.
- whitespaces are no longer considered tokens
moved token pattern logic to config, using a new TokenPatternAnnotator
moved context pattern logic to config, using a new ContextAnnotator
many updates to name detection logic
- lookup list optimizations
- added, removed and simplified patterns

2.1.0 (2023-08-07)

Added

a component for deidentifying BSN-nummers

Changed

updated dependencies
by default, deduce now recognizes and tags bsn nummers
by default, deduce now recognizes all other 7+ digit numbers as identifiers
improved regular expressions for e-mail address and url matching, with separate tags
logic for detecting phone numbers (improvements for hyphens, whitespaces, false positive identifiers)
improved regular expression for age matching
date detection logic:
- now only recognizes combinations of day, month and year (day/month combinations caused many false positives)
- detects year-month-day format in addition to (day-month-year)
loading a custom config now only replaces the config options that are explicitly set, using defaults for those not included in the custom config

Deprecated

backwards compatibility, which was temporary added to transition from v1 to v2

Removed

a separate patient identifier tag, now superseded by a generic tag
detection of day/month combinations for dates, as this caused many false positives (e.g. lab values, numeric scores)

Fixed

annotations can no longer be counted as adjacent when separated by newline or tab (and will thus not be merged)

2.0.3 (2023-04-06)

Fixed

removed 'decibutus' from list of institutions as it caused many false positives

2.0.2 (2023-03-28)

Changed

upgraded dependencies, including markdown-it-py which had a vulnerability

2.0.1 (2022-12-09)

Changed

upgraded dependencies

2.0.0 (2022-12-05)

Added

introduced new interface for deidentification, using Deduce() class
a separate documentation page, with tutorial and migration guide
support for python 3.10 and 3.11

Changed

major refactor that touches pretty much every line of code
use docdeid package for logic
speedups: now 973% faster
use lookup sets instead of lookup lists
refactor tokenizer
refactor annotators into separate classes, using structured annotations
guidelines for contributing

Removed

the annotate_text and deidentify_annotations functions
all in-text annotation (under the hood) and associated functions
support for given names. given names can be added as another first name in the Person class.
support for python 3.7 and 3.8

Fixed

< and > are no longer replaced by ( and ) respectively
deduce does not strip text (whitespaces, tabs at beginning/end of text) anymore

1.0.8 (2021-11-29)

Added

warn if there are any structured annotations whose annotated text does not match the original text in the span denoted by the structured annotation

Fixed

various modifications related to adding or subtracting spaces in annotated texts
remove the lowercasing of institutions' names
therefore, all structured annotations have texts matching the original text in the same span

1.0.7 (2021-11-03)

Changed

Internal code formatting improvements

Added

Contributing guidelines

1.0.6 (2021-10-06)

Fixed

Bug with multiple 4-digit mg dosages in one text

1.0.5 (2021-10-05)

Fixed

Minor bug where tag flattening had no effect

1.0.4 (2021-10-05)

Added

Changelog
Additional unit tests for whitespace/punctuation

Fixed

Various whitespace/punctuation bugs
Bug with nested tags not related to person names
Bug with adjacent tags not being merged

1.0.3 (2021-07-07)

Added

Structured annotations
Some unit tests

Fixed

Error with outdated unicode package
Bug with context

1.0.2

Release to PyPI

1.0.1

Small bugfix for None as input

1.0.0

Initial version