Undesired whitespace normalization of Korean text #13401

3Diller · 2024-03-28T14:40:27Z

When tokenizing Korean text, the tokenizer loses information about complex whitespace by replacing them with a single space. Consequently, text is not equal to nlp(text).text. This discrepancy causes a problem: entities described in the original text with (start, end) indices no longer align with those in the tokenized document. Similarly, the offsets of predicted entities do not match those in the original text.

The described issue, for example, invalidates this code:

entities = [(e.start, e.end, e.type) for e in entities]
tags = spacy.training.offsets_to_biluo_tags(doc, entities)
doc.ents = spacy.training.biluo_tags_to_spans(doc, tags)

How to reproduce the behaviour

>>> import spacy; nlp = spacy.blank('ko'); nlp('Hello\n\nworld!')
Hello wordl!  # <- complex '\n\n' whitespace got replaces with simple ' '.

Suggested solution

I presume it might not be the most correct/idiomatic/optimized solution but at least it works for my case (named entity recognition).

This is how I solved the issues for my project:

from typing import Generator

from spacy.lang.ko import POS, TAG_MAP, KoreanTokenizer, X, check_spaces
from spacy.tokens import Doc


class _CustomKoreanTokenizer(KoreanTokenizer):
    """Custom Korean tokenizer that preserves whitespaces.

    Required to make `text` equal to `doc.text` so that recognized entity
    offsets are correctly mapped to the original text.

    See parent class for more details.
    """

    @staticmethod
    def _add_whitespace_tokens(
        text: str, tokens: list[str]
    ) -> Generator[tuple[str, bool], None, None]:
        """Insert whitespace tokens into the list of `mecab-ko` tokens."""
        prev_end = 0
        for token in tokens:
            start = text.find(token, prev_end)
            if start == -1:
                raise ValueError(f'Token "{token}" not found in "{text}"')
            if prev_end < start:
                ws_token = text[prev_end:start]
                # Create whitespace tokens only if there is something more than
                # just a single whitespace or if it's the first token.
                if ws_token != ' ' or prev_end == 0:
                    if ws_token.startswith(' ') and prev_end > 0:
                        # Leading space goes to the `prev_token.whitespace_`
                        # if it's not the first token.
                        ws_token = ws_token[1:]
                    yield ws_token, True
            yield token, False
            prev_end = start + len(token)
        if prev_end < len(text):
            # Yield what is left as a whitespace token. We yield even just a
            # single whitespace as a token since otherwise `check_spaces` won't
            # catch it at the end of the text.
            yield text[prev_end:], True

    def __call__(self, text: str) -> Doc:
        """Tokenize `text` and create a spaCy Doc object.

        This is a slightly modified version of `spacy.lang.ko.KoreanTokenizer.__call__`.
        It preserves whitespaces, which allows entity offsets to be maintained.
        """
        dtokens = list(self.detailed_tokens(text))
        tokens = []
        is_spaces = []
        for token, is_space in self._add_whitespace_tokens(
            text, [dt['surface'] for dt in dtokens]
        ):
            tokens.append(token)
            is_spaces.append(is_space)
        doc = Doc(
            self.vocab, words=tokens, spaces=list(check_spaces(text, tokens))
        )
        for token, dtoken in zip(
            (
                token
                for token, is_space in zip(doc, is_spaces, strict=True)
                if not is_space
            ),
            dtokens,
            strict=True,
        ):
            first_tag, sep, eomi_tags = dtoken['tag'].partition('+')
            token.tag_ = first_tag  # stem(어간) or pre-final(선어말 어미)
            if token.tag_ in TAG_MAP:
                token.pos = TAG_MAP[token.tag_][POS]
            else:
                token.pos = X
            token.lemma_ = dtoken['lemma']
        doc.user_data['full_tags'] = [dt['tag'] for dt in dtokens]
        return doc

Tests for the solution:

class _FakeMecabNode:
    def __init__(self, surface, feature, is_eos=False):
        self.surface = surface
        self.feature = feature
        self.is_eos = lambda: is_eos


@pytest.mark.parametrize(
    ('text', 'tokens', 'whitespaces', 'is_spaces'),
    [
        ('', [], [], []),
        (' ', [' '], [''], [True]),
        ('A', ['A'], [''], [False]),
        ('A ', ['A', ' '], ['', ''], [False, True]),
        (' A', [' ', 'A'], ['', ''], [True, False]),
        (
            ' A B ',
            [' ', 'A', 'B', ' '],
            ['', ' ', '', ''],
            [True, False, False, True],
        ),
        (
            ' \n Hello  \n\n  \n great\n\nworld !\n\n',
            [
                ' \n ',
                'Hello',
                ' \n\n  \n ',
                'great',
                '\n\n',
                'world',
                '!',
                '\n\n',
            ],
            ['', ' ', '', '', '', ' ', '', ''],
            [True, False, True, False, True, False, False, True],
        ),
    ],
)
@patch(
    'spacy.lang.ko.try_mecab_import',
    Mock(
        return_value=lambda _: Mock(
            parse=lambda text, as_nodes: [
                _FakeMecabNode(x, 'A,B,C/D/E') for x in text.split()
            ]
            + [_FakeMecabNode('', '', True)]
        )
    ),
)
def test_custom_korean_tokenizer(text, tokens, whitespaces, is_spaces):
    doc = _CustomKoreanTokenizer(
        spacy.vocab.create_vocab('ko', spacy.language.BaseDefaults)
    )(text)
    assert text == doc.text
    assert tokens == [x.text for x in doc]
    assert whitespaces == [x.whitespace_ for x in doc]
    assert is_spaces == [x.is_space for x in doc]

Your Environment

spaCy version: 3.7.2
Platform: Linux-6.1.75+-x86_64-with-glibc2.31
Python version: 3.9.2

The text was updated successfully, but these errors were encountered:

svlandeg added lang / ko Korean language data and models feat / tokenizer Feature: Tokenizer labels Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Undesired whitespace normalization of Korean text #13401

Undesired whitespace normalization of Korean text #13401

3Diller commented Mar 28, 2024

Undesired whitespace normalization of Korean text #13401

Undesired whitespace normalization of Korean text #13401

Comments

3Diller commented Mar 28, 2024

How to reproduce the behaviour

Suggested solution

Your Environment