Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undesired whitespace normalization of Korean text #13401

Open
3Diller opened this issue Mar 28, 2024 · 0 comments
Open

Undesired whitespace normalization of Korean text #13401

3Diller opened this issue Mar 28, 2024 · 0 comments
Labels
feat / tokenizer Feature: Tokenizer lang / ko Korean language data and models

Comments

@3Diller
Copy link

3Diller commented Mar 28, 2024

When tokenizing Korean text, the tokenizer loses information about complex whitespace by replacing them with a single space. Consequently, text is not equal to nlp(text).text. This discrepancy causes a problem: entities described in the original text with (start, end) indices no longer align with those in the tokenized document. Similarly, the offsets of predicted entities do not match those in the original text.

The described issue, for example, invalidates this code:

entities = [(e.start, e.end, e.type) for e in entities]
tags = spacy.training.offsets_to_biluo_tags(doc, entities)
doc.ents = spacy.training.biluo_tags_to_spans(doc, tags)

How to reproduce the behaviour

>>> import spacy; nlp = spacy.blank('ko'); nlp('Hello\n\nworld!')
Hello wordl!  # <- complex '\n\n' whitespace got replaces with simple ' '.

Suggested solution

I presume it might not be the most correct/idiomatic/optimized solution but at least it works for my case (named entity recognition).

This is how I solved the issues for my project:
from typing import Generator

from spacy.lang.ko import POS, TAG_MAP, KoreanTokenizer, X, check_spaces
from spacy.tokens import Doc


class _CustomKoreanTokenizer(KoreanTokenizer):
    """Custom Korean tokenizer that preserves whitespaces.

    Required to make `text` equal to `doc.text` so that recognized entity
    offsets are correctly mapped to the original text.

    See parent class for more details.
    """

    @staticmethod
    def _add_whitespace_tokens(
        text: str, tokens: list[str]
    ) -> Generator[tuple[str, bool], None, None]:
        """Insert whitespace tokens into the list of `mecab-ko` tokens."""
        prev_end = 0
        for token in tokens:
            start = text.find(token, prev_end)
            if start == -1:
                raise ValueError(f'Token "{token}" not found in "{text}"')
            if prev_end < start:
                ws_token = text[prev_end:start]
                # Create whitespace tokens only if there is something more than
                # just a single whitespace or if it's the first token.
                if ws_token != ' ' or prev_end == 0:
                    if ws_token.startswith(' ') and prev_end > 0:
                        # Leading space goes to the `prev_token.whitespace_`
                        # if it's not the first token.
                        ws_token = ws_token[1:]
                    yield ws_token, True
            yield token, False
            prev_end = start + len(token)
        if prev_end < len(text):
            # Yield what is left as a whitespace token. We yield even just a
            # single whitespace as a token since otherwise `check_spaces` won't
            # catch it at the end of the text.
            yield text[prev_end:], True

    def __call__(self, text: str) -> Doc:
        """Tokenize `text` and create a spaCy Doc object.

        This is a slightly modified version of `spacy.lang.ko.KoreanTokenizer.__call__`.
        It preserves whitespaces, which allows entity offsets to be maintained.
        """
        dtokens = list(self.detailed_tokens(text))
        tokens = []
        is_spaces = []
        for token, is_space in self._add_whitespace_tokens(
            text, [dt['surface'] for dt in dtokens]
        ):
            tokens.append(token)
            is_spaces.append(is_space)
        doc = Doc(
            self.vocab, words=tokens, spaces=list(check_spaces(text, tokens))
        )
        for token, dtoken in zip(
            (
                token
                for token, is_space in zip(doc, is_spaces, strict=True)
                if not is_space
            ),
            dtokens,
            strict=True,
        ):
            first_tag, sep, eomi_tags = dtoken['tag'].partition('+')
            token.tag_ = first_tag  # stem(어간) or pre-final(선어말 어미)
            if token.tag_ in TAG_MAP:
                token.pos = TAG_MAP[token.tag_][POS]
            else:
                token.pos = X
            token.lemma_ = dtoken['lemma']
        doc.user_data['full_tags'] = [dt['tag'] for dt in dtokens]
        return doc
Tests for the solution:
class _FakeMecabNode:
    def __init__(self, surface, feature, is_eos=False):
        self.surface = surface
        self.feature = feature
        self.is_eos = lambda: is_eos


@pytest.mark.parametrize(
    ('text', 'tokens', 'whitespaces', 'is_spaces'),
    [
        ('', [], [], []),
        (' ', [' '], [''], [True]),
        ('A', ['A'], [''], [False]),
        ('A ', ['A', ' '], ['', ''], [False, True]),
        (' A', [' ', 'A'], ['', ''], [True, False]),
        (
            ' A B ',
            [' ', 'A', 'B', ' '],
            ['', ' ', '', ''],
            [True, False, False, True],
        ),
        (
            ' \n Hello  \n\n  \n great\n\nworld !\n\n',
            [
                ' \n ',
                'Hello',
                ' \n\n  \n ',
                'great',
                '\n\n',
                'world',
                '!',
                '\n\n',
            ],
            ['', ' ', '', '', '', ' ', '', ''],
            [True, False, True, False, True, False, False, True],
        ),
    ],
)
@patch(
    'spacy.lang.ko.try_mecab_import',
    Mock(
        return_value=lambda _: Mock(
            parse=lambda text, as_nodes: [
                _FakeMecabNode(x, 'A,B,C/D/E') for x in text.split()
            ]
            + [_FakeMecabNode('', '', True)]
        )
    ),
)
def test_custom_korean_tokenizer(text, tokens, whitespaces, is_spaces):
    doc = _CustomKoreanTokenizer(
        spacy.vocab.create_vocab('ko', spacy.language.BaseDefaults)
    )(text)
    assert text == doc.text
    assert tokens == [x.text for x in doc]
    assert whitespaces == [x.whitespace_ for x in doc]
    assert is_spaces == [x.is_space for x in doc]

Your Environment

  • spaCy version: 3.7.2
  • Platform: Linux-6.1.75+-x86_64-with-glibc2.31
  • Python version: 3.9.2
@svlandeg svlandeg added lang / ko Korean language data and models feat / tokenizer Feature: Tokenizer labels Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / tokenizer Feature: Tokenizer lang / ko Korean language data and models
Projects
None yet
Development

No branches or pull requests

2 participants