Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization of large(r) digital numbers #10

Open
Freschler opened this issue Feb 2, 2024 · 1 comment
Open

Tokenization of large(r) digital numbers #10

Freschler opened this issue Feb 2, 2024 · 1 comment

Comments

@Freschler
Copy link

Freschler commented Feb 2, 2024

Preliminary Remark

The observations presented here are also relevant for the polmineR repository.

Some Background

The Bundestag Protokolle often employ spacing to enhance readability of large numerical values. This approach, while globally standardized, may lead to problems with corpus analysis, notably regarding the tokenization process.

For illustration, consider a speech given by then-Chancellor Angela Merkel during the final session of the 17th legislative period (reference: BT_17_253). In this speech, five instances of large numbers grouped with spaces can be identified:

  1. Bereits über 100 000 Menschen haben ihr Leben verloren;
  2. Wir haben als erster EU-Mitgliedstaat 5 000 syrischen Flüchtlingen Aufnahme angeboten.
  3. 700 000 mehr Menschen im Alter von 60 bis 65 sind noch in Arbeit.
  4. 650 000 Menschen erhalten mehr Leistungen.
  5. Wir haben seit 2007 in Deutschland 820 000 neue Betreuungsplätze für Kinder unter drei Jahren geschaffen.

The Issue

Corpus tools, like PolmineR (and similarly, #LancsBox X), fail to recognize these groups of spaced numerical values as single tokens. Consider the following R code snippet:

library(polmineR)

merkel_speech <- corpus("GERMAPARL2") |>
  subset(protocol_date == "2013-09-03") |>
  subset(speaker_name == "Angela Merkel") |>
  subset(p_type == "speech") 

count(merkel_speech, query="000")

As a matter of fact, polmineR incorrectly counts each spaced segment of the numbers as separate tokens, yielding:

   query match count        freq
1:   000   000     5 0.001048218

The Implications

The implications of this issue are twofold:

  1. It inflates the total token count.
  2. It skews statistical measures, such as collocation (aka. co-occurrence) analysis.

The extent of this issue's impact: Employing the regular expression \b(\d{1,3})(\s)(\d{3}) returns 134,609 hits (not 100% precision rate!)

@Freschler
Copy link
Author

Freschler commented Mar 28, 2024

Some further thoughts

In my initial post, the regular expression \b(\d{1,3})(\s)(\d{3})\b was designed to match numbers in the thousand-range. However, this pattern encounters limitations when dealing with even larger numbers. While — in fact!— the original regex does partially match these larger values, its 'coverage' is obviously incomplete.

Updated Regular Expression(s)

To improve upon the issue described above, I've developed 3 new regular expressions for each numerical range (i.e. billion-rage, million-range, and thousand range). Further, I made use of grouping to facilitate easier replacement — if that is desired. (I could not think of a better solution than a 'three step clean up'.)

I. Billion-Range
RegEx (Grouped): \b(\d{1,3})\s(\d{3})\s(\d{3})\s(\d{3})\b
Replacement : \1\2\3\4

II. Million-Range
RegEx (Grouped): \b(\d{1,3})\s(\d{3})\s(\d{3})\b
Replacement : \1\2\3

III. Thousand-Range
RegEx (Grouped): \b(\d{1,3})\s(\d{3})\b
Replacement: \1\2

A few words of caution

As was hinted at in my initial post, there is the danger of false positives. Consider the following example from the corpus:

Bis zum Jahresende 2010 wurden statt 90 000 180 000 Studienplätze geschaffen (BT_17_126)

While those cases are pretty rare, they do exist!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant