Simple English Wiktionary #144

garfieldnate · 2022-08-04T14:07:42Z

This is a POC and cannot be merged as-is. It depends on a PR branch of wikitextprocessor, uses a hardcoded boolean to indicate that we are processing Simple EN text, doesn't have any tests, lots of warnings are printed while running, and probably more. However, Simple English resources are really important for ESL users, so I'm publishing this just so others know this is possible.

garfieldnate · 2022-08-04T14:11:45Z

Here's the log from running on simplewiktionary-20220720-pages-articles.xml.bz2:

log.txt

The warning "linkage recurse URL" is especially common, although there are also a few unimplemented templates.

* pages do not contain separate headers for different languages * every page contains definitions for English words

This fixes ~1,000 warnings caused by the usage of sense tags "transitive & intransitive" and "countable & uncountable".

Add a new field "lists" giving the names of vocab lists that a word belongs to; currently we have the British National Corpus top 1,000, Charles Kay Ogden's Basic English 850 word list, and an academic word list.

Some templates are redirects (wik -> wikipedia, etc.); a proper solution would detect these redirects, but for now we just handle them manually to reduce the warning count.

The previously added logic for parsing Simple EN pages was incomplete; because Simple EN pages only contain data for one language, we cannot read the first header and then only consider the rest part of the entry. We must use the whole page as the entry instead. Redoing this logic helps us not to skip many sections and also fixes an issue where pages with only one section were rendered completely empty. The following was one example: TITLE: almonds ==Noun== {{noun|almond}} #{{plural of|almond}} Making this change also loads many more templates, which uncovered further issues that needed to be fixed in wikitextprocessor. We therefore have to update the branch being installed via pip. The output file went from 20K entries, 10MB to 37K entries (as advertised by Wiktionary's stats page) and 23MB.

The templates used on the Simple EN wiki start with "The ".

Create a new parameter "edition" for specifying the language code of the edition of Wiktionary being input. For now, only allow "en" and "simple" editions. Move all Simple Wiktionary-specific logic behind a check for an edition value of "simple".

Copy the tests from `test_page.py` but with the language header removed. Additionally, copy the real page data from the entry for "freeze" just to verify current output. There are several outstanding TODO's.

garfieldnate · 2022-08-17T15:34:12Z

I've added simple functionality for specifying the Wiktionary edition, currently only allowing "en" and "simple". This is extensible for usage with further editions, so please let me know if this approach is acceptable. I would probably prefer an enum or at least static constants for the edition names, but I'm not sure where to put them yet.

I've also added a simple test! It's unclear to me what level of pre-processing/template-expanding is expected to the input of parse_page, so it's probably not quite right, but it's a start.

tatuylonen · 2022-10-06T14:26:02Z

I merged major changes from @xxyzz yesterday that relate to support for other Wiktionary editions. He has implemented support for configuring various tables, including namespaces but also a lot else, in configuration files (both on the Wiktextract and Wikitextprocessor sides). Would be any chance to rewrite these changes using the same approach?

tatuylonen · 2022-10-06T14:28:26Z

There is now at least some support for parsing the Chinese Wiktionary (I've not had a chance to fully test it yet myself though). I would think supporting the Simple English Wiktionary should be much easier. I also have some personal interest in the Simple English Wiktionary for my other research.

garfieldnate · 2022-10-06T15:41:25Z

That's great news! Let me investigate the changes from xxyzz and rebase. Indeed the simple English Wiktionary is quite similar to the English one; the major difference is that each page only contains one language, so the header structure is different (e.g. no language header).

BTW I also investigated the parsed files provided by xxyzz. I did see some unexpected structures and issues with translations, forms, examples, etc. but the glosses, the most important data for us, seemed to be good enough to use. I'm not sure if it'll be useful for you, but here's the script I used for checking it. It contains a basic model of the data output by Wiktextract and uses pydantic to validate it. (I have the script because I've taken to formatting other dictionaries the same way.)

# Models and validation code for Wiktextract output structure
# For documentation on what the fields mean, see:
# https://github.com/tatuylonen/wiktextract
# CLI USAGE: python3 models.py <file.jsonl> [<file2.jsonl> ...]
import json
import sys
from typing import List, Literal, Mapping, Optional


from pydantic import BaseModel, Extra, Field, root_validator, ValidationError


class WordLink(BaseModel, extra=Extra.forbid):
    alt: Optional[str]
    english: Optional[str]
    roman: Optional[str]
    sense: Optional[str]
    tags: Optional[List[str]]
    taxonomic: Optional[str]
    topics: Optional[List[str]]
    word: str
    # not specified in docs, but is still output by Wiktextract
    extra: Optional[str]


class Example(BaseModel, extra=Extra.forbid):
    text: str
    ref: Optional[str]
    english: Optional[str]
    type: Optional[str]
    roman: Optional[str]
    note: Optional[str]


class Translation(BaseModel, extra=Extra.forbid):
    alt: Optional[str]
    code: str
    english: str
    lang: str
    note: Optional[str]
    roman: Optional[str]
    sense: Optional[str]
    tags: Optional[List[str]]
    taxonomic: Optional[str]
    word: Optional[str]

    @root_validator
    def check_word(cls, values):
        if "word" not in values:
            assert "note" in values, "word can only be missing if a note is present"
        return values


class WithWordLinks(BaseModel, extra=Extra.forbid):
    alt_of: Optional[List[WordLink]]
    form_of: Optional[List[WordLink]]
    synonyms: Optional[List[WordLink]]
    antonyms: Optional[List[WordLink]]
    hypernyms: Optional[List[WordLink]]
    holonyms: Optional[List[WordLink]]
    meronyms: Optional[List[WordLink]]
    hyponyms: Optional[List[WordLink]]
    coordinate_terms: Optional[List[WordLink]]
    derived: Optional[List[WordLink]]
    related: Optional[List[WordLink]]
    # not used by Wiktextract, but appear on other edition Wiktionaries
    cooccurs_with: Optional[List[WordLink]]
    similar: Optional[List[WordLink]]


class Sense(WithWordLinks):
    glosses: Optional[List[str]]
    raw_glosses: Optional[List[str]]
    tags: Optional[List[str]]
    categories: Optional[List[str]]
    topics: Optional[List[str]]
    translations: Optional[List[Translation]]

    senseid: Optional[str]
    wikidata: Optional[List[str]]
    wikipedia: Optional[List[str]]
    examples: Optional[List[Example]]
    english: Optional[str]

    @root_validator
    def validate_glosses(cls, values):
        if "raw_glosses" in values:
            pass
            # output from Wiktextract doesn't conform to this
            # assert len(values["raw_glosses"]) == len(values["glosses"]), "raw_glosses and glosses must be the same length"
        else:
            assert (
                "no-gloss" in values["tags"]
            ), "no-gloss tag must be present if no glosses are present"
        return values


class Pronunciation(BaseModel, extra=Extra.forbid):
    ipa: Optional[str]
    enpr: Optional[str]
    audio: Optional[str]
    ogg_url: Optional[str]
    mp3_url: Optional[str]
    audio_ipa: Optional[str] = Field(alias="audio-ipa")
    homophone: Optional[str]
    hyphenation: Optional[List[str]]
    tags: Optional[List[str]]
    text: Optional[str]
    # these are not specified in docs, but are still output by Wiktextract
    other: Optional[str]
    note: Optional[str]
    topics: Optional[List[str]]

    @root_validator
    def any_field(cls, values):
        if not any(values.values()):
            raise ValueError("At least one field must be set")
        return values


class Form(BaseModel, extra=Extra.forbid):
    form: str
    tags: Optional[List[str]]


class Template(BaseModel, extra=Extra.forbid):
    name: str
    args: Mapping[str, str]
    expansion: str


POS = Literal[
    "abbrev",
    "adj_noun",
    "adj_verb",
    "adj",
    "adv_phrase",
    "adv",
    "affix",
    "ambiposition",
    "article",
    "character",
    "circumfix",
    "circumpos",
    "classifier",
    "clause",
    "combining_form",
    "conj",
    "converb",
    "counter",
    "det",
    "infix",
    "interfix",
    "intj",
    "name",
    "noun",
    "num",
    "particle",
    "phrase",
    "postp",
    "prefix",
    "prep",
    "preverb",
    "pron",
    "proverb",
    "punct",
    "romanization",
    "root",
    "suffix",
    "syllable",
    "symbol",
    "verb",
    # not used by Wiktextract, but appear on other edition Wiktionaries
    "prep_phrase",
    "noun_phrase",
    "adj_phrase",
    "verb_phrase"
]


class Entry(WithWordLinks):
    word: str
    pos: POS
    # code for the language the word belongs to
    lang_code: str
    # Name of the language corresponding to lang_code
    # (as it appears on Wiktionary, e.g. may or may not be an English word)
    lang: str
    senses: List[Sense]
    forms: Optional[List[Form]]
    sounds: Optional[List[Pronunciation]]
    categories: Optional[List[str]]
    topics: Optional[List[str]]
    translations: Optional[List[Translation]]
    etymology_text: Optional[str]
    etymology_templates: Optional[List[Template]]

    wikidata: Optional[List[str]]
    wiktionary: Optional[str]
    head_templates: Optional[List[Template]]
    inflection_templates: Optional[List[Template]]
    # not specified in docs, but is still output by Wiktextract
    wikipedia: Optional[List[str]]

def validate(data):
    if isinstance(data, str):
        data = json.loads(data)
    # Wiktextract output contains JSON lines for templates and other non-entries
    if "word" not in data and "title" in data:
        title = data["title"]
        print(f"Skipping non-word page: {title}", file=sys.stderr)
        return
    return Entry.parse_obj(data)


if __name__ == "__main__":
    if len(sys.argv) < 2:
        print(
            "USAGE: python3 models.py <file.jsonl> [file2.jsonl ...]", file=sys.stderr
        )
        sys.exit(1)
    for filename in sys.argv[1:]:
        entry_count = 0
        with open(filename) as f:
            line_num = 0
            for line in f:
                line_num += 1
                try:
                    entry = validate(line)
                    if entry:
                        entry_count += 1
                except ValidationError as e:
                    print(f"Error on line {line_num}: {line}\n{e}")
                    exit()
                if line_num % 1000 == 0:
                    print(".", end="", flush=True)
        print(f"\n✨ Validation of {filename} completed successfully! ✨")
        print(f"📊 Total entries: {entry_count} 📊")

garfieldnate force-pushed the simple_en branch from 9741789 to b22f342 Compare August 8, 2022 13:22

garfieldnate added 9 commits August 12, 2022 01:18

hardcode hack for Simple English page structure

19f9089

* pages do not contain separate headers for different languages * every page contains definitions for English words

Use wikitextprocessor compatible with Simple En

7eb2104

Remove unneded log statement

b3f01f5

Handle sense tags separated by &

2ac5c27

This fixes ~1,000 warnings caused by the usage of sense tags "transitive & intransitive" and "countable & uncountable".

add two missing topics

51cac33

Implement most important Simple EN templates

5df3bee

Add a new field "lists" giving the names of vocab lists that a word belongs to; currently we have the British National Corpus top 1,000, Charles Kay Ogden's Basic English 850 word list, and an academic word list.

Eliminate some errors for template aliases

78db136

Some templates are redirects (wik -> wikipedia, etc.); a proper solution would detect these redirects, but for now we just handle them manually to reduce the warning count.

Ignore "The " at beginning of form-of strings

0d4046b

The templates used on the Simple EN wiki start with "The ".

garfieldnate force-pushed the simple_en branch from b22f342 to 0d4046b Compare August 11, 2022 23:46

garfieldnate added 2 commits August 17, 2022 17:23

basic test for Simple EN edition input

a6b69fc

Copy the tests from `test_page.py` but with the language header removed. Additionally, copy the real page data from the entry for "freeze" just to verify current output. There are several outstanding TODO's.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple English Wiktionary #144

Simple English Wiktionary #144

garfieldnate commented Aug 4, 2022

garfieldnate commented Aug 4, 2022

garfieldnate commented Aug 17, 2022

tatuylonen commented Oct 6, 2022

tatuylonen commented Oct 6, 2022

garfieldnate commented Oct 6, 2022

Simple English Wiktionary #144

Are you sure you want to change the base?

Simple English Wiktionary #144

Conversation

garfieldnate commented Aug 4, 2022

garfieldnate commented Aug 4, 2022

garfieldnate commented Aug 17, 2022

tatuylonen commented Oct 6, 2022

tatuylonen commented Oct 6, 2022

garfieldnate commented Oct 6, 2022