Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple English Wiktionary #144

Draft
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

garfieldnate
Copy link
Contributor

This is a POC and cannot be merged as-is. It depends on a PR branch of wikitextprocessor, uses a hardcoded boolean to indicate that we are processing Simple EN text, doesn't have any tests, lots of warnings are printed while running, and probably more. However, Simple English resources are really important for ESL users, so I'm publishing this just so others know this is possible.

@garfieldnate
Copy link
Contributor Author

Here's the log from running on simplewiktionary-20220720-pages-articles.xml.bz2:

log.txt

The warning "linkage recurse URL" is especially common, although there are also a few unimplemented templates.

* pages do not contain separate headers for different languages
* every page contains definitions for English words
This fixes ~1,000 warnings caused by the usage of sense tags
"transitive & intransitive" and "countable & uncountable".
Add a new field "lists" giving the names of vocab lists that a word
belongs to; currently we have the British National Corpus top 1,000,
Charles Kay Ogden's Basic English 850 word list, and an academic word
list.
Some templates are redirects (wik -> wikipedia, etc.); a proper solution
would detect these redirects, but for now we just handle them manually
to reduce the warning count.
The previously added logic for parsing Simple EN pages was incomplete;
because Simple EN pages only contain data for one language, we cannot
read the first header and then only consider the rest part of the entry.
We must use the whole page as the entry instead. Redoing this logic
helps us not to skip many sections and also fixes an issue where pages
with only one section were rendered completely empty. The following was
one example:

    TITLE: almonds
    ==Noun==
    {{noun|almond}}
    #{{plural of|almond}}

Making this change also loads many more templates, which uncovered
further issues that needed to be fixed in wikitextprocessor. We
therefore have to update the branch being installed via pip.

The output file went from 20K entries, 10MB to 37K entries (as
advertised by Wiktionary's stats page) and 23MB.
The templates used on the Simple EN wiki start with "The ".
Create a new parameter "edition" for specifying the language code of the
edition of Wiktionary being input. For now, only allow "en" and "simple"
editions. Move all Simple Wiktionary-specific logic behind a check for
an edition value of "simple".
Copy the tests from `test_page.py` but with the language header removed.
Additionally, copy the real page data from the entry for "freeze" just
to verify current output. There are several outstanding TODO's.
@garfieldnate
Copy link
Contributor Author

I've added simple functionality for specifying the Wiktionary edition, currently only allowing "en" and "simple". This is extensible for usage with further editions, so please let me know if this approach is acceptable. I would probably prefer an enum or at least static constants for the edition names, but I'm not sure where to put them yet.

I've also added a simple test! It's unclear to me what level of pre-processing/template-expanding is expected to the input of parse_page, so it's probably not quite right, but it's a start.

@tatuylonen
Copy link
Owner

I merged major changes from @xxyzz yesterday that relate to support for other Wiktionary editions. He has implemented support for configuring various tables, including namespaces but also a lot else, in configuration files (both on the Wiktextract and Wikitextprocessor sides). Would be any chance to rewrite these changes using the same approach?

@tatuylonen
Copy link
Owner

There is now at least some support for parsing the Chinese Wiktionary (I've not had a chance to fully test it yet myself though). I would think supporting the Simple English Wiktionary should be much easier. I also have some personal interest in the Simple English Wiktionary for my other research.

@garfieldnate
Copy link
Contributor Author

That's great news! Let me investigate the changes from xxyzz and rebase. Indeed the simple English Wiktionary is quite similar to the English one; the major difference is that each page only contains one language, so the header structure is different (e.g. no language header).

BTW I also investigated the parsed files provided by xxyzz. I did see some unexpected structures and issues with translations, forms, examples, etc. but the glosses, the most important data for us, seemed to be good enough to use. I'm not sure if it'll be useful for you, but here's the script I used for checking it. It contains a basic model of the data output by Wiktextract and uses pydantic to validate it. (I have the script because I've taken to formatting other dictionaries the same way.)

# Models and validation code for Wiktextract output structure
# For documentation on what the fields mean, see:
# https://github.com/tatuylonen/wiktextract
# CLI USAGE: python3 models.py <file.jsonl> [<file2.jsonl> ...]
import json
import sys
from typing import List, Literal, Mapping, Optional


from pydantic import BaseModel, Extra, Field, root_validator, ValidationError


class WordLink(BaseModel, extra=Extra.forbid):
    alt: Optional[str]
    english: Optional[str]
    roman: Optional[str]
    sense: Optional[str]
    tags: Optional[List[str]]
    taxonomic: Optional[str]
    topics: Optional[List[str]]
    word: str
    # not specified in docs, but is still output by Wiktextract
    extra: Optional[str]


class Example(BaseModel, extra=Extra.forbid):
    text: str
    ref: Optional[str]
    english: Optional[str]
    type: Optional[str]
    roman: Optional[str]
    note: Optional[str]


class Translation(BaseModel, extra=Extra.forbid):
    alt: Optional[str]
    code: str
    english: str
    lang: str
    note: Optional[str]
    roman: Optional[str]
    sense: Optional[str]
    tags: Optional[List[str]]
    taxonomic: Optional[str]
    word: Optional[str]

    @root_validator
    def check_word(cls, values):
        if "word" not in values:
            assert "note" in values, "word can only be missing if a note is present"
        return values


class WithWordLinks(BaseModel, extra=Extra.forbid):
    alt_of: Optional[List[WordLink]]
    form_of: Optional[List[WordLink]]
    synonyms: Optional[List[WordLink]]
    antonyms: Optional[List[WordLink]]
    hypernyms: Optional[List[WordLink]]
    holonyms: Optional[List[WordLink]]
    meronyms: Optional[List[WordLink]]
    hyponyms: Optional[List[WordLink]]
    coordinate_terms: Optional[List[WordLink]]
    derived: Optional[List[WordLink]]
    related: Optional[List[WordLink]]
    # not used by Wiktextract, but appear on other edition Wiktionaries
    cooccurs_with: Optional[List[WordLink]]
    similar: Optional[List[WordLink]]


class Sense(WithWordLinks):
    glosses: Optional[List[str]]
    raw_glosses: Optional[List[str]]
    tags: Optional[List[str]]
    categories: Optional[List[str]]
    topics: Optional[List[str]]
    translations: Optional[List[Translation]]

    senseid: Optional[str]
    wikidata: Optional[List[str]]
    wikipedia: Optional[List[str]]
    examples: Optional[List[Example]]
    english: Optional[str]

    @root_validator
    def validate_glosses(cls, values):
        if "raw_glosses" in values:
            pass
            # output from Wiktextract doesn't conform to this
            # assert len(values["raw_glosses"]) == len(values["glosses"]), "raw_glosses and glosses must be the same length"
        else:
            assert (
                "no-gloss" in values["tags"]
            ), "no-gloss tag must be present if no glosses are present"
        return values


class Pronunciation(BaseModel, extra=Extra.forbid):
    ipa: Optional[str]
    enpr: Optional[str]
    audio: Optional[str]
    ogg_url: Optional[str]
    mp3_url: Optional[str]
    audio_ipa: Optional[str] = Field(alias="audio-ipa")
    homophone: Optional[str]
    hyphenation: Optional[List[str]]
    tags: Optional[List[str]]
    text: Optional[str]
    # these are not specified in docs, but are still output by Wiktextract
    other: Optional[str]
    note: Optional[str]
    topics: Optional[List[str]]

    @root_validator
    def any_field(cls, values):
        if not any(values.values()):
            raise ValueError("At least one field must be set")
        return values


class Form(BaseModel, extra=Extra.forbid):
    form: str
    tags: Optional[List[str]]


class Template(BaseModel, extra=Extra.forbid):
    name: str
    args: Mapping[str, str]
    expansion: str


POS = Literal[
    "abbrev",
    "adj_noun",
    "adj_verb",
    "adj",
    "adv_phrase",
    "adv",
    "affix",
    "ambiposition",
    "article",
    "character",
    "circumfix",
    "circumpos",
    "classifier",
    "clause",
    "combining_form",
    "conj",
    "converb",
    "counter",
    "det",
    "infix",
    "interfix",
    "intj",
    "name",
    "noun",
    "num",
    "particle",
    "phrase",
    "postp",
    "prefix",
    "prep",
    "preverb",
    "pron",
    "proverb",
    "punct",
    "romanization",
    "root",
    "suffix",
    "syllable",
    "symbol",
    "verb",
    # not used by Wiktextract, but appear on other edition Wiktionaries
    "prep_phrase",
    "noun_phrase",
    "adj_phrase",
    "verb_phrase"
]


class Entry(WithWordLinks):
    word: str
    pos: POS
    # code for the language the word belongs to
    lang_code: str
    # Name of the language corresponding to lang_code
    # (as it appears on Wiktionary, e.g. may or may not be an English word)
    lang: str
    senses: List[Sense]
    forms: Optional[List[Form]]
    sounds: Optional[List[Pronunciation]]
    categories: Optional[List[str]]
    topics: Optional[List[str]]
    translations: Optional[List[Translation]]
    etymology_text: Optional[str]
    etymology_templates: Optional[List[Template]]

    wikidata: Optional[List[str]]
    wiktionary: Optional[str]
    head_templates: Optional[List[Template]]
    inflection_templates: Optional[List[Template]]
    # not specified in docs, but is still output by Wiktextract
    wikipedia: Optional[List[str]]

def validate(data):
    if isinstance(data, str):
        data = json.loads(data)
    # Wiktextract output contains JSON lines for templates and other non-entries
    if "word" not in data and "title" in data:
        title = data["title"]
        print(f"Skipping non-word page: {title}", file=sys.stderr)
        return
    return Entry.parse_obj(data)


if __name__ == "__main__":
    if len(sys.argv) < 2:
        print(
            "USAGE: python3 models.py <file.jsonl> [file2.jsonl ...]", file=sys.stderr
        )
        sys.exit(1)
    for filename in sys.argv[1:]:
        entry_count = 0
        with open(filename) as f:
            line_num = 0
            for line in f:
                line_num += 1
                try:
                    entry = validate(line)
                    if entry:
                        entry_count += 1
                except ValidationError as e:
                    print(f"Error on line {line_num}: {line}\n{e}")
                    exit()
                if line_num % 1000 == 0:
                    print(".", end="", flush=True)
        print(f"\n✨ Validation of {filename} completed successfully! ✨")
        print(f"📊 Total entries: {entry_count} 📊")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants