Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spanish data from Spanish Wiktionary #360

Open
soliviantar opened this issue Oct 12, 2023 · 14 comments
Open

Spanish data from Spanish Wiktionary #360

soliviantar opened this issue Oct 12, 2023 · 14 comments

Comments

@soliviantar
Copy link

Hi, there. This project seems fantastic, congratulations on the great work!

I have been reading around and I've seen right now only en, fr and zh wiktionaries are supported. I read somewhere you were working on a setup so that only a config file would be needed to parse other editions. I am guessing these are the files in wikiextract/extractor.

I would love to help with the Spanish wiktionary extractor, unfortunately, I am not a programmer, so even if I can try and read some code, I just get lost very, very fast.

I saw @xxyzz was working on an "es" branch at some point, but couldn't find any more info about it. Is there something that I could do to help advance this?

My main interest is extracting Spanish language data (POS, etymologies, pronunciations, definitions, linkages, flexion, etc.) from the es.wiktionary, since for other languages I think the en.wiktionary is already enough in most cases. For this case, do you think I could use wikiextract as-is or that I could modify it myself? Or do you recommend using something else (like Dbnary, which seems harder to understand)?

Thanks in any case and kudos again!

@xxyzz
Copy link
Collaborator

xxyzz commented Oct 12, 2023

I'd recommend you to find the common page structure of the Spanish Wiktionary, maybe they have a help page(like how each sections are arranged). Then you might get a general idea of how a computer could parse this structure. And the new Spanish extractor also needs some language and subtitles JSON data, you could find examples in the "data" directory, the language JSON files are created by code in the "languages" folder.
Finally you could try to read the French extractor's main function: parse_page().

@soliviantar
Copy link
Author

Thanks for the quick reply!

Yes, I have found a page that explains the structure of entries in es.wiktionary. And it seems all entries are supposed to be made from a POS specific template. So they have a template for:

Seems pretty standard, but I am not sure how difficult it would be to implement this. I'll try to take a look.

I'm guessing these are what you refer to with the languages and subtitle data, right? Since my main interest is the Spanish dictionary data about Spanish itself, is it necessary to do the languages one? Or is it used for etymologies, etc. as well?

I tried to read the parse_page() from fr, but I don't understand much. But I guess I could use ChatGPT to explain it to me.

In any case, it looks a bit daunting for me at this point, to be honest. I might need to learn some python before I dare to give it a try.

I am leaving here all these links mostly for easy retrieval purposes in case I can come back to it at some point, or in case some else comes around and has the skills and will to tackle it in the meantime. If that is ok with you guys. :) Thanks again!

@kristian-clausal
Copy link
Collaborator

kristian-clausal commented Oct 12, 2023

Leaving links to useful data is perfectly fine. It would be possible to modify wiktextract to only parse Spanish entries, but it would be about as much effort to implement the extractor's config data fully. Good luck with your efforts, python is a great starting point if you're new to programming, and easy to pick up if you're not.

@999pingGG
Copy link
Contributor

Hi, I'm also interested in contributing to the Spanish data extraction, and also Russian btw, for a pet project of mine (a multilingual word coach). I'm pretty proficient at programming and I've scrapped web data before, using Scrapy. I'm going to look at the code and elsewhere in this repo or the internet and see how can this be done, as long as my motivation is up and I have the time.

I'd like to join forces with anyone already working on this. There's an unmerged pull request for Spanish under discussion. Can you accept this PR under a new "Spanish development/WIP" branch or something? I want to work based on that PR but I'm new to working with other people in git/GitHub using pull requests and I'm not sure about how I should go about this.

@empiriker
Copy link
Contributor

empiriker commented Nov 30, 2023

@999pingGG, that's great to hear! More help is always welcome. 🤗

I am the author of PR #392 and the reason why it takes some time to get merged is that I am also proposing the introduction of pydantic classes to deal with data validation and schema creation, meaning there is some need to properly discuss things.

In fact, I have already done more work on the Spanish extractor than the current PR covers. You can see what else I did here: https://github.com/empiriker/wiktextract/tree/es-ahead The plan is to submit these additions once the current PR is accepted.

You can contribute to an extractor for a particular Wiktionary edition in two ways: a) Going deep, i.e. improving the extraction quality and covering more special cases, or b) going wide, i.e. finding sections in Wiktionary that our current scripts don't extract yet.

In the case of my Spanish extractor, the etymology or morphology sections aren't parsed yet and I don't plan to implement this either. The benefit of tackling a separate section would be that your code would be more or less independent.

Otherwise, a good place to start are the # XXX: comments in the code or the debug statements. (I usually downsample the sqlite database to a couple thousand pages for quick iteration and seeing what the most common type of debugs (=not ideally processed nodes) are.)


I don't know the best way either to contribute to the WIP of the Spanish extractor. You could work based on my es-ahead branch but I am pretty sure my code will have changed at least bit by the time it gets merged.


FYI, I am planning to also add the same scope of support to the Russian Wiktionary as I did for the German and Spanish ones, i.e. extracting glosses, examples, translations, pronunciation and linkages. But I didn't get very far yet.

I look forward to your first PR! 🔥

@999pingGG
Copy link
Contributor

@empiriker I see that now your PR has been merged, congrats!

I'd like to get my feet wet on the codebase based on your Spanish branch by fixing (the underlying cause of) some of those debug messages and implementing extraction for new sections, etymology and morphology, since you said that that code would be more independent, but I need some help to get started. How can I setup a development environment? Yesterday, I cloned the master branch here and tried to modify the code and experiment, but I'm a bit lost here since this project has a setup I've never seen in my (small) Python experience. When you install following the instructions, an executable binary is generated for the wiktwords command and whatever changes I make to the code aren't applied. How can I setup the classic modify code-run-debug-repeat cycle using a small subset of the Spanish wiki? Thanks for your help and friendliness! Let's do this

@empiriker
Copy link
Contributor

You will want to install the wiktextract with these option: python -m pip install --use-pep517 -e ".[dev]" which is documented here. The -e option installs the wiktextract as an editable package, i.e. your code gets executed instead of a copy of it. The ".[dev]" option installs the dev dependencies you'll need for testing.

You can run all tests with make test and one test file with nose2 tests.test_es_etymology.

You will want to run wiktwords the first time with the --db option while pointing to a dumpfile. The second time then, you can just keep the --db option.

What I do then is to make a copy of the db and delete all but N pages with the namespace-id = 0 with some SQL. (You will want to keep all other namespaces!)

The Spanish etymology section should be reached here:

def parse_section(
wxr: WiktextractContext,
page_data: List[WordEntry],
base_data: WordEntry,
level_node: WikiNode,
) -> None:
# Page Structure: https://es.wiktionary.org/wiki/Wikcionario:Estructura
subtitle = clean_node(wxr, base_data, level_node.largs)
wxr.wtp.start_subsection(subtitle)
pos_template_name = None
for level_node_template in level_node.find_content(NodeKind.TEMPLATE):
pos_template_name = level_node_template.template_name
if subtitle in wxr.config.OTHER_SUBTITLES["ignored_sections"]:
pass
elif pos_template_name and pos_template_name in wxr.config.POS_SUBTITLES:
process_pos_block(
wxr, page_data, base_data, level_node, pos_template_name, subtitle
)
else:
wxr.wtp.debug(
f"Unprocessed section: {subtitle}",
sortid="extractor/es/page/parse_section/48",
)

You can add another elif and start processing the section.

Good luck!

@kristian-clausal
Copy link
Collaborator

make test has some really slow coverage tests added later, there's make quick (or make quicktest, don't remember which is currently committed, though that later is better and should be) to skip coverage.

@999pingGG
Copy link
Contributor

Thanks for your help, I have read most but not all of the README and I didn't imagine iterating on development would be like running tests. The argument is --db-path instead of --db tho. I was able to setup my development environment with a sample of 3000 words using some SQL just like you said.

I apologize if this is getting too off-topic, but I wanted to say that having all the data in a SQLite database is very neat and handy, I can quickly see for any word what is being parsed. Now let's write some code!

@kristian-clausal
Copy link
Collaborator

I apologize if this is getting too off-topic, but I wanted to say that having all the data in a SQLite database is very neat and handy, I can quickly see for any word what is being parsed. Now let's write some code!

This is thanks to @xxyzz, who implemented everything about it. Before that, we just had a big cache file (and smaller page files used for debugging and reading the source wikitext files).

@empiriker
Copy link
Contributor

@999pingGG

Iterating on development for me covers a couple of steps:

  • Reading the Wiktionary help pages to understand how a page is supposed to be structured (doesn't necessarily mean that every page implements this guide to a t)
  • Reading docs on Wiktionary about how a certain template works
  • Comparing compiled Wiktionary pages, wikicode and the output of our extractor to figure out what's going wrong/could be improved
  • Looking at the debug statements
  • Sometimes just adding a temporary log to see what kind of nodes are reached in a particular place in our code
  • Identifying default cases/edge cases that I want to work on and writing a test for them
  • Writing code to satisfy that test

It goes without saying: Not necessarily all the time all the steps in that particular order.


Sorry about --db-path. Since argparse supports prefix matching, I have been using just --db for a while now.


When sampling down, keep in mind that in some editions the page namespace (id=0) has a large percentage of redirect pages (ca. 40% for the Russian edition). So with 3000 words you would be effectively developing only on 1800 pages. Depending on your personal trade off between speed and variety during development that might be enough for you.

@SrGnis
Copy link

SrGnis commented Dec 13, 2023

Hello everybody, sorry for the self advertising but maybe some of you can find this useful.

First a bit of context. Almost a year ago I come across this project in my search of getting data from the Spanish Wiktionary in a functional format. But after being too lazy to understand the code and learn wikitext I decided that will be faster to implement my own solution that extract data from the rendered html. spoiler: it wasn't.

But at the end I created Wikjote, a python pakage that parses the Spanish Wiktionary html into json.

It dont work directly with the rendered html, instead it uses the zim files created by Kiwix.

And it stills in early development and only parses the html in a simplified json structure but is enought to be functinal.

Don't get me wrong, i think wiktextract is superior because it works directly with the dumps and probably when it parses correctly the spanish wiktionary my pakage will become obsolete but in the mean time some of you can find it useful.

You can find the entire wiktionary as json here

@xxyzz
Copy link
Collaborator

xxyzz commented Dec 14, 2023

The HTML in ZIM file are simplified from Wiktionary? Some data might be lost in ZIM file. You could get better HTML files created from Parsoid that have template arguments: https://dumps.wikimedia.org/other/enterprise_html/runs

Vuizur has a repo parses this format for the Russian Wiktionary: https://github.com/Vuizur/ruwiktionary-htmldump-parser

But this new HTML dump files are unstable, they often lack many pages or have many duplicated pages(the file size from each run varies several GB). If they fix this issue parse HTML should be faster because templates are already expanded.

And we currently has a Spanish extractor thanks to empiriker's contributions, you could find their code at here: https://github.com/tatuylonen/wiktextract/tree/master/src/wiktextract/extractor/es

@SrGnis
Copy link

SrGnis commented Dec 15, 2023

Thank you @xxyzz, I was not aware of those dumps, it is worth taking a look at them. Having more data sources is a good idea.

And those are very good news, I need to try it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants