Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposition of changes to WiktionaryConfig class #278

Open
xxyzz opened this issue Jul 7, 2023 · 5 comments
Open

Proposition of changes to WiktionaryConfig class #278

xxyzz opened this issue Jul 7, 2023 · 5 comments

Comments

@xxyzz
Copy link
Collaborator

xxyzz commented Jul 7, 2023

Currently new WiktionaryConfig object is created for every page in wiktionary.page_handler() to save and return WiktionaryConfig.language_counts, WiktionaryConfig.pos_counts, WiktionaryConfig.section_counts, WiktionaryConfig.errors, WiktionaryConfig.warnings, WiktionaryConfig.debugs.

But WiktionaryConfig.__init__() loads many JSON files and convert language data, rerun this code for every page is very inefficient, especially when the JSON file is quite large like the languages.json file. And the --statistics feature currently doesn't work because only WiktionaryConfig.section_counts is filled with data, even if other data are added, this feature is still unusable because it will prints really long output and also uses lots of memory. WiktionaryConfig.merge_return() also saves error message from Wtp object to an array, this could also be a huge array.

Here are my propositions:

  1. Remove the --statistics option and delete WiktionaryConfig.language_counts, WiktionaryConfig.pos_counts, WiktionaryConfig.section_counts, so newWiktionaryConfig object won't be created for each page. We could write separate code to process the final JSON file and create some statistic charts in a scheduled GitHub Actions job.
  2. Write error, warning and debug message data to separate JSON line files.
  3. Extract language data from the dump file using the Lua code in languages folder after all pages are added to database, and also save the data to database. Therefore we'll always use the updated language data. But since this data is not available from the start, some code requires the language data before parsing the dump file will need to be changed.

I commented the code in page_handler() the creates new WiktionaryConfig object and the process time on the Chinese Wiktionary dump file decreased from over 20 minutes down to 13 minutes.

@kristian-clausal
Copy link
Collaborator

kristian-clausal commented Jul 7, 2023

It sounds like we should just move most of the stuff from __init__() in WiktionaryConfig one level higher into WiktextractContext, which was my intention with creating WiktextractContext, but I haven't gotten around to it... At that point, we can also renamed WiktionaryConfig to something more suitable as a container for multiprocess return data.

I think the reason why the data is returned from pagehandler in this way is because of multiprocessing. Writing into the same .json file from several processes sounds dangerous. Writing into separate files that are then merged doesn't sound optimal, either, so keeping the stats and errors in memory, returning them up from the multiprocessing pool and then processing them (keeping in memory or maybe writing to file at that point) seems still sensible.

I will ask Tatu to comment on this, so we hear what the reasoning of the structure is.

@xxyzz
Copy link
Collaborator Author

xxyzz commented Jul 7, 2023

Returned error message data(error, warning, debug) can be written to separate files in the final for loop in the parent process line be line, so they won't be raced. On the contrary, I think the current code is modifying the same wxr.wtp obejct in multiple processes.

And if WiktionaryConfig.__init__ is moved then the WiktionaryConfig class could be deleted, because it only contains error messages from Wtp and there is not need for another class to wrap the returned data.

@kristian-clausal
Copy link
Collaborator

kristian-clausal commented Jul 7, 2023

The wxr and wxr.wtp are forked by the processes, so they keep their own version in memory; they're duplicated, because of the limitations of Python multiprocessing.

As far as I understand it:

  1. Using Python threads is not actually parallelized over multiple cores and it's just an abstraction layer that is still linear (that is, Python can only process one thread at a time physically)
  2. Multiple processes can be forked, but they duplicate memory and can't communicate with each other or their mother process except by returning.

@xxyzz
Copy link
Collaborator Author

xxyzz commented Jul 7, 2023

These callback functions make things more complicated, and it looks like the wxr object is the same object from the parent process... What I'm trying to saying is the result of wxr.wtp.to_return() can be returned directly and it can be used in the main loop so it won't be raced(inside wiktionary.reprocess_wiktionary(), should be same as how each word data is written the final JSON file).

@xxyzz
Copy link
Collaborator Author

xxyzz commented Jul 27, 2023

#296 removes the code that create new WiktionaryConfig object in wiktionary.page_handler().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants