Proposition of changes to `WiktionaryConfig` class #278

xxyzz · 2023-07-07T02:36:49Z

Currently new WiktionaryConfig object is created for every page in wiktionary.page_handler() to save and return WiktionaryConfig.language_counts, WiktionaryConfig.pos_counts, WiktionaryConfig.section_counts, WiktionaryConfig.errors, WiktionaryConfig.warnings, WiktionaryConfig.debugs.

But WiktionaryConfig.__init__() loads many JSON files and convert language data, rerun this code for every page is very inefficient, especially when the JSON file is quite large like the languages.json file. And the --statistics feature currently doesn't work because only WiktionaryConfig.section_counts is filled with data, even if other data are added, this feature is still unusable because it will prints really long output and also uses lots of memory. WiktionaryConfig.merge_return() also saves error message from Wtp object to an array, this could also be a huge array.

Here are my propositions:

Remove the --statistics option and delete WiktionaryConfig.language_counts, WiktionaryConfig.pos_counts, WiktionaryConfig.section_counts, so newWiktionaryConfig object won't be created for each page. We could write separate code to process the final JSON file and create some statistic charts in a scheduled GitHub Actions job.
Write error, warning and debug message data to separate JSON line files.
Extract language data from the dump file using the Lua code in languages folder after all pages are added to database, and also save the data to database. Therefore we'll always use the updated language data. But since this data is not available from the start, some code requires the language data before parsing the dump file will need to be changed.

I commented the code in page_handler() the creates new WiktionaryConfig object and the process time on the Chinese Wiktionary dump file decreased from over 20 minutes down to 13 minutes.

The text was updated successfully, but these errors were encountered:

kristian-clausal · 2023-07-07T05:15:41Z

It sounds like we should just move most of the stuff from __init__() in WiktionaryConfig one level higher into WiktextractContext, which was my intention with creating WiktextractContext, but I haven't gotten around to it... At that point, we can also renamed WiktionaryConfig to something more suitable as a container for multiprocess return data.

I think the reason why the data is returned from pagehandler in this way is because of multiprocessing. Writing into the same .json file from several processes sounds dangerous. Writing into separate files that are then merged doesn't sound optimal, either, so keeping the stats and errors in memory, returning them up from the multiprocessing pool and then processing them (keeping in memory or maybe writing to file at that point) seems still sensible.

I will ask Tatu to comment on this, so we hear what the reasoning of the structure is.

xxyzz · 2023-07-07T05:58:01Z

Returned error message data(error, warning, debug) can be written to separate files in the final for loop in the parent process line be line, so they won't be raced. ~~On the contrary, I think the current code is modifying the same wxr.wtp obejct in multiple processes.~~

And if WiktionaryConfig.__init__ is moved then the WiktionaryConfig class could be deleted, because it only contains error messages from Wtp and there is not need for another class to wrap the returned data.

kristian-clausal · 2023-07-07T06:02:20Z

The wxr and wxr.wtp are forked by the processes, so they keep their own version in memory; they're duplicated, because of the limitations of Python multiprocessing.

As far as I understand it:

Using Python threads is not actually parallelized over multiple cores and it's just an abstraction layer that is still linear (that is, Python can only process one thread at a time physically)
Multiple processes can be forked, but they duplicate memory and can't communicate with each other or their mother process except by returning.

xxyzz · 2023-07-07T06:07:24Z

These callback functions make things more complicated, and it looks like the wxr object is the same object from the parent process... What I'm trying to saying is the result of wxr.wtp.to_return() can be returned directly and it can be used in the main loop so it won't be raced(inside wiktionary.reprocess_wiktionary(), should be same as how each word data is written the final JSON file).

xxyzz · 2023-07-27T09:07:00Z

#296 removes the code that create new WiktionaryConfig object in wiktionary.page_handler().

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposition of changes to `WiktionaryConfig` class #278

Proposition of changes to `WiktionaryConfig` class #278

xxyzz commented Jul 7, 2023 •

edited

kristian-clausal commented Jul 7, 2023 •

edited

xxyzz commented Jul 7, 2023 •

edited

kristian-clausal commented Jul 7, 2023 •

edited

xxyzz commented Jul 7, 2023 •

edited

xxyzz commented Jul 27, 2023

Proposition of changes to WiktionaryConfig class #278

Proposition of changes to WiktionaryConfig class #278

Comments

xxyzz commented Jul 7, 2023 • edited

kristian-clausal commented Jul 7, 2023 • edited

xxyzz commented Jul 7, 2023 • edited

kristian-clausal commented Jul 7, 2023 • edited

xxyzz commented Jul 7, 2023 • edited

xxyzz commented Jul 27, 2023

Proposition of changes to `WiktionaryConfig` class #278

Proposition of changes to `WiktionaryConfig` class #278

xxyzz commented Jul 7, 2023 •

edited

kristian-clausal commented Jul 7, 2023 •

edited

xxyzz commented Jul 7, 2023 •

edited

kristian-clausal commented Jul 7, 2023 •

edited

xxyzz commented Jul 7, 2023 •

edited