"neural_modifications" has lost some meaning as name of this branch #246

cgr71ii · 2022-11-19T22:45:19Z

This got out of hands, too many small changes. Sorry in advance, I know I should have split all these changes in different branches :/

Real "neural"-related changes:

Automatic management of GPUs allocation in neural tools (i.e. NDA, vecalign and bicleaner rules). Configuring parallelJobs or the CUDA_VISIBLE_DEVICES envvar, the GPU management is automatic. This is possible configuring the CUDA_VISIBLE_DEVICES envvar. Since multiple jobs may run in parallel and Snakemake use processes, not threads, per job, we need process communication, and since we want to allocate GPUs, we need to implement a mutex mechanism through files due to Snakemake design: https://snakemake.readthedocs.io/en/stable/project_info/faq.html#i-want-to-pass-variables-between-rules-is-that-possible
- Related links:
  - Handle GPU ids snakemake/snakemake#281
  - Bugfix and optimization mozilla/firefox-translations-training#41
- I tried to use other methods provided in the standard python library (e.g. semaphores, mutex), but it didn't worked because they are not intended to work for communicate different processes, but threads. I didn't find any other solution which fulfilled the needs for the needs of the feature, so I ended up using what more fit the situation given the Snakemake FAQ.
- mutex.py: helper functions which currently are being used for the GPU allocation among processes, and it should work even among all Snakemake instances (this is good news in the case of allocating resources, since we can share resources even among executions! In the case that the implemented methods are wanted for other purposes which demands to only communicate among jobs but not among executions, an ID of the execution may be used in the data structure), but it should work for any other purpose which needs of communication between jobs. It uses PersistentDict from pytools, which guarantees exclusive access in a concurrent environment, in order to allow the jobs to communicate. The library stores files in ~/.cache/pytools/, so in case that multiple instances are running (e.g. run run-tests-min.sh), the concurrent access is guaranteed, so there will not be problems, but a warning has been configured to alert that the file exists, and it shouldn't (when the execution finishes, the stored data structures are deleted in the tear-down method, but if the execution was abrupter interrupted or other problem happened, the file will exist, and this might cause problems). If the execution is abruptly interrupted, this might cause that resources can't be allocated if they were configured to be allocated (by default the allocation is disabled), and in case that all resources were allocated in the previous and interrupted execution, this might case a deadlock (in these cases, the solution is to check the files in ~/.cache/pytools/ and remove those associated to the GPU allocation).
- allocate_cuda_gpus.py: it uses mutex.py in order to automatically allocate GPUs. For each job that demands a GPU, a "token" is provided from that job, and this token has to be the path of a file which doesn't exist. While this file doesn't exist, the resource is allocated for that job, and in the moment that the job finishes, it's its responsibility to touch the token that initially provided in order to allocate the GPU. In the moment that any job tries to allocate a GPU, all previous provided tokens are checked if exist, and if they exist, the GPU is freed and can be assigned to other job. A method is provided in order to be invoked in the teardown of the whole execution, which what does is, basically, release all the GPUs in case that there were still that hadn't been released (this might happen if the job "skipped" its responsibility or, for any reason, failed and couldn't touch the provided token). This teardown method guarantees that the next execution won't find any allocated resource (in the case of non-concurrent executions).
Configuration option embeddingsBatchSize has been generalized and substituted by neuralToolsBatchSize. Now, neuralToolsBatchSize is a dictionary-like which allows to specify the batch size of different neural tools: NDA, vecalign and bicleaner AI.

Rest of changes:

Changes described in Some changes from neural_modifications branch: Metadata refactorization #245
- Replacement of '\' with '\\' in all places which applies (extended explanation in previous mentioned PR).
- Use of logging library in all places which applies (extended explanation in previous mentioned PR).
GHA scripts updated (there were deprecated warnings).
New option in text2prevertical.py: --random-date. This option allows to make the tests deterministic (if --seed is used) for those cases which uses the script.
Tests which generate a prevertical file using text2prevertical.py now enable additionalMetadata as well in order to extend the coverage of those test cases.
Snakemake handlers used: onstart, onsuccess and onerror.
Utilities get_snakemake_execution_mark and is_first_snakemake_execution have been removed since are not longer needed because now we are using the onstart handler.
Minor fix: when until was not configured and the selected docalign was not Bleualign, the added file to the output documents was not the correct, what leaded, if I don't remember wrong, to extra files being generated but not used.
Some TODOs resolved and other added.
Rule pre_filter_sort_flags removed and pre_filter renamed to pre_filter_sort_flags.
- Rule pre_filter_sort_flags was forcing to wait until all files from rule pre_filter were ready instead of continue the execution with the rule filter. Since the rule pre_filter_sort_flags was only checking that all the files from rule pre_filter were exactly the same, is not totally necessary (this task has been moved to the sents rule), and the blocking was slowing the global execution.
Rules filter and sents have been simplified. Now, the output files of filter doesn't add the header to the files but instead print the header to separate files. This makes easier the process of the sents rule, and the header is post-added in this last rule. Since the output files from the filter rule are temp files, they are not expected to be released, so that the header fields are not in these files is not very important, I think, and improves the performance and readability of the filter and sents rules, at the cost of lose the header fields of the filter rule output files.

Automatic GPU allocation is achieved through mutex mechanisms. Once a job acquires the mutex, it allocates a device. Affected jobs are NDA, vecalign and bicleaner. Since the mutex mechanisms are done through PersistentDict from pytools (recommended from the snakemake FAQ: https://snakemake.readthedocs.io/en/stable/project_info/ faq.html#i-want-to-pass-variables-between-rules-is-that -possible), the GPU allocation should work even if multiple snakemake instances are running. This may be useful since it allows to handle automatically the GPU allocation for different configurations of Bitextor. Other changes: - Typo in docs - Use of snakemake handlers (e.g. onstart)

Instead of have the para id in the same file where are the sentences, a new file is intended to be created and processed by the segalign. This is what is being done right now, but moving all the content in only one file, what leads to extra processing and makes many rules dependant on this feature.

Other minor fixes and changes

Single backslash is replaced by a literal new line when processed by snakemake. We usually want double backslash because we want to introduce a literal backslash because in bash that means to continue the current command in the next line

Tests which were using a prevertical generated file was not being deterministic because of the date of the prevertical

lpla · 2022-12-04T15:28:11Z

Let's see what the Intensive tests say and I will do the review.

lpla · 2022-12-07T09:01:26Z

Current intensive tests from run-tests.sh are working. But I think that we should add new tests to be sure that all these changes regarding GPU management are correctly working before doing a review of this code. At least to test it on offline machines with several GPUs.

…textor into neural_modifications

cgr71ii added 30 commits September 28, 2022 16:30

Merge branch 'master' into neural_modifications

f29e740

Refactoring

fc5999b

Finish GPU allocation

a13cadb

Fixes and update documentation

c0bc49e

Using standard python logging library

ffddfc4

Update GHA actions/setup-python

62b9125

Update GHA external actions

ad8c878

Fix translation when paragraphs are enabled

ff93bc3

Fix GHA java action

d5e73e4

Using docjoin in Vecalign and Hunalign

45dbf09

Changes related to new format of metadata

428184f

Add metadata support to hunalign

c220d44

Other minor fixes and changes

Update NDA script for supporting metadata file

776142b

Fake metadata, if needed

acfd0eb

Fix metadata

42f8e57

Minor fix

070831d

Fixes related to metadata

c6c2c87

Update submodules vecalign and bleualign-cpp

c2e0b0a

Replace single backslash by double backslash

75056d2

Single backslash is replaced by a literal new line when processed by snakemake. We usually want double backslash because we want to introduce a literal backslash because in bash that means to continue the current command in the next line

Minor fix and resolved TODOs

eb78c77

Minor fix

71d4b74

New metadata arg and fix tests

8f2c93c

Tests which were using a prevertical generated file was not being deterministic because of the date of the prevertical

Improve WARC to prevertical script

003568c

Update submodule bleualign-cpp

eb3250f

Minor fixes

8a9b696

Merge branch 'master' into neural_modifications

96b4a5c

Minor fix

a72d612

Merge branch 'master' into neural_modifications

493373e

lpla and others added 9 commits November 24, 2022 09:10

Merge branch 'master' into neural_modifications

0557112

Merge branch 'neural_modifications_partial' into neural_modifications

ac56e3d

Minor fix

9dde0df

Minor fix

24789c7

Minor fixes

f4018a6

Merge branch 'neural_modifications_partial' into neural_modifications

49a1d69

Merge branch 'master' into neural_modifications

4da7b2b

Merge branch 'master' into neural_modifications

98cd6ab

Merge branch 'master' into neural_modifications

56d7ab4

lpla and others added 2 commits December 5, 2022 21:10

Merge from master

2e52d76

Merge branch 'master' into neural_modifications

13b01ef

lpla and others added 7 commits December 13, 2022 16:50

Merge from master

7850aad

Merge from master

84a008d

Merge branch 'master' into neural_modifications

1346e6a

Merge branch 'master' into neural_modifications

c7ec01a

Merge branch 'neural_modifications' of https://github.com/bitextor/bi…

84857a1

…textor into neural_modifications

Merge branch 'master' into neural_modifications

25e5037

Merge branch 'master' into neural_modifications

89cba5f

lpla force-pushed the master branch 5 times, most recently from f177983 to eb5156c Compare February 7, 2023 15:51

lpla and others added 3 commits February 9, 2023 10:39

Merge branch 'master' into neural_modifications

be4866b

Merge branch 'master' into neural_modifications

872374c

Merge branch 'master' into neural_modifications

48e99bc

ZJaume mentioned this pull request Mar 28, 2023

New Bicleaner AI full models #255

Open

lpla force-pushed the master branch from df322d4 to 75a2b28 Compare May 26, 2023 12:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"neural_modifications" has lost some meaning as name of this branch #246

"neural_modifications" has lost some meaning as name of this branch #246

cgr71ii commented Nov 19, 2022

lpla commented Dec 4, 2022

lpla commented Dec 7, 2022

"neural_modifications" has lost some meaning as name of this branch #246

Are you sure you want to change the base?

"neural_modifications" has lost some meaning as name of this branch #246

Conversation

cgr71ii commented Nov 19, 2022

lpla commented Dec 4, 2022

lpla commented Dec 7, 2022