Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"neural_modifications" has lost some meaning as name of this branch #246

Draft
wants to merge 56 commits into
base: master
Choose a base branch
from

Conversation

cgr71ii
Copy link
Collaborator

@cgr71ii cgr71ii commented Nov 19, 2022

This got out of hands, too many small changes. Sorry in advance, I know I should have split all these changes in different branches :/

Real "neural"-related changes:

  • Automatic management of GPUs allocation in neural tools (i.e. NDA, vecalign and bicleaner rules). Configuring parallelJobs or the CUDA_VISIBLE_DEVICES envvar, the GPU management is automatic. This is possible configuring the CUDA_VISIBLE_DEVICES envvar. Since multiple jobs may run in parallel and Snakemake use processes, not threads, per job, we need process communication, and since we want to allocate GPUs, we need to implement a mutex mechanism through files due to Snakemake design: https://snakemake.readthedocs.io/en/stable/project_info/faq.html#i-want-to-pass-variables-between-rules-is-that-possible
    • Related links:
    • I tried to use other methods provided in the standard python library (e.g. semaphores, mutex), but it didn't worked because they are not intended to work for communicate different processes, but threads. I didn't find any other solution which fulfilled the needs for the needs of the feature, so I ended up using what more fit the situation given the Snakemake FAQ.
    • mutex.py: helper functions which currently are being used for the GPU allocation among processes, and it should work even among all Snakemake instances (this is good news in the case of allocating resources, since we can share resources even among executions! In the case that the implemented methods are wanted for other purposes which demands to only communicate among jobs but not among executions, an ID of the execution may be used in the data structure), but it should work for any other purpose which needs of communication between jobs. It uses PersistentDict from pytools, which guarantees exclusive access in a concurrent environment, in order to allow the jobs to communicate. The library stores files in ~/.cache/pytools/, so in case that multiple instances are running (e.g. run run-tests-min.sh), the concurrent access is guaranteed, so there will not be problems, but a warning has been configured to alert that the file exists, and it shouldn't (when the execution finishes, the stored data structures are deleted in the tear-down method, but if the execution was abrupter interrupted or other problem happened, the file will exist, and this might cause problems). If the execution is abruptly interrupted, this might cause that resources can't be allocated if they were configured to be allocated (by default the allocation is disabled), and in case that all resources were allocated in the previous and interrupted execution, this might case a deadlock (in these cases, the solution is to check the files in ~/.cache/pytools/ and remove those associated to the GPU allocation).
    • allocate_cuda_gpus.py: it uses mutex.py in order to automatically allocate GPUs. For each job that demands a GPU, a "token" is provided from that job, and this token has to be the path of a file which doesn't exist. While this file doesn't exist, the resource is allocated for that job, and in the moment that the job finishes, it's its responsibility to touch the token that initially provided in order to allocate the GPU. In the moment that any job tries to allocate a GPU, all previous provided tokens are checked if exist, and if they exist, the GPU is freed and can be assigned to other job. A method is provided in order to be invoked in the teardown of the whole execution, which what does is, basically, release all the GPUs in case that there were still that hadn't been released (this might happen if the job "skipped" its responsibility or, for any reason, failed and couldn't touch the provided token). This teardown method guarantees that the next execution won't find any allocated resource (in the case of non-concurrent executions).
  • Configuration option embeddingsBatchSize has been generalized and substituted by neuralToolsBatchSize. Now, neuralToolsBatchSize is a dictionary-like which allows to specify the batch size of different neural tools: NDA, vecalign and bicleaner AI.

Rest of changes:

  • Changes described in Some changes from neural_modifications branch: Metadata refactorization #245
    • Replacement of '\' with '\\' in all places which applies (extended explanation in previous mentioned PR).
    • Use of logging library in all places which applies (extended explanation in previous mentioned PR).
  • GHA scripts updated (there were deprecated warnings).
  • New option in text2prevertical.py: --random-date. This option allows to make the tests deterministic (if --seed is used) for those cases which uses the script.
  • Tests which generate a prevertical file using text2prevertical.py now enable additionalMetadata as well in order to extend the coverage of those test cases.
  • Snakemake handlers used: onstart, onsuccess and onerror.
  • Utilities get_snakemake_execution_mark and is_first_snakemake_execution have been removed since are not longer needed because now we are using the onstart handler.
  • Minor fix: when until was not configured and the selected docalign was not Bleualign, the added file to the output documents was not the correct, what leaded, if I don't remember wrong, to extra files being generated but not used.
  • Some TODOs resolved and other added.
  • Rule pre_filter_sort_flags removed and pre_filter renamed to pre_filter_sort_flags.
    • Rule pre_filter_sort_flags was forcing to wait until all files from rule pre_filter were ready instead of continue the execution with the rule filter. Since the rule pre_filter_sort_flags was only checking that all the files from rule pre_filter were exactly the same, is not totally necessary (this task has been moved to the sents rule), and the blocking was slowing the global execution.
  • Rules filter and sents have been simplified. Now, the output files of filter doesn't add the header to the files but instead print the header to separate files. This makes easier the process of the sents rule, and the header is post-added in this last rule. Since the output files from the filter rule are temp files, they are not expected to be released, so that the header fields are not in these files is not very important, I think, and improves the performance and readability of the filter and sents rules, at the cost of lose the header fields of the filter rule output files.

Automatic GPU allocation is achieved through mutex mechanisms.
Once a job acquires the mutex, it allocates a device. Affected
jobs are NDA, vecalign and bicleaner.

Since the mutex mechanisms are done through PersistentDict
from pytools (recommended from the snakemake FAQ:
https://snakemake.readthedocs.io/en/stable/project_info/
faq.html#i-want-to-pass-variables-between-rules-is-that
-possible), the GPU allocation should work even if
multiple snakemake instances are running. This may be
useful since it allows to handle automatically the GPU
allocation for different configurations of Bitextor.

Other changes:
 - Typo in docs
 - Use of snakemake handlers (e.g. onstart)
Instead of have the para id in the same file where are the
sentences, a new file is intended to be created and processed by
the segalign.
This is what is being done right now, but moving all the content
in only one file, what leads to extra processing and makes many
rules dependant on this feature.
Other minor fixes and changes
Single backslash is replaced by a literal new line when processed
by snakemake. We usually want double backslash because we want to
introduce a literal backslash because in bash that means to
continue the current command in the next line
Tests which were using a prevertical generated file was not being
deterministic because of the date of the prevertical
@lpla
Copy link
Member

lpla commented Dec 4, 2022

Let's see what the Intensive tests say and I will do the review.

@lpla
Copy link
Member

lpla commented Dec 7, 2022

Current intensive tests from run-tests.sh are working. But I think that we should add new tests to be sure that all these changes regarding GPU management are correctly working before doing a review of this code. At least to test it on offline machines with several GPUs.

@lpla lpla force-pushed the master branch 5 times, most recently from f177983 to eb5156c Compare February 7, 2023 15:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants