-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"neural_modifications" has lost some meaning as name of this branch #246
Draft
cgr71ii
wants to merge
56
commits into
master
Choose a base branch
from
neural_modifications
base: master
Could not load branches
Branch not found: {{ refName }}
Could not load tags
Nothing to show
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Automatic GPU allocation is achieved through mutex mechanisms. Once a job acquires the mutex, it allocates a device. Affected jobs are NDA, vecalign and bicleaner. Since the mutex mechanisms are done through PersistentDict from pytools (recommended from the snakemake FAQ: https://snakemake.readthedocs.io/en/stable/project_info/ faq.html#i-want-to-pass-variables-between-rules-is-that -possible), the GPU allocation should work even if multiple snakemake instances are running. This may be useful since it allows to handle automatically the GPU allocation for different configurations of Bitextor. Other changes: - Typo in docs - Use of snakemake handlers (e.g. onstart)
Instead of have the para id in the same file where are the sentences, a new file is intended to be created and processed by the segalign. This is what is being done right now, but moving all the content in only one file, what leads to extra processing and makes many rules dependant on this feature.
Other minor fixes and changes
Single backslash is replaced by a literal new line when processed by snakemake. We usually want double backslash because we want to introduce a literal backslash because in bash that means to continue the current command in the next line
Tests which were using a prevertical generated file was not being deterministic because of the date of the prevertical
Let's see what the Intensive tests say and I will do the review. |
Current intensive tests from |
…textor into neural_modifications
lpla
force-pushed
the
master
branch
5 times, most recently
from
February 7, 2023 15:51
f177983
to
eb5156c
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This got out of hands, too many small changes. Sorry in advance, I know I should have split all these changes in different branches :/
Real "neural"-related changes:
parallelJobs
or theCUDA_VISIBLE_DEVICES
envvar, the GPU management is automatic. This is possible configuring theCUDA_VISIBLE_DEVICES
envvar. Since multiple jobs may run in parallel and Snakemake use processes, not threads, per job, we need process communication, and since we want to allocate GPUs, we need to implement a mutex mechanism through files due to Snakemake design: https://snakemake.readthedocs.io/en/stable/project_info/faq.html#i-want-to-pass-variables-between-rules-is-that-possiblemutex.py
: helper functions which currently are being used for the GPU allocation among processes, and it should work even among all Snakemake instances (this is good news in the case of allocating resources, since we can share resources even among executions! In the case that the implemented methods are wanted for other purposes which demands to only communicate among jobs but not among executions, an ID of the execution may be used in the data structure), but it should work for any other purpose which needs of communication between jobs. It usesPersistentDict
frompytools
, which guarantees exclusive access in a concurrent environment, in order to allow the jobs to communicate. The library stores files in~/.cache/pytools/
, so in case that multiple instances are running (e.g. runrun-tests-min.sh
), the concurrent access is guaranteed, so there will not be problems, but a warning has been configured to alert that the file exists, and it shouldn't (when the execution finishes, the stored data structures are deleted in the tear-down method, but if the execution was abrupter interrupted or other problem happened, the file will exist, and this might cause problems). If the execution is abruptly interrupted, this might cause that resources can't be allocated if they were configured to be allocated (by default the allocation is disabled), and in case that all resources were allocated in the previous and interrupted execution, this might case a deadlock (in these cases, the solution is to check the files in~/.cache/pytools/
and remove those associated to the GPU allocation).allocate_cuda_gpus.py
: it usesmutex.py
in order to automatically allocate GPUs. For each job that demands a GPU, a "token" is provided from that job, and this token has to be the path of a file which doesn't exist. While this file doesn't exist, the resource is allocated for that job, and in the moment that the job finishes, it's its responsibility totouch
the token that initially provided in order to allocate the GPU. In the moment that any job tries to allocate a GPU, all previous provided tokens are checked if exist, and if they exist, the GPU is freed and can be assigned to other job. A method is provided in order to be invoked in the teardown of the whole execution, which what does is, basically, release all the GPUs in case that there were still that hadn't been released (this might happen if the job "skipped" its responsibility or, for any reason, failed and couldn't touch the provided token). This teardown method guarantees that the next execution won't find any allocated resource (in the case of non-concurrent executions).embeddingsBatchSize
has been generalized and substituted byneuralToolsBatchSize
. Now,neuralToolsBatchSize
is a dictionary-like which allows to specify the batch size of different neural tools: NDA, vecalign and bicleaner AI.Rest of changes:
neural_modifications
branch: Metadata refactorization #245logging
library in all places which applies (extended explanation in previous mentioned PR).text2prevertical.py
:--random-date
. This option allows to make the tests deterministic (if--seed
is used) for those cases which uses the script.text2prevertical.py
now enableadditionalMetadata
as well in order to extend the coverage of those test cases.onstart
,onsuccess
andonerror
.get_snakemake_execution_mark
andis_first_snakemake_execution
have been removed since are not longer needed because now we are using theonstart
handler.until
was not configured and the selected docalign was not Bleualign, the added file to the output documents was not the correct, what leaded, if I don't remember wrong, to extra files being generated but not used.TODO
s resolved and other added.pre_filter_sort_flags
removed andpre_filter
renamed topre_filter_sort_flags
.pre_filter_sort_flags
was forcing to wait until all files from rulepre_filter
were ready instead of continue the execution with the rulefilter
. Since the rulepre_filter_sort_flags
was only checking that all the files from rulepre_filter
were exactly the same, is not totally necessary (this task has been moved to thesents
rule), and the blocking was slowing the global execution.filter
andsents
have been simplified. Now, the output files offilter
doesn't add the header to the files but instead print the header to separate files. This makes easier the process of thesents
rule, and the header is post-added in this last rule. Since the output files from thefilter
rule aretemp
files, they are not expected to be released, so that the header fields are not in these files is not very important, I think, and improves the performance and readability of thefilter
andsents
rules, at the cost of lose the header fields of thefilter
rule output files.