-
Notifications
You must be signed in to change notification settings - Fork 302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
assumptions around modifying input files, or making new files in their directories #495
Comments
I strongly support, at the least, a statement that tasks "SHOULD" treat their input files as read-only. But it may be challenging to require engines to enforce this -- that assumes fine-grained control over how the container scheduler assembles the filesystem mounts & permissions. Regarding "making new files in their directories" -- IIRC (correction welcomed) Cromwell drops input files into the task working directory, so I think practically this has to be allowed. There's also the simple OTOH, there are |
Cromwell creates a separate As for the samtools index case: BioWDL makes a hardlink this works in Cromwell because Anyway, sorry for the digression. I am in favor of enforcing read-only inputs. In terms of reproducibility, if I run task A, and then run task A again on the same input, that should have the same result. That can not be guaranteed if task A changes the input. |
For scenarios where the program/script being wrapped directly modifies inputs, it seems that enforcing read-only inputs would require duplicating/hardlinking inputs instead of using softlinks, yes? It is a workaround I am already using in some of my WDLs, but I don't know if requiring people to double the disk size requirements is ideal. Regarding reproducibility, the inputs are copied into Cromwell's inputs folder, not moved, even on local runs. So if running on the same set of inputs, wouldn't it still be considered reproducible? Running I am not entirely against read-only inputs, to be clear, and I do find situations like "samtools puts the index file into the inputs folder" to be confusing. But I wonder if just renaming the |
@aofarrel I agree with your points when we assume the engine already has to copy/localize the input file(s) before starting the task. However, both miniwdl and dxCompiler have the ability to avoid that in many cases, which can be a speed/resource advantage.
|
FWIW, dxCompiler explicitly states that inputs are read-only, and it is suggested to link an input to the working directory if you need to create an adjacent file (e.g. the Another reason dxCompiler enforces read-only inputs is because it is not straight-forward to deal with modifiable input directories. Let say you have a version development
task foo {
input {
Directory d
}
command <<<
echo 'hello' > ~{d}/foo
}
output {
Directory dout = d
}
} There are lots of edge cases that just make it easier to treat inputs as read-only. I don't really have a strong opinion as to whether the spec says SHOULD or MUST, but I think there should be a strong recommendation to the user to copy or link input files if they need to modify them or have them in a writable directory. |
MUST is the only option in my opinion. Mostly because WDL strives to be readable. Having to keep in mind that a task manipulates the inputs adds very significant cognitive overhead. Even seemingly simple workflows might be not as simple as they seem, so each individual task needs to be inspected to ensure that indeed, only the outputs are affected by the tasks, not the inputs. I think that is a very undesirable state for WDL to be in. Reading a workflow from the top-level should be enough to infer what is happening. |
In the 3/22/23 governance call we decided that WDL v1.2 will include the SHOULD language, with a deprecation warning, and 2.0 will include the MUST language. |
Thank you @vsmalladi and @jdidion ! |
It is suggested the the OpenWDL specification clarify the write-status of input files, and the directories they are in.
I suggest that in a future version of the OpenWDL spec, it is declared that all inputs files must be make read-only to have consistent behavior.
This also helps with converting to CWL, as it has the same restriction, unless
InitialWorkDirRequirement
is used to mark some inputs aswritable: true
For miniwdl, they have an IO-expensive workaround
https://miniwdl.readthedocs.io/en/latest/runner_advanced.html#read-only-input-files
Draft implementation: https://github.com/openwdl/wdl/tree/495-files-read-only
The text was updated successfully, but these errors were encountered: