Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

assumptions around modifying input files, or making new files in their directories #495

Closed
mr-c opened this issue Feb 20, 2022 · 8 comments · Fixed by #642
Closed

assumptions around modifying input files, or making new files in their directories #495

mr-c opened this issue Feb 20, 2022 · 8 comments · Fixed by #642
Assignees
Milestone

Comments

@mr-c
Copy link

mr-c commented Feb 20, 2022

It is suggested the the OpenWDL specification clarify the write-status of input files, and the directories they are in.

I suggest that in a future version of the OpenWDL spec, it is declared that all inputs files must be make read-only to have consistent behavior.

This also helps with converting to CWL, as it has the same restriction, unless InitialWorkDirRequirement is used to mark some inputs as writable: true

For miniwdl, they have an IO-expensive workaround
https://miniwdl.readthedocs.io/en/latest/runner_advanced.html#read-only-input-files

Draft implementation: https://github.com/openwdl/wdl/tree/495-files-read-only

@mlin
Copy link
Member

mlin commented Feb 21, 2022

I strongly support, at the least, a statement that tasks "SHOULD" treat their input files as read-only. But it may be challenging to require engines to enforce this -- that assumes fine-grained control over how the container scheduler assembles the filesystem mounts & permissions.

Regarding "making new files in their directories" -- IIRC (correction welcomed) Cromwell drops input files into the task working directory, so I think practically this has to be allowed. There's also the simple samtools index case in our domain that we want to keep straightforward.

OTOH, there are Directory inputs -- miniwdl defaults to read-only bind mounts for them, for the same reason to avoid copying, but indeed this prevents the task from adding new files to them. Definitely I'm in a tradeoff between performance and consistency there.

@rhpvorderman
Copy link
Contributor

Regarding "making new files in their directories" -- IIRC (correction welcomed) Cromwell drops input files into the task working directory, so I think practically this has to be allowed.

Cromwell creates a separate inputs directory that is in the parent of the execution (working) directory. All files are solved to absolute paths in cromwell.

As for the samtools index case: BioWDL makes a hardlink this works in Cromwell because inputs and execution are always on the same filesystem. It is an extremely ugly hack, and precisely for that reason we don't use the samtools index task. Usually when handling BAM files, the utility that does so has an index command. So we just perform the indexing directly in the command that produces a new BAM file. That solves a lot of hassle. It is really only a problem for tools that have no flag for specifying the index if it is not in the same dir. Maybe we should fix this issue there?

Anyway, sorry for the digression. I am in favor of enforcing read-only inputs. In terms of reproducibility, if I run task A, and then run task A again on the same input, that should have the same result. That can not be guaranteed if task A changes the input.

@aofarrel
Copy link

aofarrel commented Mar 1, 2022

For scenarios where the program/script being wrapped directly modifies inputs, it seems that enforcing read-only inputs would require duplicating/hardlinking inputs instead of using softlinks, yes? It is a workaround I am already using in some of my WDLs, but I don't know if requiring people to double the disk size requirements is ideal.

Regarding reproducibility, the inputs are copied into Cromwell's inputs folder, not moved, even on local runs. So if running on the same set of inputs, wouldn't it still be considered reproducible? Running java -jar cromwell-76.jar run foo.wdl -inputs foo.json twice is reproducible, even if the first run's inputs directory gets changed, since the first run doesn't change anything about the files in foo.json.

I am not entirely against read-only inputs, to be clear, and I do find situations like "samtools puts the index file into the inputs folder" to be confusing. But I wonder if just renaming the inputs directory to something like inputs-copy or localized-inputs might be a better option -- it would clarify they are copies and might be subject to change.

@mlin
Copy link
Member

mlin commented Mar 2, 2022

@aofarrel I agree with your points when we assume the engine already has to copy/localize the input file(s) before starting the task. However, both miniwdl and dxCompiler have the ability to avoid that in many cases, which can be a speed/resource advantage.

  • miniwdl uses docker bind mounts to mount each input file at its exists on the host/shared file system
  • dxCompiler can leverage a FUSE mechanism to "stream" file data from cloud storage without an upfront download step

@jdidion
Copy link
Collaborator

jdidion commented Mar 2, 2022

FWIW, dxCompiler explicitly states that inputs are read-only, and it is suggested to link an input to the working directory if you need to create an adjacent file (e.g. the samtools index example).

Another reason dxCompiler enforces read-only inputs is because it is not straight-forward to deal with modifiable input directories. Let say you have a Directory input and you create a new file in the localized directory. Furthermore, you return that directory as an output. (see example below) Should the output directory include the new file, even though the spec says that Directorys should be treated as snapshots?

version development
task foo {
  input {
    Directory d
  }
  command <<<
    echo 'hello' > ~{d}/foo
  }
  output {
    Directory dout = d
  }
}

There are lots of edge cases that just make it easier to treat inputs as read-only. I don't really have a strong opinion as to whether the spec says SHOULD or MUST, but I think there should be a strong recommendation to the user to copy or link input files if they need to modify them or have them in a writable directory.

@rhpvorderman
Copy link
Contributor

MUST is the only option in my opinion. Mostly because WDL strives to be readable. Having to keep in mind that a task manipulates the inputs adds very significant cognitive overhead. Even seemingly simple workflows might be not as simple as they seem, so each individual task needs to be inspected to ensure that indeed, only the outputs are affected by the tasks, not the inputs.

I think that is a very undesirable state for WDL to be in. Reading a workflow from the top-level should be enough to infer what is happening.

@jdidion
Copy link
Collaborator

jdidion commented Mar 23, 2023

In the 3/22/23 governance call we decided that WDL v1.2 will include the SHOULD language, with a deprecation warning, and 2.0 will include the MUST language.

@mr-c
Copy link
Author

mr-c commented May 15, 2024

Thank you @vsmalladi and @jdidion !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

6 participants