assumptions around modifying input files, or making new files in their directories #495

mr-c · 2022-02-20T13:49:14Z

It is suggested the the OpenWDL specification clarify the write-status of input files, and the directories they are in.

I suggest that in a future version of the OpenWDL spec, it is declared that all inputs files must be make read-only to have consistent behavior.

This also helps with converting to CWL, as it has the same restriction, unless InitialWorkDirRequirement is used to mark some inputs as writable: true

For miniwdl, they have an IO-expensive workaround
https://miniwdl.readthedocs.io/en/latest/runner_advanced.html#read-only-input-files

Draft implementation: https://github.com/openwdl/wdl/tree/495-files-read-only

The text was updated successfully, but these errors were encountered:

mlin · 2022-02-21T21:11:46Z

I strongly support, at the least, a statement that tasks "SHOULD" treat their input files as read-only. But it may be challenging to require engines to enforce this -- that assumes fine-grained control over how the container scheduler assembles the filesystem mounts & permissions.

Regarding "making new files in their directories" -- IIRC (correction welcomed) Cromwell drops input files into the task working directory, so I think practically this has to be allowed. There's also the simple samtools index case in our domain that we want to keep straightforward.

OTOH, there are Directory inputs -- miniwdl defaults to read-only bind mounts for them, for the same reason to avoid copying, but indeed this prevents the task from adding new files to them. Definitely I'm in a tradeoff between performance and consistency there.

rhpvorderman · 2022-02-22T07:40:40Z

Regarding "making new files in their directories" -- IIRC (correction welcomed) Cromwell drops input files into the task working directory, so I think practically this has to be allowed.

Cromwell creates a separate inputs directory that is in the parent of the execution (working) directory. All files are solved to absolute paths in cromwell.

As for the samtools index case: BioWDL makes a hardlink this works in Cromwell because inputs and execution are always on the same filesystem. It is an extremely ugly hack, and precisely for that reason we don't use the samtools index task. Usually when handling BAM files, the utility that does so has an index command. So we just perform the indexing directly in the command that produces a new BAM file. That solves a lot of hassle. It is really only a problem for tools that have no flag for specifying the index if it is not in the same dir. Maybe we should fix this issue there?

Anyway, sorry for the digression. I am in favor of enforcing read-only inputs. In terms of reproducibility, if I run task A, and then run task A again on the same input, that should have the same result. That can not be guaranteed if task A changes the input.

aofarrel · 2022-03-01T00:48:21Z

For scenarios where the program/script being wrapped directly modifies inputs, it seems that enforcing read-only inputs would require duplicating/hardlinking inputs instead of using softlinks, yes? It is a workaround I am already using in some of my WDLs, but I don't know if requiring people to double the disk size requirements is ideal.

Regarding reproducibility, the inputs are copied into Cromwell's inputs folder, not moved, even on local runs. So if running on the same set of inputs, wouldn't it still be considered reproducible? Running java -jar cromwell-76.jar run foo.wdl -inputs foo.json twice is reproducible, even if the first run's inputs directory gets changed, since the first run doesn't change anything about the files in foo.json.

I am not entirely against read-only inputs, to be clear, and I do find situations like "samtools puts the index file into the inputs folder" to be confusing. But I wonder if just renaming the inputs directory to something like inputs-copy or localized-inputs might be a better option -- it would clarify they are copies and might be subject to change.

mlin · 2022-03-02T04:56:47Z

@aofarrel I agree with your points when we assume the engine already has to copy/localize the input file(s) before starting the task. However, both miniwdl and dxCompiler have the ability to avoid that in many cases, which can be a speed/resource advantage.

miniwdl uses docker bind mounts to mount each input file at its exists on the host/shared file system
dxCompiler can leverage a FUSE mechanism to "stream" file data from cloud storage without an upfront download step

jdidion · 2022-03-02T05:30:46Z

FWIW, dxCompiler explicitly states that inputs are read-only, and it is suggested to link an input to the working directory if you need to create an adjacent file (e.g. the samtools index example).

Another reason dxCompiler enforces read-only inputs is because it is not straight-forward to deal with modifiable input directories. Let say you have a Directory input and you create a new file in the localized directory. Furthermore, you return that directory as an output. (see example below) Should the output directory include the new file, even though the spec says that Directorys should be treated as snapshots?

version development
task foo {
  input {
    Directory d
  }
  command <<<
    echo 'hello' > ~{d}/foo
  }
  output {
    Directory dout = d
  }
}

There are lots of edge cases that just make it easier to treat inputs as read-only. I don't really have a strong opinion as to whether the spec says SHOULD or MUST, but I think there should be a strong recommendation to the user to copy or link input files if they need to modify them or have them in a writable directory.

rhpvorderman · 2022-03-02T07:06:23Z

MUST is the only option in my opinion. Mostly because WDL strives to be readable. Having to keep in mind that a task manipulates the inputs adds very significant cognitive overhead. Even seemingly simple workflows might be not as simple as they seem, so each individual task needs to be inspected to ensure that indeed, only the outputs are affected by the tasks, not the inputs.

I think that is a very undesirable state for WDL to be in. Reading a workflow from the top-level should be enough to infer what is happening.

jdidion · 2023-03-23T16:00:06Z

In the 3/22/23 governance call we decided that WDL v1.2 will include the SHOULD language, with a deprecation warning, and 2.0 will include the MUST language.

…495.

mr-c · 2024-05-15T20:41:09Z

Thank you @vsmalladi and @jdidion !

jdidion added this to the 1.2 milestone Mar 23, 2023

jdidion added Spec Change clarification labels Mar 23, 2023

jdidion self-assigned this Mar 30, 2023

jdidion removed their assignment Mar 28, 2024

adamnovak mentioned this issue Apr 25, 2024

Allow symlinks to inputs as WDL outputs DataBiosphere/toil#4883

Merged

19 tasks

vsmalladi added a commit that referenced this issue May 10, 2024

Update language around files and direcotories being read only. Closes #…

1730dc3

…495.

vsmalladi mentioned this issue May 10, 2024

Update language around assumptions around files and directories #642

Merged

2 tasks

vsmalladi self-assigned this May 10, 2024

vsmalladi linked a pull request May 10, 2024 that will close this issue

Update language around assumptions around files and directories #642

Merged

2 tasks

jdidion closed this as completed May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assumptions around modifying input files, or making new files in their directories #495

assumptions around modifying input files, or making new files in their directories #495

mr-c commented Feb 20, 2022 •

edited by jdidion

mlin commented Feb 21, 2022

rhpvorderman commented Feb 22, 2022

aofarrel commented Mar 1, 2022

mlin commented Mar 2, 2022

jdidion commented Mar 2, 2022 •

edited

rhpvorderman commented Mar 2, 2022

jdidion commented Mar 23, 2023

mr-c commented May 15, 2024

assumptions around modifying input files, or making new files in their directories #495

assumptions around modifying input files, or making new files in their directories #495

Comments

mr-c commented Feb 20, 2022 • edited by jdidion

mlin commented Feb 21, 2022

rhpvorderman commented Feb 22, 2022

aofarrel commented Mar 1, 2022

mlin commented Mar 2, 2022

jdidion commented Mar 2, 2022 • edited

rhpvorderman commented Mar 2, 2022

jdidion commented Mar 23, 2023

mr-c commented May 15, 2024

mr-c commented Feb 20, 2022 •

edited by jdidion

jdidion commented Mar 2, 2022 •

edited