Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Defining Environment Variables Within a Task #504

Merged
merged 2 commits into from
May 22, 2024

Conversation

patmagee
Copy link
Member

This PR is the result of a few discussions that have been ongoing over the past while (#458 and #443) of different ways to make improve the security and usability of WDL.

Declare Environment Variables within a Task

This PR adds the ability to declare environment variables within the task scope, as well as to pass values for those variables within a call block. Most WDL engines rely on templating in values to the command section prior to command execution. While this is convenient and allows us to define our own expression language within a command block, it does mean that any WDL command is subject injection attacks or poor formatting during execution.

Using environment variables circumvents this issue by not relying on templating at all. Instead the execution engine safely escapes strings prior to setting them within the container (or other execution) environment at runtime. Environment variables are not usable within other WDL expressions and once defined are accessed using the normal shell semantics associated with the given interpreter.

Environment variables are all implicitly strings, and therefore can be set to any expression which either evaluates to a string, or is coercible to one.

task foo {
  input {
   String my_inp
  }
  env {
   MY_ENV_WITH_DEFAULT = my_inp
   MY_ENV
  }

  command <<<
    echo $MY_ENV $MY_ENV_WITH_DEFAULT
  >>>
}

workflow bar {
  
  call foo {
    input : my_inp = "foo"
    env: MY_ENV = "bar"
  }
}

Checklist

  • Pull request details were added to CHANGELOG.md

@mlin
Copy link
Member

mlin commented May 14, 2022

Thanks @patmagee for moving this along! As you know, I strongly support the general idea of enabling environment variables as an alternative (often a superior one) to ~{} text interpolations. As to the fine details of how to express it:

I'd prefer not to need the separate env: section of the calls. I think of the env entries as simply additional string inputs, which the engine will know to set in the container environment. If we make input & env more-separate concepts as proposed, then WDL implementations need to be extended to model them separately (for example- WDL.Task.available_inputs and WDL.Task.required_inputs need to mitosis into inputs & envs, and then all code consuming them needs to consider both). It'll be easier if we think of envs as just more inputs, that the runtime does something a bit different with.

So the call would just look like

  call foo {
    input : my_inp = "foo",
            MY_ENV = "bar"
  }

(Aside- I think we should get rid of the useless input: keyword in calls anyway, but that'll be another thread!)

Another idea worth considering was to use type specializations instead of a separate section, keeping the current declaration syntax, e.g.,

task foo {
  input {
   SomeStruct my_inp
   EnvString MY_ENV
  }

  EnvString STRUCT_JSON_FILE = write_json(my_inp)

  command <<<
    echo $MY_ENV $STRUCT_JSON_FILE
  >>>
}

workflow bar {
  
  call foo {
    input : my_inp = SomeStruct { ... },
            MY_ENV = "bar"
  }
}

This allows the task author to control which environment variables the caller can and can't override.

My feeling is that either of the above approaches do the job well and are somewhat simpler for both WDL authors and engine implementers.

@mlin
Copy link
Member

mlin commented May 14, 2022

Another angle on the prior comment: the container environment variables are an implementation detail that should be encapsulated inside the task. The caller doesn't really need to know whether a given input will turn into a container environment variable or not. The task author might want to change how they consume an input from interpolation to environment variable, and such a change needn't trigger required changes in calls too.

@patmagee
Copy link
Member Author

patmagee commented May 14, 2022

@mlin that is a very valid point. The one thing I was trying to avoid was the automatically setting environment variables in the task based on input strings. Without a special prefix that would likely be error prone and potentially break many different tasks.

I must admit, I was initially torn between a type and a an env block. Both felt natural, the block felt like it encapsulated the concept of environment variables well. But having them buble up to the level of the workflow does feel odd to be honest. I guess that leaves us with a few approaches

Option 1: Keep the block in a task but require values to be set

This option would be a good way of being very declarative. We can keep the env block within a task that I created. The difference though is that every item must be assigned, either by a literal value or from an input value / private declaration.

The benefit to this approach is that it proves fine grained control over what gets exposed in the environment and it groups everything together.

task foo {
  input {
    String myInp
    Int maxNum
  }
  
  env {
   MYINP = myInp
   OTHER = "someValue"
   MAXNUM = maxNum
  }
}

The drawback to this approach is that it adds potentially more boilerplate to a task. The declared environment variables would still only be usable within the command block, but the workflow has no concept of environment variables

Option 2: Use a declared type

This would be following your suggestion and use a type like EnvString. It makes sense and I rather like it, especially if that string is treated like a normal string in any other wdl expression.

task foo {
  input {
   EnvString FOO
  }
}

  • The drawback of this approch is it adds a new type handled the same as a string but with special meaning

Option 3: add some sort of qualifier to the type

We could also go the route of adding an additional qualifier to the type or the declaration some ideas would be.

task foo {
  input {
    # explicitly state where to expose the string. this would be my favourite probably
    String FOO in env
    # use a type qualifier preceding the type. This one could also be my favorite
    env String FOO
    # Use some sort of qualifier coming after the type declaration
    String@env FOO
    String:env FOO
    String{env}

   

Type qualifiers have the drawback (or benefit) of being applicable to any type. Therefore any type could potentially be set in the env and it's the engines role to figure that out. Ie serialization of a struct or other complex type


I am definitely with you that we do not want to expand the concept of what constitutes an input, at least at this point.

@mlin
Copy link
Member

mlin commented May 15, 2022

# use a type qualifier preceding the type. This one could also be my favorite
env String FOO

I like this too!

One further question would be whether to allow env to prefix any type (Int, SomeStruct, etc.) and let the engine string-ify it (perhaps as JSON for compound types).

I think I'd say yes, but could be convinced otherwise

@patmagee
Copy link
Member Author

@mlin I do not see a reason to limit this to only strings. I think allowing any type would gain a lot of support from the community. You could put it in a struct then easily use something like jq to manipulate the contents of the env variable

env Map[String,Array[String]] COMPLEX_TYPE
env MyStruct FOO

One question then is whether the env qualifier can be used outside of the task block. And whether it changes the type signature at all. For example would String and env String be equivalent? I think the answer is yes in the context of wdl expressions. Which if that is the case, the env keyword is less a qualifier and more a directive, telling the engine how to handle that variable

@patmagee
Copy link
Member Author

Adding something like directives as a concept can generalize as well. For example in a task you could define readonly File myFile or remote File foo to indicate that a file should not be localized.

@rhpvorderman
Copy link
Contributor

Using environment variables circumvents this issue ..."

"Environment variables are not usable within other WDL expressions"

Due to environment variables not being usable in WDL expressions, that limits their power quite a bit. For some input variables being able to derive other variables from them is quite essential for the good functioning of a task. Therefore environment variables are never a complete solution. As such there is not one way to define inputs, but two equally valid ways when this becomes part of the spec.

and once defined are accessed using the normal shell semantics associated with the given interpreter.

I find this a bit shaky ground, as this means that the engine and the shell will be more tightly coupled. I know the current spec defines bash as the standard shell, but some containers do not have bash. As such, on biowdl/tasks we try to avoid bashisms to make sure any task can run on other shells as well.

My main problem with adding environment variables is that instead of one, we create two ways of defining inputs. This is quite confusing, as it is quite arbitrary which method is to be preferred. If I may quote the Zen of Python: "There should be one-- and preferably only one --obvious way to do it." This change violates that principle and therefore it makes WDL less usable.

Given that the actual problem that we are trying to solve here is " injection attacks or poor formatting during execution" I think the solution should be more tailored towards that end. Another solution is to quote all WDL variables by default while templating and to provide a function that disables this (conveniently called unsafe(...)). That is a much smaller and more targeted solution to the problem at hand.

@mlin
Copy link
Member

mlin commented May 16, 2022

Environment variables are a universal mechanism that every shell/language interpreter will have an idiomatic way to consume, and will also be more familiar to authors learning WDL who do have other scripting experience.

I think shakier ground is the suggestion that engines could escape each value in a way that would be so universally safe and compatible. That suggestion works better if, on the contrary, we double-down on specifying that bash is the interpreter -- but even then, we'd face the common pattern of the bash script running an embedded heredoc in another language.

Re "Environment variables are not usable within other WDL expressions" -- I doubt that restriction is really needed. I view them as additional input declarations, so of course, they should be usable in other WDL expressions.

Re creating two ways of defining inputs -- it's an important point but my personal opinion is that the interpolation of strings directly into the command script was a minor design flaw from the beginning, and we should gradually deprecate them in favor of environment variables; but I don't expect everyone to share that opinion exactly...

@rhpvorderman
Copy link
Contributor

@mlin. Agreed. Redesigning WDL to only use environment variables and deprecate the templating is the better option.

Re "Environment variables are not usable within other WDL expressions" -- I doubt that restriction is really needed. I view them as additional input declarations, so of course, they should be usable in other WDL expressions.

If I add up things correctly I come to the following conclusions:

  • There is no need for an additional syntax. Input variables can be declared in their current way.
  • Input variables can be used in expressions in any part of the WDL. Except the command block.
  • Inside the command block, only variable names can be used not expressions (i.e. ~{my_var} not {my_var * 5}. These variables are passed as environment variables.

This solves the templating/injection issue. While restricting the language instead of expanding it. That seems preferable to me.

@cjllanwarne
Copy link
Contributor

Redesigning WDL to only use environment variables and deprecate the templating is the better option.

That's a really interesting concept. It would let us remove the (command) interpolation syntax entirely. And allow things like arrays to be supplied natively. Are there any use cases which would get worse if this were the case?

@rhpvorderman
Copy link
Contributor

Are there any use cases which would get worse if this were the case?

Supplying optional variables comes to mind. Where you can do ~{"--optional-flag " + optional_value} now. That is a very common use case.

I guess that could still be made possible for optional values without templating? So that it either evaluates to "" or --optional-flag $OPTIONAL_VALUE.

In retrospect my statement "Input variables can be used in expressions in any part of the WDL. Except the command block." was a bit harsh. I guess you can still allow any sort of expression, as long as it evaluates to a singular variable that can be saved as an environment variable and does not depend on being templated to text.

@rhpvorderman
Copy link
Contributor

And another thing that is hairy: How to work with tools that use the same flag multipe times to specify output files. For instance -o output1.fastq.gz -o output2.fastq.gz -o output3.fastq.gz.
Since you can have either R1, R1,R2, or R1,R2,R3 it makes sense to have the tool use a flexibel output flag like that. How to use that when disallowing templating? (I think templating is not too bad in this case, since the developer specifies the sep, the env vars can still be created as ENV_ONE, ENV_TWO etc.)

@illusional
Copy link
Contributor

CWL has a pretty nice mechanism for constructing command lines, I think there's value in being separate from that though, for that reason I do like the command block in WDL, and generally making information available via env variables

@patmagee
Copy link
Member Author

So, how about this for a concrete proposal:

Instead of the env { .... } block introduced in this PR, we add a new concept to WDL called Directives or Decorators (lets fight over what name seems better here). These are symbols that immediately preceed the type declaration and provide additional handling information for a specific Wdl Declaration.

In relation to environment variables, we can use the env direction/decorator/annotation to signal to a task that an environment variable should be generated from the given WDL declaration.

task foo {
  input {
    env String HELLO
  }

  # Also works
  String HELLO_FROM_BOB = HELLO + " from bob"

  command <<<
    echo ${HELLO}
    # Still works
    echo ~{HELLO}
  >>> 

}

@mlin
Copy link
Member

mlin commented Jun 17, 2022

@patmagee while I like the concrete proposal env String etc., as well as some of the other ideas mentioned, I'm a little wary of calling it an instance of a more-general feature (Directive/Decorator) without thinking through what else we might have that feature do, which might delay us.

As an example of the can of worms I think that could open, a long time ago @orodeh and I sketched out adding a metadata section for each input declaration, e.g.

input {
  File reads { help: "BAM file", patterns: ["*.bam"] }
}

To be clear: I'm NOT trying to advance that now, just mentioning the kind of thing that might bog us down if we get into use cases for a general concept like Directive/Decorator.

@patmagee
Copy link
Member Author

@mlin good point, I am happy to keep this restricted to just the env keyword for now and see where we naturally extend this too

@patmagee
Copy link
Member Author

I modified the PR to remove the explicit env {...} section to allow ANY task input or declaration to be set as an environment variable using the env modifier

@patmagee
Copy link
Member Author

@mlin @illusional @rhpvorderman @cjllanwarne do you have any issues with the current approach?

@patmagee patmagee requested review from jdidion, mlin, illusional and rhpvorderman and removed request for jdidion September 21, 2022 16:34
Copy link
Contributor

@illusional illusional left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally happy with this approach, except I don't think you should be able to mark an environment variable from a workflow call

I'm still not sure how you would achieve something like this with the new approach.

--param1 value1 --param2 value2

@@ -2791,7 +2868,7 @@ workflow main_workflow {

Any required workflow inputs (i.e. those that are not initialized with a default value or expression) must have their values provided when invoking the workflow. Inputs may be specified for a workflow invocation using any mechanism supported by the execution engine, including the [standard JSON format](#json-input-format).

By default, all calls to subworkflows and tasks must have values provided for all required inputs by the caller. However, the execution engine may allow the workflow to leave some subworkflow/task inputs undefined - to be specified by the user at runtime - by setting the `allowNestedInputs` flag to `true` in the `meta` section of the top-level workflow. For example:
By default, all calls to subworkflows and tasks must have values provided for all required inputs and environment variables by the caller. However, the execution engine may allow the workflow to leave some subworkflow/task inputs/environment variables undefined - to be specified by the user at runtime - by setting the `allowNestedInputs` flag to `true` in the `meta` section of the top-level workflow. For example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still relevant? I think it's more clear to say "any input marked with the env directive is required, or not too.

@@ -2576,14 +2652,15 @@ workflow wf {

# Calls my_task with one required input - it is okay to not
# specify a value for my_task.opt_string since it is optional.
call my_task { input: num = i }
call my_task { input: num = i env: NUM = i }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't super love this. I think all tool interactions should be handled by the tool wrapper and not be the caller.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I second this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should definitely be removed, I think its outdated from the previous iteration. I will update the PR

GOVERNANCE.md Outdated
| Patrick Magee | DNAstack | [patmagee](https://github.com/patmagee) |
| Brian O'Connor | University of California, Santa Cruz | [briandoconnor](https://github.com/briandoconnor) |
| Geraldine Van der Auwera | Broad Institute | [vdauwera](https://github.com/vdauwera) |
| Christopher Llanwarne | Broad Institute | [cjllanwarne](https://github.com/cjllanwarne) |
| John Didion | DNAnexus | [jdidion](https://github.com/jdidion) |
| Michael Franklin | Centre for Population Genomics | [illusional](https://github.com/illusional) |
| Amy Paguirigan | Fred Hutch | [vortexing](https://github.com/vortexing) |
| Ruben Vorderman | Leiden University Medical Center | [rhpvorderman](https://github.com/rhpvorderman) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some PR cross-contamination here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, will fix!

@rhpvorderman
Copy link
Contributor

rhpvorderman commented Sep 23, 2022

@patmagee, my primary comment is that I think this is a inverse approach by requiring the maintainer to put env in front off everything to make sure no shell code can be injected. Forget the env keyword one time? That is a security-hole.
If security is a concern, environment variables should be the default, and in cases where that leads to problems a template keyword should be used to distinguish variables that rely on the old behavior. EDIT: (This would of course be a change for 2.0 then, as it is not fully backwards-compatible.)

Speaking from a BioWDL perspective: we don't have any pressing security concerns since all our tasks are containerized, and only user inputs are mounted in the containers. The containers are run as the UID exclusively and not as root. Furthermore our pipelines are not run as a service, and run by individual users on our compute cluster. They can only break things they have the right to break in the first place so it makes no sense adding lots of guards against this. In practice security is a non-issue for us. I say this not to invalidate the PR, but to give the proper weight to my own opinion: very low. We have no stake in this.

@rhpvorderman
Copy link
Contributor

I have been thinking somewhat more about this and I start to wonder: what does this feature protect you from that containerization does not already? If users can submit their own workflows to the system, then this addition is pointless. So this protects specifically those instances where an entity provides WDL as a service, but did not containerize the backend properly? (Instead opting to provide all bioinformatics tools under the sun in a single environment without dependency conflicts). That seems a very unlikely use case.
There is a clear trade-off at work here. The language will be more complex, but also more secure. But how much more secure? Without a detailed assessment of the security implications it is impossible to make a proper judgement. On the other hand, with the changes I proposed above (environment variables are default) the complexity cost is going to be so low that it is worth it, in my opinion.

@patmagee
Copy link
Member Author

@rhpvorderman you make some good points. To start, containerization gets you pretty far down the road of security assuming that the system is using a secure container technology. Unfortunately even in the most restricted environments where users can only pick from a list of pre-approved workflows there is the opportunity for abuse. For example, If you happen to know the execution environment, you could construct a nefarious string that would run a specific attack within the container (not so bad), but more importantly within the network the container is running in potentially under any authority assigned to that container/vm. Google tools are great examples of this, where there is a metadata api allowing you to run gsutil within a container and actually retrieve a service account dynamically simply by running gsutil on a VM.

So all that is to say, I think in some circumstances the security concerns are warranted, but I agree that the overall impact to the user should be as minimal as possible. To be honest, the original motivator was to not override any existing variable within the container by automagically converting inputs into env variables. This actually may not be a big concern, especially not if this is a well documented feature. If all values were accessible either through the WDL semantics (~{foo}) AND their shell semantics ('${foo}`) it would probably be the most flexibly and lowest impact to the user.

@mlin
Copy link
Member

mlin commented Nov 25, 2022

Included this feature in miniwdl v1.8.0 version development; def invite anyone interested to hammer on it, can still change.

@patmagee
Copy link
Member Author

@mlin I tried out this feature, and I have to say, really love it as defined. Honestly I am 100% happy with it :D

@patmagee
Copy link
Member Author

I think this is ready for final review, and if everything looks good we can get it voted on

Copy link
Collaborator

@jdidion jdidion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - just a few small suggestions/questions

### Environment Variables

Any input to a task may be converted into an environment variable that will be accessible from within the task's
shell environment during command execution. Unlike inputs, environment variables are not templated into the command when
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might say "interpolated" or "replaced in" rather than "templated into"

Any input to a task may be converted into an environment variable that will be accessible from within the task's
shell environment during command execution. Unlike inputs, environment variables are not templated into the command when
preparing the execution script, but are actually available at runtime directly from the environment. They may be accessed using
normal shell variable access semantics (ie `$FOO` or `${FOO}`).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e.

environment, it does not change its access to normal WDL expressions within a command. That means you can refer to a declaration
annotated with `env` either through the shell semantics (`${FOO}`) or through normal WDL semantics (`~{FOO}`).

When an input is annotated with `env` it is the engines responsibility to serialize the value appropriately into a string.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

engine's

annotated with `env` either through the shell semantics (`${FOO}`) or through normal WDL semantics (`~{FOO}`).

When an input is annotated with `env` it is the engines responsibility to serialize the value appropriately into a string.
Serializing complex values (`Map`, `Pair`, `Struct`, `Array`) should be equivalent to calling `write_json` on the given
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a link [WDL Value Serialization and Deserialization](#appendix-a-wdl-value-serialization-and-deserialization)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, write_json writes the value to a file. Is that the desired behavior - that the environment variable will have a path to the file with the serialized value?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point to discuss:

On the one hand, if the developer definitely wants the JSON in a file then they can easily add a private declaration env File x_json = write_json(x). Given that's easy, maybe the environment variable for a compound value should contain the JSON directly.

On the other hand, JSON directly in the environment variable isn't very natural to work with in shell; often you'd end up just dumping it into a file to be parsed anyway. Also, developers should be discouraged from from cramming large compound values directly into environment variables, because there will usually be some system-imposed limit on environment size.

TBH, env File x_json = write_json(x) makes the intent much clearer to somebody reading the code, so whichever way we go on this question, I think we should actually discourage using env with compound values at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mlin JSON directly as environment variables is actually not as cumbersome as you may think. We regularly use it in our own CI/CD infrastructure and pass values to and from tools like jq which are great at parsing json. We rarely ever need to use temporary files and just pipe input from one command to the other, storing intermediate entries in a environment variable.

I think pointing to the example of the write_json is actually not an accurate representation of what is happening. we probably want to just say that complex types will be serialized to JSON prior to being set as an environment variable.

There is of course the limitation on size of the environment variables. So Maybe we can provide guidance here: "Engines may impose limits on the total length a single environment variable is allowed to occupy as well as the number of environment variables that are allowed to be passed into a single task. If such limitations exist, engines should clearly document them for the end user."

@patmagee
Copy link
Member Author

patmagee commented Jan 9, 2023

@mlin @jdidion I added a new sentence on limitations that may be imposed by engines and clarified the serialization, Ready for another look!

@patmagee patmagee added this to the 1.2 milestone Feb 13, 2023
@patmagee
Copy link
Member Author

I think with the recent updates, this PR is ready for voting and inclusion in the 1.2 Spec. If there is no one opposed, I will happily mark this as open for voting

@mlin
Copy link
Member

mlin commented Mar 26, 2023

@patmagee For your consideration...I asked GPT-4 to rewrite a few of the prose paragraphs more concisely. (Turning into a standard practice!)

Task inputs can be converted into environment variables accessible during command execution. Environment variables are accessed using shell variable syntax, like $FOO or ${FOO}. To access an input value as an environment variable, add the env modifier before a declaration within the task, including input sections and private declarations.

Despite the env modifier making declarations available in the environment, it does not alter access to WDL expressions within a command. Declarations annotated with env can be referred to using shell syntax (${FOO}) or WDL syntax (~{FOO}).

It is the engine's responsibility to serialize input values annotated with env into strings (see Serialization of WDL values) and set them as environment variables. Engines may impose limits on environment variable length and quantity, and should provide clear documentation on any limitations.

(129 words, down from 245)

@mlin
Copy link
Member

mlin commented Mar 26, 2023

I have some doubts about this:

When a value is declared with the env modifier, it becomes
the execution engine's responsibility to escape the string thus preventing any sort of interpolation. Functionally,
using an environment variable is the same as first escaping the string of any special characters and then wrapping it single quotes.

single_quote(escape(${variable}))

I think it's quite unusual to bind the environment variable to an escaped value. It means our environment variables have a peculiar "format", which seems to counteract the goal of using less-WDL-specific practices. I think the environment variables should be bound to the exact (stringified) value, so that task commands would inherit shell scripting's best practices and pitfalls for handling potentially-malicious environment variables securely. Chiefly, they should be used inside double-quotes, or if necessary explicitly escaped using printf "%q". Implicitly escaping the environment variable values would be a more-secure default, but the goal of the env feature IMHO should be to "let shell scripts be shell scripts."

Automatically escaping the stringified value feels to me like something I wish we'd done when we introduced the WDL-specific ~{} interpolation syntax. Maybe for 2.0?

@patmagee
Copy link
Member Author

@mlin I think you are right, and the stanza you highlighted is really a bit of a misguided attempt and enforcing how env variables are set.

Fundamentally what I was trying to achieve with that statement is the ability to set environment variables without breaking scripts. This is probabyl more perscriptive then it needs to be, as all I really am trying to state is the engine should set the literal value similar to how the docker .env files works

@jdidion
Copy link
Collaborator

jdidion commented Dec 17, 2023

@patmagee please change target to wdl-1.2

@patmagee patmagee changed the base branch from main to wdl-1.2 December 21, 2023 18:09
@patmagee
Copy link
Member Author

@jdidion I have update the spec against 1.2

@markjschreiber
Copy link

That being said, it is possible that behaviour could also become desirable. For example if you were using gcloud in a command, you could simply set an input called File GOOGLE_APPLICATION_CREDENTIALS and your gcloud would automatically authenticate with the right credentials.

This is probably not a use case that we would want to advertise, at least for containers on remote machines. Putting credentials into environment variables for containers is not a great idea. It is better to have the underlying host machine use an appropriate IAM role. For local execution it is probably OK (as long as you know the workflow is not going to echo them to logs or someone else's machines).

@markjschreiber
Copy link

When a task is run outside of a container then presumably the environment is applied to the shell executing that tasks command. Do we need to specify where the environment variable is applied for a containerized task? Would the environment variable be applied to the container (for example as part of some docker command) or would the variable be applied to the command shell that will run in the container?

@patmagee
Copy link
Member Author

@markjschreiber is the distinction that the environment variable may change depending on whether it is injected into the task shell script vs into the docker image?

@markjschreiber
Copy link

@markjschreiber is the distinction that the environment variable may change depending on whether it is injected into the task shell script vs into the docker image?

Pretty much. I think it's fine to clarify with something like - if the task is run in a container the env var is "injected" into the container and not applied to the shell on the host that runs the container.

@patmagee
Copy link
Member Author

patmagee commented May 22, 2024

@markjschreiber would saying something like the following work

The environment variable should be evaluated by the engine prior to injecting it into the execution environment. if the task is run in a container the env var is "injected" into the container and not applied to the shell on the host that runs the container.

@markjschreiber
Copy link

@markjschreiber would saying something like the following work

The environment variable should be evaluated by the engine prior to injecting it into the execution environment. if the task is run in a container the env var is "injected" into the container and not applied to the shell on the host that runs the container.

Perfect!

@patmagee patmagee merged commit 32cb760 into openwdl:wdl-1.2 May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

8 participants