Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Config to Pass Around Injected Args #652

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

ranchodeluxe
Copy link
Contributor

@ranchodeluxe ranchodeluxe commented Nov 16, 2023

Problem

If someone wants to write a custom beam.PTransform or beam.DoFn in their recipe and they would like to leverage target_root or cache the only option (that is a heavy lift) is to make their recipe into a package with an entrypoint. There should be simple way to accommodate this. From this conversation on DevSeed's pangeo-forge-external channel:

Screen Shot 2023-11-16 at 9 11 49 AM

@sbquinlan workaround: https://github.com/pangeo-forge/staged-recipes/pull/259/files#diff-c2ef5f081c41ff256ca9344512c7d7d8f369623a8b9483c6c1921cdafc2825deR220-R226

Goal

See docs string in diff for an example or this example recipe for local testing#651

I wanted to just inject the whole config dict into a simple variable at import or during ast traversal using view_Assign but I see we'd still have to modify all places where it's used so this seems to be the best option 🦐 🦃

Runner Companion PR

pangeo-forge/pangeo-forge-runner#146

@yuvipanda
Copy link
Contributor

This will only work on pangeo_forge_recipes, and not on other libraries that may be built on it (like https://github.com/yuvipanda/pangeo-forge-cmr).

I also think either there should be a config object or we do dynamic injection, I think doing both is confusing.

My preferred solution is still to just support definining injectsion in the recipe.py file :)

Also I can't access the slack link :)

@ranchodeluxe
Copy link
Contributor Author

My preferred solution is still to just support definining injectsion in the recipe.py file :)

I know this 😆 IMHO the less injection code the better

This will only work on pangeo_forge_recipes, and not on other libraries that may be built on it (like https://github.com/yuvipanda/pangeo-forge-cmr).

Good point 🥳 I guess I don't understand what's going on here. This is a plugin of sorts? Why wouldn't this just be another recipe since get_cmr_granule_links etc is just generating things for beam.Create?

@cisaacstern
Copy link
Member

I also cannot see the slack link, can you post the example linked there publicly somewhere?

Can comment more directly on the PR once I see the problem it is trying to solve.

@ranchodeluxe
Copy link
Contributor Author

@yuvipanda and @cisaacstern: I updated the description above with a problem statement, please refer there

@cisaacstern
Copy link
Member

cisaacstern commented Nov 16, 2023

@ranchodeluxe thanks for the clarification. I understand the use case now, and think I am in favor. Curious to get Yuvi's take as he is more familiar with the design considerations here.

As a bit of a tangent, in the linked example, I don't think I understand what the WriteReferences transform does that our built in WriteCombinedReference transform does not?

@ranchodeluxe
Copy link
Contributor Author

ranchodeluxe commented Nov 16, 2023

@ranchodeluxe thanks for the clarification. I understand the use case now, and think I am in favor. Curious to get Yuvi's take as he is more familiar with the design considerations here.

yeah, I'm fine doing what folks think is best for this pattern

As a bit of a tangent, in the linked example, I don't think I understand what the WriteReferences transform does that our built in WriteCombinedReference transform does not?

Yeah, where I linked is not clear. He uses that reference to pull out the target_store to pass to his custom transform: https://github.com/pangeo-forge/staged-recipes/pull/259/files#diff-c2ef5f081c41ff256ca9344512c7d7d8f369623a8b9483c6c1921cdafc2825deR242-R246

@cisaacstern
Copy link
Member

He uses that reference to pull out the target_store to pass to his custom transform:

Right, got that. My question is how is that custom transform any different from our built-in:

@dataclass
class WriteCombinedReference(beam.PTransform, ZarrWriterMixin):
"""Store a singleton PCollection consisting of a ``kerchunk.combine.MultiZarrToZarr`` object.

And I think the answer is just that it passes through the written dataset? Which if true, means that the custom transform is just our built-in plus #636?

If so, then can this use case be addressed by just finishing #636 and using the built-in?

@ranchodeluxe
Copy link
Contributor Author

If so, then can this use case be addressed by just finishing #636 and using the built-in?

yeah, I think I'm following and agree

@cisaacstern
Copy link
Member

cisaacstern commented Nov 16, 2023

yeah, I think I'm following and agree

Cool. 😎

So actually this discussion has me leaning slightly against merging this PR now? Because I think we want to discourage custom, un-released transforms which are sufficiently generalizable as to require configurable injections?

That is, if a transform is so foundational to the pipeline that it requires knowing the cache or target location, then maybe its a code smell if it's not released somewhere? (Because such a significant piece of logic is presumably re-usable by other community members?)

Not to discourage innovation/creativity, but I tend to think of custom transforms as best for data munging/preprocessing/etc., and caching/writing to be generalizable operations which we should be collaborating on via some released package? Thoughts?

@sbquinlan
Copy link

My question is how is that custom transform any different from our built-in:

@cisaacstern We can talk about this in slack, but I wanted to consolidate the metadata in the references before writing them out. WriteCombinedReferences is the composition of MZZ + writing them out so I needed to pull that apart. Even if I didn't need to get the target_root, I feel like it's perfectly reasonable for recipe developers to want to access the config.

Copy link
Member

@cisaacstern cisaacstern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to try to upstream @sbquinlan's custom kerchunk writer here eventually, but regardless, I take the point that it's reasonable for recipe developers to access the injected config! (Including for cases where that's useful for prototyping things that will eventually be upstreamed! 😄 )

In terms of the PR itself, I think I prefer the simpler "Config" (instead of "RecipesConfig"). Although this package is "pangeo_forge_recipes", the word "recipes" is not generally used in the codebase itself currently, so the more generic "Config" makes sense to me.

pangeo_forge_recipes/config.py Outdated Show resolved Hide resolved
pangeo_forge_recipes/config.py Outdated Show resolved Hide resolved
pangeo_forge_recipes/injections.py Outdated Show resolved Hide resolved
target_root: Union[str, FSSpecTarget, RequiredAtRuntimeDefault] = field(
default_factory=RequiredAtRuntimeDefault
)
cache: Optional[Union[str, CacheFSSpecTarget]] = ""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while we're talking about naming, can you name this input_cache and maybe add the metadata_cache field here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metadata_cache

will add

input_cache

as long as folks are comfortable with this but it's not backward compatible (or haven't thought about how to achieve that yet)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm being silly. metadata_cache isn't a thing at all and I think cache is covering both from what I see.

Also, MetadataCacheStore in runner doesn't have matching storage class in recipes lib that I'm seeing so this isn't being used 🤔

Let me look at this more later today

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MetadataCacheStorage in runner exists for backward compatibility with recipes < 0.10.0. You are correct that it does not exist in recipes >= 0.10.0, and fwiw this is not because InputCacheStorage is playing two roles, but rather because metadata caching is not part of the >= 0.10.0 paradigm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: naming, there are (unfortunately) two naming conventions for (roughly) the same things here:

  • names in the context of injections
  • names in the context of kwargs on PTransforms

I suggest we chose one of the two and just adhere to it exactly, rather than introducing a third convention.

Seems like the injections names are more generic-sounding, and verbose, so that is probably the right one to use, meaning the fields on Config would be:

  • Config.target_storage
  • Congig.input_cache_storage

@ranchodeluxe
Copy link
Contributor Author

ranchodeluxe commented Nov 16, 2023

In terms of the PR itself, I think I prefer the simpler "Config" (instead of "RecipesConfig"). Although this package is "pangeo_forge_recipes", the word "recipes" is not generally used in the codebase itself currently, so the more generic "Config" makes sense to me.

I don't enjoy the way the word "recipe" rolls off my tongue so this is a good change. I prefer referring to these workflows as "KitchenSequence"(s) and "BakeChain"(s) 😏

ranchodeluxe and others added 3 commits November 16, 2023 14:27
Co-authored-by: Charles Stern <62192187+cisaacstern@users.noreply.github.com>
Co-authored-by: Charles Stern <62192187+cisaacstern@users.noreply.github.com>
Co-authored-by: Charles Stern <62192187+cisaacstern@users.noreply.github.com>
@ranchodeluxe ranchodeluxe changed the title WIP: Add RecipeConfig to Pass Around Injected Args Add Config to Pass Around Injected Args Nov 20, 2023
@ranchodeluxe ranchodeluxe added the test-integration Apply this label to run integration tests on a PR. label Nov 20, 2023
Copy link
Member

@cisaacstern cisaacstern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting close, @ranchodeluxe, thanks for pushing on this.

A few small notes on docs/docstring, and also a bigger-picture question re: if we can leverage this internally as well.

Looking forward to your thoughts!

Comment on lines 7 to 9
"StoreToZarr": {
"target_root": "TARGET_STORAGE",
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"StoreToZarr": {
"target_root": "TARGET_STORAGE",
},

Should we take this a step further, and actually use this config object internally?

Meaning delete the StoreToZarr, WriteCombinedReference, and OpenURLWithFSSpec injections specs here, and instead doing something like:

# transforms.py

from .config import Config

c = Config()
...

class StoreToZarr(PTransform):
    ...
-   target_root: Union[str, FSSpecTarget, RequiredAtRuntimeDefault] = field(
-       default_factory=RequiredAtRuntimeDefault
+   target_root: Union[str, FSSpecTarget, RequiredAtRuntimeDefault] = c.target_storage

Without fiddling with it a bit I'm not sure exactly how this would work, but conceptually, what do we think about making Config the interface to all injections, including for internal transforms?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do enjoy where you're going with this ✨ Given that dependency injection is targeting the recipe this won't work AFAICT without more complicated acrobatics in the AST rewriter which I think we should avoid. So my feedback here is that we just do what you are proposing in the recipes explicitly:

benefits

  • recipes < 0.10.x (some version with this change) can easily be backward compatible and still leverage the existing dep injection

  • recipes >= 0.10.x (some version with this change) are now explicit and folks need less explanation about how this all works (except the explanation about dep injection). For example, with this change you probably don't need to even write documentation about how to write custom transforms b/c StoreToZarr, WriteCombinedReference and OpenURLWithFSSpec recipe examples already show it how it works

  • it's only a couple/few call sites we'd be passing conf.target_storage and conf.input_cache_storage to per recipe

drawbacks

  • if the number of transforms we write requiring conf.target_storage and conf.input_cache_storage dramatically increases then, yes, it'd seem we might be doing an unnecessary amount of input passing

Comment on lines 10 to 13
"""a simple class that contains all possible injection spec values

folks can import it into their recipes to pass injection spec values
around to custom beam routines as the example below shows
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conventionally, our docstrings tend to be a bit less colloquial, could we formalize the language a bit here? Thanks!

Comment on lines +8 to +9
@dataclass
class Config:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add this to the API Reference section of the docs.

And possibly link to it in either/or:

  • The discussion of "Deploy-time configurable keyword arguments" under **Recipe composition > Transforms"
  • a new Advanced Topics section, e.g. "Custom Transforms"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks 💯 I'll get around to this after hearing back from Yuvi and other folks

@yuvipanda
Copy link
Contributor

I take the point that it's reasonable for recipe developers to access the injected config!

I totally agree too!

I have many strong opinions, particularly around #652 (comment):

Good point 🥳 I guess I don't understand what's going on here.

I feel like there was a lot of context around this for why we settled on an injection based workflow, vs various other options that were considered at that time. I know this happened during the meetings, but do you know if there is a solid written record of it somewhere, @cisaacstern?

How urgent is this PR? I would like to provide a more solid explanation here, but I just spent far too long on airplanes, and don't want to write a half-assed response here before my brain recovers.

@cisaacstern
Copy link
Member

From my perspective it is worth waiting to get your review, @yuvipanda.

Hopefully the DevSeed folks can run anything they need from this branch until then!

@ranchodeluxe
Copy link
Contributor Author

ranchodeluxe commented Jan 17, 2024

@cisaacstern and @norlandrhagen :

Rebasing this. Hoping we can tidy this up quick and merge this to address the limitations in #668 even though the discussion stalled about where/how much we should use this pattern

@ranchodeluxe
Copy link
Contributor Author

@cisaacstern and @norlandrhagen :

Rebasing this. Hoping we can tidy this up quick and merge this to address the limitations in #668 even though the discussion stalled about where/how much we should use this pattern

Ignore this. Won't work given how the AST rewriter works 😞. Back to the drawing board. Sorry about the noise

@moradology
Copy link
Contributor

moradology commented Feb 7, 2024

I've been staring at this issue (I'm convinced that a backdoor to arbitrary runtime config is probably useful) and trying to wrap my head around the rationale for the complex injection strategy defined in plugin.py from the runner for longer than is healthy and I'm increasingly feeling like either I'm missing something or like perhaps there's a bit too much caution/safety in the adopted strategy both because it makes me feel like a caveman and because I'm concerned that it locks the runner and any provided recipes into a very tight coupling that may end up being significant overhead.

Can someone talk me down? Convince me I'm just stupid, please! I will absolutely attempt to explain/document the reasoning for others similarly impaired

@yuvipanda
Copy link
Contributor

yuvipanda commented Feb 7, 2024

@moradology let me try! These are all the design considerations I had when building out the dynamic injections. It's been a few months, and unfortunately many of these conversations happened in meetings without being written down in detail, so be gentle if I miss a few words here and there (or get things wrong!). You (and others!) probably already know all this information (and maybe more!), so think of this more of a historical explanation than an advocacy for status quo.

Separation of config from the recipe

The same recipe / feedstock should be runnable on my local machine, on GCP and AWS. This was the core reason I started the runner project - prior to that, recipes were run by a heroku runner purely on dataflow. So this means that some config - particularly about where the data should go, must be defined by the bakery operator, via the traitlets config file, rather than by the recipe writer. But this information needed to be passed into the PTransforms somehow, so they actually know where to write the output to. This was the core of the problem that needed solving - how do we allow bakery operators to specify this information, rather than recipe writers?

Parsing vs Running

Code written in a recipe is distinctly executed in two 'phases' (there may be beam terminology for this I'm not aware of):

  1. A 'parse' phase, that happens wherever the runner is being invoked. This is akin to a compilation phase. The output of this phase is simply a Beam DAG, represented as a Pipeline object. Python code that is directly responsible for generating the pipeline is actually executed here - but transforms, custom callables, etc are not executed, but just pickled and put in the pipeline.
  2. The actual running phase, that is happening on the remote beam runner. The pickled DAG is executed here, but most of the python code written in the recipe is actually not executed during this phase. This is why you can't really use the global variables declared in recipe file in any transforms (without hacks like save_main_session).

This separation is hard to see sometimes - if you look at a function you write, is it going to get called where the runner is being called, at 'parse' phase? Or in flink remotely somewhere, 2 hours from now, at runtime? Sometimes the answer is both! There's also PortableRunner involved, and IIRC that's written in Java, so it's all a bit of a mess.

But fundamentally, the goal is for the generated DAG to be as pure as possible, not making any references to variables or values that are only available at parse time, but not runtime. Here's an example of a problem:

import apache_beam as beam
import json

with read('/etc/target-config.json') as f:
  config = json.load(f)

def output_path(item):
  with open(config["outputPath"], "a") as out:
     out.write(item + "\n")

with beam.Pipeline() as pipeline:
   pipeline 
   | "Create elements" >> beam.Create(["Hello", "World!"])
   | "Output elements" >> beam.Map(output_path)

What happens when you run this locally? It works all fine, not a problem! But what happens if you run this on dataflow or flink? output_path function is only executed in flink or dataflow runners, and it's just that specific function that is executed, not the whole file (or any captured variables). So it'll actually error out because config isn't a variable it knows anything about. You can't read the file in the function either - that file itself is not present at runtime, only parse time. If you want it to be present there at run time, each runner implementation (dataflowrunner, direct, flink) would have to figure a way out to copy that file into each runner instance.

So a useful way to think about this is that Transforms really need to be pure functions, operating only on the parameters passed on to them. Even trying to use json inside output_path will fail, as that's not been imported in there!

The injections work here by treating all transforms as pure functions, and passing in config to them the only way it's possible to pass information into a pure function - by adding it as a parameter. An elaborate partial that is done in such a way that when pickled, all the information needed for a transform to execute is already in that transform! This also means you can't inject things that can't be pickled, but we mostly inject strings and dicts so it's ok.

The alternative we considered was environment variables, but PTransforms looking for os.environ["SOMETHING"] in the middle of their code felt a lot more ugly, and we would have to make sure that if we're using dataflow / flink these environment variables are passed on to all the runner instances. Which environment variables do we pass on? All of them? That's definitely going to leak someone's local credentials on some remote runner log pretty soon :D By doing injections this way, we know exactly what's going to be passed on.

I also didn't spend too many innovation points here, as this is mostly just the same system pytest uses.

Separation of runner from pangeo-forge-recipes

Before the injections were implemented, the version of runner was very tightly coupled to the version of pangeo-forge-recipes. Runner would literally import and look for specific ptransforms from recipes, and inject stuff into that. This also meant it was impossible to use any ptransforms that were coming from packages outside of recipes. https://github.com/yuvipanda/pangeo-forge-earthdatalogin is an illustrative example - it has a PTransform that inherits from OpenWithFSSpec, but runner had no way to actually know to give it the cache target information. With yuvipanda/pangeo-forge-earthdatalogin#2, the runner could give it cache target information. So pangeo-forge-earthdatalogin and pangeo-forge-recipes were treated equally by runner - pangeo-forge-recipes is no longer special cased. We constantly used to run into problems with not being able to make releases of either because of the coupling, and this injection setup allowed us to fully decouple these two projects. This was the primary reason for setting up injections, and I think it's been a real good success!

Current problem

The current problem here is that you need a package to tell pangeo-forge-recipes what to inject, and I agree that's a serious limitation. The current 'right' way to solve the problem is to put your recipe in a different github repository, and just refer to it in requirements.txt with git+https://repo-url@sha-or-branch. This is what I did for yuvipanda/pangeo-forge-earthdatalogin#2, and I also personally think this is better software development practice, as it does keep these appropriately loosely coupled. Putting the transforms in the recipe itself is making them more tightly coupled. But I'm also not a recipe writer, so I understand the need to put PTransforms in the recipe itself. Much faster iteration times are 😍

Future direction?

The broader solution for this and another issue that folks will run into (you can only have a single recipe.py file, you can not split that up into multiple files!) was something I started working on (pangeo-forge/pangeo-forge-runner#92) and is currently available in pangeo-forge/pangeo-forge-runner#110. This would also allow you to add the appropriate entrypoints in the generated setup.py file, either automatically or by looking for a decorator.

However, I never really had direct funding to work on pangeo-forge - only broad funding to work on things that moved the whole ecosystem forward. My funding situation changed, and while we attempted to get more direct funding for me to work on this problem, that got stuck in various bits of paperwork and never materialized. I should have documented all this in writing when we initially set that up (instead of only convincing the others working on this directly at that time), and I apologize for not doing that.

I'm happy to chat (synchronously if needed?) with folks about specific ideas here, but I'm mostly otherwise fully occupied these days (helping run 2i2c.org!) and won't be able to actually write code here. So if the community here wants to move in a different direction, I'm totally happy for all this to be ripped out and another solution tried! I wrote this large comment mostly to try explain why things are the way they are, what problems they tried to solve, and that the complexity here is not fully accidental, but essential (due to the parse / distributed-run dichotomy).

@moradology
Copy link
Contributor

moradology commented Feb 7, 2024

@yuvipanda This was an excellent writeup and incredibly valuable historical motivation for pieces that would have taken hours of code archaeology to uncover. Thank you for taking the time to write it all out and expose the constraints you were working around. Plenty to mull over here. And regardless of how things shake out, it will provide some top notch material to expose in comments and documentation

@yuvipanda
Copy link
Contributor

Thanks, @moradology :) Glad to be of use.

Personally, I would say that documenting very clearly this 'parse time' vs 'runtime' situation (in more detail than I have presented here) would be incredibly valuble.

@yuvipanda
Copy link
Contributor

As an example, in this PR, import xarray is itself inside the function (link, which is weird python!

And then the config object is constructed here at parse time, and then a particular value from it is passed into the function at runtime. I am not sure, but I think this works because the value being passed is a string that's just going to get pickled and passed through. But now the distinction here is more blurred - if you try to access the config variable itself inside your transform, it'll fail when running in a distributed manner (for the same reason that you needed to import xarray inside the transform). This also puts a lot of other constraints in your PTransform - you can't split it into multiple files, you'll constantly forget and start using variables from the outer scope (because that's how regular python works!), etc.

As described in this comment in this PR, I think doing this via --setup_file is the right thing to do. I think custom ptransforms can then be in the same feedstock, but just another file that you import. --setup_file the way I have tried to implement it then turns the whole thing into a package that's shipped to beam (at least in dataflow - needs to be figured out how this works with flink?), and then you can stop having to write this weird subset of python where your ptransform has to be 'pure' but you have to manually manage it.

@moradology
Copy link
Contributor

moradology commented Feb 20, 2024

At the risk of revealing ignorance, I'm wondering if a transition to lazy pipeline construction 168 on runner (i.e. expecting recipe to be a callable) couldn't be used in place of the current injection strategy. In addition to passing a pipeline object in (to be mutated) we could expect a second argument for configuration. If that'd work, it might simplify things a bit and better enable developers to fully take advantage of the inversion of control beam's execution model requires of us

So, recipe would need to be something like Callable[[beam.Pipeline, SomeConfigSuperClass], None] (None signifying, here, the use of side effects)

What do you think, @yuvipanda?

@yuvipanda
Copy link
Contributor

Unfortunately @moradology I do not have the capacity now to help figure it out, although I'd have loved to :( So this will have to go on without me. Am still happy to provide any historical context, but I think y'all have a good handle on how to take this one forward!

@cisaacstern
Copy link
Member

💯 points for @yuvipanda for setting boundaries on uncompensated effort, a very difficult thing to do in open source!

@moradology
Copy link
Contributor

@yuvipanda Understood! I appreciate the context you've laid out for us here and think we should be in a position to keep things rolling. Obviously your wisdom and experience here is very welcome but as @cisaacstern points out, it is yours to give and not ours to demand. Thank you again

@yuvipanda
Copy link
Contributor

You're welcome :) I'm very happy to see all the effort being poured into pangeo-forge - I think it's truly transformational technology, and I'm extremely excited to see where it goes. I'm very happy to still answer questions about specific historical choices made, but I also trust y'all to take the project forward :)

@cisaacstern
Copy link
Member

So, recipe would need to be something like Callable[[beam.Pipeline, SomeConfigSuperClass], None] (None signifying, here, the use of side effects)

I do see how this could be one way of making the config available to arbitrary transforms, but am not convinced (nor are you suggesting) that this alone is a strong enough justification for merging the linked runner PR (this is a simple, but not the only, way to get the config where we need it).

I have some other thoughts on the linked runner PR which I'll share over there now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test-integration Apply this label to run integration tests on a PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants