Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

suggestions/improvements to the split-apply-combine pipeline #167

Open
golobor opened this issue Jun 10, 2019 · 0 comments
Open

suggestions/improvements to the split-apply-combine pipeline #167

golobor opened this issue Jun 10, 2019 · 0 comments

Comments

@golobor
Copy link
Member

golobor commented Jun 10, 2019

a thread to discuss the new experimental pipeline.

My couple of minor suggestions:

  1. why use the name "pipe" if "apply" if there is the more standard name "apply"?
    Also, "pipe" implies that multiple functions are supposed to be applied to the data. Yet, it's very possible that a large share of "pipelines" will contain a single function, as it reduces the amount of time spent copying data when using multiprocessing.

  2. I find the usage pattern of the data argument to be... a bit raw/poorly defined/restrictive?
    I understand that 'data' is trying to solve the patterns where different steps of the pipeline must create and pass extra information besides the chunks themselves. But there several issues with the current implementation:
    2.1) Having an optional argument "data" poses a major block to creating reusable components, as they now have to come in two varieties - one taking chunk as an argument; another taking (chunk, data).
    2.2) More importantly, this extra "data" argument does not really solve the issue that, in complicated pipelines, different functions must be custom "fitted" to each other. There is no single "data" that functions can expect and pass downstream. Designing a library that would anticipate what kind of extra data is passed between functions is futile.
    2.3) Finally, the only place where data is currently used is during balancing, where it stores filtered pixel counts. Correct me if I'm wrong, but, is this case, it's actually fine to modify chunks since the downstream functions do not use the original weights! I'd say, modifying chunks is great, because it enables combinatorial composition of filtering and computing functions w/o custom interfaces.

My proposal:

  1. drop prepare; if needed, developers themselves can design custom functions take a chunk and output (chunk, extra_data).
  2. it's okay to modify chunks, unless I am missing something big here.
  3. use docs to teach the developers that the functions of their pipelines can generate extra data and pass it downstream.
  4. rename pipe -> apply
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants