Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addressing individual task results as opposed to invocations #6

Open
Tracked by #220 ...
Gozala opened this issue Dec 16, 2022 · 12 comments · May be fixed by #21
Open
Tracked by #220 ...

Addressing individual task results as opposed to invocations #6

Gozala opened this issue Dec 16, 2022 · 12 comments · May be fixed by #21

Comments

@Gozala
Copy link
Contributor

Gozala commented Dec 16, 2022

Current draft requires that you address individual tasks via pointer inside the invocation. I find this unfortunate, because it implies that I can not address a specific task without revealing the batch it was part of.

In our design tasks are explicitly self sufficient so you could take a batch apart and hand some tasks to other actor without revealing everything user intended to do. Unfortunately I don't have a concrete user story to justify need for this and our design choice had been mostly informed by intuition & some experiences in IPFS where addressing by paths inside DAGs as opposed to by CID had proven to be inadequate (I may want to share with you subgraph without revealing outer graph or the fact that it exists).

I realize this approach allows reducing number of signatures, yet I wonder if we could optimize that without without having to sacrifice task independence. e.g. perhaps we could attach bloom filter to each task and sign that instead, that way signature could be used with any task in the bloom filter without revealing what's in it (I realize there are false positives, but there ways around it). There might be some other ways to address signatures, my point is perhaps number of signatures should be considered separately

@Gozala
Copy link
Contributor Author

Gozala commented Dec 16, 2022

Additionally that would imply that I can pass task reference that aud can invoke or get result of, which is different from passing a promise to the result it's ability to run the task.

@expede
Copy link
Member

expede commented Dec 16, 2022

addressing by paths inside DAGs as opposed to by CID had proven to be inadequate

I would love to know more about this! I had assumed that the Merkle proof made these interchangable relative to a stable root CID.

Current draft requires that you address individual tasks via pointer inside the invocation. I find this unfortunate, because it implies that I can not address a specific task without revealing the batch it was part of.

Because actions can be run multiple times with different results (because they occur over time), they merely need to be made unique. Updating a counter multiple times needs to be counted as independent actions, for example.

One possible solution is to move the nonce to each task rather than on the wrapper. This gives you a stable identifier for that specific call.

Another is that you could send many single invocations wrapped in an array, used this for routing multiple invocations. This has a clean separation of concerns, but is an extra level of wrapping.

Perhaps to explain how we're thinking about this stuff in IPVM: a coordinator (possible the initial job creator, maybe external) takes the job, splits it up into new subtasks that are semantically equivalent, and creates subinvocations. These can reference each others results as promises, so there's no need to return the value to the coordinator — they can communicate directly, or even gossip about receipts on a public pubsub channel.

@Gozala
Copy link
Contributor Author

Gozala commented Dec 16, 2022

One possible solution is to move the nonce to each task rather than on the wrapper. This gives you a stable identifier for that specific call.

That is our current design pretty much, because our tasks are basically delegations that you nominate as tasks in the unsigned envelope.

@Gozala
Copy link
Contributor Author

Gozala commented Dec 16, 2022

Perhaps it helps describing our current design, because I think something in between two would provide a best compromise.

Our notion of a tasks is basically a UCAN delegation which MUST delegate only one capability.

Our notion of an invocation is [&UCAN, ...[&UCAN]] in other words just a tuple of the delegations which serves following purpose:

Signals to the aud that capabilities passed need to be treated as tasks to run (which run concurrently). We do not currently have promises (so no pipelining) nor way to name individual tasks inside the batch, but all of those were kind of desired / planned for the future and this spec is great allowing us to implement those features.

It is also clear that tuple is a poor signaling mechanism, so we would love to adopt ideas from this spec to address current limitations and improve on this.

That said having a way to link to specific task just by CID had been very useful and we use it a lot in our infrastructure. Additionally it is nice that batch can be taken apart and forwarded simply by repacking it in a different tuple.

@expede
Copy link
Member

expede commented Dec 18, 2022

I talked through this with @QuinnWilton a bit earlier today. We both have this intuition that {String : Task} is friendlier and more easily adoptable than a list CIDs, despite it adding some complexity by having local and remote promises. This is somewhat odd, because we're both more likely to advocate for using CIDs directly for their DAG properties.

One way that we may be able to have both the friendly names and be able to refer to tasks directly by CID would be to put nonces on every Task instead of only on the outer signature envelope. This gives each Task a unique CID, and thus can be addressed directly:

type Closure struct {
  with URI
  do String
  inputs Any
}

type Task struct {
  with URI
  do String
  inputs Any
  meta Any
  nonce String -- NEW
}

I guess the downside is that you'd need to track the associated wrapper if you wanted to check the signature or other metadata about the invocation. I guess that could be managed via a reverse lookup table?

In the use case from DAG House as I understand it, you may not need this. A CAR file can contain multiple disjoint graphs. You can use this to pack several Invocations that each contain a single task, and thus have a single identifier in their array: index 0. This is isomorphic to having each task signed, it's just packaged in a CAR file to put several together. This also helps solve for the problem about multiple auds.

{
  sig: "somebytes",
  nnc: "abcdef",
  meta: {"dev/comments": /* ... */},
  run: [
    {
      with: "https://exmaple.com/posts",
      do: "crud/create",
      inputs: {/* ... */},
    }
  ]
}

...is a special case that's fully isomorphic to...

{
  sig: "somebytes",
  nnc: "abcdef",
  meta: {"dev/comments": /* ... */},
  with: "https://exmaple.com/posts",
  do: "crud/create",
  inputs: {/* ... */}
}

That said having a way to link to specific task just by CID had been very useful and we use it a lot in our infrastructure.

You and I talked a bit about this on our call, but I'd be interested to hear more about how addressing by CID has been useful. You also alluded to how paths inside CIDs has been problematic, and it would be really useful to know more! Is it that the paths can be to something not under that CID?

@Gozala
Copy link
Contributor Author

Gozala commented Dec 20, 2022

I talked through this with @QuinnWilton a bit earlier today. We both have this intuition that {String : Task} is friendlier and more easily adoptable than a list CIDs, despite it adding some complexity by having local and remote promises. This is somewhat odd, because we're both more likely to advocate for using CIDs directly for their DAG properties.

I think there are two things bundled here, so let me address them individually:

  1. {String: &Task} is better than [&Task]. I totally agree, we just cut the corner with a tuple, and mostly considered it to be a {String: &Task} which ugly keys with a plan to address this in the future.
  2. Inline {String: Tasks} is more adoptable than linked approach { String: &Task }. I'd buy that if all of this was not so entrenched in IPLD already. If you already in the IPLD than links are just references and an opportunity to avoid block size limit.

I am not arguing for {String: &Task} and have no problem with {String: Task|&Task }, just not convinced of tradeoffs as you increase the surface with second addressing schema in the system where you're almost guaranteed to have IPLD stack.

@Gozala
Copy link
Contributor Author

Gozala commented Dec 20, 2022

One way that we may be able to have both the friendly names and be able to refer to tasks directly by CID would be to put nonces on every Task instead of only on the outer signature envelope. This gives each Task a unique CID, and thus can be addressed directly

That's cool, but that still does not makes them self contained as in I have to carry that envelope around as additional authorization context. This also applies to receipts, linked tasks there don't include proof of authorization to invoke the task (although that may be desired there).

@Gozala
Copy link
Contributor Author

Gozala commented Dec 20, 2022

I guess the downside is that you'd need to track the associated wrapper if you wanted to check the signature or other metadata about the invocation. I guess that could be managed via a reverse lookup table?

It less about about how do we manage it and more about the self-containment as design principal. E.g. we should be able to anchor things in blockchain simply via CID of the task, however in order to be verifiable we need to also publish extra context or wrap the thing in an envelope to provide it all of which introduces which are difficult to justify give that only justification (unless I've missed some) is reduced number of signatures (which could also be addressed differently).

If I've missed a rational for why current structure is better, please let me know.

@Gozala
Copy link
Contributor Author

Gozala commented Dec 20, 2022

In the use case from DAG House as I understand it, you may not need this. A CAR file can contain multiple disjoint graphs. You can use this to pack several Invocations that each contain a single task, and thus have a single identifier in their array: index 0. This is isomorphic to having each task signed, it's just packaged in a CAR file to put several together.

We could do that. However I'm not yet convinced we'll be making right tradeoffs there. We can receive invocations that do not ad-hence to single task per invocation constraint forcing us to either refuse them or having to cope with "additional context on the side". If we do former than we're not really implementing a spec and likely will run into problems with toolchains (build around spec) and etc... If we go with later, we're back to square one.

Perhaps implementing a subset of the spec is better than not at all, yet I wish we could arrive to the composable set of specs so we could implement some while not the others.

@Gozala
Copy link
Contributor Author

Gozala commented Dec 20, 2022

Following is not unreasonable

{
  sig: "somebytes",
  nnc: "abcdef",
  meta: {"dev/comments": /* ... */},
  run: {
   task: {
      with: "https://exmaple.com/posts",
      do: "crud/create",
      inputs: {/* ... */},
    }
  },
 prf: [/* ... */]
}

Except now promises aren't simply CIDs, instead they need to carry additional "task" sub-selector everywhere. Which is not terrible just really uncomfortable that simpler cases are more verbose than more complex ones.

@Gozala
Copy link
Contributor Author

Gozala commented Dec 20, 2022

You and I talked a bit about this on our call, but I'd be interested to hear more about how addressing by CID has been useful.

We use CIDs to namespace things with-in storage buckets. We considering to utilize CIDs for sharding as well across various buckets. In way CID is serialized reference so it is tends to be useful anywhere you want to associate or lookup information with a specific reference.

However references are not nearly as useful if what they refer to aren't self-contained e.g. if I need we need extra context (like invocation here) they are no longer verifiable as in I can't publish such CID into blockchain or even a gossip network.

Imperfect analogy I'll try to make here is languages with 1st class functions vs languages like Java which did not have those. Later end up forcing people to develop silly classes with a single method so it could be passed around, closures were also bit trickier now you had another silly less universal class to capture context.

Tasks as they stand in spec feel like lack of first-class functions in Java, they're like methods that you can't pass around so you're forced to wrap them in invocations. In fact even proposed solution is equivalent, you just create single task invocations (single method classes) and pass those around instead. Which works but is short of first-class functions never the less.

You also alluded to how paths inside CIDs has been problematic, and it would be really useful to know more! Is it that the paths can be to something not under that CID?

Most problems we've faced with CID + path are from IPFS and some may not necessarily apply here, yet it taught us to be cautious around them. Here are few things I can think of:

  1. CID + path reveals outer layers which may not be desired. I may not want you to know there is an outer layer as you may discover things I don't want you to.
  2. Unverifiable without getting all the outer layers. If I ask for path within CID and random peer on the net gives me some data there is way to verify I got what I asked for without having to also get everything along the path.
  3. Path within a node encoded with uncommon codec is non-traversable.
  4. Bitswap is CID based, so getting content under the path can prove challenging and impose significant overhead.
  5. IPFS Blockstore is multihash based, so path based lookups aren't ideal here either (could be addressed by indexing system)
  6. Unlike Link it has no first class representation in IPLD so you'll end up with fragmented domain specific versions that other layers of stuck aren't aware of usually resulting in some poor experience.
  7. Paths may or may not pass block boundaries and consequently making their access profile very unpredictable it can take anywhere from 1 up to N lookups, where N is number of fragment in path.

I'm sure ppl that had been longer at PL will have longer lists.

Overall sentiment is: while bunch of things on the list could be addressed in various subsystems added complexity seems have to prevented it from happening. Paths tend to introduce problems in various places in the stack and avoiding them often led to simple and more robust design so if you see path 🚨

@mikeal
Copy link

mikeal commented Jan 5, 2023

This is the composition i’d like to have, don’t care that much what things are named:

  • Individual Task
  • Batch (Implemented to match “Individual Task” interface) is either:
    • Ordered Array of “Individual Tasks” for sequential batch execution.
    • Unordered Set of “Individual Tasks” for concurrent batch execution.

Everything is a “Task” with a CID, batches are collections of these addresses and, preferably, nothing else so they maintain hash consistency.

This can be recursively composed and executions can be compared and deduplicated across batches and executions by comparing the CID’s in the Arrays and Sets.

Receipts (task results) should compose identical to this but in reverse.

@Gozala Gozala mentioned this issue Jan 31, 2023
@expede expede linked a pull request Oct 3, 2023 that will close this issue
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants