Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Public partitions as a cartesian product of dimensions #277

Open
dvadym opened this issue May 17, 2022 · 2 comments
Open

Public partitions as a cartesian product of dimensions #277

dvadym opened this issue May 17, 2022 · 2 comments
Assignees
Labels
Good first issue 🎓 Perfect for beginners, welcome to OpenMined! Type: New Feature ➕ Introduction of a completely new addition to the codebase

Comments

@dvadym
Copy link
Collaborator

dvadym commented May 17, 2022

Context

Note: here is more about terminology.

Definitions (from terminology page)

A partition is a subset of the data corresponding to a given value of the aggregation criterion. Usually we want to aggregate each partition separately. For example, if we count visits to restaurants, the visits for one particular restaurant are a single partition, and the count of visits to that restaurant would be the aggregate for that partition.

Public partitions are partition keys that are publicly known and hence don’t leak any user information. An example of public partitions could be week days.

DPEngine.aggregate is API function that performs DP aggregation. public_partitions is an argument of DPEngine.aggregate(). It might be Python iterable (when it's small enough to fit in memory and to efficiently distributed among workers) or distributed collection (PCollection for beam, RDD for spark).

In short, public partition selection consists of 2 stages:

  1. Filtering out all partition key, which are not in public_partitions (code which does this).
  2. Addding "zero" partitions for all elements of public_partitions which are not in input data (code which does this).

Downsides of the current state.

Let’s consider the case when partitions are cartesian products of multiple dimensions, for example (country, date).
The user needs to do generation of cross-join: that’s additional steps from users, so more possibilities for bugs and this cross-join might be very large (as a result performance impact).

What can be done better?

The user can specify values of each dimensions, and PipelineDP internally can do join: this would be easier to use for users and it might be done much more effectively from performance point of view inside PipelideDP.

Goals

Allow to specify public_partitions as a product of different dimensions. Steps to implement (it might be split in sevaral PRs)

1.Device a nice API for specifying public_paritions in arguments of DPEngine.aggregate (a separate argument, or maybe some class object which specifies product). For beginning we can assume that dimensions values are Python iterable.
2. Implement steps 1 & 2 of public_partitiosn algorithm (see in section above).
3. Propagate these public partitoins in all places where they used (e.g. in Beam API, e.g in Spark API).

@dvadym dvadym added Good first issue 🎓 Perfect for beginners, welcome to OpenMined! Type: New Feature ➕ Introduction of a completely new addition to the codebase labels May 17, 2022
@replomancer
Copy link
Member

I'd like to work on that.

@dvadym
Copy link
Collaborator Author

dvadym commented May 24, 2022

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Good first issue 🎓 Perfect for beginners, welcome to OpenMined! Type: New Feature ➕ Introduction of a completely new addition to the codebase
Projects
None yet
Development

No branches or pull requests

2 participants