Contribution bounding with Group By privacy unit #488

dvadym · 2023-09-13T11:33:20Z

Context

Prerequisites: PipleineDP terminology, especially privacy unit, partition key.

One part of the anonymization pipeline is to do contribution bounding. Namely for to limit contributions from 1 privacy unit. One of the common way to specify contributions is with max_partitions_contributed and max_contribution_per_partition. Atm it's done with 2 samplings:

Sample max_contributions_per_partition per (privacy_id, partition_key) (code)
Sample max_partitions_contributed per (partition_key) (code).

It's scalable, but it requires 2 shufling sessions (each sampling requires shufling). It's expensive. Another way to do sampling is
to do group by privacy_key and to do sampling in memory.

Goal

Implement sampling with one group by privacy_key and to do sampling in memory.

Note: Since one privacy unit can contain too much, datapoints, we can limit it with some large const, for example 10**7.

Code pointers

ContributionBounder is the abstract base class for ContributionBounders.
SamplingCrossAndPerPartitionContributionBounder is the class which does current 2 stage sampling.
SamplingPerPrivacyIdContributionBounder is a class which samples fixed number per privacy_unit (it's more as an example)
Tests for contriution bounders
Contribution bounder creation

The text was updated successfully, but these errors were encountered:

dvadym added the Type: New Feature ➕ Introduction of a completely new addition to the codebase label Sep 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contribution bounding with Group By privacy unit #488

Contribution bounding with Group By privacy unit #488

dvadym commented Sep 13, 2023

Contribution bounding with Group By privacy unit #488

Contribution bounding with Group By privacy unit #488

Comments

dvadym commented Sep 13, 2023

Context

Goal

Code pointers