Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Establish a data control plan #8

Open
jbenet opened this issue May 1, 2017 · 11 comments
Open

Establish a data control plan #8

jbenet opened this issue May 1, 2017 · 11 comments

Comments

@jbenet
Copy link
Member

jbenet commented May 1, 2017

we should outline the control plan (hand over to wikipedia itself, etc)

@flyingzumwalt flyingzumwalt changed the title data control plan Establish a data control plan May 12, 2017
@flyingzumwalt
Copy link
Contributor

Though we hope that wikipedia will take this under their wing sometime, we should not assume that they will. Based on that, we're setting up a community-based model for managing the generation of snapshots from kiwix dumps. This is one of the first tests of the model that evolved out of the Data Rescue hackathons in early 2017 -- where communities of hackers, content specialists and do-gooders work together to manage the work of pulling data off of centralized servers and redistributing it.

To apply this model we're partnering with @b5 from http://www.qri.io/, who did a lot of the technical work behind the Data Rescue hackathons. Many other people like @dcwalk @titaniumbones @mayaad @trinberg @ abergman contributed to the evolution of this model.

The Process

Key elements of this process:

  • Embrace community contributions with an open model of community governance. In short, use github and PRs to manage everything. Actively embrace contributions by community members, give them a voice in governance of the code, and provide a clear definition of the requirements to become a committer.
  • Use code to automate repeatable tasks: rather than having lots of people write one-off scripts and run them once, put that energy into building and maintaining reusable scripts.
  • Need to be careful about provenance and chain of custody: It's important to be clear exactly where the snapshots came from and exactly what was done to them. To enforce this, we have to be careful about who runs the scripts and how they run the scripts.

Balancing Open Community with Careful Chain of Custody

It may seem like the open community model is at odds with maintaining a clear chain of custody when processing the snapshots. Here's how we will balance the two:

Open community contributions (via github Pull Requests, etc) wherever possible.

  • maintaining the scripts that pull dumps from kiwix
  • maintaining any scripts that modify snapshots and write them to ipfs
  • nominating new language variants to be added as snapshots
  • deciding when to run new snapshots
  • maintaining the docker container that is used to run these scripts
    ... With an open governance model around who can become a committer on the repo, etc.

Meanwhile a smaller group of committers will handle:

  • running the scripts, using the community-managed docker image, to generate new snapshots
  • publishing updates to the IPNS entries

Eventually we might incorporate cryptographic techniques (ie. SNARKS) to prove that the intended operations (and only the intended operations) were run on the snapshots, which would allow anyone to build the snapshots without corrupting the chain of custody. This will require some research. For now, it's overkill.

@flyingzumwalt
Copy link
Contributor

Note: one cool thing about using IPFS with this structure: if you want to validate that someone actually ran the scripts they claim, you can just re-run the scripts from the same sources and compare the hashes of the results...

@dcwalk
Copy link

dcwalk commented May 13, 2017

pinging @patcon and @meyerscr (edit, didn't need to ;)) to watch here

@b5
Copy link

b5 commented May 17, 2017

Ok we've started to make progress on this. Currently this is just defaulting to sending emails while we figure out how to connect the requests to a queue, but it's a start.

Live url here: https://task-mgmt.archivers.space
Repo here: https://github.com/archivers-space/task-mgmt

Note, you'll need write access to ipfs/distributed-wikipedia-mirror in order to access the page.

I've outlined some next steps in the repo readme, @flyingzumwalt it might make sense to touch base on next steps sometime soon, specifically around the question of where the actual task execution is going to happen. If we need to build that, that's ok. In the meantime I still have lots to chew on.

@Kubuxu
Copy link
Member

Kubuxu commented May 19, 2017

The archivers requesting full private repo access is no go for me unfortunately.

Many platforms allow for public and separate upgrade to private repo access when need arrives.

@flyingzumwalt
Copy link
Contributor

Is archivers requesting access? I thought it was just using GH oauth response to know if the user has write access to this repo -- so you need write permission in the GH repo in order to manage stuff in archivers. That lets us set it up so that anyone who can modify this repo can also manage things in archivers like kicking off building a new snapshot. The actual submission of new content from archivers or from the workers it runs will be done vi PRs, which does not require write access to this repo.

@Kubuxu
Copy link
Member

Kubuxu commented May 20, 2017

The management page does: https://task-mgmt.archivers.space if you try to login with GH.

@flyingzumwalt
Copy link
Contributor

aha. yeah we have to change that.

@b5
Copy link

b5 commented May 20, 2017

Oh yes completely agreed. I'll drop the permissions ask, will report back once the change is up.

@b5
Copy link

b5 commented May 30, 2017

Ok, change is now live. App shouldn't request access to private repos.

@flyingzumwalt
Copy link
Contributor

Update: @b5 is making amazing progress building a robust and reusable solution for our data-control needs datatogether/task_mgmt#4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants