Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Only connect to writers #36

Open
max-mapper opened this issue Jul 18, 2018 · 48 comments
Open

Proposal: Only connect to writers #36

max-mapper opened this issue Jul 18, 2018 · 48 comments

Comments

@max-mapper
Copy link

Following up from some twitter discussion a couple weeks back. I propose changing the default discovery algorithm in three ways in order to improve default privacy:

    1. be able to verify known trusted hosts
    1. default to download only from trusted hosts (similar to https privacy)
    1. default to download only (opt in to seed)

In other words, turn off p2p mode by default, except for a set of 'trusted hosts'. To simplify things, I propose we define the 'trusted hosts' as any writer. This is a simple default that can be overridden by settings (e.g. to specify a set of trusted non-writer hosts).

The way I envision discovery working in this new scheme is something like:

  • Discover initial peers based on first key (same behavior as now)
  • Additionally, subscribe to discovery channel for every key in writers
  • Instead of connecting to any IP:PORT peers that are discovered, only connect to IP:PORT peers signed by writer keyholders

This changes the privacy expectation to match HTTPS: Users need only trust the 'owner' of the content they are requesting to keep their server logs secure. The key difference being instead of one DNS record being considered the owner, the entire set of Dat writers (and their corresponding IP:PORT pairs) are to be considered trustworthy.

Again, this is just a proposed default for any Dat client. An option to run in 'unrestricted p2p mode' is easily added.

This would probably be a breaking change, since there could be existing dat schemes out there that rely on non-writers re-seeding content.

DEP wise, there would need to be a mechanism added to sign DHT payloads.

@e-e-e
Copy link

e-e-e commented Jul 18, 2018

Forgive me if I am wrong, but would this not seriously hinder any attempt at establishing a supporting network of peers that help host data?

Given that multi writing is reliant on one hypercore feed per writer and hyperdb's internal implementation details contains many iterations based on the number of writer feeds (e.g. getting the head), than to increase the number of secure sources would mean eventually hitting performance bottlenecks.

I am not too aware of the issues are around privacy though - is it primarily ip sniffing? or is it also related to bad actors poisoning the network?

@max-mapper
Copy link
Author

Forgive me if I am wrong, but would this not seriously hinder any attempt at establishing a supporting network of peers that help host data?

The burden would be on the owner(s) to ensure availability. If someone wants to help, the owner adds them as a writer, and then the new writer seeds.

I am not too aware of the issues are around privacy though - is it primarily ip sniffing? or is it also related to bad actors poisoning the network?

The issue is that currently by using Dat you are exposing your IP:PORT to the network, and it is trivial for anyone else with the Dat key to see what bytes you have and/or are downloading. It's not an issue for 'private dats' but is an issue for e.g. hosting public websites on Dat. More background: https://blog.datproject.org/2016/12/12/reader-privacy-on-the-p2p-web/

Re: hypercore, I was under the impression that there was a concept of a top level set of writer keys that can be queried in a performant way. E.g. something like hypercore.getWriters(function (err, writers){}) where writers is an array of public keys.

@RangerMauve
Copy link
Contributor

I totally get the privacy implications, but this will likely hurt the overall strength of the network and will make it harder for non-tech-savvy people to make use of the features.

One of the main selling points, for me, was that the more peers were accessing the data, the more resiliant the network would be and there would be less of a load on the initial sources of the data.

With this, that functionality goes away by default and there's more mental burdon on casual users to keep their data online.

I think that adding support for some sort of mixnet into the protocol would be a better way forward for preserving IP address privacy in that it will be easier to make things "just work".

@max-mapper
Copy link
Author

max-mapper commented Jul 19, 2018 via email

@RangerMauve
Copy link
Contributor

Would it be possible to make this opt-in per dat rather than the default for all dats? Like, allowing users to choose whether they want more privacy or more resiliancy.

You mentioned "opt in to seed", but you can't opt into seeding if you're not a trusted host since nobody would attempt to connect to you. Does that mean that only trusted hosts can opt into seeding?

One of the use-cases I have is a social media platform where you seed the data for all of the people you're following, kinda like SSB. That way if you have a decent sized community, you're more likely to be online at the same time as somebody to share updates to the content. Could that still work?

Also, is dat still peer to peer if you have centralized hosts for replication? If those trusted hosts are blocked by a network or overloaded, there's now no way to get access to that data.

How does this interact with sharing content over MDNS? If I'm using an offline-first chat, and somebody sends me a link, I now can't get access to that data from the person that sent me unless they're a "trusted" peer, right? That would make dat a lot less useful for collaboration without internet.

@max-mapper
Copy link
Author

Would it be possible to make this opt-in per dat rather than the default for all dats? Like, allowing users to choose whether they want more privacy or more resiliancy. You mentioned "opt in to seed", but you can't opt into seeding if you're not a trusted host since nobody would attempt to connect to you. Does that mean that only trusted hosts can opt into seeding?

The opt in could work for the seeder or the downloader. e.g. If you did dat config allow-unauthorized-hosts or something it would opt in your client to seed and connect to anyone.

One of the use-cases I have is a social media platform where you seed the data for all of the people you're following, kinda like SSB. That way if you have a decent sized community, you're more likely to be online at the same time as somebody to share updates to the content. Could that still work?

It depends on the privacy guarantees you want to provide to your users... but SSB pubs are a good model to think about, if the pub is a writer, then everyone can just connect to the pub. If the pub isn't available for some reason, the app can ask the user if its ok to try to connect to potentially untrustworthy sources in order to access content.

Also, is dat still peer to peer if you have centralized hosts for replication? If those trusted hosts are blocked by a network or overloaded, there's now no way to get access to that data.

Dat would act the same way HTTPS acts today, except there would be a set of writers that would be trusted, rather than 1 host. So it would be more resilient than HTTPS but by no means would I describe dat as a censorship resistance tool (which is also a difficult problem related to anonymity).

How does this interact with sharing content over MDNS? If I'm using an offline-first chat, and somebody sends me a link, I now can't get access to that data from the person that sent me unless they're a "trusted" peer, right? That would make dat a lot less useful for collaboration without internet.

The IP:PORT would just be a local one, so it could get signed by the writer key and would work the same as internet discovery.

@e-e-e
Copy link

e-e-e commented Jul 19, 2018

More background: https://blog.datproject.org/2016/12/12/reader-privacy-on-the-p2p-web/

Thanks for clarification @maxogden. I remember reading this a while ago - and was not sure if there were other concerns that this proposal was addressing too.

Re: hypercore, I was under the impression that there was a concept of a top level set of writer keys that can be queried in a performant way. E.g. something like hypercore.getWriters(function (err, writers){}) where writers is an array of public keys.

In terms of performance its not the checking of the authorised keys that I think would become a problem but potentially the use of HyperDB. @mafintosh definitely has more of an understand of this, but my understanding is that navigating the trie structures of hyperdb will become less performant as writers increase. For example, getHeads iterates over every writer - https://github.com/mafintosh/hyperdb/blob/master/index.js#L243-L245 - and this is not the only function to do so. If you have to add a writer in order to add an authorised seeder - it become a limit. It would work fine for small number of writers, but would not scale well.

@max-mapper
Copy link
Author

If you have to add a writer in order to add an authorised seeder - it become a limit. It would work fine for small number of writers, but would not scale well.

Ahh I see what you mean. I'm not sure the number of writers/seeders is something that needs to be scaled up very high, probably supporting <100 would be fine for most use cases. I can't imagine a use case where you would need that many trusted hosts, you might as well just run in the opt-in mode at that point. Other limits would probably be hit first, such as the memory overhead of creating that many discovery-swarm instances.

Implementation wise I would imagine one could cache the set of seeds as separate HyperDB key/values, making sure to sync the hyperdb writer keys into your seeds list when they are added or removed, but also allowing for manual management of non-writer seeds. I imagine this would avoid the potential perf issues you mention.

@RangerMauve
Copy link
Contributor

Dat would act the same way HTTPS acts today, except there would be a set of writers that would be trusted, rather than 1 host. So it would be more resilient than HTTPS but by no means would I describe dat as a censorship resistance tool (which is also a difficult problem related to anonymity).

That's really surprising to hear given the homepage mentions that Dat is geared to be peer-to-peer and you yourself retweeted something where Dat was being used to circumvent censorship by other platforms.

I've been trying to sell people in my area on the p2p aspect of Dat. Are there no alternatives for improving privacy that don't limit p2p connections?

@max-mapper
Copy link
Author

That's really surprising to hear given the homepage mentions that Dat is geared to be peer-to-peer and you yourself retweeted something where Dat was being used to circumvent censorship by other platforms.

I should probably add a 'RT aren't endorsements' disclaimer to my personal account then. I've definitely never described Dat as a censorship resistance, anonymity or piracy tool, or advocated for it's use as such. Being p2p does not imply any of the above features. In my opinion we must strive for better privacy than the existing p2p ecosystems out there that disregard it. Part of the principles of the modern web is building in privacy to protocols in the post-mass-surveillance era.

I've been trying to sell people in my area on the p2p aspect of Dat. Are there no alternatives for improving privacy that don't limit p2p connections?

I'm not saying Dat isn't p2p or that we should get rid of p2p, just that if you ignore the privacy problems with unrestricted p2p connections you are throwing user privacy out the window. Two alternatives I have looked into are instructing users to use VPNs, or running Dat over Tor.

@RangerMauve
Copy link
Contributor

I'm 100% in agreement about the privacy focus, but I believe that limiting the p2p aspect and limiting who can seed will be worse for the network in the long run. Right now there might not be that many people making use of Dat, and we're not seeing that much load. But if it's going to replace HTTP on the web, having it scale with the number of peers will make the adoption more smooth.

I don't think that trusting writers is enough to guarantee privacy. That's the model we have now in the web, and it's really not good enough. If an adversary is monitoring who is accessing a given dat, they can analyze your traffic to see if you're connecting to one of the writers for it. It also makes it easier to DoS the writers for a given Dat to take the data down. Lastly, it means that all peers are going to be connecting to the writers rather than being mixed up in a network, so the writers now have more ability to analyze all the peers in the network. This change prevents random peers in the network from analyzing what you're reading, but it sacrifices network resiliance and scalability and doesn't prevent malicious acotrs that know the writers IPs (or having control of the writers) from analyzing your traffic.

Even though I agree with the concerns you have about anonymity being misused by bad actors, mixnets seem to be the safest way to prevent malicious actors from analyzing your traffic and tracing users back to an IP. i2p would be a good way to go because unlike Tor, they force all nodes to participate in the routing and have resulted in better speeds for long-running nodes.

What threat model are you concerned with when you talk about user privacy? Personally, I think that nobody should be able to trace peers accessing a dat, not even the writers. Anything less will expose vulnerable users.

@max-mapper
Copy link
Author

max-mapper commented Jul 20, 2018

What threat model are you concerned with when you talk about user privacy?

I'm specifically concerned about reader privacy. Consider if Wikipedia switches to Dat. On Dat if a user is reading a wikipedia article, they have no guarantee what they are reading is private. Today with HTTPS they have to trust Wikimedia to keep their privacy safe. If someone is monitoring that users traffic, they can only see that the user is on Wikipedia, they can't tell which article they are reading. With Dat today, the entire world can watch whatever bytes of whatever files you are reading or sharing, which we should all agree is a bad default for privacy. I'm just advocating for a safer default in Dat that works like the web works today but doesn't change what Dat is. You are advocating for a different safe default, and I respect and understand your position.

mixnets seem to be the safest way to prevent malicious actors from analyzing your traffic and tracing users back to an IP.

HTTPS accomplishes this as well, as long as a malicious actor is not in control of the server. For cases like the Snowden mass surveillance revelations (ISP level MITMs etc), HTTPS security protects many many people compared to HTTP. For cases where an attacker specifically targets you, or you get subpoenaed to relinquish your server logs, you can't trust the server any more and HTTPS doesn't protect you from them.

i2p would be a good way to go because unlike Tor, they force all nodes to participate in the routing and have resulted in better speeds for long-running nodes

I have not personally benchmarked i2p implementations in node and also have not seen anyone else do so, but the only numbers I can find on this is that i2p maxes out at 200kb/s per socket. Much faster than the average Tor socket, yes. But this would still mean making Dat about 150 times slower per socket than the UTP/TCP sockets we use today (30MB/s).

Personally, I think that nobody should be able to trace peers accessing a dat

This means Dat would become an anonymity and censorship resistance tool in addition to a peer to peer filesystem. I am not opposed to all of these things being supported by Dat. The question up for debate seems to be what functionality gets turned on by default, as they all have significant tradeoffs.

I believe that limiting the p2p aspect and limiting who can seed will be worse for the network in the long run... But if it's going to replace HTTP on the web, having it scale with the number of peers will make the adoption more smooth

On an engineering level, I don't think it would be possible to bolt on an anonymity transport layer to Dat and make it anywhere close to fast without lots of effort. For example Tor relies on chained TCP sockets that hop through their routed network to an exit node. To have p2p connections work for >50% of users you need UDP for hole punching (TCP simultaneous open has about 1/3rd the success rate as UDP hole punching last time I checked). So we can't even attempt to use the Tor protocol over our own hybrid p2p network to increase speed, because it has no messaging based network API.

So IMO running Dat over Tor isn't really peer to peer, because all of your connections between potentially fast hosts have to get routed through Tor TCP chains to exit nodes, so nearly of the bandwidth advantages you get from direct p2p connections go away. I'd argue this would be more of a degradation in user experience for most people than restricting downloads to writers only. And having users opt-in to being exit nodes puts them at even more risk than running Dat or BitTorrent on a 'naked' connection today. I believe i2p is similarly high level to Tor, making it impossible to do hole punching.

For me, the goal of Dat has always been to take the web we have today and 1) put users in control of their data, allowing them to sync their data offline or to other places on the web, which directly combats the vendor lock in we have today with fb/twitter/google etc, 2) make bandwidth cheaper by allowing for more distributed network topologies for distributing content than what we have today with everyone paying by the MB to CDNs and google/amazon cloud to host their data to millions of users from a handful of data centers, when many of those members of the network might have fiber upload, and 3) use content addressability and signing on web content to improve the state of web archiving and permanence.

If we can achieve those goals without degrading the privacy of today's HTTPS web I think that would be a huge upgrade to what we have now. But I guess I am dubious that tunneling all Dat connections over an onion routing network will be able to achieve no 2. above as it seems to inherently throw away the huge peer to peer bandwidth advantages. There is also the separate issue of the 'silver platter' above which weighs heavily on my conscious in light of the recent political climate.

As an aside, I really appreciate the discussion so far from all the participants in this thread.

@pfrazee
Copy link
Contributor

pfrazee commented Jul 20, 2018

My 2 cents --

  • I think an optional whitelist/blacklist is a good idea. It does degrade connectivity, but I think that's a choice the user can make for themselves.
  • For privacy, right now hypercore-proxy (a proxy on the hypercore protocol) gets the best results. It trades very little performance or connectivity (and can improve connectivity since our NAT punching is a work in progress).
  • I consider the overlay networks a pretty significant effort which we should be researching and discussing, but not prioritizing over other options that we know we can implement. I think it'd be a mistake not to implement the other two approaches (white/blacklists & proxies) because we're busy trying to make an overlay work

I think there's general consensus that reader privacy is important. The debate is about the mechanisms and their tradeoffs.

@max-mapper
Copy link
Author

Related, I found this really great paper from TU Delft (maybe @mafintosh knows the researchers...) that discusses an approach at reimplementing Tor over UDP for higher bandwidth connections. The takeaway for me is that to make a Tor/Dat hybrid fast we would need to not only reimplement Tor from the ground up to be messaged based like they did (and then get it audited etc), but we'd need some sort of reputation system like the BarterCast thing they mention to ensure attackers can't sybil with slow nodes to kill throughput.

"BarterCast is an epidemic protocol in which peers broadcast real-time upload and down- load statistics to others they know."

We have talked about these things before during Dat's development and it's a very, very hard problem to balance the advantages of gossiping metadata for better peer routing against the latency and bandwidth costs of the gossip protocol itself.

https://repository.tudelft.nl/islandora/object/uuid:997890d1-4141-4597-92eb-3dbaa4dc44a1/datastream/OBJ/download

@max-mapper
Copy link
Author

For privacy, right now hypercore-proxy gets the best results

@pfrazee Just to clarify the use case here, would it be accurate to liken this approach to a VPN? E.g. you have a publicly accessible server somewhere that you tunnel your traffic through, thereby obscuring your IP and making your connections always work? If so, with this approach, in order to protect their reader privacy, a user have to acquire a proxy and configure their Dat client to use it, so I'd classify it as an "opt-in" mechanism.

@RangerMauve
Copy link
Contributor

RangerMauve commented Jul 20, 2018

What do you think about allowing dats to opt-into whitelists but keep them open by default?

A lot of users' data isn't Wikipedia, and casual users are likely going to have dats for stuff like their fritter profile which they won't have backed up on a cloud provider.

Also, I don't think that "write access" is the best way to signify that a peer can be trusted to replicate data since I wouldn't want to give something like hashbase write access despite trusting it to seed my data.

It's been discussed before that there should be a "web of trust" for dats.

Maybe on top of writers, a dat could have a public key used for "signing" ids of peers that are allowed to seed.

That way, you could opt-into only allowing your dat to be seeded by trusted parties by adding the public key and making sure that the trusted parties have some sort of token signed by the key. It doesn't mean they can write, necessarily, but it means they can be trusted for hosting the data.

I also like the idea of using hypercore-proxy as a sort of VPN for users to hide their origins from sources.
Kinda tangential, but I was thinking that proxies could advertise themselves on the DHT and a peer could look them up and potentially form routes through them kinda like an onion router. Proxies could also be used for supporting browsers that can't support extensions needed for talking to dat normally. Kinda what I've been working on with discovery-swarm-stream

@max-mapper
Copy link
Author

max-mapper commented Jul 20, 2018

It's been discussed before that there should be a "web of trust" for dats. Maybe on top of writers, a dat could have a public key used for "signing" ids of peers that are allowed to seed.

This is what I meant by the "manual management of non-writer seeds" discussion above. IMO "web of trust" is a great way to describe this functionality. You as the original author of the Dat are constructing a set of trusted nodes. If users trust you, they trust who you trust. When you mark a set of keys as "preferred seeders", clients are taking your word that that set of preferred seeders will respect their privacy.

A client can choose to disregard your preferred seeders and venture beyond to anyone else who has a copy. But to get good privacy in the network, IMO, there needs to be 1) general adoption of this "preferred seeders" whitelist option by dat creators making it easy to use and understand and 2) a default in the clients that opts-in the user to using it.

Maybe a middle ground we could start with would be: If a Dat author enables "privacy" mode, then clients respect it by only connecting to the seed list that the author specifies. But if a Dat author does nothing, it continues to work like it does today, with no public privacy. This is similar to the "HTTPS everywhere" debate that's been happening. Beaker could even mirror HTTPS privacy policies such as requiring "privacy" mode to allow JS access to the webcam or other privacy sensitive APIs in web apps.

I also like the idea of using hypercore-proxy as a sort of VPN for users to hide their origins from sources.

I also want to voice my support for this feature, but also recognize that despite what VPN companies say about security and privacy, it is an incredibly shady industry, and also costs $ and time to setup (meaning only savvy users will use it).

edit sorry meant to quote this too:

What do you think about allowing dats to opt-into whitelists but keep them open by default?

I think that could work to start, given that if the author opts in, it opts the user in as well. The user can still explicitly opt out if they want.

@marcelklehr
Copy link

Re @maxogden

For me, the goal of Dat has always been to take the web we have today and 1) put users in control of their data, allowing them to sync their data offline or to other places on the web, which directly combats the vendor lock in we have today with fb/twitter/google etc, 2) make bandwidth cheaper by allowing for more distributed network topologies for distributing content than what we have today with everyone paying by the MB to CDNs and google/amazon cloud to host their data to millions of users from a handful of data centers, when many of those members of the network might have fiber upload, and 3) use content addressability and signing on web content to improve the state of web archiving and permanence.

To me (as a lurker on this repo up to now -- hi everyone 👋), restricting sources to only the writers of a Dat kind of sounds like a large trade off at the expense of point 2). How will bandwidth get cheaper for a publisher if their peers are going to be the main data sources for the majority of users? Will this not drive publishers to solutions similar to the situation today, with large data centers that are necessary to handle the load? (Inadvertantly, at the same time this could compromise integrity as @RangerMauve noted, since the publisher then would need to give write access to the hoster, as I understand it.)

PS: I also find this particularly interesting. It's one of the major problems.

@pfrazee
Copy link
Contributor

pfrazee commented Jul 20, 2018

@maxogden that's accurate. Hypercore proxies are somewhat similar to a peer whitelist, except that when your proxy doesn't already have a dat, you can command it to go get it.

About the proposal: I think what you're suggesting is that we reduce the seeder-set by having the owner authorize their seeders with a signed whitelist. (Based on my quick skim of the proposal) the owner could authorize multiple seeders without giving away control of the site. It's not so much that "writers" are seeding; rather, it's peers that are appointed by the writer.

My take:

  • It doesn't eliminate the privacy concerns but (if it works as intended) it does reduce the surface area, so there is some benefit.
  • It pushes a configuration task onto the publisher, which might not be too onerous, but does add to their workload.
  • It can make archival harder because if the writer loses their private key, they won't be able to authorize new seeders.
  • You lose any potential for the network to horizontally scale by adding seeders automatically.

(Reading now your most recent response --)

I don't see any harm in adding the ability for the DHT to have "preferred seeders" and then the a client could choose to limit their connections to those peers. Any time I'm unsure if a solution is right, I prefer to use an approach that's easy to discard if it fails.

@RangerMauve
Copy link
Contributor

RangerMauve commented Jul 20, 2018

So what about the following knobs:

For creators

  • If you're publishing a private dat or something you expect to be seeded by unprivileged peers (your social circle), don't have a whitelist
  • If you're publishing a public dat or want to limit who can see who's reading your data, enable whitelisting and authorize peers to seed (similar to authorizing writes, but doesn't start tracking their hypercores) with an API to authorize seeding along side the write authorization

For consumers

  • Look at the dat's metadata to see whether it's using a whitelist for peers before replicating. You'll be downloading all the metadata anyways, so it's probably safe to get it from any peer.
  • Enable higher privacy to only read content from writers/authorized parties. Metadata will probably still need to be downloaded from untrusted peers for the initialization.
  • Maybe settle on a standard for proxying discovery-swarm or hypercore-protocol as part of a DEP to make it easy for adding a proxy?

@pfrazee
Copy link
Contributor

pfrazee commented Jul 20, 2018

@RangerMauve I think for this proposal to work, it has to be constrained to the discovery/DHT. So, the whitelist would be exposed by merit of having signed peers in the dht.

@max-mapper
Copy link
Author

max-mapper commented Jul 20, 2018

It pushes a configuration task onto the publisher, which might not be too onerous, but does add to their workload.

The idea of using 'writers' is to automate this task. Rather than requiring separate management of seeders and writers, I figure we can adopt the default of simply combining them, but allowing separate management if desired. If it's a default, it's more likely to be used (e.g. it's the security mantra of: if it's not on by default, nobody will use it).

It can make archival harder because if the writer loses their private key, they won't be able to authorize new seeders.

This is a general problem we need to solve anyway (like how keybase does multi device management)

You lose any potential for the network to horizontally scale by adding seeders automatically.
How will bandwidth get cheaper for a publisher if their peers are going to be the main data sources for the majority of users?

This proposal is designed to sit between two extremes. One extreme is HTTPS today where you have 1 authority, usually one server (or more if the SSL cert holder load balances to other hosts, but that's hard to set up and usually its one owner providing a service, and a different delegated trust model). On the other extreme is unrestricted P2P, where anyone can upload, but suffers from the privacy issues above.

This proposal means rather than 1 host, you can curate a distributed web of trust to share the load. It's a kind of distributed load balancer. So it still offers significant horizontal scalability and bandwidth commodification over 1 host hosting. And it can still "fall back" to unrestricted p2p if users opt out of privacy.

Metadata will probably still need to be downloaded from untrusted peers for the initialization.

I don't think we can leak any metadata either, because if you are just downloading 1 file, you only need to get the metadata corresponding to that file, so that leaks your reader privacy as well.

If you're publishing a public dat or want to limit who can see who's seeding your data, enable whitelisting and authorize peers to seed (similar to authorizing writes, but doesn't start tracking their hypercores) with an API to authorize seeding along side the write authorization

In addition to this, would it work for everyone to automatically copy any writers to this list as well? Or are there specific objections to that?

@RangerMauve
Copy link
Contributor

@pfrazee So, instead of using the ususal announce on the mainline DHT, discovery-channel would need to start using BEP 44 to put arbitrary payloads?

How would a DHT-based approach work with MDNS / DNS-discovery in general? IMO, having it be part of hyperdrive / hyperdb would make it easier to understand for users since it'd be like "authorizing" for a write, but with less priviledges.

@max-mapper
Copy link
Author

Clarification: The only metadata I think you can should be able to get from untrusted sources is the list of writer keys.

Also, I would be OK with an opt-in option for dat creators that they have to turn on for the repository to run in 'public privacy' mode. But once on, it automatically adds new writers to the preferred seeds list.

@RangerMauve
Copy link
Contributor

I don't think can should leak any metadata either, because if you are just downloading 1 file, you only need to get the metadata corresponding to that file, so that leaks your reader privacy as well.

I was under the impression that peers always downloaded the full metadata hypercore in order to get the latest changes (regardless of sparce mode) and that hyperdb authorization was part of it's metadata

In addition to this, would it work for everyone to automatically copy any writers to this list as well? Or are there specific objections to that?

I'm 100% behind that. I don't see why someone would have write access but without the ability to seed.

@pfrazee
Copy link
Contributor

pfrazee commented Jul 20, 2018

The idea of using 'writers' is to automate this task. Rather than requiring separate management of seeders and writers, I figure we can adopt the default of simply combining them, but allowing separate management if desired. If it's a default, it's more likely to be used (e.g. it's the security mantra of: if it's not on by default, nobody will use it).

Yeah that's sensible, I just wanted to surface that fact.

It can make archival harder because if the writer loses their private key, they won't be able to authorize new seeders.

This is a general problem we need to solve anyway (like how keybase does multi device management)

True, but in this case you're exacerbating the problem because you lose the option for a lost-key archive to persist in a readonly form.

This proposal means rather than 1 host, you can curate a distributed web of trust to share the load. It's a kind of distributed load balancer. So it still offers significant horizontal scalability and bandwidth commodification over 1 host hosting. And it can still "fall back" to unrestricted p2p if users opt out of privacy.

Fair enough!

@pfrazee So, instead of using the ususal announce on the mainline DHT, discovery-channel would need to start using BEP 44 to put arbitrary payloads?

@RangerMauve I'm 99% sure we're going to move away from the mainline DHT permanently and create our own so that we can add features and fix issues (like the key length truncation). That said, I'm not well-versed in the details of those implementations.

I was under the impression that peers always downloaded the full metadata hypercore in order to get the latest changes (regardless of sparce mode) and that hyperdb authorization was part of it's metadata

It's actually possible to download the metadata in sparse mode and use pointers within the metadata to download only what's needed.

But either way, with reader privacy, you wouldn't want to download any metadata at all prior to choosing the peers you wish to communicate with.

@max-mapper
Copy link
Author

I'm 99% sure we're going to move away from the mainline DHT permanently and create our own so that we can add features and fix issues (like the key length truncation). That said, I'm not well-versed in the details of those implementations.

@mafintosh and I agreed to this like a year or two ago but have not gotten around to it. There are indeed a number of issues with the mainline dht that deserve more discussion in another thread. Fun fact, 3 years ago now we both flew down to California to visit Juan from IPFS specifically to try to use their DHT, and could never find a way to integrate it, so we added bittorrent-dht instead. We really didn't ever want to use Mainline DHT but it was the only thing we could get working, and has been in there ever since.

It's actually possible to download the metadata in sparse mode and use pointers within the metadata to download only what's needed.

Yes I think this will become the default in the future, especially for name resolution/DNS-like use cases like NPM on Dat.

True, but in this case you're exacerbating the problem because you lose the option for a lost-key archive to persist in a readonly form.

If i'm understanding correctly, couldn't you just "peg" your client version to an older trusted version if you disagree with the writes one of the other keyholders has made?

@max-mapper
Copy link
Author

discovery-channel would need to start using BEP 44 to put arbitrary payloads?

Yes we could switch the discovery-channel API pretty easily to only allow announcing buffers, and switch the underlying mechanism to BEP44 for bittorrent-dht. MDNS and dns-discovery already support buffers.

@RangerMauve
Copy link
Contributor

How do you know which metadata to download in order to know which peers are authorized? At the moment, adding a writer to hyperdb appends a block at the end. Though you could find all the other feed IDs from that.

Alternately, what data are you publishing on the DHT that can be trusted to have been created by a writer of the archive? Something like <id, the id signed by key holder, ip , port>?

It would be impossible to detect whether a whitelist should be used or not if there's no additional metadata being downloaded somewhere.

@RangerMauve
Copy link
Contributor

Maybe the DHT could hold <id, useWhitelist? (optional, signed), amSeeder (optional signature of id), ip, port>?

@max-mapper
Copy link
Author

max-mapper commented Jul 20, 2018

How do you know which metadata to download in order to know which peers are authorized? At the moment, adding a writer to hyperdb appends a block at the end. Though you could find all the other feed IDs from that.

Above I describe a possible implementation where you store preferred peers as a separate set of hyperdb keys, making it easy to just grab those keys without doing complicated traversals.

It would be impossible to detect whether a whitelist should be used or not if there's no additional metadata being downloaded somewhere.

In the original post I suggest having a separate swarm for every key. So the original swarm for writer 1 gets created as usual. Then you request the set of preferred seeds and create a swarm for each one. If you get none back you would not use the "privacy" mode.

So I think the discovery payload just needs to be <signed IP:PORT> edit specifically, a libsodium attached signature

@RangerMauve
Copy link
Contributor

RangerMauve commented Jul 23, 2018

Sorry for not replying! Went into weekend mode shortly after you posted that. 😅

Above I describe a possible implementation where you store preferred peers as a separate set of hyperdb keys, making it easy to just grab those keys without doing complicated traversals.

I think pfrazee was really into the idea of keeping this information at the discovery-swarm level in order to avoid even connecting to untrusted peers in the first place. Having it be part of the protocol will avoid revealing metadata about which data you're looking at, but it will reveal that you are looking at the dat.

I think it will also complicate the protocol somewhat if there's a set of keys that are separate from the rest of the hyperdb.

In the original post I suggest having a separate swarm for every key.

I'm not sure what the benefit is in having separate swarms here. Anybody could start announcing on a given swarm key and you wouldn't know if they're legitimate or not until you connected to them. Plus, this would offer a lot of overhead per-dat.

I think that the DHT approach wouldn't require too many changes and would also scale for different data formats without changing the replication protocol.

The flow I'm thinking is:

  • User creates a dat, with the goal of making it higher privacy.
  • Upon creation they set some sort of flag somewhere to keep track of this setting.
  • Publishing on the DHT requires including this flag along with their id/port/ip
  • The flag will contain the signed preference (true / false). This should be replicated with the data
  • User announces on the DHT for the discovery key with <their id, their ip / port, their ip/port signed by their ID, their id signed by the dat key, the signed privacy preference>
  • A peer which will soon become a seeder searches the DHT, sees the item, and takes note that the dat is private. They know this because the preference is guarateed to have been created by the owner. They also see that the ID was signed by the owner
  • They replicate the data
  • The user then authorizes the peer as a seeder or writer
  • The peer sees this because their id is now part of the seeders list in hyperdb, signed by the creator
  • They start announcing on the DHT with <their id, their ip/port, their ip/port signed by their ID, their id signed by the creator, the signed privacy preference (signed by the creator>
  • Other peers come and filter out any DHT entries where the privacy flag wasn't signed by the creator and thus avoid connecting to malicious parties
  • Other peers know which announced peers are valid because their ids are signed by the creator
  • Other peers never connect to a announced IP that isn't trusted
  • Seeders reuse the value of the signed flag because they cannot create the signature for the key, and no other ids can be found until replication begins.
  • If a dat doesn't need to be private, they will have that flag saved to be as such and peers won't need to validate the IDs of whoever announced to the DHT.

Some problems with this approach that I can think of right away:

  • What happens when the original creator's key is lost and they can't sign new seeders?
  • Are all these signatures going to take too much data to fit on the DHT?
  • Should the creator be allowed to change their mind about the privacy? (probably no?)
  • What happens when there are no more seeders for a private dat? Does the data just disappear?

Edit: Also, I was talking to @mafintosh about having a fully encrypted DHT where peer IDs used "hash cash" on top of their public keys in order to make it expensive to generate IDs to participate in the DHT so that sybil attacks would take a lot more energy, and making sure communication is encrypted. Twitter thread about it

@max-mapper
Copy link
Author

@RangerMauve @pfrazee so just to summarize, there are two approaches proposed so far:

    1. Use a hypercore protocol extension to let a client send an e.g. "WANT-PEERS" message asking for a set of "preferred keys", they either receive a list of keys (that they will then limit connections to) or an empty list (that signals that they are allowed to connect to any peer -- the current behavior). This introduces no new signature schemes, as it reuses the existing verification mechanisms built into hypercore. For discovery, peers would subscribe to each key in the authorized peer list. Discovery format for DHT backend changes to BEP44 with attached <signed IP:PORT>. All peers discovered in these channels can be added to 1 swarm.
    1. Change the discovery value format in the layer above hypercore to introduce a new signature scheme to allow clients to cross reference a "seeders list in hyperdb" that they receive (over I'm assuming the hypercore protocol somehow) with the payloads they get from discovery that contain signed peer info, filtering out non-signed peers. For discovery, nothing changes on what they listen to (still subscribe to 1 key), but they ignore out non-authorized responses.

Did I get that right? If so, seems like putting it all into the DHT protocol is more risky, because there is no "optional extension" mechanism there, it's just a buffer, so we'd have to support backwards compat on that protobuf if we ever change the scheme. Using a minimal signed payload in the DHT and then using hypercore protocol extension for the rest of the scheme seems like a better way to have something we can discard if it fails.

@RangerMauve
Copy link
Contributor

RangerMauve commented Jul 23, 2018

@mafintosh Totally agree that modifying the DHT will be a lot of effort. The reason I'm more into it, though, is that you can hide IP addresses more easily since you'd need to perform sybil attacks on the DHT near the discovery key to find the IPs rather than just announcing that you have the data.

Some questions:

  • How will you verify that the peers returned by whatever peer you connect to are valid? i.e. how does this work if there's a malicious node in the network that wants you to use the existing behavior?
  • Do you think the overhead of querying for more channels in the DHT and DNS will be negligble as the number of dats you're tracking grows? Say, to hundreds of channels?
  • Is the leaking of IP addresses for users looking for a given Dat OK within the "wikipedia on Dat" threat model?

@max-mapper
Copy link
Author

max-mapper commented Jul 23, 2018

How will you verify that the peers returned by whatever peer you connect to are valid? i.e. how does this work if there's a malicious node in the network that wants you to use the existing behavior?

I'm assuming the trust model is that you trust all of the preferred peers (same trust model as trusting all writers, which is why writers should automatically become preferred peers). So as long as an existing writer signed the data you receive, you can trust it.

Is the leaking of IP addresses for users looking for a given Dat OK within the "wikipedia on Dat" threat model?

It's OK to know an IP is accessing a Dat swarm, it's just not OK to know what parts of it they are uploading or downloading. edit I don't really see how in either of these scenarios the IP of the users looking to download data are ever exposed to anyone other than the signed preferred peers though. I don't really see how of these scenarios are worse than each other about exposing IP of the users looking to download data. I also don't consider data in the DHT private, because sybils and bad 'closest' routing logic in random peers results in lots of ways to discover keys. If you look at inbound discovery-channel debug output on any Dat you occasionally see random non-dat peers that discovered you on the DHT, even if your dat key has never been shared to anyone, I think because when you announce, other peers re-share the key you did a query on.

Do you think the overhead of querying for more channels in the DHT and DNS will be negligble as the number of dats you're tracking grows? Say, to hundreds of channels?

Querying lots of channels should be OK, the only real overhead since its all stateless is in the JS object memory footprint area, which is very optimizable. I misspoke last week when I mentioned multiple swarms, I meant to say multiple channels.

@RangerMauve
Copy link
Contributor

I'm assuming the trust model is that you trust all of the preferred peers

Cool, so then the empty peer list message should be signed by the creator and peers should make sure to save it for when they handshake?

I guess each new writer/seeder will need to have a signed message of a previous writer and the entire chain will need to be transferred.

I get what you mean about the IP security. Sybil attacks are pretty annoying.

@max-mapper
Copy link
Author

Cool, so then the empty peer list message should be signed by the creator and peers should make sure to save it for when they handshake?

Yea I was thinking if its stored in hyperdb then that ensures its saved, and that would also provide a mechanism to replicate those keys and sign them (and verify the signatures).

I guess each new writer/seeder will need to have a signed message of a previous writer and the entire chain will need to be transferred.

A good @mafintosh question, I believe he told me once an existing writer can add a new writer and then a new random peer would have a mechanism to check if what they received from someone was signed by one of the writers. Maybe we could piggy back on that (not sure about implementation there).

I get what you mean about the IP security. Sybil attacks are pretty annoying.

Agreed, sounds like you have lots of cool ideas for DHT improvements, maybe a new thread to discuss improving the DHT and anonymizing IPs etc would be worthwhile.

@RangerMauve
Copy link
Contributor

that would also provide a mechanism to replicate those keys and sign them (and verify the signatures)

Will it be a separate hypderdb from the main one? Or will it be under a prefix?

an existing writer can add a new writer and then a new random peer would have a mechanism to check if what they received from someone was signed by one of the writers

The way adding new writers works AFAIK, is that an existing writer adds another writer's key to the feeds property of their latest messages. in the DEP

This means that getting the list of writers requires getting the latest block from each writer's feed.

I suppose the flow could look like:

  • As a writer: when adding a seeder, add their key to your seeders array in your hyperdb feed
  • As a user trying to find data for a dat:
  • Find peers for the dat
  • Connect to one of them to get the latest block that they have
  • Find the list of writers from them, try to get more of the latest blocks
  • From that find out what the set of latest writers is and whether the dat requires trusted peers

Originally I thought this would be problematic if the remote peer attempted to hide blocks from you that contained information about writers and seeders, but if they do that they will be setting the upper limit on which blocks the user will bother fetching.

Regardless, I think the flag for setting the dat as "requiring trusted seeders" should be set in the first few blocks of the feed. Preferably in the header.

The question now is, should we limit only hyperdb-based data structures to have the ability for additional privacy, or would it also be useful to have something for raw hypercores?

@max-mapper
Copy link
Author

Will it be a separate hypderdb from the main one? Or will it be under a prefix?

Was thinking easiest way would just be a prefixed set of keys

Regardless, I think the flag for setting the dat as "requiring trusted seeders" should be set in the first few blocks of the feed. Preferably in the header.

I was thinking rather than a flag you'd use a new protocol message like WANT-SEEDS and based on the response know if you're supposed to limit peer connections or not. Not sure the flag is necessary.

As a writer: when adding a seeder, add their key to your seeders array in your hyperdb feed

First you'd need to add the key as a writer, then copy it to your seeders array. I think above we agreed writers get copied in, but the seeders list should be separately managed so that you can have a non-writer seeder etc.

From that find out what the set of latest writers is and whether the dat requires trusted peers

This is where you'd send a 'WANT-PEERS' message, and based on the response you'd know the latest seeders and that implies the answer to whether the dat requires trusted peers.

I imagine in the future some of those steps can be removed if writers gets cached. I also imagine @mafintosh would come up with a better mechanism to store the list of seeders and ensure their integrity is checked. But it sounds like we're in general agreement about the API. Maybe we need another thread with a more concrete proposal now.

@alancwoo
Copy link

alancwoo commented Sep 3, 2019

In echo of this conversation (specifically with regards to GDPR), I was hoping to use DAT for a project to distribute academic research/materials (also intended to be offline accessible/searchable with a local sqlite db) for an institution based in Germany.

We ran into some legal issues where, with regards to GDPR, if we were to utilize DAT, we would be unable to explicitly provide identity or assign responsibility to other peers in the swarm who would have access to users' IP addresses.

Having read the post from 2016 on https://blog.datproject.org/2016/12/12/reader-privacy-on-the-p2p-web/ it appears that there needs to be a GDPR approved registry of trusted peers to allow for something like this, but am I correct in understanding that a lot of this not yet possible and would also still require a number of pieces to fall into place such as client support or update to the protocol is in place?

It's a shame because, clearly as is frequently mentioned, P2P trades off to some extent with privacy, but GDPR has made it almost impossible to utilize DAT for a project it is so well suited for, and we are likely forced back to a centralized server or group of servers to host this data.

@pfrazee
Copy link
Contributor

pfrazee commented Sep 3, 2019

@alancwoo We're working on (optional) authenticated connections in dat which would make it possible to whitelist who is allowed to connect and replicate the data. I'd expect it to land sometime during 2020.

@okdistribute
Copy link

okdistribute commented Sep 3, 2019

You can do this today.

If you are building a custom tool you could already pass in a whitlelisted group of peers through discovery-swarm's options (whitelist here: https://github.com/mafintosh/discovery-swarm/#var-sw--swarmopts)

If you want to use the commandline tool, there is some low hanging fruit here to allow it to accept some options for whitelisting peers, a related issue is here: dat-ecosystem/dat#1082... PRs totally welcome

@martinheidegger
Copy link
Contributor

martinheidegger commented Sep 4, 2019

@alancwoo

We ran into some legal issues where, with regards to GDPR, if we were to utilize DAT, we would be unable to explicitly provide identity or assign responsibility to other peers in the swarm who would have access to users' IP addresses.

Can you refer the section of the GDPR is referring to? Generally speaking in DAT there is no private data being communicated. Technically speaking: the clients join a dht network which is - much like a router - a public network. In this public network the clients send a public request for data and entries in the network that have this data forward it. There is - however - no private request or data processing happening. I would be really curious - in detail - what you are referring to.

@okdistribute
Copy link

@martinheidegger the person clearly stated that ip addresses are the issue

@martinheidegger
Copy link
Contributor

I would be grateful for more references. We have to deal with GDPR as well and we have not found any issues regarding the IP addresses. Maybe we are overlooking something but I can't find the context in which the use of IP addresses in DAT would violate the GDPR.

@okdistribute
Copy link

Within GDPR, the EU includes IP addresses as “Personal Identifiable Information” potentially subject to privacy laws. There are a number of articles about this.
Although I've heard that many organizations are currently flying under the radar as regulators are focusing on the big fish with millions of daily users.

@martinheidegger
Copy link
Contributor

@alancwoo mentions:

explicitly provide identity or assign responsibility to other peers

That seems to mean that private, personal identifiable data is leaked or distributed to unknown third parties. The GDPR does cover the collection of IP addresses for private, but I don't think this applies here. In any router software IP's and their packets need to be stored and passed on through packets to other routers as part of the protocol, without the ability to look inside the packet content (given its https). That is pretty much how DAT works if I am not mistaken. If data would be exclusively shared between two parties, DAT would just need to encrypt it, and any router in-between would not see its content. Which is what ciphercore does. Another angle I could see is that IP's are used to track what person has interest in which resource, which - if stored - would give a means of tracking insights. But as @alancwoo mentions this is covered by law (you may not do that). I am very interested in that subject and I would really like to have a reference to the actual issue.

@alancwoo
Copy link

@martinheidegger our legal advisor mentioned:

Art. 13-14 GDPR require certain transparency information to be provided by data controllers. Since peers process the IP-Address they may have to obey these requirements, (and thus also the requirement to reveal their legal identity).

The DAT-Protocol itself does not (at least to my knowledge) provide a way to make this information available through the protocol. All it offers is to show the IP-Address of peers via the “swarm debugger”.

So I think the issue is that GDPR regulates IPs as private data and thus any entity who is able to capture this information needs to be made legally identifiable to the user. As this is, from what I could imagine, just the nature of a peer network, the legal advisor mentioned the possibility to whitelist identified/trusted peers, and perhaps running our own discovery server but then I feel the issue would still remain, that any peers on the network at the same time could expose IPs to one another and thus be in violation of GDPR.

I think the advisor's final solution is to protect the read-key behind a registration form that forces the users to agree to a Privacy Policy/ToS and provide identification, thus people who technically have the read key have fulfilled this identification requirement, but I still wonder because people can still pass the key around and so on?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants