Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Federation #19

Open
10 tasks
victorb opened this issue Apr 25, 2019 · 7 comments
Open
10 tasks

Federation #19

victorb opened this issue Apr 25, 2019 · 7 comments
Labels
enhancement New feature or request

Comments

@victorb
Copy link
Member

victorb commented Apr 25, 2019

I opened a preliminary PR (#10) for Federation but probably best to go via a issue first, to better enable discussions around it. Here is what I've been thinking so far.

Old proposal: https://gist.github.com/victorb/82ace9e6fe7adf578833527b8b94f914

New proposal:

Open-Registry Federation

Summary

Open-Registry as a crowdfunded registry won't be able to reach the same scale
of npm inc registry without raising significant amount of funds. What we can do
however, is setup a federation of registries which would significantly lower our
operating costs and also give the users the benefit of faster performance and
local resource sharing.

The model of federation proposed here will decentralize the storage and
transfer of tarballs first, as it poses an easier way of getting started with
federation for Open-Registry.

Once implemented and used, we can start focusing on research about federated
publishing as well.

Motivation

  • Lower bandwidth/storage/hosting expenses
  • Faster performance for participants
  • Resilience
  • User Control

Constraints

  • Needs to handle npm namespace to be npm compatible (global + scoped packages)
  • Handles propagation of package updates
  • Anti-spam measures if needed
  • Cheap to run (Federated version needs to be lightweight)
  • Downloads metadata + tarballs on-demand
  • Space aware (never cause "out of space" state by itself)
  • Users should be able to benefit from federation by just changing the registry
    url (DNS/HTTP federation)
  • Users can benefit further by running federation software locally
  • Runs offline

Use Cases

  • Individuals can find closer mirrors
  • Teams can share the same mirror
  • Companies can deploy on-prem mirrors
  • Organizations depending on Open Source packages can help host packages
  • Registry will continue to work even though main mirror is down

Security

  • Malicious people might try to be a part of federation honestly, until they
    aren't honest anymore
    • Content-addressing helps address this specific issue
  • Tarballs are verified when downloaded via content-addressing + popular
    clients (npm + yarn) checks the checksums before extracting, so mutating
    served tarballs is hard without client detecting it

Practical steps

Ok, so the working plan is the following:

  • Write a lightweight proxy to run locally
    • Should connect to set of bootstrap nodes run by Open-Registry, and can
      find other nodes via those bootstrap nodes
    • Runs a local HTTP proxy that fetches the right package when needed
      • Support metadata route
      • Support tarball route
      • Support index route
    • Otherwise runs in the background, potentialy helping others finding
      packages

This is the small, MVP version to ensure the idea is viable in the wild.

First step towards federation is having the metadata index centralized with
Open-Registry while tarballs can be served from anywhere and anyone.

Plan is to use ipfs-lite by @hsanjuan to start a embedded libp2p node that will
expose the traditional registry interface as HTTP endpoints.

The software will connect to the central registry to find out the latest root
hash and also listen for any changes, automatically update it's local pointer
when Open-Registry's pointer changes.

The root hash can be found in multiple different ways, depending on the
environment of the software.

The software will basically be a resolver for (packageName, packageVersion) =>
IPFS hash via it's local proxy.

CLI interface

$ open-registry --federate
                --share
                --update-type=<http|dns|ipns|pubsub>
                --offline

--federate <multiaddr>   - Connect to already running instance and use it's
                           root hash.
                           Default: /dns4/npm.open-registry.dev/tcp/6736

--share                  - Enable other peers to connect to you and download
                           public packages.
                           Default: true

--update-type            - How to get the latest root hash from Open-Registry.
                           Default: http

--offline                - Don't do any connections, use last known root hash.
                           Default: false

Example usage:

$ open-registry
Connecting to npm.open-registry.dev
Getting latest hash via HTTP over TLS
Started sharing downloaded public packages with others
Started HTTP server on http://localhost:6736 # mnemonic: "open" in T9
...
Currently connected to 3 peers
Upload/Download [current/total]: 32kbps/0kbps [3mb/7.3mb]

Pointing your package manager to http://localhost:6736 should now allow
you to download and install packages on-demand, while caching them and serving
it to other users who are trying to download them too.

Federation Protocol

When the federation software gets started on the users device, it connects to
the main registry.

Once connection has been established, it asks for the latest version of the
registry (just a pointer), and saves it for future use.

Concurrently, it starts a HTTP server locally.

Now the user can point it's client to the local HTTP server

Requests will be proxied via the latest root hash the federation software knows
about, and cache fetched data

When the root hash of the main registry changes, it publishes it via the
following ways:

  • txt record on npm.open-registry.dev under the format "registry-hash="
  • Under property hash in response to a GET request to npm.open-registry.dev
  • Send the hash via the topic npm.open-registry.dev on the used libp2p
    network
  • (maybe) updates the IPNS name that the main registry uses

If the local client makes a request for a package that doesn't exists in the
local root hash, the client needs to make a request to the central registry to
download the package. After this is done, the package will be included in the
new root hash, and can therefore be downloaded by the local client without any
requests to the central registry.

Simulator

First step of the federation setup is creating a suitable testing environment
where we can run tests about how well the federation is working.

Simulator should start with running the following scenarios:

  • Starting one node connected to the main registry, downloading packages
    for one project. Run two times and ensure second is faster than first
  • Start two nodes. Make sure wanted packages is cached in the first one.
    Download packages without being connected to the Internet in the second one.
    Ensure second node is faster than first node as connection should now be
    local.
  • Start five nodes connected to internet and download packages for one
    project. Compare to starting five nodes where only one is connected to the
    internet. Second phase should be less bandwidth intense as packages are
    only downloaded from the Internet once instead of for each node.

More elaborate schemes can be created in the future.

Bootstrap nodes

Open-Registry will run a couple of bootstrap nodes. These are responsible for
being accessible to the federation nodes and provide the data for metadata and
tarballs if the federation nodes doesn't have it locally.

Metrics

Both the bootstrap nodes and the main registry index should publish metrics in
the Prometheus format to be collected by the metrics gatherer. These metrics
will eventually be made accessible via a public dashboard.

For the federation nodes, we can offer opt-in metrics in the future, so we can
see the health of the federation.

Existing infrastructure migration

The current Open-Registry is just one instance which is the main Open-Registry
index. With federation, the architecture would change to add another component
which would be the federated instances. We have more flexibility on where to
place these but are in no rush to add them currently.

Potential Issues

  • Lockfiles contains direct location-based URLs
    • hard for project to migrate without having to rewrite their lockfiles
  • Efficient and fast lookup in the IPFS network
    • private networks solve this but brings it's own problems

Drawbacks

  • Requiring software to be installed and run in the background for people
    wanting to take advantage of it
    • ^ could possibly be solved with HTTP/DNS routing, but initial routing will
      be centralized in that case and require internet connectivity

Alternatives

  • Continue to run a centralized service
  • Skip federation and start researching a architecture for fully decentralized
    registry for both tarball and metadata
    • Probably a too huge of a undertaking right now

Unresolved Problems

  • Using a IPFS private VS public network
    • private network will be faster to bootstrap + finding content
    • public network gives us a bigger reach and ability to download content from
      other nodes
    • Should benchmark and see which one is faster (although private network is
      pretty much sure to be faster, would be interesting to see how much)

Future

  • After implementation of the tarball federation, further research should be
    done on how metadata can be federated as well
  • Research URL scheme currently used to define packages
    • Right now, entire ecosystem is in one namespace (lets call it the npm
      namespace)
    • Things are referred to as class-is directly in the package.json and
      lockfiles
    • We'd like to support multiple registries by doing something similar to
      /registry.npmjs.org/class-is instead. More verbose, but more accurate and
      flexible
@victorb victorb added the enhancement New feature or request label Apr 25, 2019
@victorb victorb pinned this issue Apr 25, 2019
@max-mapper
Copy link

I don't have time to work on this right now, but here's an old thread from a similar initiative I worked on depjs/dep#8

@victorb
Copy link
Member Author

victorb commented Apr 25, 2019

Thanks a lot @maxogden, will check that out.

@retrohacker
Copy link

Running a global network of the scale of the npm registry will be impossible to do with just being funded by the community as the costs will be too high.

Wonder how far we can get with cloudflare + cloud storage.

Will be experimenting with this in the coming weeks and will report back :-)

@victorb
Copy link
Member Author

victorb commented Apr 26, 2019

@retrohacker thanks, appreciate it, ping here once you have some results to share :)

That said, I do think that even if we find the fastest CDN, we can make it faster for people by having a federated model. But CDN in front of the metadata registry would still be a good idea.

@victorb
Copy link
Member Author

victorb commented Apr 29, 2019

I've updated the initial issue here with an updated version of the proposed federation, will also bump it on the roadmap.

Old proposal can be found here: https://gist.github.com/victorb/82ace9e6fe7adf578833527b8b94f914

@marcusnewton
Copy link

Build it using Holochain, it's exactly what you need for distributed (fully sharded) storage and cryptographic security

@retrohacker
Copy link

retrohacker commented Jun 24, 2019

@victorb as promised, circling back to report on cost.

Self hosting on cloud providers turned out to be reasonable. Our GCP mirror ended up costing ~$300 to do the initial mirroring (pulling 5TB of data through cloud functions and into cloud storage).

Once the files are sitting in storage (multi-zone within the US), the cost is ~$6 a day. That includes the instance that is sitting there watching the CouchDB stream from the npm registry to keep the mirror fresh. The breakdown is $3.46 per day for storage and $2.28 for the compute instance.

Cloudflare functions (where we are doing our load balancing) costs $0.50 per million invocations.

BTW the service is up and running if you want to give it a try: https://freajs.com

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants