Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Bloom Filter as a fast path to check for public dataset existence #1899

Open
b5 opened this issue Aug 30, 2021 · 0 comments
Open

Use Bloom Filter as a fast path to check for public dataset existence #1899

b5 opened this issue Aug 30, 2021 · 0 comments
Labels
perf A code change that improves performance resolvers dataset reference resolving & loading

Comments

@b5
Copy link
Member

b5 commented Aug 30, 2021

I'd like to propose an extension interface aimed at improving the performance of time-to-not-found on a dataset alias that doesn't exist. Right now if I run qri get some_user/random_gibberish_dataset_name, all qri resolvers have to scan their entire list of datasets to confirm such a name doesn't exist. Keeping in mind that qri get is by default hooked up to a network-backed resolver, that means an HTTP call to a database scan per bad name request.

We can speed this up with a bloom filter hydrated with all known public names, or with specific SQL queries on sql-backed name stores. In either case, I'd like to define an interface resolver implementations can use to confirm or deny the existence of a name:

package dsref

// PublicAliasChecker is a fast-check for the existence of a public username / name combination
type PublicAliasChecker interface {
	PublicAliasExists(ctx context.Context, username, name string) (bool, error)
}

Using these techniques we can dramatically drop the time required to rule out a dataset existing.

I think we should define this as an interface in the dsref package to assist in a long term migration toward a single resolver definition backed by different name-store implementations. Currently every resolver needs to re-implement what is effectively the same pattern, just specialized to different stores. But our codebase is trending toward defining uniform store interfaces and business logic that operates on those stores. If we apply that pattern to the ResolveRef interface we should see a Namestore interface emerge at some point that allows us to unify ResolveRef into single function that operates on any given namestore.

We need to define this on public names only, because which unlisted / private datasets a user can see will be dependent on who they are, which will be a different issue. But that does mean a name resolver can only return dsref.ErrNameNotFound when all public names and the user's accessible list of private names are checked.

@b5 b5 added perf A code change that improves performance resolvers dataset reference resolving & loading labels Aug 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perf A code change that improves performance resolvers dataset reference resolving & loading
Projects
None yet
Development

No branches or pull requests

1 participant