Use Bloom Filter as a fast path to check for public dataset existence #1899

b5 · 2021-08-30T15:53:11Z

I'd like to propose an extension interface aimed at improving the performance of time-to-not-found on a dataset alias that doesn't exist. Right now if I run qri get some_user/random_gibberish_dataset_name, all qri resolvers have to scan their entire list of datasets to confirm such a name doesn't exist. Keeping in mind that qri get is by default hooked up to a network-backed resolver, that means an HTTP call to a database scan per bad name request.

We can speed this up with a bloom filter hydrated with all known public names, or with specific SQL queries on sql-backed name stores. In either case, I'd like to define an interface resolver implementations can use to confirm or deny the existence of a name:

package dsref

// PublicAliasChecker is a fast-check for the existence of a public username / name combination
type PublicAliasChecker interface {
	PublicAliasExists(ctx context.Context, username, name string) (bool, error)
}

Using these techniques we can dramatically drop the time required to rule out a dataset existing.

I think we should define this as an interface in the dsref package to assist in a long term migration toward a single resolver definition backed by different name-store implementations. Currently every resolver needs to re-implement what is effectively the same pattern, just specialized to different stores. But our codebase is trending toward defining uniform store interfaces and business logic that operates on those stores. If we apply that pattern to the ResolveRef interface we should see a Namestore interface emerge at some point that allows us to unify ResolveRef into single function that operates on any given namestore.

We need to define this on public names only, because which unlisted / private datasets a user can see will be dependent on who they are, which will be a different issue. But that does mean a name resolver can only return dsref.ErrNameNotFound when all public names and the user's accessible list of private names are checked.

The text was updated successfully, but these errors were encountered:

b5 added perf A code change that improves performance resolvers dataset reference resolving & loading labels Aug 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Bloom Filter as a fast path to check for public dataset existence #1899

Use Bloom Filter as a fast path to check for public dataset existence #1899

b5 commented Aug 30, 2021

Use Bloom Filter as a fast path to check for public dataset existence #1899

Use Bloom Filter as a fast path to check for public dataset existence #1899

Comments

b5 commented Aug 30, 2021