Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MEGA EPIC | Knowledge Corpus Enhancements #34

Open
maxyu1115 opened this issue Aug 28, 2023 · 0 comments
Open

MEGA EPIC | Knowledge Corpus Enhancements #34

maxyu1115 opened this issue Aug 28, 2023 · 0 comments
Labels
epic Big goals that track many stories/smaller tasks high priority High priority tasks/issues
Milestone

Comments

@maxyu1115
Copy link
Collaborator

Knowledge Corpuses right now need a number of improvements to be usable:

  1. Speed is too slow
  2. Implement shared corpus needed infrastructure
  3. Size issues

Here are some of my initial thoughts on these directions.

Speed

We need to find ways to increase insertion speed so the full wikipedia can be inserted in hours not days. This will likely be a combination of:

  • Batch API
  • Parallel and more resource intensive solutions, like map reduce scripts and scaling up dependencies. These are not ideal, but still a reasonable solution
  • Preprocessed corpuses or other ways we can import full corpuses

It'd be great if we can figure out a way to package and import an entire (preprocessed) corpus. Two major concerns with this are:

  1. How much improvement can we get? I imagine it mostly cuts down on embedding time + memas processing logic time (which isn't much?).
  2. How can we make it not exploitable? This may become a huge security risk

Shared Corpus Infrastructure

We want people to be able to freely and easily share knowledge corpuses, so that people can utilize common datasets similar to training datasets. Sharing corpuses is also essential to keep memas deployment size in control; we'd like to avoid extremes like needing hundreds of GBs per every single user.

Community shared corpuses will need a number of improvements in place, such as:

  • Speed and preprocessed corpuses
  • multicorpus search. In order to fully utilize shared corpuses, we need to be able to recall from multiple corpuses at once.
  • A community hub. How much can we use existing platforms like hugging face?
  • Community safety and trust mechanisms, maybe like upvoting useful knowledge sets, reporting malicious/problematic ones, etc.
  • Potentially an export_corpus CP API.

Size issues

We need to investigate and alleviate most size issues, such as the milvus file index size https://milvus.io/docs/v1.1.0/storage_concept.md

We may also need to look into further sharding/splitting corpus storage. Currently across services like ES/Milvus/Cassandra, we fit all corpuses into a single index/collection/table. This is because most of these services do not scale well with these top level storage units. E.g. cassandra has a hard limit of about 1000 tables. This however creates noisy neighbor problems as we have more corpuses/bigger corpuses. So we may need to look at pushing these table number limits through sharding corpuses across tables etc.

TRACKS: #14, #15, #18, #22, #8

@maxyu1115 maxyu1115 added epic Big goals that track many stories/smaller tasks high priority High priority tasks/issues labels Aug 28, 2023
@maxyu1115 maxyu1115 added this to the v1.0 milestone Aug 28, 2023
@maxyu1115 maxyu1115 changed the title EPIC | Knowledge Corpus Enhancements MEGA EPIC | Knowledge Corpus Enhancements Oct 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic Big goals that track many stories/smaller tasks high priority High priority tasks/issues
Projects
None yet
Development

No branches or pull requests

1 participant