MEGA EPIC | Knowledge Corpus Enhancements #34

maxyu1115 · 2023-08-28T08:04:08Z

Knowledge Corpuses right now need a number of improvements to be usable:

Speed is too slow
Implement shared corpus needed infrastructure
Size issues

Here are some of my initial thoughts on these directions.

Speed

We need to find ways to increase insertion speed so the full wikipedia can be inserted in hours not days. This will likely be a combination of:

Batch API
Parallel and more resource intensive solutions, like map reduce scripts and scaling up dependencies. These are not ideal, but still a reasonable solution
Preprocessed corpuses or other ways we can import full corpuses

It'd be great if we can figure out a way to package and import an entire (preprocessed) corpus. Two major concerns with this are:

How much improvement can we get? I imagine it mostly cuts down on embedding time + memas processing logic time (which isn't much?).
How can we make it not exploitable? This may become a huge security risk

Shared Corpus Infrastructure

We want people to be able to freely and easily share knowledge corpuses, so that people can utilize common datasets similar to training datasets. Sharing corpuses is also essential to keep memas deployment size in control; we'd like to avoid extremes like needing hundreds of GBs per every single user.

Community shared corpuses will need a number of improvements in place, such as:

Speed and preprocessed corpuses
multicorpus search. In order to fully utilize shared corpuses, we need to be able to recall from multiple corpuses at once.
A community hub. How much can we use existing platforms like hugging face?
Community safety and trust mechanisms, maybe like upvoting useful knowledge sets, reporting malicious/problematic ones, etc.
Potentially an export_corpus CP API.

Size issues

We need to investigate and alleviate most size issues, such as the milvus file index size https://milvus.io/docs/v1.1.0/storage_concept.md

We may also need to look into further sharding/splitting corpus storage. Currently across services like ES/Milvus/Cassandra, we fit all corpuses into a single index/collection/table. This is because most of these services do not scale well with these top level storage units. E.g. cassandra has a hard limit of about 1000 tables. This however creates noisy neighbor problems as we have more corpuses/bigger corpuses. So we may need to look at pushing these table number limits through sharding corpuses across tables etc.

TRACKS: #14, #15, #18, #22, #8

The text was updated successfully, but these errors were encountered:

maxyu1115 added epic Big goals that track many stories/smaller tasks high priority High priority tasks/issues labels Aug 28, 2023

maxyu1115 added this to the v1.0 milestone Aug 28, 2023

maxyu1115 changed the title ~~EPIC | Knowledge Corpus Enhancements~~ MEGA EPIC | Knowledge Corpus Enhancements Oct 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MEGA EPIC | Knowledge Corpus Enhancements #34

MEGA EPIC | Knowledge Corpus Enhancements #34

maxyu1115 commented Aug 28, 2023

MEGA EPIC | Knowledge Corpus Enhancements #34

MEGA EPIC | Knowledge Corpus Enhancements #34

Comments

maxyu1115 commented Aug 28, 2023

Speed

Shared Corpus Infrastructure

Size issues