You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Knowledge Corpuses right now need a number of improvements to be usable:
Speed is too slow
Implement shared corpus needed infrastructure
Size issues
Here are some of my initial thoughts on these directions.
Speed
We need to find ways to increase insertion speed so the full wikipedia can be inserted in hours not days. This will likely be a combination of:
Batch API
Parallel and more resource intensive solutions, like map reduce scripts and scaling up dependencies. These are not ideal, but still a reasonable solution
Preprocessed corpuses or other ways we can import full corpuses
It'd be great if we can figure out a way to package and import an entire (preprocessed) corpus. Two major concerns with this are:
How much improvement can we get? I imagine it mostly cuts down on embedding time + memas processing logic time (which isn't much?).
How can we make it not exploitable? This may become a huge security risk
Shared Corpus Infrastructure
We want people to be able to freely and easily share knowledge corpuses, so that people can utilize common datasets similar to training datasets. Sharing corpuses is also essential to keep memas deployment size in control; we'd like to avoid extremes like needing hundreds of GBs per every single user.
Community shared corpuses will need a number of improvements in place, such as:
Speed and preprocessed corpuses
multicorpus search. In order to fully utilize shared corpuses, we need to be able to recall from multiple corpuses at once.
A community hub. How much can we use existing platforms like hugging face?
Community safety and trust mechanisms, maybe like upvoting useful knowledge sets, reporting malicious/problematic ones, etc.
We may also need to look into further sharding/splitting corpus storage. Currently across services like ES/Milvus/Cassandra, we fit all corpuses into a single index/collection/table. This is because most of these services do not scale well with these top level storage units. E.g. cassandra has a hard limit of about 1000 tables. This however creates noisy neighbor problems as we have more corpuses/bigger corpuses. So we may need to look at pushing these table number limits through sharding corpuses across tables etc.
Knowledge Corpuses right now need a number of improvements to be usable:
Here are some of my initial thoughts on these directions.
Speed
We need to find ways to increase insertion speed so the full wikipedia can be inserted in hours not days. This will likely be a combination of:
It'd be great if we can figure out a way to package and import an entire (preprocessed) corpus. Two major concerns with this are:
Shared Corpus Infrastructure
We want people to be able to freely and easily share knowledge corpuses, so that people can utilize common datasets similar to training datasets. Sharing corpuses is also essential to keep memas deployment size in control; we'd like to avoid extremes like needing hundreds of GBs per every single user.
Community shared corpuses will need a number of improvements in place, such as:
recall
from multiple corpuses at once.export_corpus
CP API.Size issues
We need to investigate and alleviate most size issues, such as the milvus file index size https://milvus.io/docs/v1.1.0/storage_concept.md
We may also need to look into further sharding/splitting corpus storage. Currently across services like ES/Milvus/Cassandra, we fit all corpuses into a single index/collection/table. This is because most of these services do not scale well with these top level storage units. E.g. cassandra has a hard limit of about 1000 tables. This however creates noisy neighbor problems as we have more corpuses/bigger corpuses. So we may need to look at pushing these table number limits through sharding corpuses across tables etc.
TRACKS: #14, #15, #18, #22, #8
The text was updated successfully, but these errors were encountered: