Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RCF 4.0 #352

Open
sudiptoguha opened this issue Oct 26, 2022 · 0 comments
Open

RCF 4.0 #352

sudiptoguha opened this issue Oct 26, 2022 · 0 comments

Comments

@sudiptoguha
Copy link
Contributor

This issue initiates RCF 4.0. The main two thrusts are further optimizations of memory and distributed access beyond the existing methods.

  • Streaming algorithms and data structures provide unified treatment of sequential analysis (as in Sequential Design of Experiments) and small space algorithms. While sequential analysis should be standard in time series, in some special cases, sequential analysis can be bypassed. However such an adjustment will not provide the same answers on a different input type. Such bypasses can become unhelpful in explainability or understanding the results of a sequential algorithm. This is particularly true for timed sketch data structures such as RCF, specially when shingle size is greater than 1. Thus this library aims to provide simple APIs for sequential analysis that can be applied to generic sequences. Two advantages ensue: first, if the data is already stored (in arrays, contiguous segments) then RCF can be constructed with significantly lower state (memory). Second, the sequential analysis can perform calibration in a streaming manner as well and provide an efficient explanation and error measurement of any inference. Information theoretically, a self calibration has more information (and thus less entropy) than any post hoc callibration/explanation.

  • Streaming algorithms can also unify distributed data — provided sequentiality is not required or is limited. This library aims to provide similar simple APIs that clarifies the use the RCF sketches in such a distributed setting. This is most relevant to shingle size being equal to 1 settings. Once again, if the data is already stored (as is common assumption in distributed setting) this can save memory and communication in building models for inference on distributed data. Likewise, a distributed calibration and explanation (perhaps with some more error) can be performed more efficiently, in terms of communication.

In effect the improvements would separate, clarify, and provide reference points of the usages of RCF in the two disparate settings mentioned above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant