Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize GaussianCDFEncoding to arbitrary CDF encoding #218

Open
Datseris opened this issue Dec 23, 2022 · 2 comments
Open

Generalize GaussianCDFEncoding to arbitrary CDF encoding #218

Datseris opened this issue Dec 23, 2022 · 2 comments
Labels
encodings Related to the Encodings API enhancement New feature or request that is non-breaking good first issue Good for newcomers

Comments

@Datseris
Copy link
Member

Datseris commented Dec 23, 2022

This is some generality improvements for the current GaussianCDFEncoding and Dispersion. In general, any CDF could be used in the source code of the encoding; one could store the CDF function in the encoding struct. E.g., give some timeseries x generate the function:

m, s = mean(x), std(x)
f = x -> gaussian(x, m, s)

Any other univariate function instead of f could be generated. This function then is stored as a field in a new struct CDFEncoding, that uses the exact source code of GaussianCDFEncoding but using f instead of the existing gaussian function.

Then, this is super easily propagated into Dispersion: that type should initrialize a CDFEncoding and store the encoding directly as its field. If given only a timeseries, it defaults to getting mean, std and initializing the Gaussian encoding.

@Datseris Datseris added enhancement New feature or request that is non-breaking good first issue Good for newcomers encodings Related to the Encodings API labels Dec 23, 2022
@kahaaga
Copy link
Member

kahaaga commented Dec 23, 2022

The encoding could in principle be generalizable to multidimensional CDF functions as well, although I'm not sure something like that exists in the literature yet in the context of these "entropy-like" quantities. The function f is just the input to quadgk (which only handles univariate functions at the moment).

A completely generic version of CDFEncoding could be something like

Base.@kwdef struct CDFEncoding <: Encoding
    precomputed_stuff::NamedTuple # e.g. mean and std
    f::Function =  exp((-(xᵢ - μ)^2)/(2σ^2)) # or something else for another CDF
    lb::T # lower integration bound
    up::T # upper integration bound
    integrator::Function = quadgk
end

Or something along those lines, depending on the call signature of quadgk or whatever other integrator one would use for multidimensional input.

EDIT:

Alternatively, one could drop the integrator stuff in the CDFEncoding stuff and rather have CDFEncoding{D} <: Encoding, where D is the dimension of the data. Then one could dispatch separately for 1D (using quadgk for integration), >=2D data (using some other integrator).

@Datseris
Copy link
Member Author

you don't need to have precomputed_stuff. Simply make the function f = x -> exp((-(xᵢ - μ)^2)/(2σ^2)) by calculating or using μ, σ. The closure already stores the numbers. But also, not sure what is the use here of hyper generalizing: higher dimensions and different integrator functions don't really fit the need of the struct. Univariate cumulate distribution functions still make sense in context though. Also no need for the integration bounds, as from -inf to x makes sense because thats by definition what gives you the probability from a CDF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
encodings Related to the Encodings API enhancement New feature or request that is non-breaking good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants