Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API simplification #66

Open
rofinn opened this issue Sep 25, 2020 · 2 comments
Open

API simplification #66

rofinn opened this issue Sep 25, 2020 · 2 comments
Milestone

Comments

@rofinn
Copy link
Member

rofinn commented Sep 25, 2020

Overview

The current implementation has some nice features for handling iterative data and provides early exit conditions. Unfortunately, these features are harder to maintain as we need to handle more use cases and different data structures. A couple of examples of this include:

  • AbstractContext: while this concept is nice for handling early exits and iterative threshold checks. It's also a bit cumbersome and complicates adding new imputation methods. Also, implementing a new AbstractContext type isn't entirely intuitive.
  • in-place: This kind of imputation doesn't really make sense for things like iterators over tables. Perhaps it'd be best to simply remove that option altogether... or limit it to arrays?
  • missing: It's a pretty ubiquitous in Julia now, so maybe we should just expect people to replace it before imputing?
  • dims: We should explicitly use this to apply an imputation method along a dimension. Maybe we can special-case symbols :rows/:cols for processing a columntable or a rowtable? I suppose we could expect folks to explicitly pass in a columntable or rowtable, but that seems a little unfriendly from a usability standpoint.

Proposed Changes

  1. Drop AbstractContext and maybe replace it with some or all of the below:
    - Impute.replace!: which will handle the allowmissing call and could support replacing values in multiple columns at once.
    - Impute.assert?: if you want to throw an error if some missing data threshold is reached [trivial in most cases]
    - Impute.mask?: will just give you a binary mask over your input data [trivial in most cases]
  2. Add an Impute.filter option which will filter observations base on some threshold. Along a dims would probably be more general. This is also probably more general than dropobs and dropvars?
  3. Drop public API support for in-place imputation rather than having undefined behaviour
  4. Add a Tables.jl like interface for describing whether imputation methods support vectors, matrices, tables or arbitrary iterators of some eltype. This would help us produce better error messages when someone uses an invalid imputation methods rather than giving a long traceback to a rather opaque method error.
  5. Add a MCAR test check
  6. Add a data generation module which would be helpful for tests, but could also be used for method comparisons.

Out of scope

  • Handling various missing values
  • Handling non-standard data types like intervals of zoned datetimes. We're gonna limit ourselves to standard datatype like ints, floats, bools, chars and strings. Anything else should have reasonable fallbacks, but if you want to have a timeseries specific algorithm that cares about zoned datetimes then that should be done in a different package.
  • Most of the standard use-cases we've been dealing with don't include streaming data, so if you want an iteration based approach that might be best left to a separate package with different/simpler imputation methods.

Success Conditions

  1. Easier to write new imputation methods
  2. No undefined behaviour
  3. More composable API
  4. Shouldn't be slower than existing tooling (e.g., replace, filter)

Failure Conditions

  1. It isn't much easier to write new imputation methods (e.g., similar lines of code, about as readable) and we've also gained the trade-offs below
  2. Trade-offs are severe in even the simplest of cases. Existing test cases should be used as benchmarks for checking this.
  3. Most operations can be replace with an existing method
  4. Error conditions are harder to debug

Trade-offs

  • Not handling different missing values in a context will involve an extra step (pass)
  • Not handling missing thresholds in a context will involve yet another extra step (pass)
  • Our move away from arbitrary iterators may limit usage and makes it easier to write multi-pass algorithms that may be necesary, but slower
  • Dropping attempts at in-place methods may impose an added allocation penalty for some users

Related Issues & PRs

This was referenced Sep 25, 2020
@rofinn rofinn added this to the 0.6 milestone Oct 10, 2020
@rofinn
Copy link
Member Author

rofinn commented Oct 21, 2020

I think proposed changes 3-6 shouldn't be breaking, so I'll bump the remainder of this issue to the 1.0 release.

@bencottier
Copy link
Contributor

bencottier commented Apr 16, 2021

Following invenia/AxisSets.jl#44 (comment):

Ideally we would standardise how to handle dims and AxisSets.Patterns across Impute.jl, FeatureTransforms.jl, and AxisSets.jl, in which the former two are both supported.

I think the main difference in handling is that FeatureTransforms.jl supports dims=:, which means apply a transform element-wise over an array.

We should also use a consistent convention for what e.g. dims=1 means (outdated but relevant issue: invenia/FeatureTransforms.jl#18)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants