API simplification #66

rofinn · 2020-09-25T02:30:44Z

Overview

The current implementation has some nice features for handling iterative data and provides early exit conditions. Unfortunately, these features are harder to maintain as we need to handle more use cases and different data structures. A couple of examples of this include:

AbstractContext: while this concept is nice for handling early exits and iterative threshold checks. It's also a bit cumbersome and complicates adding new imputation methods. Also, implementing a new AbstractContext type isn't entirely intuitive.
in-place: This kind of imputation doesn't really make sense for things like iterators over tables. Perhaps it'd be best to simply remove that option altogether... or limit it to arrays?
missing: It's a pretty ubiquitous in Julia now, so maybe we should just expect people to replace it before imputing?
dims: We should explicitly use this to apply an imputation method along a dimension. Maybe we can special-case symbols :rows/:cols for processing a columntable or a rowtable? I suppose we could expect folks to explicitly pass in a columntable or rowtable, but that seems a little unfriendly from a usability standpoint.

Proposed Changes

~~Drop AbstractContext and maybe replace it with some or all of the below:~~
~~- Impute.replace!: which will handle the allowmissing call and could support replacing values in multiple columns at once.~~
~~- Impute.assert?: if you want to throw an error if some missing data threshold is reached [trivial in most cases]~~
~~- Impute.mask?: will just give you a binary mask over your input data [trivial in most cases]~~
~~Add an Impute.filter option which will filter observations base on some threshold. Along a dims would probably be more general. This is also probably more general than dropobs and dropvars?~~
Drop public API support for in-place imputation rather than having undefined behaviour
Add a Tables.jl like interface for describing whether imputation methods support vectors, matrices, tables or arbitrary iterators of some eltype. This would help us produce better error messages when someone uses an invalid imputation methods rather than giving a long traceback to a rather opaque method error.
Add a MCAR test check
Add a data generation module which would be helpful for tests, but could also be used for method comparisons.

Out of scope

Handling various missing values
Handling non-standard data types like intervals of zoned datetimes. We're gonna limit ourselves to standard datatype like ints, floats, bools, chars and strings. Anything else should have reasonable fallbacks, but if you want to have a timeseries specific algorithm that cares about zoned datetimes then that should be done in a different package.
Most of the standard use-cases we've been dealing with don't include streaming data, so if you want an iteration based approach that might be best left to a separate package with different/simpler imputation methods.

Success Conditions

Easier to write new imputation methods
No undefined behaviour
More composable API
Shouldn't be slower than existing tooling (e.g., replace, filter)

Failure Conditions

It isn't much easier to write new imputation methods (e.g., similar lines of code, about as readable) and we've also gained the trade-offs below
Trade-offs are severe in even the simplest of cases. Existing test cases should be used as benchmarks for checking this.
Most operations can be replace with an existing method
Error conditions are harder to debug

Trade-offs

Not handling different missing values in a context will involve an extra step (pass)
Not handling missing thresholds in a context will involve yet another extra step (pass)
Our move away from arbitrary iterators may limit usage and makes it easier to write multi-pass algorithms that may be necesary, but slower
Dropping attempts at in-place methods may impose an added allocation penalty for some users

Related Issues & PRs

The text was updated successfully, but these errors were encountered:

rofinn · 2020-10-21T21:46:24Z

I think proposed changes 3-6 shouldn't be breaking, so I'll bump the remainder of this issue to the 1.0 release.

bencottier · 2021-04-16T15:30:33Z

Following invenia/AxisSets.jl#44 (comment):

Ideally we would standardise how to handle dims and AxisSets.Patterns across Impute.jl, FeatureTransforms.jl, and AxisSets.jl, in which the former two are both supported.

I think the main difference in handling is that FeatureTransforms.jl supports dims=:, which means apply a transform element-wise over an array.

We should also use a consistent convention for what e.g. dims=1 means (outdated but relevant issue: invenia/FeatureTransforms.jl#18)

This was referenced Sep 25, 2020

Cleanup of old/noisy tests #67

Merged

Simplify Imputor API #69

Merged

rofinn added the enhancement label Oct 6, 2020

rofinn added this to the 0.6 milestone Oct 10, 2020

rofinn modified the milestones: 0.6, 1.0 Oct 21, 2020

rofinn mentioned this issue Apr 15, 2021

WIP: support FeatureTransforms.jl invenia/AxisSets.jl#44

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API simplification #66

API simplification #66

rofinn commented Sep 25, 2020 •

edited

rofinn commented Oct 21, 2020

bencottier commented Apr 16, 2021 •

edited

API simplification #66

API simplification #66

Comments

rofinn commented Sep 25, 2020 • edited

Overview

Proposed Changes

Out of scope

Success Conditions

Failure Conditions

Trade-offs

Related Issues & PRs

rofinn commented Oct 21, 2020

bencottier commented Apr 16, 2021 • edited

rofinn commented Sep 25, 2020 •

edited

bencottier commented Apr 16, 2021 •

edited