PipeOpFilterRows #410

sumny · 2020-04-21T21:06:39Z

PipeOpFilterRows allows to filter the rows of the data of a task according to a filter parameter and adds an easy way for removing rows with missing values, related to #301
PipeOpPredictionUnion allows to combine predictions of the same task_type andpredict_type into a new Prediction. This maybe needs some discussion as combining predictions is imo not that straightforward. dropped this from the PR

R/PipeOpFilterRows.R

tests/testthat/test_pipeop_filterrows.R

R/PipeOpFilterRows.R

R/PipeOpPredictionUnion.R

mb706 · 2020-06-21T23:24:02Z

Also the need to filter rows during training (e.g. based on NA or some other property that a learner cannot handle) may be very different from this splitting up predictions between learners thing?

sumny · 2020-10-10T15:38:27Z

Cleaned this up. Currently has two parameters:

filter_formula (formula with only a rhs evaluating to TRUE or FALSE within the data of the Task indexing rows to keep)
na_selector (Selector function selecting features that should be checked for missing values results in rows having missing values w.r.t to these features being dropped)

I agree with @mb706 that this is somewhat redundant, because:

po1 = po("filterrows", param_vals = list(filter_formula = ~ !is.na(insulin)))
po2 = po("filterrows", param_vals = list(na_selector = selector_name("insulin")))

results in virtually the same operation being done.
However, what if I have a huge number of features and want to e.g. remove missing values w.r.t to factorial features? The approach via filter_formula = ~ !is.na(feature1) & !is.na(feature2) & ... seems quite cumbersome compared to na_selector = selector_type("factor")? I guess you could somewhat auto-generate the formula in this case but I am not sure if this is really nicer..

Maybe we should just split this up into PipeOpFilterRows and PipeOpRemoveMissings to not confuse users and state in PipeOpFilterRows that PipeOpRemoveMissings provides a cleaner interface for missing value removal/complete case analysis.

mb706 · 2020-10-12T15:52:43Z

That argument for 'nacols" would also apply to other conditions to filter by, e.g. trying to filter if any of a number of features is 0 or negative. We could go the data.table row and

Introduce the .SD variable that is just a data.table of the dataset
introduce an SDcols hyperparameter that selects the features that should be present in .SD

So then dropping rows where any feature is NA would go down to

SDcols = selector_grep("^nonNA\\.")
filter_formula = ~ !apply(is.na(.SD), 1, any)

mb706 · 2020-10-12T16:54:54Z

So this is supposed to filter out rows during predict time, right? Is there a way to keep track of what sample is being predicted, once a learner sees the incomplete prediction dataset? So if my input task has two rows, one is dropped, my Graph returns one prediction, which sample does the prediction belong to?

Suppose that is not a problem, what is the current state of mlr3 handling missing predictions? ignores them? uses a fallback learner? just crashes? In the latter case we whould modify GraphLearner to do something sensible with missing predictions.

Also, maybe this PipeOp have the option to select on what phase samples are filtered? E.g. "train", "predict", "always".

sumny · 2020-10-20T15:12:04Z

That argument for 'nacols" would also apply to other conditions to filter by, e.g. trying to filter if any of a number of features is 0 or negative. We could go the data.table row and
1. Introduce the `.SD` variable that is just a `data.table` of the dataset

2. introduce an `SDcols` hyperparameter that selects the features that should be present in `.SD`
So then dropping rows where any feature is NA would go down to
SDcols = selector_grep("^nonNA\\.")
filter_formula = ~ !apply(is.na(.SD), 1, any)

I like the idea of using SDcols. I included this. So filtering is now only specified using the filter_formula which can rely on the .SD variable because it is evaluated within the frame of the data.table backend of the Task.

sumny · 2020-10-20T15:19:17Z

So this is supposed to filter out rows during predict time, right? Is there a way to keep track of what sample is being predicted, once a learner sees the incomplete prediction dataset? So if my input task has two rows, one is dropped, my Graph returns one prediction, which sample does the prediction belong to?

Suppose that is not a problem, what is the current state of mlr3 handling missing predictions? ignores them? uses a fallback learner? just crashes? In the latter case we whould modify GraphLearner to do something sensible with missing predictions.

Also, maybe this PipeOp have the option to select on what phase samples are filtered? E.g. "train", "predict", "always".

I think we are probably mixing up different use cases. I believe what @pfistfl wanted to be possible in #301 is to simple split your task based on some logic both during training and prediction (with the row ids of the splits being disjunct and their union being equal to the row ids of the complete task). In this case using PipeOpFilterRows would look somewhat like this:

task = tsk("pima")
gr1 = po("filterrows", param_vals = list(filter_formula = ~ age < 31)) %>>% lrn("classif.rpart")
gr2 = po("filterrows", param_vals = list(filter_formula = ~ age >= 31)) %>>% lrn("classif.rpart")
gr1$train(task)
gr2$train(task)
p1 = gr1$predict(task)[[1L]]  # correct row_ids are pertained 
p2 = gr2$predict(task)[[1L]]
setequal(union(p1$row_ids, p2$row_ids), task$row_ids)

Then we would only need another PipeOp that would do nothing during training and combine the predictions during prediction, i.e., simply wrapping c so we could have something like copy %>>% list(filterrows1, filterrows2) %>>% unite and the prediction output would be:

c(p1, p2)

<PredictionClassif> for 768 observations:
    row_id truth response
         4   neg      neg
         6   neg      neg
         7   pos      neg
---                      
       763   neg      neg
       764   neg      neg
       767   pos      pos

codecov-io · 2020-12-13T00:04:38Z

Codecov Report

Merging #410 (f7798ad) into master (fc1e85a) will decrease coverage by 0.92%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #410      +/-   ##
==========================================
- Coverage   94.46%   93.54%   -0.93%     
==========================================
  Files          77       72       -5     
  Lines        2528     2245     -283     
==========================================
- Hits         2388     2100     -288     
- Misses        140      145       +5

Impacted Files	Coverage Δ
R/PipeOpFilterRows.R	`100.00% <100.00%> (ø)`
R/PipeOpPredictionUnion.R	`100.00% <100.00%> (ø)`
R/greplicate.R	`0.00% <0.00%> (-100.00%)`	⬇️
R/PipeOpImputeSample.R	`60.00% <0.00%> (-30.00%)`	⬇️
R/PipeOpImpute.R	`76.19% <0.00%> (-18.75%)`	⬇️
R/PipeOpImputeMean.R	`90.00% <0.00%> (-10.00%)`	⬇️
R/PipeOpImputeMedian.R	`90.00% <0.00%> (-10.00%)`	⬇️
R/mlr_graphs_elements.R	`82.79% <0.00%> (-9.29%)`	⬇️
R/GraphLearner.R	`94.00% <0.00%> (-2.39%)`	⬇️
R/Graph.R	`87.40% <0.00%> (-1.28%)`	⬇️
... and 40 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 36ca770...84d7341. Read the comment docs.

py9mrg · 2021-02-22T11:59:34Z

Hello,

I thought this new feature might help my issue #566. I gave it a try using this commit and then adding the line po("filterrows", param_vals = list(filter_formula = ~ !apply(is.na(.SD), 1, any))) before my svm learner. But I still got the error about mismatching truth and response lengths so either I'm doing something wrong or this won't solve #566. Please let me know if I'm being really thick!

sumny · 2021-02-23T21:27:58Z

Hi @py9mrg,

your approach above does not work out of the box because the SDcols hyperparameter of PipeOpFilterRows is set to selector_all() by default, and selector_all() called on a Task will only return the feature names of the task. As you deal with missing values in your target variable you can provide your own selector, e.g.:

selector_all_inc_target = function() {
  make_selector(function(task) {
    c(task$feature_names, task$target_names)
  }, description = "selector_all_inc_target")
}

then use this in your pipeop:

po = po("filterrows",
  param_vals = list(
    filter_formula = ~ !apply(is.na(.SD), 1, any),
    SDcols = selector_all_inc_target()
  )
)

this would result in the 3rd row also being dropped:

po$train(list(task_w_missing))[[1]]$data()

    target variable1 variable2
 1:     32         4         4
 2:     50         5         5
 3:     72         6         6
 4:     98         7         7
...

All in all this should work for you.
However, there is some inconsistency now that the state of the po would only list your features "variable1" and "variable2" as being affected (although your target is also). We may have to discuss what to do with PipeOps affecting targets in general. Also I will try to finish this PR up soon and maybe refactor it. Hope this helps.

py9mrg · 2021-02-24T13:36:22Z

Hello @sumny ,

Thanks so much for coming back to me. That works! Just one thing in case anyone is reading with a similar use case, because make_selector is not exported, I had to use mlr3pipelines:::make_selector. Took me longer than I care to admit to work that out.

Thanks again for your help.

py9mrg · 2021-04-16T12:03:05Z

Hello @sumny, do you know when this PR might be merged? I'm using it all the time and it would be nice, when not using renv, to not have to keep rolling mlr3pipelines back to this commit every time I forget to untick the update box in RStudio for this particular package!

Edit: oh I forgot to say, it's probably me, but I can only get the 84d7341 commit to recognise the new filterrows po. I can see it getting the html documentation when installing the other commits as mlr_pipeops_filterrows html appears, but po("filterrows", etc) only works with the older commit.

sumny · 2021-04-22T17:26:19Z

@py9mrg, @mb706 and I have to go over it and check whether the complete functionality is still within the scope of mlr3pipelines. Also, selecting targets via selectors is still an open discussion #493 I will ping you once it is merged - sorry for keeping you waiting so long!

py9mrg · 2021-04-22T17:32:50Z

@sumny thanks and don't worry - I'm sure I am completely underestimating how significant the change is.

add PipeOpFilterRows, PipeOpPredictionUnion, tests and docu

3e83535

sumny added Status: Review Needed Status: Completed labels Apr 21, 2020

sumny requested a review from pfistfl April 21, 2020 21:06

This comment has been minimized.

Sign in to view

pfistfl reviewed Apr 22, 2020

View reviewed changes

sumny added 3 commits April 22, 2020 13:22

adjust defaults, touch up docs, add more tests

92c9fa7

Merge remote-tracking branch 'origin/master' into pipeop_filterrows

22e3af6

Merge remote-tracking branch 'origin/master' into pipeop_filterrows

f7798ad

This comment has been minimized.

Sign in to view

mb706 reviewed Jun 19, 2020

View reviewed changes

This comment has been minimized.

Sign in to view

mb706 added the Status: Needs Discussion We still need to think about what the solution should look like label Jun 21, 2020

sumny added Status: Revision Needed and removed Status: Completed Status: Review Needed labels Jul 9, 2020

sumny self-assigned this Aug 13, 2020

sumny added 3 commits October 10, 2020 14:45

Merge remote-tracking branch 'origin/master' into pipeop_filterrows

aa6e504

drop prediction union - provide separate PR later

aca5d5e

rework PipeOpFilterRows, update docs and tests

a574477

sumny added Status: Completed Status: Review Needed and removed Status: Needs Discussion We still need to think about what the solution should look like Status: Revision Needed labels Oct 10, 2020

sumny requested review from mb706 and pfistfl October 10, 2020 15:07

update docs, fix tests, update NEWS

2f3ee4e

sumny changed the title ~~pipeops FilterRows and PredictionUnion~~ PipeOpFilterRows Oct 10, 2020

fix env of formula

23c4bed

sumny mentioned this pull request Oct 10, 2020

pack_formula() #518

Open

mb706 added Status: Revision Needed and removed Status: Completed Status: Review Needed labels Oct 12, 2020

sumny added 3 commits October 19, 2020 18:11

finalize first version

229284c

fix environment stuff for formula, add one more test and update docs

5ecf756

Merge remote-tracking branch 'origin/master' into pipeop_filterrows

c1ceaae

sumny added Status: Needs Discussion We still need to think about what the solution should look like and removed Status: Revision Needed labels Oct 20, 2020

Merge remote-tracking branch 'origin/master' into pipeop_filterrows

84d7341

py9mrg mentioned this pull request Feb 19, 2021

Tuning a graph learner with missing values in target variable #566

Closed

sumny added 3 commits March 11, 2021 12:38

Merge remote-tracking branch 'origin/master' into pipeop_filterrows

0bee3c8

minor reiterate

2e3fb8b

news and docs

511d2d1

sumny mentioned this pull request Mar 11, 2021

open up PipeOpLearnerCV to all resampling methods #513

Open

sumny added 2 commits April 22, 2021 12:26

Merge remote-tracking branch 'origin/master' into pipeop_filterrows

996c6e0

rerun docs

47d414b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PipeOpFilterRows #410

PipeOpFilterRows #410

sumny commented Apr 21, 2020 •

edited

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

mb706 commented Jun 21, 2020

sumny commented Oct 10, 2020 •

edited

mb706 commented Oct 12, 2020

mb706 commented Oct 12, 2020

sumny commented Oct 20, 2020

sumny commented Oct 20, 2020

codecov-io commented Dec 13, 2020

py9mrg commented Feb 22, 2021 •

edited

sumny commented Feb 23, 2021

py9mrg commented Feb 24, 2021

py9mrg commented Apr 16, 2021 •

edited

sumny commented Apr 22, 2021

py9mrg commented Apr 22, 2021

PipeOpFilterRows #410

Are you sure you want to change the base?

PipeOpFilterRows #410

Conversation

sumny commented Apr 21, 2020 • edited

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

mb706 commented Jun 21, 2020

sumny commented Oct 10, 2020 • edited

mb706 commented Oct 12, 2020

mb706 commented Oct 12, 2020

sumny commented Oct 20, 2020

sumny commented Oct 20, 2020

codecov-io commented Dec 13, 2020

Codecov Report

py9mrg commented Feb 22, 2021 • edited

sumny commented Feb 23, 2021

py9mrg commented Feb 24, 2021

py9mrg commented Apr 16, 2021 • edited

sumny commented Apr 22, 2021

py9mrg commented Apr 22, 2021

sumny commented Apr 21, 2020 •

edited

sumny commented Oct 10, 2020 •

edited

py9mrg commented Feb 22, 2021 •

edited

py9mrg commented Apr 16, 2021 •

edited