Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PipeOpFilterRows #410

Open
wants to merge 18 commits into
base: master
Choose a base branch
from
Open

PipeOpFilterRows #410

wants to merge 18 commits into from

Conversation

sumny
Copy link
Sponsor Member

@sumny sumny commented Apr 21, 2020

PipeOpFilterRows allows to filter the rows of the data of a task according to a filter parameter and adds an easy way for removing rows with missing values, related to #301
PipeOpPredictionUnion allows to combine predictions of the same task_type andpredict_type into a new Prediction. This maybe needs some discussion as combining predictions is imo not that straightforward. dropped this from the PR

@codecov-io

This comment has been minimized.

R/PipeOpFilterRows.R Outdated Show resolved Hide resolved
R/PipeOpFilterRows.R Outdated Show resolved Hide resolved
R/PipeOpFilterRows.R Outdated Show resolved Hide resolved
R/PipeOpFilterRows.R Outdated Show resolved Hide resolved
R/PipeOpFilterRows.R Outdated Show resolved Hide resolved
R/PipeOpFilterRows.R Outdated Show resolved Hide resolved
R/PipeOpFilterRows.R Outdated Show resolved Hide resolved
tests/testthat/test_pipeop_filterrows.R Show resolved Hide resolved
@codecov-commenter

This comment has been minimized.

R/PipeOpFilterRows.R Outdated Show resolved Hide resolved
R/PipeOpFilterRows.R Outdated Show resolved Hide resolved
R/PipeOpFilterRows.R Outdated Show resolved Hide resolved
R/PipeOpFilterRows.R Outdated Show resolved Hide resolved
R/PipeOpFilterRows.R Outdated Show resolved Hide resolved
R/PipeOpFilterRows.R Outdated Show resolved Hide resolved
R/PipeOpPredictionUnion.R Outdated Show resolved Hide resolved
R/PipeOpPredictionUnion.R Outdated Show resolved Hide resolved
@mb706

This comment has been minimized.

@mb706 mb706 added the Status: Needs Discussion We still need to think about what the solution should look like label Jun 21, 2020
@mb706
Copy link
Collaborator

mb706 commented Jun 21, 2020

Also the need to filter rows during training (e.g. based on NA or some other property that a learner cannot handle) may be very different from this splitting up predictions between learners thing?

@sumny sumny added Status: Completed Status: Review Needed and removed Status: Needs Discussion We still need to think about what the solution should look like Status: Revision Needed labels Oct 10, 2020
@sumny sumny requested review from mb706 and pfistfl October 10, 2020 15:07
@sumny
Copy link
Sponsor Member Author

sumny commented Oct 10, 2020

Cleaned this up. Currently has two parameters:

  • filter_formula (formula with only a rhs evaluating to TRUE or FALSE within the data of the Task indexing rows to keep)
  • na_selector (Selector function selecting features that should be checked for missing values results in rows having missing values w.r.t to these features being dropped)

I agree with @mb706 that this is somewhat redundant, because:

po1 = po("filterrows", param_vals = list(filter_formula = ~ !is.na(insulin)))
po2 = po("filterrows", param_vals = list(na_selector = selector_name("insulin")))

results in virtually the same operation being done.
However, what if I have a huge number of features and want to e.g. remove missing values w.r.t to factorial features? The approach via filter_formula = ~ !is.na(feature1) & !is.na(feature2) & ... seems quite cumbersome compared to na_selector = selector_type("factor")? I guess you could somewhat auto-generate the formula in this case but I am not sure if this is really nicer..

Maybe we should just split this up into PipeOpFilterRows and PipeOpRemoveMissings to not confuse users and state in PipeOpFilterRows that PipeOpRemoveMissings provides a cleaner interface for missing value removal/complete case analysis.

@sumny sumny changed the title pipeops FilterRows and PredictionUnion PipeOpFilterRows Oct 10, 2020
@sumny sumny mentioned this pull request Oct 10, 2020
@mb706
Copy link
Collaborator

mb706 commented Oct 12, 2020

That argument for 'nacols" would also apply to other conditions to filter by, e.g. trying to filter if any of a number of features is 0 or negative. We could go the data.table row and

  1. Introduce the .SD variable that is just a data.table of the dataset
  2. introduce an SDcols hyperparameter that selects the features that should be present in .SD

So then dropping rows where any feature is NA would go down to

SDcols = selector_grep("^nonNA\\.")
filter_formula = ~ !apply(is.na(.SD), 1, any)

@mb706
Copy link
Collaborator

mb706 commented Oct 12, 2020

So this is supposed to filter out rows during predict time, right? Is there a way to keep track of what sample is being predicted, once a learner sees the incomplete prediction dataset? So if my input task has two rows, one is dropped, my Graph returns one prediction, which sample does the prediction belong to?

Suppose that is not a problem, what is the current state of mlr3 handling missing predictions? ignores them? uses a fallback learner? just crashes? In the latter case we whould modify GraphLearner to do something sensible with missing predictions.

Also, maybe this PipeOp have the option to select on what phase samples are filtered? E.g. "train", "predict", "always".

@sumny
Copy link
Sponsor Member Author

sumny commented Oct 20, 2020

That argument for 'nacols" would also apply to other conditions to filter by, e.g. trying to filter if any of a number of features is 0 or negative. We could go the data.table row and

1. Introduce the `.SD` variable that is just a `data.table` of the dataset

2. introduce an `SDcols` hyperparameter that selects the features that should be present in `.SD`

So then dropping rows where any feature is NA would go down to

SDcols = selector_grep("^nonNA\\.")
filter_formula = ~ !apply(is.na(.SD), 1, any)

I like the idea of using SDcols. I included this. So filtering is now only specified using the filter_formula which can rely on the .SD variable because it is evaluated within the frame of the data.table backend of the Task.

@sumny
Copy link
Sponsor Member Author

sumny commented Oct 20, 2020

So this is supposed to filter out rows during predict time, right? Is there a way to keep track of what sample is being predicted, once a learner sees the incomplete prediction dataset? So if my input task has two rows, one is dropped, my Graph returns one prediction, which sample does the prediction belong to?

Suppose that is not a problem, what is the current state of mlr3 handling missing predictions? ignores them? uses a fallback learner? just crashes? In the latter case we whould modify GraphLearner to do something sensible with missing predictions.

Also, maybe this PipeOp have the option to select on what phase samples are filtered? E.g. "train", "predict", "always".

I think we are probably mixing up different use cases. I believe what @pfistfl wanted to be possible in #301 is to simple split your task based on some logic both during training and prediction (with the row ids of the splits being disjunct and their union being equal to the row ids of the complete task). In this case using PipeOpFilterRows would look somewhat like this:

task = tsk("pima")
gr1 = po("filterrows", param_vals = list(filter_formula = ~ age < 31)) %>>% lrn("classif.rpart")
gr2 = po("filterrows", param_vals = list(filter_formula = ~ age >= 31)) %>>% lrn("classif.rpart")
gr1$train(task)
gr2$train(task)
p1 = gr1$predict(task)[[1L]]  # correct row_ids are pertained 
p2 = gr2$predict(task)[[1L]]
setequal(union(p1$row_ids, p2$row_ids), task$row_ids)

Then we would only need another PipeOp that would do nothing during training and combine the predictions during prediction, i.e., simply wrapping c so we could have something like copy %>>% list(filterrows1, filterrows2) %>>% unite and the prediction output would be:

c(p1, p2)
<PredictionClassif> for 768 observations:
    row_id truth response
         4   neg      neg
         6   neg      neg
         7   pos      neg
---                      
       763   neg      neg
       764   neg      neg
       767   pos      pos

@sumny sumny added Status: Needs Discussion We still need to think about what the solution should look like and removed Status: Revision Needed labels Oct 20, 2020
@codecov-io
Copy link

Codecov Report

Merging #410 (f7798ad) into master (fc1e85a) will decrease coverage by 0.92%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #410      +/-   ##
==========================================
- Coverage   94.46%   93.54%   -0.93%     
==========================================
  Files          77       72       -5     
  Lines        2528     2245     -283     
==========================================
- Hits         2388     2100     -288     
- Misses        140      145       +5     
Impacted Files Coverage Δ
R/PipeOpFilterRows.R 100.00% <100.00%> (ø)
R/PipeOpPredictionUnion.R 100.00% <100.00%> (ø)
R/greplicate.R 0.00% <0.00%> (-100.00%) ⬇️
R/PipeOpImputeSample.R 60.00% <0.00%> (-30.00%) ⬇️
R/PipeOpImpute.R 76.19% <0.00%> (-18.75%) ⬇️
R/PipeOpImputeMean.R 90.00% <0.00%> (-10.00%) ⬇️
R/PipeOpImputeMedian.R 90.00% <0.00%> (-10.00%) ⬇️
R/mlr_graphs_elements.R 82.79% <0.00%> (-9.29%) ⬇️
R/GraphLearner.R 94.00% <0.00%> (-2.39%) ⬇️
R/Graph.R 87.40% <0.00%> (-1.28%) ⬇️
... and 40 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 36ca770...84d7341. Read the comment docs.

@py9mrg
Copy link

py9mrg commented Feb 22, 2021

Hello,

I thought this new feature might help my issue #566. I gave it a try using this commit and then adding the line po("filterrows", param_vals = list(filter_formula = ~ !apply(is.na(.SD), 1, any))) before my svm learner. But I still got the error about mismatching truth and response lengths so either I'm doing something wrong or this won't solve #566. Please let me know if I'm being really thick!

@sumny
Copy link
Sponsor Member Author

sumny commented Feb 23, 2021

Hi @py9mrg,

your approach above does not work out of the box because the SDcols hyperparameter of PipeOpFilterRows is set to selector_all() by default, and selector_all() called on a Task will only return the feature names of the task. As you deal with missing values in your target variable you can provide your own selector, e.g.:

selector_all_inc_target = function() {
  make_selector(function(task) {
    c(task$feature_names, task$target_names)
  }, description = "selector_all_inc_target")
}

then use this in your pipeop:

po = po("filterrows",
  param_vals = list(
    filter_formula = ~ !apply(is.na(.SD), 1, any),
    SDcols = selector_all_inc_target()
  )
)

this would result in the 3rd row also being dropped:

po$train(list(task_w_missing))[[1]]$data()
    target variable1 variable2
 1:     32         4         4
 2:     50         5         5
 3:     72         6         6
 4:     98         7         7
...

All in all this should work for you.
However, there is some inconsistency now that the state of the po would only list your features "variable1" and "variable2" as being affected (although your target is also). We may have to discuss what to do with PipeOps affecting targets in general. Also I will try to finish this PR up soon and maybe refactor it. Hope this helps.

@py9mrg
Copy link

py9mrg commented Feb 24, 2021

Hello @sumny ,

Thanks so much for coming back to me. That works! Just one thing in case anyone is reading with a similar use case, because make_selector is not exported, I had to use mlr3pipelines:::make_selector. Took me longer than I care to admit to work that out.

Thanks again for your help.

@py9mrg
Copy link

py9mrg commented Apr 16, 2021

Hello @sumny, do you know when this PR might be merged? I'm using it all the time and it would be nice, when not using renv, to not have to keep rolling mlr3pipelines back to this commit every time I forget to untick the update box in RStudio for this particular package!

Edit: oh I forgot to say, it's probably me, but I can only get the 84d7341 commit to recognise the new filterrows po. I can see it getting the html documentation when installing the other commits as mlr_pipeops_filterrows html appears, but po("filterrows", etc) only works with the older commit.

@sumny
Copy link
Sponsor Member Author

sumny commented Apr 22, 2021

@py9mrg, @mb706 and I have to go over it and check whether the complete functionality is still within the scope of mlr3pipelines. Also, selecting targets via selectors is still an open discussion #493 I will ping you once it is merged - sorry for keeping you waiting so long!

@py9mrg
Copy link

py9mrg commented Apr 22, 2021

@sumny thanks and don't worry - I'm sure I am completely underestimating how significant the change is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Needs Discussion We still need to think about what the solution should look like
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants