open up PipeOpLearnerCV to all resampling methods #513

sumny · 2020-10-01T15:10:40Z

Allow all resamplings currently listed in mlr_resamplings (and more, e.g. all that inherit from Resampling). Closes #500

If a resampling returns multiple predictions for a row id, then they are aggregated using the mean; if task_type = "classif" aggregation for response is done using the mode; maybe we should simply base this on the argmax of the mean aggregation of the probs instead - if available). Maybe we also should open up to custom aggregation functions that can be passed as a hyperparameter but we have to make sure that boundaries of probs etc. are respected, e.g., all aggregated probs must be within [0, 1] and the sum for one row id must still be 1 (or we enforce this later),

If a resampling fails to return predictions for a row id present in the input task, this row id is added with missing values.
All in all, this results in the row ids of the input task matching the row ids of the output features based on the resampled prediction.

Probably should be renamed from PipeOpLearnerCV to PipeOpLearnerResamling or something.

~~For custom resampling, train_sets and test_sets are currently passed as ParamUtys "resampling.custom.train_sets" and "resampling.custom.test_sets"; not sure if this is the best way here.~~

R/PipeOpLearnerCV.R

sumny · 2020-10-04T18:30:19Z

I guess if we want to be really flexible with respect to which Resamplings to include, there is no other way than also wrapping the Resampling as provided by the user. The current status of this PR is exactly this, i.e., we not only wrap the Learner but also the Resampling (initialized to rsmp("cv", folds = 3), e.g. as before).

Test currently fail because:

~~There are issues with deep cloning param_sets of Resamplings~~
~~The parameters of Resamplings do neither have a train nor predict tag~~ fixed this manually during construction

…tend

sumny · 2021-03-11T14:22:24Z

I just did a reiterate which I summarize here:

PipeOpLearnerCV now wraps any Resampling and simply returns the predictions. Due to the fact that some Resampling methods may return multiple predictions per original row id, it also append a new column of col_role row_reference to the data
Based on such a row_reference column one can then use PipeOpAggregate to aggregate these multiple predictions per original row id
There is still the "problem" that some Resampling methods must not necessarily return a prediction for each original row id, i..e, in subsampling the returned predictions (w.r.t to row ids) are only a subset of the original row ids; in this case stacking via featureunion is not possible but I guess this is fine (PipeOpFilterRows #410 may fix this in the future)

mb706 · 2021-10-06T11:31:45Z

R/PipeOpLearnerCV.R

 #'
 #' @usage NULL
 #' @name mlr_pipeops_learner_cv
 #' @format [`R6Class`] object inheriting from [`PipeOpTaskPreproc`]/[`PipeOp`].
 #'
 #' @description
-#' Wraps an [`mlr3::Learner`] into a [`PipeOp`].
+#' Wraps a [`mlr3::Learner`] and [`mlr3::Resampling`] into a [`PipeOp`].


just say learner

mb706 · 2021-10-06T11:34:03Z

R/PipeOpLearnerCV.R

-#' Inherits the `$param_set` (and therefore `$param_set$values`) from the [`Learner`][mlr3::Learner] it is constructed from.
+#' In the case of the resampling method returning multiple predictions per row id, the predictions
+#' are returned unaltered. The output [`Task`][mlr3::Task] always gains a `row_reference` column
+#' named `pre.<ID>` indicating the original row id prior to the resampling process. [`PipeOpAggregate`] should then


mb706 · 2021-10-06T11:41:53Z

R/PipeOpLearnerCV.R

-        prds = as.data.table(private$.learner$predict(task))
-      }
+      # compute resampled predictions
+      rr = resample(task, private$.learner, private$.resampling)


is there a way to check if the resampling fits a model on all the data, which could then be used for prediction without needing to fit twice?

mb706 · 2021-10-06T11:44:49Z

R/PipeOpLearnerCV.R

-      # dependency). We will opt for the least annoying behaviour here and just not use dependencies
-      # in PipeOp ParamSets.
-      # private$.crossval_param_set$add_dep("folds", "method", CondEqual$new("cv"))  # don't do this.
+      private$.additional_param_set$values = list(keep_response = FALSE)


name resampling.keep_response (and hope no resampling method has keep_response as parameter)

mb706 · 2021-10-06T11:51:35Z

R/PipeOpLearnerCV.R

+
+      # get task_type from mlr_reflections and call constructor
+      constructor = get(mlr_reflections$task_types[["task"]][chmatch(task$task_type, table = mlr_reflections$task_types[["type"]], nomatch = 0L)][[1L]])
+      newtask = invoke(constructor$new, id = task$id, backend = backend, target = task$target_names, .args = task$extra_args)


Note to @mb706, this needs to be brought in accord with PipeOpTaskPreproc's affect_columns.

But it should be the PipeOpTaskPreproc's responsibility to keep all col-roles (that are not disabled by affect_columns), so in particular should respect weights etc.

Maybe use this for inspiration.

Also the previous-id-column should really be a different col-role to avoid accidentally training on the ID.

mb706 · 2021-10-06T11:52:44Z

R/PipeOpLearnerCV.R

+      renaming = setdiff(colnames(prds), c("row_ids", "truth"))
+      setnames(prds, old = renaming, new = sprintf("%s.%s", self$id, renaming))
+      setnames(prds, old = "truth", new = task$target_names)
+      row_reference = paste0("pre.", self$id)


mb706 · 2021-10-06T11:54:14Z

R/PipeOpLearnerCV.R

+      setnames(prds, old = "truth", new = task$target_names)
+      row_reference = paste0("pre.", self$id)
+      while (row_reference %in% task$col_info$id) {
+        row_reference = paste0(row_reference, ".")


instead throw an error when IDs collide, user has to change PipOp's ID then. In any case IDs are unique within graph so usually shouldn't be a problem.

mb706 · 2021-10-06T11:57:50Z

R/PipeOpLearnerCV.R

+      # the following is needed to pertain correct row ids in the case of e.g. cv
+      # here we do not necessarily apply PipeOpAggregate later
+      backend = if (identical(sort(prds[[row_reference]]), sort(task$row_ids))) {
+        set(prds, j = task$backend$primary_key, value = prds[[row_reference]])


check primary_key is not in names(prds)

mb706 · 2021-10-06T12:08:04Z

R/PipeOpTuneThreshold.R

@@ -143,7 +143,12 @@ PipeOpTuneThreshold = R6Class("PipeOpTuneThreshold",
    },
    .task_to_prediction = function(input) {
      prob = as.matrix(input$data(cols = input$feature_names))
-      colnames(prob) = unlist(input$levels())
+      # setting the column names the following way is safer
+      nms = map_chr(strsplit(colnames(prob), "\\."), function(x) x[length(x)])


maybe breaks when factor level has a period?

Better way: use input$levels(input$target_name), generate putative colnames from that, then compare to given col names. the assert is good though.

mb706 · 2021-10-06T12:10:38Z

R/PipeOpTuneThreshold.R

@@ -143,7 +143,12 @@ PipeOpTuneThreshold = R6Class("PipeOpTuneThreshold",
    },
    .task_to_prediction = function(input) {
      prob = as.matrix(input$data(cols = input$feature_names))
-      colnames(prob) = unlist(input$levels())


this worked before because there was only one col in the task with factors for classification, but it is very brittle and should be avoided.

mb706 · 2021-10-06T12:12:05Z

R/zzz.R

@@ -15,6 +15,9 @@ register_mlr3 = function() {
    c("abstract", "meta", "missings", "feature selection", "imbalanced data",
    "data transform", "target transform", "ensemble", "robustify", "learner", "encode",
     "multiplicity")))
+  if (!all(grepl("row_reference", x$task_col_roles))) {
+    x$task_col_roles = map(x$task_col_roles, function(col_roles) c(col_roles, "row_reference"))


this doesn't work if other tasks get added after mlr3pipelines, so we should ask @mllg (1) is there a way to do things like this well?, (2) can we just have the col role like that

mb706 · 2021-10-06T12:18:04Z

R/PipeOpAggregate.R

+#' @format [`R6Class`] object inheriting from [`PipeOpTaskPreprocSimple`]/[`PipeOpTaskPreproc`]/[`PipeOp`].
+#'
+#' @description
+#' Aggregates features row-wise based on multiple observations indicated via a column of role `row_reference` according to expressions given as formulas.


maybe we don't need to restrict to that colrole?

mb706 · 2021-10-06T12:25:04Z

R/PipeOpAggregate.R

+#' The parameters are the parameters inherited from [`PipeOpTaskPreproc`], as well as:
+#' * `aggregation` :: named `list` of `formula`\cr
+#'   Expressions for how features should be aggregated, in the form of `formula`.
+#'   Each element of the list is a `formula` with the name of the element naming the feature to aggregate and the formula expression determining the result.


maybe this shouldn't be a named list of formulae, but just a single formula naming a data.table expression, such as
lapply(.SD, mean) or .(Sepal.Length = first(Sepal.Length), Sepal.Width = last(Sepal.Width)). We wouldn't need to teach the user data.table.

alternatively: aggregation.all (single argument function), aggregation.specific: named list of formulae, similar to PipeOpMutate

aggregation.all does not apply to (1) things named in aggregation.specific or (2) by columns

mb706 · 2021-10-06T12:25:26Z

R/PipeOpAggregate.R

+#'   Initialized to `list()`, i.e., no aggregation is performed.
+#' * `by` :: `character(1)` | `NULL`\cr
+#'   Column indicating the `row_reference` column of the [`Task`][mlr3::Task] that should be the row-wise basis for the aggregation.
+#'   Initialized to `NULL`, i.e., no aggregation is performed.


Maybe also .SDcols

mb706 · 2021-10-06T12:28:36Z

R/PipeOpAggregate.R

+# checks that `aggregation` is
+# * a named list of `formula`
+# * that each element has only a rhs
+check_aggregation_formulae = function(x) {


by now we can use mlr3misc::crate()

mb706 · 2021-10-06T12:30:20Z

R/PipeOpAggregate.R

+  private = list(
+    .transform = function(task) {
+
+      if (length(self$param_set$values$aggregation) == 0L || is.null(self$param_set$values$by)) {


empty aggregation should not be allowed

empty by should still not early-exit.

sumny added 2 commits September 30, 2020 22:42

add experimental other resamplings, handle duplicates and missings

2efdb99

extend PipeOpLearnerCV to other resamplings, add tests, update docs

b8e1deb

sumny added Status: Review Needed Status: Completed labels Oct 1, 2020

sumny requested review from pfistfl and mb706 October 1, 2020 15:10

pfistfl reviewed Oct 1, 2020

View reviewed changes

R/PipeOpLearnerCV.R Outdated Show resolved Hide resolved

pfistfl reviewed Oct 1, 2020

View reviewed changes

R/PipeOpLearnerCV.R Outdated Show resolved Hide resolved

pfistfl reviewed Oct 1, 2020

View reviewed changes

R/PipeOpLearnerCV.R Outdated Show resolved Hide resolved

allow for more flexible Resampling, fix tests and docs accordingly

a90616c

sumny requested a review from pfistfl October 4, 2020 18:30

sumny added 6 commits October 4, 2020 20:32

update NEWS

7c3c301

Merge remote-tracking branch 'origin/master' into pipeop_learnercv_ex…

10066ed

…tend

fix conversion learnercv test, rerun docs

d7f8969

Merge remote-tracking branch 'origin/master' into pipeop_learnercv_ex…

b114dd8

…tend

fix news, fix tags of resampling params, fix tests

c7b8a3c

Merge remote-tracking branch 'origin/master' into pipeop_learnercv_ex…

fe1624b

…tend

This comment has been minimized.

Sign in to view

sumny added 3 commits March 7, 2021 17:24

Merge remote-tracking branch 'origin/master' into pipeop_learnercv_ex…

2f235a6

…tend

rework

2f99db8

..

6431bd9

mb706 reviewed Oct 6, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

open up PipeOpLearnerCV to all resampling methods #513

open up PipeOpLearnerCV to all resampling methods #513

sumny commented Oct 1, 2020 •

edited

sumny commented Oct 4, 2020 •

edited

This comment has been minimized.

This comment has been minimized.

sumny commented Mar 11, 2021

mb706 Oct 6, 2021

mb706 Oct 6, 2021

mb706 Oct 6, 2021

mb706 Oct 6, 2021

mb706 Oct 6, 2021

mb706 Oct 6, 2021

mb706 Oct 6, 2021

mb706 Oct 6, 2021

mb706 Oct 6, 2021

mb706 Oct 6, 2021

mb706 Oct 6, 2021

mb706 Oct 6, 2021

mb706 Oct 6, 2021

mb706 Oct 6, 2021

mb706 Oct 6, 2021

mb706 Oct 6, 2021

mb706 Oct 6, 2021

mb706 Oct 6, 2021

mb706 Oct 6, 2021

mb706 Oct 6, 2021

mb706 Oct 6, 2021

mb706 Oct 6, 2021

mb706 Oct 6, 2021

open up PipeOpLearnerCV to all resampling methods #513

Are you sure you want to change the base?

open up PipeOpLearnerCV to all resampling methods #513

Conversation

sumny commented Oct 1, 2020 • edited

sumny commented Oct 4, 2020 • edited

This comment has been minimized.

This comment has been minimized.

sumny commented Mar 11, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sumny commented Oct 1, 2020 •

edited

sumny commented Oct 4, 2020 •

edited