predict on test set rows #706

mb706 · 2022-10-19T11:51:58Z

go through each pipeop and let it predict directly, since some pipeops need the predicted set

mb706 · 2024-01-26T17:03:51Z

PipeOpTaskPreproc checks if there are test set rows, if so calls .predict_task
tests should verify that all pipeops that have "Task" inputs and outputs handle this correctly: $predict() and $train()[test set rows, ] are the same. The autotests have some flag that indicates if $predict() is random? in expect_datapreproc_pipeop_class
some pipeops that do task preprocessing but dont inherit from pipeoptaskpreproc are worrying here: pipeopimpute. It should probably just inherit from pipeoptaskpreproc.
PipeOpTaskPreproc always handles test rows if they are present. GraphLearner removes test rows in most cases, unless a PipeOp needs them. Determines this by checking each PipeOp: Does it contain a $uses_test_rows active binding / field? (choose not to just look for attached $learner, if present. Problem here: e.g. ImputeLearner should not forward its 'needs-test-set' since it can not be used)
GraphLearner can set its own relevant property

sebffischer · 2024-01-29T11:27:50Z

There is a memory inefficiency with the suggested approach:

Let's say we have a task with data of the form

row_id	y	x1
1
...
n_use
n_use + 1
...
n_use + n_test

Lets say we apply two preprocessing operations to x1 in two subsequent PipeOps.

In the approach we discussed, we first task$cbind() the preprocessed train rows (pipeop$.train_task()and then task$rbind() the preprocessed test rows, resulting from pipeop$.predict_task().
The problem with this is now, that the task$rbind() of the test rows, happens in both preprocessing PipeOps.
This means in both PipeOps, a table of the form below is rbinded.

row_id	y	x1
1
...
n_test

The wasteful thing here is that we rbind column y, as it contains information that is simply already present in the tasks's backend and is now being duplicated. This problem comes worse when there are many columns that are left untouched by a preprocessing pipeop.

The virtual backend that would result of applying two preprocessing operations on x1 would internally store n_use + n_test + n_test * n_pipeop rows, (containing all un-preprocessed columns + newly created columns) which is not so great.
Fortunately there is a more memory efficient way to achieve the same results, which works (at least) for the standard scenario of having distinct and unique use rows and test rows (which is usually the case, except for bootstrapping and insample resampling). That approach requires the following:

we modify mlr3's task$cbind() function, to support $cbind()-ing test rows (straightforward)
Preprocessing PipeOps must (I think) call $.train_dt() and$.predict_dt(), then rbind their result and cbind them to the task. (not so straightforward, because we are changing the structure of PipeOpTaskPreproc)

sebffischer · 2024-01-29T13:10:48Z

In addition to that, retrieving the use roles will become more and more expensive, because of the many DataBackendRbinds. This is, because calling databackendrbind$data(), first goes through the rbinded backend (b2) to find the rows and then through b1.

sebffischer · 2024-01-29T13:14:10Z

Some possible solutions:

Accept this inefficiency (I am not a huge fan)
Hack our way around it (removing the internal cbinded backend containing the preprocesses use rows after $.train_task(), then rbind its data with the preprocessed test rows and $cbind() it again to the task) (I am also not a huge fan)
Change the $.train_task() method everywhere (maybe (?))
Somehow not $rbind() the test and use tasks, to avoid the problem (I think this should be considered)

mb706 mentioned this issue Jan 26, 2024

start working on validation support #753

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

predict on test set rows #706

predict on test set rows #706

mb706 commented Oct 19, 2022

mb706 commented Jan 26, 2024

sebffischer commented Jan 29, 2024 •

edited

sebffischer commented Jan 29, 2024

sebffischer commented Jan 29, 2024 •

edited

predict on test set rows #706

predict on test set rows #706

Comments

mb706 commented Oct 19, 2022

mb706 commented Jan 26, 2024

sebffischer commented Jan 29, 2024 • edited

sebffischer commented Jan 29, 2024

sebffischer commented Jan 29, 2024 • edited

sebffischer commented Jan 29, 2024 •

edited

sebffischer commented Jan 29, 2024 •

edited