ResamplingCustom with observation weights, class weights and class costs #239

camsique · 2019-06-06T11:02:42Z

Currently mlr supports observation weights for a task. In situations when the observation weights could change in resampling training instances (time series) it would be beneficial to extend the ResamplingCustom function to accommodate this. An extension for custom observation weights, class weights and class costs is general and gives the user the ability to use the mlr3 framework in more ways.

There is a forecasting extension for mlr3, but this proposal does not collide with that extension.

This is my first pull request ever so please excuse any errors and deviations from good practice. Mistakes will be of course corrected.

Mlr used wrappers for extending learners and tasks. It would be necessary to create some wrappers to be used after custom weights resampling. What is the plan for mlr3 in this regard? In case this proposal is accepted I could try to make such wrapper.

Additional arguments obs_weights_train_sets, class_weights_train_sets, class_costs_train_sets

Addition of private method for ResamplingCustom

Public methods for get_obs_weights_train, get_class_weights_train, get_class_costs_train

typo correction

Changes to documentation roxygen

Change in instance %??%

Correction of assert_list for new methods. Length of list.

Updated documentation

codecov · 2019-06-06T11:11:23Z

Codecov Report

Merging #239 into master will decrease coverage by 2.2%.
The diff coverage is 29.41%.

@@            Coverage Diff             @@
##           master     #239      +/-   ##
==========================================
- Coverage   92.58%   90.38%   -2.21%     
==========================================
  Files          76       75       -1     
  Lines        1997     1956      -41     
==========================================
- Hits         1849     1768      -81     
- Misses        148      188      +40

Impacted Files	Coverage Δ
R/Resampling.R	`88.09% <0%> (-6.78%)`	⬇️
R/ResamplingCustom.R	`65.71% <35.71%> (-26.6%)`	⬇️
R/Measure.R	`68.18% <0%> (-27.28%)`	⬇️
R/TaskSupervised.R	`80% <0%> (-20%)`	⬇️
R/BenchmarkResult.R	`78.43% <0%> (-19.61%)`	⬇️
R/Log.R	`78.26% <0%> (-16.48%)`	⬇️
R/Prediction.R	`77.77% <0%> (-9.73%)`	⬇️
R/Experiment.R	`88.65% <0%> (-6.02%)`	⬇️
R/assertions.R	`69.11% <0%> (-2.95%)`	⬇️
... and 22 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 95da197...57e722c. Read the comment docs.

mllg · 2019-06-06T13:27:42Z

I'm not sure if I understand the purpose of observation weights which change during resampling, but I have only little experience with modeling time series. Can you maybe give an example how the weights are used for forecasting tasks, so that I can see the bigger picture?

@berndbischl Can you also have a look?

camsique · 2019-06-06T14:24:20Z

Sure, let`s imagine the following situations. We have data for a couple of years. We want to train on sliding window, but also want to give older observations smaller weight as they can be less informative. One observation can be present in more training sets. Either because of the using the GrowingCV, or depending on the time length of the training set. In one training set the given observation could be the newest observation and in another training set the same observation could be the oldest observation. Attaching the same weight to that observation in different sets might not be desirable.

mllg · 2019-06-06T20:50:23Z

Thanks for your example, I think I got it.

I believe that what you describe could be solved with a pipeline operator in a much more generic fashion. Such an operator would get the training task as input and then incorporates weights into the task before passing the task down to the learner. We could allow arbitrary weighting functions, but I assume that weights usually decay exponentially and that we usually need tasks with a column which defines the order of observations (column role "order" in mlr3).

I can write a prototype for that after my vacation (in approx. 2 weeks). Maybe @mb706 finds time to look at the problem first, or you can familiarize yourself with mlr3pipelines.

camsique · 2019-06-06T21:48:03Z

Many thanks for bringing mlr3pipelines to my attention. I have seen it as an extension, but was not aware of the usage. I will have a look and try it.
You are correct to assume that the weighting is usually exponential, but the more general framework the better.
As for the need to define order of observations. I have great experience to use "keyed" data.table for time series. That "key" might be of use here as well.

berndbischl · 2019-06-07T06:22:22Z

Hi I also think I understand what is wanted here and in general thing that this I a useful thing to support. And it might belong to pipelines.

What mlr3 now needs to support are dynamically settable, mutateable weights on observation level. We do have this right?

camsique · 2019-06-07T12:17:38Z

After looking into mlr3pipelines I am not sure how a PipeOp (with custom CV) would be iteratively used to train a learner. I think it would be beneficial to have as much free room as possible and the current ResampleCustom approach gives just that. However this opinion is based on lack of broader understaing of mlr3 framework. Where to implement is a design question not for me to comment on.
If you make a decision and point me in the right direction I will try to fit it in the mlr3 framework.

topepo · 2021-01-22T18:59:33Z

As we make similar changes in tidymodels, I'm not sure how to handle this for situations were the weights are pattern counts. If I had two categorical predictors and a lot of data, it would be efficient to encode the data as the unique combinations of the categorical columns (outcome included) and the case weight would be n per pattern. For example:

   y       x_1   x_2   weight
   <chr>   <chr> <chr>  <dbl>
   class_1 a     A         29
   class_1 a     B         22
   class_1 a     C         70
   class_1 b     A         52
   class_1 b     B         66
   class_1 b     C         92
   class_2 a     A         28
   class_2 a     B         76
   class_2 a     C         80
   class_2 b     A         25
   class_2 b     B         60
   class_2 b     C         37

So, when splitting, should the pattern counts be split so that patterns go into both training and test or do they stay in one of those data sets randomly. I see bias in the performance metric either way but would probably distribute the patterns across both data sets. Any thoughts?

pat-s · 2021-12-08T07:41:17Z

@mllg We should decide how to move forward with this proposal.

topepo · 2021-12-08T20:21:44Z

In case it helps, for resampling, our current plan in tidymodels is to have an optional (case) weights argument for frequency weights. We'll make the splits by appropriately by sampling the weights column.

For example, if you have a weight of 50 for a row and do 10-fold CV, the modeling partition will get a weight of 45 and the performance holdout will get 5. I feel that this is fairly dangerous in some cases (a same data pattern in both partitions) but users will have to opt-in to this type of splitting.

We are also weaving case weights through the preprocessing, modeling, and performance estimation parts. Very tedious but needs to happen. We'll eventually write up a document along the lines of "so you want to use case weights..." since there is some nuance that people may not have thought about much.

camsique added 11 commits June 5, 2019 14:13

ResamplingCustom.R

ea6e01a

Additional arguments obs_weights_train_sets, class_weights_train_sets, class_costs_train_sets

ResamplingCustom.R

a8d38d9

Addition of private method for ResamplingCustom

Resampling.R

da91598

Public methods for get_obs_weights_train, get_class_weights_train, get_class_costs_train

Resampling.R: typo correction

03280b7

typo correction

Resampling.R

0acc4ed

Changes to documentation roxygen

ResamplingCustom.R

e785083

Change in instance %??%

ResamplingCustom.R: temporary dis_assert for new methods

140a518

Resampling.R: assert for new methods

9df6fbe

ResamplingCustom.R: assert correction

9435d44

Correction of assert_list for new methods. Length of list.

ResamplingCustom.R: example for obs_weights_train

deb5767

ResampleCustomWeights: documentation

107eeea

Updated documentation

mllg mentioned this pull request Jun 6, 2019

ResamplingCustom with observation weights, class weights and class costs #240

Open

camsique added 4 commits June 20, 2019 09:23

Merge remote-tracking branch 'origin/master'

7a3bda5

Merge branch 'master' into origin/master

57e722c

Merge branch 'master' into origin/master

1d97ba5

Merge remote-tracking branch 'origin/master'

875a77d

Base automatically changed from master to main January 25, 2021 19:37

sebffischer mentioned this pull request Apr 11, 2024

Column role "weight" #1016

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ResamplingCustom with observation weights, class weights and class costs #239

ResamplingCustom with observation weights, class weights and class costs #239

camsique commented Jun 6, 2019

codecov bot commented Jun 6, 2019 •

edited

mllg commented Jun 6, 2019

camsique commented Jun 6, 2019

mllg commented Jun 6, 2019

camsique commented Jun 6, 2019

berndbischl commented Jun 7, 2019

camsique commented Jun 7, 2019

topepo commented Jan 22, 2021

pat-s commented Dec 8, 2021

topepo commented Dec 8, 2021

ResamplingCustom with observation weights, class weights and class costs #239

Are you sure you want to change the base?

ResamplingCustom with observation weights, class weights and class costs #239

Conversation

camsique commented Jun 6, 2019

codecov bot commented Jun 6, 2019 • edited

Codecov Report

mllg commented Jun 6, 2019

camsique commented Jun 6, 2019

mllg commented Jun 6, 2019

camsique commented Jun 6, 2019

berndbischl commented Jun 7, 2019

camsique commented Jun 7, 2019

topepo commented Jan 22, 2021

pat-s commented Dec 8, 2021

topepo commented Dec 8, 2021

codecov bot commented Jun 6, 2019 •

edited