Adding capability of taking auxiliary data #436

yinweisu · 2021-12-07T21:22:43Z

The idea of this PR is to enable benchmark yaml to take in auxiliary data. Auxiliary data could be useful when the task requires more than train and test datasets. For example, images are needed for multimodality tasks. There are more usage possibilities, and therefore we think it's worth to have auxiliary data as a general feature.

Example yaml file:

- name: benchmark
  folds: 1
  dataset:
    target: label
    train: train.csv
    test: test.csv
  auxiliary_data:
    train: train_aux.zip
    test: test_aux.zip  # test aux not required

The main difference between auxiliary data and regular dataset should be that:

We don't require a 1 to 1 correspondence of train and test aux data. Imagine a use case where some psudo label data are used during training, while we don't need those during testing.
AMLB should let the users to handle the aux data themselves because use case of the aux data could vary a lot. AMLB only prepare the data and hand users the path.

yinweisu · 2021-12-07T21:25:13Z

There are duplicate code in terms of extracting auxiliary paths because I find it confusing to put the logic of extracting auxiliary and regular train/test paths in a single function. I'm sure there are better designs and feel free to propose them :).

yinweisu · 2021-12-07T21:27:02Z

@sebhrusen @PGijsbers

amlb/datasets/file.py

PGijsbers · 2021-12-08T10:51:32Z

Thank you for the contribution. I just wanted to let you know that it might be a little while before I myself can look at this. Though if Seb finds the time then I'm okay with whatever he says :)

In general I think it would be a useful addition to allow for auxiliary data to be present in a task. These types of tasks have already been the focus of research and AutoML competitions. Not processing auxiliary data in any meaningful way by the benchmark framework is the only way I see it work (given its free form).

sebhrusen

Looks more like an ad-hoc feat for AG right now, would need to think how it can be implemented to be useful for more frameworks.

sebhrusen · 2021-12-09T15:25:11Z

amlb/datasets/file.py

+        paths = dict(train=train_path, test=test_path)
+        return paths
+
+    def _extract_auxiliary_paths(self, auxiliary_data, fold=None):


looks like a copy/paste of _extract_train_test_paths with minor changes:

different regex pattern

slightly different messages

as it is a recursive function, internal call to _extract_auxiliary_paths instead of _extract_auxiliary_paths.

Given the complexity of this logic, we can't afford duplicating it.
Let's make a higher level _extract_paths function that can hold the logic for both, sth like:

def _extract_train_test_paths(self, dataset, fold=None): return self._extract_paths(self._extract_train_test_paths, dataset, fold) def _extract_auxiliary_paths(self, dataset, fold=None): return self._extract_paths(self._extract_auxiliary_paths, dataset, fold, train_suffix="train_auxiliary", test_suffix="test_auxiliary") def _extract_paths(self, extract_paths_fn, data, fold=None, train_suffix='train', test_suffix='test'): train_search_pat = re.compile(rf"(?:(.*)[_-]){train_suffix}(?:[_-](\d+))?\.\w+") ... return extract_paths_fn(ns(train=…) ...

sebhrusen · 2021-12-09T15:30:39Z

amlb/datasets/file.py

@@ -55,6 +55,84 @@ def load(self, dataset, fold=0):
        else:
            raise ValueError(f"Unsupported file type: {ext}")

+    @profile(logger=log)
+    def load_auxiliary_data(self, auxiliary_data, fold=0):


please add a test for this under tests/unit/amlb/datasets/file

sebhrusen · 2021-12-09T15:31:40Z

amlb/datasets/file.py

+        paths = dict(train=train_path, test=test_path)
+        return paths


I don't like the fact that it only returns paths. For all other data, we try to provide an object that allows a consistent loading of data, but still providing the possibility to access the path directly. I don't see why it can't be done here as well.
Even if the access to data as a pandas DF is not provided immediately, at least we should have some interface.

The problem here is that the auxiliary data could be in all different forms. For example, it could be a zip file containing bunch of images. We don't know how should we handle it

I understand, see my suggestion for AuxData interface below: we can easily imagine extending it with some format property later.
Right now, I'm just concerned about providing an access to those aux data that is likely to be useful only for AG, that's why I just want to have a better interface.

So in the AuxData interface, there's a data property that's supposed to return a DF, which is not doable for many aux data. My point of view is that each framework should handle their own usage of the aux data because of its free form. I don't think providing only the path is useful only for AG.

My point of view is that each framework should handle their own usage of the aux data because of its free form.

you're making a design decision for them here.
Most frameworks currently supported deal only with tabular data and that is the current focus of amlb. So if you want to autodetect to support images for your own need, fine for you, you can use the path property in AuxData, but you can't generalize your approach to all frameworks.

So in the AuxData interface, there's a data property that's supposed to return a DF, which is not doable for many aux data

hence the idea of adding a type/format attribute to AuxData.
Again, you don't have to implement the data property now, I just want to have this AuxData interface now, so that we can implement more specific loading of those data later without having to rewrite everything.

This makes sense. Thanks for the discussion!

sebhrusen · 2021-12-09T15:37:06Z

amlb/datasets/file.py

+        self._test_auxiliary_data = auxiliary_data_path.get('test', None)
+
+    @property
+    def train_auxiliary_data(self) -> str:


if it's a path, please name it train_auxiliary_path not _data: by convention in amlb, _data is used for loaded data and some serialization logic relies on this naming convention.

sebhrusen · 2021-12-09T16:04:14Z

amlb/datasets/file.py

@@ -167,6 +245,59 @@ def _get_metadata(self, prop):
        return meta[prop]


+class DatasetWithAuxiliaryData:


must extend Dataset if you want to expose this.

Actually, I think the right interface would be sth like:

# in data.py class Datasplit(ABC): ... @property def has_auxiliary_data(self) -> bool: pass @property def auxiliary_data(self) -> List[AuxData]: pass class AuxData(ABC): # does it need a name property? a type/category/labels? @property def path(self) -> str: pass @property @abstractmethod def data(self) -> DF: """ :return: the auxiliary data as a pandas DataFrame. """ pass @profile(logger=log) def release(self, properties=None): clear_cache(self, properties)

then you can implement the logic for files to fulfill the contract of this interface.

Having auxiliary data in Datasplit seems unnecessary to me and it requires methods to add auxiliary data to an existing Dataset because dataset and aux data are loaded separately. This is identical to DatasetWithAuxiliaryData, which is a wrapper containing both the dataset and the aux data. I'll extend it with Dataset

Having auxiliary data in Datasplit seems unnecessary

why? you were planning to have dataset.train_auxiliary_path and dataset.test_auxiliary_path.
I'm suggesting that for you it would then be accessed as dataset.train.auxiliary_data[0].path and dataset.test.auxiliary_data[0].path: it's much more consistent with the current interfaces.

Everything that is exposed to frameworks must be defined in one of the interfaces defined in data.py.

This is identical to DatasetWithAuxiliaryData

no, this is not identical: you're creating a class that doesn't extend anything and you add features that are not visible in the contract defined by the interfaces: those are the only references for all data providers.

to sum up, a dataset has currently 2 splits: train and test.
you can have aux data for each of those, therefore auxdata should be a property of Datasplit: if you add the auxdata directly to the Dataset instance instead of Datasplit, you create an inconsistency.

yinweisu · 2021-12-27T19:28:35Z

@sebhrusen Hey Seb, happy holidays! Can you review the code when you have time?

yinweisu · 2022-02-14T18:02:22Z

@sebhrusen Hi, can you review this PR when you have time? Thanks!

sebhrusen · 2022-02-14T18:22:40Z

Hey @yinweisu sorry for making you wait on this: don't have much time for amlb right now, but I'll try to look at your PR during the week.
Also, I'll try to work soon with @PGijsbers on restructuring some parts of the code and setting up a workflow for contributors to make this easier for you to contribute and for us to review without being afraid of breaking existing logic.
Thanks for your understanding.

Innixma · 2022-05-10T00:14:00Z

We have an intern joining for AutoGluon that we want to have work on multi-modal optimization (i.e. datasets with an image feature). Ideally we would like him to be able to use AutoMLBenchmark as the benchmarking tool, but this is tricky without this functionality being merged in. We can make do by hacking it into a forked repo, but just mentioning this.

yinweisu added 2 commits December 7, 2021 13:06

auxilary data

9f34bca

fix typo

399e692

Innixma reviewed Dec 7, 2021

View reviewed changes

amlb/datasets/file.py Outdated Show resolved Hide resolved

yinweisu added 3 commits December 7, 2021 13:46

fix

55077a2

fix

53b86be

fix

9165fc3

sebhrusen requested changes Dec 9, 2021

View reviewed changes

yinweisu added 4 commits December 9, 2021 14:53

extract_path and AuxData

4fe07cb

remove unused DatasetWithAuxiliaryData

73f8f16

move aux data in data split

95c824f

fix

fd482b2

yinweisu requested a review from sebhrusen December 13, 2021 18:28

unit test

d67aef6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding capability of taking auxiliary data #436

Adding capability of taking auxiliary data #436

yinweisu commented Dec 7, 2021 •

edited

yinweisu commented Dec 7, 2021

yinweisu commented Dec 7, 2021

PGijsbers commented Dec 8, 2021

sebhrusen left a comment

sebhrusen Dec 9, 2021

sebhrusen Dec 9, 2021

yinweisu Feb 14, 2022

sebhrusen Dec 9, 2021

yinweisu Dec 9, 2021

sebhrusen Dec 9, 2021

yinweisu Dec 9, 2021

sebhrusen Dec 9, 2021

sebhrusen Dec 9, 2021

yinweisu Dec 9, 2021

sebhrusen Dec 9, 2021

sebhrusen Dec 9, 2021

yinweisu Dec 9, 2021

sebhrusen Dec 9, 2021

yinweisu commented Dec 27, 2021

yinweisu commented Feb 14, 2022

sebhrusen commented Feb 14, 2022

Innixma commented May 10, 2022

		@@ -167,6 +245,59 @@ def _get_metadata(self, prop):
		return meta[prop]


		class DatasetWithAuxiliaryData:

Adding capability of taking auxiliary data #436

Are you sure you want to change the base?

Adding capability of taking auxiliary data #436

Conversation

yinweisu commented Dec 7, 2021 • edited

yinweisu commented Dec 7, 2021

yinweisu commented Dec 7, 2021

PGijsbers commented Dec 8, 2021

sebhrusen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yinweisu commented Dec 27, 2021

yinweisu commented Feb 14, 2022

sebhrusen commented Feb 14, 2022

Innixma commented May 10, 2022

yinweisu commented Dec 7, 2021 •

edited