Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding capability of taking auxiliary data #436

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

yinweisu
Copy link
Contributor

@yinweisu yinweisu commented Dec 7, 2021

The idea of this PR is to enable benchmark yaml to take in auxiliary data. Auxiliary data could be useful when the task requires more than train and test datasets. For example, images are needed for multimodality tasks. There are more usage possibilities, and therefore we think it's worth to have auxiliary data as a general feature.

Example yaml file:

- name: benchmark
  folds: 1
  dataset:
    target: label
    train: train.csv
    test: test.csv
  auxiliary_data:
    train: train_aux.zip
    test: test_aux.zip  # test aux not required

The main difference between auxiliary data and regular dataset should be that:

  1. We don't require a 1 to 1 correspondence of train and test aux data. Imagine a use case where some psudo label data are used during training, while we don't need those during testing.
  2. AMLB should let the users to handle the aux data themselves because use case of the aux data could vary a lot. AMLB only prepare the data and hand users the path.

@yinweisu
Copy link
Contributor Author

yinweisu commented Dec 7, 2021

There are duplicate code in terms of extracting auxiliary paths because I find it confusing to put the logic of extracting auxiliary and regular train/test paths in a single function. I'm sure there are better designs and feel free to propose them :).

@yinweisu
Copy link
Contributor Author

yinweisu commented Dec 7, 2021

@sebhrusen @PGijsbers

amlb/datasets/file.py Outdated Show resolved Hide resolved
@PGijsbers
Copy link
Collaborator

Thank you for the contribution. I just wanted to let you know that it might be a little while before I myself can look at this. Though if Seb finds the time then I'm okay with whatever he says :)

In general I think it would be a useful addition to allow for auxiliary data to be present in a task. These types of tasks have already been the focus of research and AutoML competitions. Not processing auxiliary data in any meaningful way by the benchmark framework is the only way I see it work (given its free form).

Copy link
Collaborator

@sebhrusen sebhrusen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks more like an ad-hoc feat for AG right now, would need to think how it can be implemented to be useful for more frameworks.

paths = dict(train=train_path, test=test_path)
return paths

def _extract_auxiliary_paths(self, auxiliary_data, fold=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a copy/paste of _extract_train_test_paths with minor changes:

  • different regex pattern
  • slightly different messages
  • as it is a recursive function, internal call to _extract_auxiliary_paths instead of _extract_auxiliary_paths.

Given the complexity of this logic, we can't afford duplicating it.
Let's make a higher level _extract_paths function that can hold the logic for both, sth like:

def _extract_train_test_paths(self, dataset, fold=None):
    return self._extract_paths(self._extract_train_test_paths, dataset, fold)

def _extract_auxiliary_paths(self, dataset, fold=None):
    return self._extract_paths(self._extract_auxiliary_paths, dataset, fold,
                          train_suffix="train_auxiliary", test_suffix="test_auxiliary")    
    
def _extract_paths(self, extract_paths_fn, data, fold=None, train_suffix='train', test_suffix='test'):
    train_search_pat = re.compile(rf"(?:(.*)[_-]){train_suffix}(?:[_-](\d+))?\.\w+")
    ...
    return extract_paths_fn(ns(train=…)
    ...

@@ -55,6 +55,84 @@ def load(self, dataset, fold=0):
else:
raise ValueError(f"Unsupported file type: {ext}")

@profile(logger=log)
def load_auxiliary_data(self, auxiliary_data, fold=0):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a test for this under tests/unit/amlb/datasets/file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Comment on lines 65 to 66
paths = dict(train=train_path, test=test_path)
return paths
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the fact that it only returns paths. For all other data, we try to provide an object that allows a consistent loading of data, but still providing the possibility to access the path directly. I don't see why it can't be done here as well.
Even if the access to data as a pandas DF is not provided immediately, at least we should have some interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here is that the auxiliary data could be in all different forms. For example, it could be a zip file containing bunch of images. We don't know how should we handle it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand, see my suggestion for AuxData interface below: we can easily imagine extending it with some format property later.
Right now, I'm just concerned about providing an access to those aux data that is likely to be useful only for AG, that's why I just want to have a better interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in the AuxData interface, there's a data property that's supposed to return a DF, which is not doable for many aux data. My point of view is that each framework should handle their own usage of the aux data because of its free form. I don't think providing only the path is useful only for AG.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point of view is that each framework should handle their own usage of the aux data because of its free form.

you're making a design decision for them here.
Most frameworks currently supported deal only with tabular data and that is the current focus of amlb. So if you want to autodetect to support images for your own need, fine for you, you can use the path property in AuxData, but you can't generalize your approach to all frameworks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in the AuxData interface, there's a data property that's supposed to return a DF, which is not doable for many aux data

hence the idea of adding a type/format attribute to AuxData.
Again, you don't have to implement the data property now, I just want to have this AuxData interface now, so that we can implement more specific loading of those data later without having to rewrite everything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. Thanks for the discussion!

self._test_auxiliary_data = auxiliary_data_path.get('test', None)

@property
def train_auxiliary_data(self) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's a path, please name it train_auxiliary_path not _data: by convention in amlb, _data is used for loaded data and some serialization logic relies on this naming convention.

@@ -167,6 +245,59 @@ def _get_metadata(self, prop):
return meta[prop]


class DatasetWithAuxiliaryData:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

must extend Dataset if you want to expose this.

Actually, I think the right interface would be sth like:

# in data.py

class Datasplit(ABC):
    ...

    @property
    def has_auxiliary_data(self) -> bool:
        pass
       
    @property
    def auxiliary_data(self) -> List[AuxData]:
        pass
        
class AuxData(ABC):
    # does it need a name property? a type/category/labels?

    @property
    def path(self) -> str:
        pass

    @property
    @abstractmethod
    def data(self) -> DF:
        """
        :return: the auxiliary data as a pandas DataFrame.
        """
        pass

    @profile(logger=log)
    def release(self, properties=None):
        clear_cache(self, properties)

then you can implement the logic for files to fulfill the contract of this interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having auxiliary data in Datasplit seems unnecessary to me and it requires methods to add auxiliary data to an existing Dataset because dataset and aux data are loaded separately. This is identical to DatasetWithAuxiliaryData, which is a wrapper containing both the dataset and the aux data. I'll extend it with Dataset

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having auxiliary data in Datasplit seems unnecessary

why? you were planning to have dataset.train_auxiliary_path and dataset.test_auxiliary_path.
I'm suggesting that for you it would then be accessed as dataset.train.auxiliary_data[0].path and dataset.test.auxiliary_data[0].path: it's much more consistent with the current interfaces.

Everything that is exposed to frameworks must be defined in one of the interfaces defined in data.py.

This is identical to DatasetWithAuxiliaryData

no, this is not identical: you're creating a class that doesn't extend anything and you add features that are not visible in the contract defined by the interfaces: those are the only references for all data providers.

to sum up, a dataset has currently 2 splits: train and test.
you can have aux data for each of those, therefore auxdata should be a property of Datasplit: if you add the auxdata directly to the Dataset instance instead of Datasplit, you create an inconsistency.

@yinweisu
Copy link
Contributor Author

@sebhrusen Hey Seb, happy holidays! Can you review the code when you have time?

@yinweisu
Copy link
Contributor Author

@sebhrusen Hi, can you review this PR when you have time? Thanks!

@sebhrusen
Copy link
Collaborator

Hey @yinweisu sorry for making you wait on this: don't have much time for amlb right now, but I'll try to look at your PR during the week.
Also, I'll try to work soon with @PGijsbers on restructuring some parts of the code and setting up a workflow for contributors to make this easier for you to contribute and for us to review without being afraid of breaking existing logic.
Thanks for your understanding.

@Innixma
Copy link
Collaborator

Innixma commented May 10, 2022

We have an intern joining for AutoGluon that we want to have work on multi-modal optimization (i.e. datasets with an image feature). Ideally we would like him to be able to use AutoMLBenchmark as the benchmarking tool, but this is tricky without this functionality being merged in. We can make do by hacking it into a forked repo, but just mentioning this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants