[Feature request]: Flexibility in label transformations #160

laserkelvin · 2024-03-19T17:37:11Z

Feature/behavior summary

Given that properties from different datasets can span large dynamic ranges, and/or are very non-Gaussian, we should design a framework for modifying and transforming labels ideally just before loss calculations. As part of this, it may be advantageous to calculate dataset-wide statistics on the fly with caching.

Request attributes

Would this be a refactor of existing code?
Does this proposal require new package dependencies?
Would this change break backwards compatibility?
Does this proposal include a new model?
Does this proposal include a new dataset?
Does this proposal include a new task/workflow?

Related issues

#75 pertains to an issue with normalization not being applied; this solution would supersede it.

Solution description

One solution would be to implement this as a subclass of transform, which mutates data in-place:

class AbstractLabelTransform(AbstractTransform):
    def apply(self, *args, **kwargs):
         ...

    def cache_statistic(self, key, value):
        ...

    def save(self, path):
        ...

On-the-fly statistics could be calculated using a moving-average or something, which is then cached to disk based on the dataset class, and the dataset path. The only issue with this is synchronization: for DDP scenarios, we'd want to make sure statistics are the same across each data loader worker. Could probably do some reduction call, etc.

We can then implement concrete versions of the transforms:

class NormalTransform(AbstractLabelTransform):
     # rescales based on mean/std

class MinMaxTransform(AbstractLabelTransform):
    # rescales to [min, max] of specified value, or dataset

class LambdaTransform(AbstractLabelTransform):
     # this is a bit dicey, but apply an arbitrary function to a key

class ExponentialTransform(AbstractLabelTransform):
     # many properties have long-tailed distributions

The idea would be that you could freely compose these such that different labels can be transformed in different ways.

Alternatively:

As a pl.Callback; since it has access to discrete after/before_x_step regions, which could be helpful in getting access to batch data.
We could take the existing normalization steps that are being used in _compute_losses. However, caching and whatnot isn't as flexible.

Additional notes

A task list based on the transform-based solution (convert to issues/PRs for tracking):

Implement abstract label transformation interface, including caching mechanism #162
Implement concrete label transformations based on common use cases

The text was updated successfully, but these errors were encountered:

laserkelvin added the enhancement New feature or request label Mar 19, 2024

laserkelvin self-assigned this Mar 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request]: Flexibility in label transformations #160

[Feature request]: Flexibility in label transformations #160

laserkelvin commented Mar 19, 2024 •

edited

[Feature request]: Flexibility in label transformations #160

[Feature request]: Flexibility in label transformations #160

Comments

laserkelvin commented Mar 19, 2024 • edited

Feature/behavior summary

Request attributes

Related issues

Solution description

Additional notes

laserkelvin commented Mar 19, 2024 •

edited