Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] ExpandingCutoffSplitter - splitter by moving cutoff #6360

Merged
merged 33 commits into from
May 20, 2024

Conversation

ninedigits
Copy link
Contributor

Reference Issues/PRs

#6327
#6359

What does this implement/fix? Explain your changes.

This PR introduces a new splitter, ExpandingCutoffSplitter, to the time series forecasting module of sktime. This splitter starts with an initial training window based on a specified cutoff point and expands incrementally in each split. This feature addresses the need for a dynamic adjustment of the training window in time series forecasting based on a cutoff date, especially in cases where many different time series will be tested on common cutoff dates but otherwise have different starting points.

Does your contribution introduce a new dependency? If yes, which one?

No new dependencies are introduced with this PR.

What should a reviewer concentrate their feedback on?

  • Implementation of the ExpandingCutoffSplitter logic.
  • Integration with the existing splitter architecture.
  • Applicability and ease of use in various forecasting scenarios.

Did you add any tests for the change?

Yes, tests have been added to ensure the ExpandingCutoffSplitter works as intended and interacts correctly with other splitters and forecasting tools in sktime.

Any other comments?

PR checklist

For all contributions
  • [ x] I've added myself to the list of contributors with any new badges I've earned :-)
    How to: add yourself to the all-contributors file in the sktime root directory (not the CONTRIBUTORS.md). Common badges: code - fixing a bug, or adding code logic. doc - writing or improving documentation or docstrings. bug - reporting or diagnosing a bug (get this plus code if you also fixed the bug in the PR).maintenance - CI, test framework, release.
    See here for full badge reference
  • Optionally, for added estimators: I've added myself and possibly to the maintainers tag - do this if you want to become the owner or maintainer of an estimator you added.
    See here for further details on the algorithm maintainer role.
  • [ x] The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.
For new estimators
  • I've added the estimator to the API reference - in docs/source/api_reference/taskname.rst, follow the pattern.
  • I've added one or more illustrative usage examples to the docstring, in a pydocstyle compliant Examples section.
  • If the estimator relies on a soft dependency, I've set the python_dependencies tag and ensured
    dependency isolation, see the estimator dependencies guide.

@ninedigits
Copy link
Contributor Author

ninedigits commented Apr 28, 2024

@fkiraly I think there are some issues with other commits getting included here, but I think the core logic of ExpandingCutoffSplitter is still worth reviewing.

@fkiraly
Copy link
Collaborator

fkiraly commented Apr 29, 2024

@fkiraly I think there are some issues with other commits getting included here, but I think the core logic of ExpandingCutoffSplitter is still worth reviewing.

I don't see any other commits? All looks fine.

Further, could you kindly ensure your code passes formatting, then the automated tests will run.
Guide: https://www.sktime.net/en/stable/developer_guide/coding_standards.html

@fkiraly fkiraly added enhancement Adding new functionality module:splitters&resamplers data splitters and resamplers labels Apr 29, 2024
@fkiraly
Copy link
Collaborator

fkiraly commented May 1, 2024

restarted tests

Comment on lines 42 to 43
|---------------------|←---fh---→|------|
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fkiraly Any issues with including characters like ↓←→ as seen in line 42-43? Not sure if this is causing one of the unit tests to fail.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure - I cannot download or access the logs. So, perhaps? I will try to run the tests locally.

@yarnabrina or @benHeid - can you see the test logs?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just ran it locally - the failure is due to fh being modified in __init__. For any sklearn-like object, the self.param should be identical to what was passed in __init__. If you want to check_fh, you should write it to self._fh or similar.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yarnabrina or @benHeid - can you see the test logs?

Yes, I can see CI logs. Sharing two sample failures:

FAILED sktime/tests/test_all_estimators.py::TestAllObjects::test_set_params_sklearn[ExpandingCutoffSplitter] - AssertionError: get_params result of ExpandingCutoffSplitter (x) does not match what was passed to set_params (y). Reason for discrepancy: [fh].type, x.type = <class 'sktime.forecasting.base._fh.ForecastingHorizon'> != y.type = <class 'list'>
assert False
FAILED sktime/split/tests/test_expandingcutoff.py::test_expandingcutoff_fh_list_007 - AttributeError: attribute 'freq' of 'pandas._libs.tslibs.timestamps._Timestamp' objects is not writable

Copy link
Contributor Author

@ninedigits ninedigits May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yarnabrina and @fkiraly. I believe I fixed those test cases, I think a few more errors might pop up for other issues.

I think L119-L127

        input_args = [y[0], self.cutoff]
        if not _inputs_are_supported(input_args) and not _is_all_args_periodlike(
            input_args
        ):
            msg = (
                "y indicies and cutoff must have the same datatypes, but instead "
                f"found type(y) = {y.dtype} and type(cutoff) = {type(self.cutoff)}"
            )
            raise TypeError(msg)

Might conflict with:

    def get_test_params(cls, parameter_set="default"):
        """..."""
        return [{"fh": [2, 4], "cutoff": 3, "step_length": 1}]

I have the current class designed such that you can only use cutoff as an int if the index of y is also int, and conversely, when y has a datelike index, then cutoff must be datelike.

Here's one of the errors:

FAILED sktime/split/tests/test_all_splitters.py::TestAllSplitters::test_split_loc[ExpandingCutoffSplitter-False] - TypeError: y indicies and cutoff must have the same datatypes, but instead found type(y) = datetime64[ns] and type(cutoff) = <class 'int'>

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's probably related to one of the special characters - it is always a bit of a hassle to decode "position 1234" to the actual location.

Copy link
Collaborator

@fkiraly fkiraly May 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this make sense? I might be overanalyzing it.

I think it makes sense, and it might be confusing to the user if not explained clearly how cutoff and index types combine.

Still, I can think of iloc indexing in case of date-like index making sense: In the picture above:

index: 123456789
s1:    --*******
s2:    *********
s3:    -********

this alignment happens if you use negative iloc indices, e.g., -1 and -2 will index the last two indices.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this alignment happens if you use negative iloc indices, e.g., -1 and -2 will index the last two indices.

I’d considered this as well. I'd be open to modifying the code to work with datetime indices when the cutoff value is an int. Should we design it such that only negative cutoff values are allowed when working with datetime-like indices, or should we allow all integers? The former might still be a headache for the get_test_params method.

We need to clarify some assumptions. For hierarchical data, it’s not guaranteed that subgroups will end with the same index. If they don't end with the same index (represented by a datetime-like value in this example), then negative indices would reference different datetime values. However, if they do end with the same index, which I suspect is the case for most scenarios, then negative indices should work appropriately.

What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we design it such that only negative cutoff values are allowed when working with datetime-like indices, or should we allow all integers?

This is really a difficult question, sorry for the delay in replying.

I would allow all integers, even in date-like case. That would be consistent with negative integers, so needs less case handling.

If they don't end with the same index (represented by a datetime-like value in this example), then negative indices would reference different datetime values. However, if they do end with the same index, which I suspect is the case for most scenarios, then negative indices should work appropriately.

I think it is fine if negative iloc index refers to different loc index - this could be a reasonable scenario, similar to the greedy splitter. In any case, as long as we explain this clearly to the user, it will be ok imo, all solutions seem to have an unpleasant aspect, but this is possibly the least confusing one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fkiraly I actually implemented this in the last commit. The logic goes like this:

  • When an index with dates is paired with an integer cutoff, the cutoff serves as an index locator (iloc).
  • When the index is an integer, it directly compares the values, except when the integer is negative, in which case it functions as an iloc.
  • example: index = 5 6 7 8; cutoff = 6 would be a direct comparison, whereas cutoff -2 would be the iloc equivalent. If, for example, someone used a cutoff of 10, then it would raise an error, since it's not present in the index.

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Minor points:

  • I think the Parameters docstring section has one indent too much
  • do you want to say a few words how the different index/offset combinations are interpreted, in the docstring preamble?
  • could you add at least one more test parameter set?

@ninedigits ninedigits changed the title [ENH] Implement ExpandingCutoffFilter [ENH] Implement ExpandingCutoffSplitter May 12, 2024
@ninedigits
Copy link
Contributor Author

Looks good!

Minor points:

  • I think the Parameters docstring section has one indent too much
  • do you want to say a few words how the different index/offset combinations are interpreted, in the docstring preamble?
  • could you add at least one more test parameter set?

Done.

@fkiraly
Copy link
Collaborator

fkiraly commented May 12, 2024

seems like the docstring example is failing - looks like a minor thing

@ninedigits
Copy link
Contributor Author

ninedigits commented May 12, 2024

seems like the docstring example is failing - looks like a minor thing

Ah, I think I fixed it now.

Edit: apparently I did not fix it. Not sure what the issue is.

@fkiraly
Copy link
Collaborator

fkiraly commented May 13, 2024

The failure is unrelated, it is this sporadic problem: #6406

@ninedigits
Copy link
Contributor Author

The failure is unrelated, it is this sporadic problem: #6406

Oh, got it. Anything left to do on my end?

fkiraly
fkiraly previously approved these changes May 14, 2024
Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, all fine!

Forgot to approve, apologies for the oversight.

@fkiraly fkiraly changed the title [ENH] Implement ExpandingCutoffSplitter [ENH] ExpandingCutoffSplitter - splitter by moving cutoff May 20, 2024
@fkiraly fkiraly merged commit 26ad073 into sktime:main May 20, 2024
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Adding new functionality module:splitters&resamplers data splitters and resamplers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants