Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TaskLoader makes copies of data, leading to duplication in memory #82

Open
tom-andersson opened this issue Oct 18, 2023 · 0 comments
Open

Comments

@tom-andersson
Copy link
Collaborator

In many deepsensor modelling scenarios, the user will have the same dataset (xarray or pandas) on the context and target side of the TaskLoader. Clearly, the TaskLoader should be using the same object in memory in these cases. However, part of the processing in the TaskLoader is returning a copy of the data objects. Since different pointers are used for the context and target data, this results in duplication in memory. See code example below.

import deepsensor.torch
from deepsensor.data import DataProcessor, TaskLoader
from deepsensor.model import ConvNP
from deepsensor.train import Trainer

import xarray as xr
import pandas as pd
import numpy as np
from tqdm import tqdm

# Load raw data
ds_raw = xr.tutorial.open_dataset("air_temperature")

# Normalise data
data_processor = DataProcessor(x1_name="lat", x2_name="lon")
ds = data_processor(ds_raw)

task_loader = TaskLoader(context=ds, target=ds)
>>> print(task_loader.context[0] is task_loader.target[0])
False

One solution is to use a hashmap/dict which is shared between the context and target data. Some thought would be needed on what the keys should be in the hashmap, and how the context and target lists should link to those entries.

We will need to test this for both xarray/pandas cases and also the case where the context/target entries are fpaths rather than xarray/pandas objects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant