Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about custom data generator #75

Open
YanaHontarenko opened this issue Jun 9, 2020 · 0 comments
Open

Question about custom data generator #75

YanaHontarenko opened this issue Jun 9, 2020 · 0 comments
Labels
question Further information is requested

Comments

@YanaHontarenko
Copy link

I see that previously you answered that "for big amount of data you can fit model several times"(#8).
But I didn't work with pytorch before and don't know how it is must work: how pass info about losses and gradients for different parts of dataset.
That's why I want to ask if your library has ability to fit with custom data generator (like fit_generator in keras). Or maybe you can tell me where I can see example for such case.

This is what my class for data looks like(prevoiusly I save different parts of data in "data.npz"):

from torch.utils.data import Dataset
class Data(Dataset):
    def __init__(self, set, batch_size=32, shuffle=True):
        self.data = np.load("data.npz", allow_pickle=True)
        self.set = set
        self.batch_size = batch_size
        self.shuffle = shuffle
        self.on_epoch_end()

    def __len__(self):
        return int(np.floor(len(self.indexes) / self.batch_size))

    def __getitem__(self, index):
        temp_indexes = self.indexes[index * self.batch_size:(index + 1) * self.batch_size]

        sequences_batch, clusters_batch = self.__data_generation(temp_indexes)

        return sequences_batch, clusters_batch

    def on_epoch_end(self):
        self.indexes = np.arange(self.data[f'{self.set}_sequence'].shape[0])
        if self.shuffle:
            np.random.shuffle(self.indexes)

    def __data_generation(self, temp_indexes):
        sequences = self.data[f'{self.set}_sequence'][temp_indexes]
        clusters = self.data[f'{self.set}_cluster_id'][temp_indexes]
        sequences = [seq.astype(float) + 0.00001 for seq in sequences]
        clusters = [np.array(cid).astype(str) for cid in clusters]

        return sequences, clusters

And this is how I create generator:

train_set = Data("train", 32, True)
train_generator = DataLoader(train_set)

P.S.: I'll be happy to receive any help, because I don't even sure that I go in the right direction.

@YanaHontarenko YanaHontarenko added the question Further information is requested label Jun 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant