Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataLoader does not work with list of arrays of different sizes #506

Open
douglas125 opened this issue Jun 16, 2018 · 2 comments
Open

DataLoader does not work with list of arrays of different sizes #506

douglas125 opened this issue Jun 16, 2018 · 2 comments

Comments

@douglas125
Copy link

I would like to use a dataloader to iterate through objects in a dataset.

However, it seems that the dataloader gets stuck when the arrays have different sizes. I assume the dataloader would like to return a nparray of shape (batch_size, my_size) but the kernel freezes without any error message.

If dataloader could pad the arrays with 0 (or something that the user can choose) it would solve my problem. Note that if the numpy arrays have the same length then the code works.

You can reproduce the issue like this:

import mxnet as mx
import numpy as np

batch_test = [np.array([1,2,3,1,2,6])]
batch_test.append(np.array([3,2,1,1,5]))
batch_test.append(np.array([3,1,2,6]))
batch_test.append(np.array([1,2,2,6]))
batch_test.append(np.array([1,4]))

dataset = mx.gluon.data.dataset.ArrayDataset(batch_test, batch_test)

from multiprocessing import cpu_count
CPU_COUNT = cpu_count()

data_loader = mx.gluon.data.DataLoader(dataset, batch_size=3, num_workers=1) #CPU_COUNT)
print('We do get here')
for X_batch, y_batch in data_loader:
    print("X_batch has shape {}, and y_batch has shape {}".format(X_batch.as_in_context(ctx).shape, y_batch.shape))
    
print('But we never get here')
@aakashpatel25
Copy link

It has been 3 months since this issue. Any solution? I am facing similar issue, I have to pad my data to get it in the same shape!

@ThomasDelteil
Copy link
Contributor

ThomasDelteil commented Nov 15, 2018

@douglas125 @aakashpatel25 you can use a custom transform function on your dataset.

max_len = max([len(array) for array in batch_test])

def transform(x1, x2):
    x1_ = np.zeros(max_len)
    x2_ = np.zeros(max_len)
    x1_[:len(x1)] = x1
    x2_[:len(x2)] = x2
    return x1_, x2_

data_loader = mx.gluon.data.DataLoader(dataset.transform(transform), batch_size=3, num_workers=0)
for X_batch, y_batch in data_loader:
    print("X_batch has shape {}, and y_batch has shape {}".format(X_batch.shape, y_batch.shape))
X_batch has shape (3, 6), and y_batch has shape (3, 6)
X_batch has shape (2, 6), and y_batch has shape (2, 6)

or use the gluon-cv package batchify functions:

import gluoncv
from gluoncv.data.batchify import Pad, Tuple
import mxnet as mx
import numpy as np

batch_test = [np.array([1,2,3,1,2,6])]
batch_test.append(np.array([3,2,1,1,5]))
batch_test.append(np.array([3,1,2,6]))
batch_test.append(np.array([1,2,2,6]))
batch_test.append(np.array([1,4]))

batchify = Tuple(Pad(), Pad())
dataset = mx.gluon.data.dataset.ArrayDataset(batch_test, batch_test)

data_loader = mx.gluon.data.DataLoader(dataset, batch_size=3, num_workers=0, batchify_fn = batchify)
for X_batch, y_batch in data_loader:
    print("X_batch has shape {}, and y_batch has shape {}".format(X_batch.shape, y_batch.shape))
X_batch has shape (3, 6), and y_batch has shape (3, 6)
X_batch has shape (2, 4), and y_batch has shape (2, 4)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants