Skip to content
This repository has been archived by the owner on Nov 1, 2021. It is now read-only.

Got stuck in getBatch with larger batch size #21

Open
joeyhng opened this issue Mar 27, 2016 · 7 comments
Open

Got stuck in getBatch with larger batch size #21

joeyhng opened this issue Mar 27, 2016 · 7 comments

Comments

@joeyhng
Copy link

joeyhng commented Mar 27, 2016

The following code reproduce the error:

local Dataset = require 'dataset.Dataset'

local opt = lapp[[
Got stuck in torch-dataset with batchSize == 128

(options)
   --batchSize     (default 128)    how many images in a mini-batch?
]]

-- create tmp csv file containing lots of rows
local tmpcsv = paths.tmpname() .. '.csv'
f = io.open(tmpcsv, 'w')
f:write('filename\n')
for i=1,300 do
  f:write(paths.tmpname() .. '\n')
end
f:close()

dataset = Dataset(tmpcsv)

getBatch, numBatches, reset = dataset.sampledBatcher({
  batchSize = opt.batchSize,
  inputDims = {10, 256},
  verbose = true,
  poolSize = 4,
  get = function(x)
    return torch.FloatTensor(10,256)
  end,
  processor = function(res, processorOpt, input) 
    return true
  end,
})

print('before getBatch')
local batch = getBatch()
print('finish getBatch')

Strangely the program works when batchSize is 64, but got stuck in getBatch() when batchSize is 128.

I came across this problem for several different problems which use custom get function and load data with other non-default method like image.load, where batchSize 64 works but not 128.

Any idea is appreciated. Thanks!

@zakattacktwitter
Copy link
Contributor

Hi,

I am not sure what you are trying to accomplish with this sample code? Can you provide a high level explanation of what you want to use Dataset for?

Thanks,
Zak

@joeyhng
Copy link
Author

joeyhng commented Mar 29, 2016

In my actual application, I'm usually trying to do something like this:

getBatch, numBatches, reset = dataset.sampledBatcher({
  batchSize = opt.batchSize,
  inputDims = {10, 256},
  verbose = true,
  poolSize = 4,
  get = function(x)
    return torch.load(x) -- or some other loading function like image.load / npy4th.load
  end,
  processor = function(res, processorOpt, input) 
    local x = augment(res) -- some data augmentation function
    input:copy(x)
    return true
  end,
})

which I use a custom get function to load the data, and do some data augmentation in processor.

This issue happens to me in different similar scenario where larger batch size got stuck. Thanks for your help.

@zakattacktwitter
Copy link
Contributor

Try not setting the poolSize option, that's a tricky one to set.

@joeyhng
Copy link
Author

joeyhng commented Mar 29, 2016

Yes, I find that not setting poolSize removes this error, but sometimes when I run for longer time the process got killed (just printed "Killed" in stderr), and I haven't figured out why yet. I suspect it is because of creating too many threads.

Should the poolSize limited by the number of cores on the machine? Are there any guideline for how to set it?

@zakattacktwitter
Copy link
Contributor

Its not really meant for users to set. I should probably remove.

The threads are created once at the start and no more are created after it. So it doesn't make sense that it crashed due to too many threads.

They way you are using Dataset, putting torch.load in custom get function, will create a ton of garbage and definitely won't be speedy.

How is your data laid out? Is it a whole bunch of little files on disk? If you describe your data I can help you use Dataset to sample it efficiently.

@joeyhng
Copy link
Author

joeyhng commented Mar 29, 2016

I'm processing video data, which are saved in a hard drive mounted in the system. Usually I save them in two formats:

  1. Extracted frame level features, usually in npy or t7 format. Each file would contain the extracted features of a specific video, which is a T x D tensor.
  2. Video frames in image format. Each video will have a directory, each containing a number of .jpg files representing the frames. I usually sample and load a few consecutive frames from the directory in the get or processor function.

Thanks a lot for your help!

@zakattacktwitter
Copy link
Contributor

Hi,

You can now adjust poolSize as much as you want.

The deadlock has been fixed in the IPC ( https://github.com/twitter/torch-ipc ) package. Just get the latest version of it and you should be good to go.

Thanks,
Zak

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants