shuffle buffer issue? #7

jfb54 · 2021-03-15T15:05:06Z

I suspect that your reader code is affected by the meta-dataset shuffle buffer issue 54. I did a full run with your reader and the results were mostly consistent with what I would get with get using the official meta-dataset reader except for traffic signs (and a couple of other datasets) where the results were more optimistic that if the data is not shuffled. In a quick look through your code, it seems that the shuffle buffer mechanism is not used.

mboudiaf · 2021-03-17T14:22:50Z

Hmm that's interesting, thanks a lot for bringing this ! Let me look into this and come back to you with more answers ! :)

Update : I may have found the cause of the problem, that comes from the TFRecordDataset that was reading stream linearly without shuffling by default. By passing the argument shuffle_queue_size, it always keeps some samples in the buffer and randomly shuffles them, which should solve the problem (however may require a lot of memory if shuffle_queue_size is set too high). Please let me know if you're able to get consistent results with the official implementation now :) Thanks !

…licate data when num_workers > 0 + attempt to solve the problem of generator pickling #5

jfb54 · 2021-04-05T17:17:41Z

I did a run using your reader with the latest changes and I believe that the shuffle buffer issue is still present. I haven't looked carefully at how you implemented the shuffling, but I do know the recommended buffer size of examples to be selected from a class is 1000.

mboudiaf · 2021-04-05T18:13:10Z

Hey,
Thanks for re-testing again. The only thing I changed for this problem is at this line:

pytorch-meta-dataset/pytorch_meta_dataset/reader.py

Line 84 in cad64b3

shuffle_queue_size=self.shuffle_queue_size)

By default I've set it to 10, not to pose memory problems, but you can easily hard code it to 1000 by modifying the line. Please let me know if that changes anythings and thanks again !

jfb54 · 2021-04-10T19:27:26Z

Unfortunately setting this to 1000 does run into memory issues. To do proper Meta-Dataset training and evaluation, you need to have 19 iterators (1 for training, 8 for each of the validation datasets, 10 for each of the test datasets). When I ran this on a GPU within an 8 GPU cluster, it used so much in the way of resources that the jobs were automatically killed by the system. Not sure how to work around this.

…rhead #7

mboudiaf · 2021-04-16T16:44:34Z

Hi,

I've solved the problem of shuffling by completely getting rid of the idea of keeping buffer :). The idea is simply to pre-create an index file for each .tfrecords file that indicates the (start_byte, end_byte) of each sample in the .tfrecord file. Then, once the iterator is queried, it generates a random ordering of the samples and only loads in memory 1 sample at a time by retrieving the right bytes in memory. This adds 0 memory overhead, is fast, and should scale to an arbitrary number of datasets. Concretely:

To create the .index files for all the 10 datasets, I've written a script that you can simply execute:

bash make_index_files.sh [PATH_TO_CONVERTED_DATA]

Now there is no more "shuffle_queue_size", there is only the binary "shuffle" option. If you activate it, each class dataset will be read in a random order. Once all samples have been processed, a new random permutation will be generated and so on ..

Please let me know if you're able to make it work with this modification !
Best

jfb54 · 2021-04-23T14:09:39Z

I'm testing this now. Looking good so far. Will report back soon.

jfb54 · 2021-05-01T20:08:18Z

I have done extensive testing between the "official" dataset reader and yours.

The good news: The training curves are almost identical and the shuffle buffer issue has been solved. Thanks!

The bad news: Accuracies on ilsvrc_2012 and mscoco are lower on test and validation by a few percent. All other datasets have very consistent accuracy with the official reader.

Datapoint: I took the model trained with your reader and tested it using the official reader. The ilsvrc_2012 jumped up to what I would expect.

I suspect that somehow ilsvrc_2012 tasks are being sampled differently from the official code (and maybe only the test or validation splits?). I have noticed that the performance of ilsvrc_2012 and mscoco are related as they both have similar content. This makes me think that the differences are due entirely to how ilsvrc_2012 is being handled by your code (maybe in the hierarchical sampling?).

mboudiaf · 2021-05-06T07:12:59Z

Hi @jfb54,

Thanks a lot for the update ! That is weird, the sampling part is really the one I took care to "copy/paste" as I did not want to interfere with the code. But I could try to double check again.

On my end, I actually observed the exact opposite: when training with my loader, and testing it with the official one, the performance decreased. I identified this coming from the difference between the TensorFlow resizing function (inside decoder.py) and the PyTorch resize that I use: they have different default behaviors. First on the way the preservation of aspect ratio is handled, and second on the anti-aliasing.

The anti-aliasing option is activated by default in PyTorch, and deactivated in TensorFlow, which causes a significant feature shift. Below is an example (left are images from the original loader, right are the ones from my implem.).

I have benchmarked the SimpleShot method with a Res-Net18 with my loader. To give you an idea, when testing on ILSVRC_2012, I get the following results: original: 52.7, original + anti-aliasing: 59.7, mine: 60.0.

Does the method you're working on require episodic training ?

mboudiaf added a commit that referenced this issue Apr 1, 2021

add shuffle_queue_size for random shuffling #7 + fixed problem of dup…

cad64b3

…licate data when num_workers > 0 + attempt to solve the problem of generator pickling #5

mboudiaf added a commit that referenced this issue Apr 16, 2021

added bashfile to make .index files + now shuffling with 0 memory ove…

90edd48

…rhead #7

mboudiaf closed this as completed Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

shuffle buffer issue? #7

shuffle buffer issue? #7

jfb54 commented Mar 15, 2021

mboudiaf commented Mar 17, 2021 •

edited

jfb54 commented Apr 5, 2021

mboudiaf commented Apr 5, 2021

jfb54 commented Apr 10, 2021

mboudiaf commented Apr 16, 2021

jfb54 commented Apr 23, 2021

jfb54 commented May 1, 2021

mboudiaf commented May 6, 2021

shuffle buffer issue? #7

shuffle buffer issue? #7

Comments

jfb54 commented Mar 15, 2021

mboudiaf commented Mar 17, 2021 • edited

jfb54 commented Apr 5, 2021

mboudiaf commented Apr 5, 2021

jfb54 commented Apr 10, 2021

mboudiaf commented Apr 16, 2021

jfb54 commented Apr 23, 2021

jfb54 commented May 1, 2021

mboudiaf commented May 6, 2021

mboudiaf commented Mar 17, 2021 •

edited