Skip to content

yichengsu/DALI_CSV

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PyTorch DataLoader with DALI and CSV

This repo shows a demo of how to use DALI(v0.16.0) to read images and label from the CSV config file. ./images folder provide five images as a small dataset.

Allow me to complain first.

./doc_demo.py comes from the document of DALI.

You can run this demo like this and it will show the results in ./res/:

python doc_demo.py

To be honest, DALIGenericIterator does not implement the features described in the documentation. Especially these two parameters of it: fill_last_batch and last_batch_padded.

With the data set [1,2,3,4,5,6,7] and the batch size 2:

fill_last_batch last_batch_padded last batch next iteration realize
False True [7] [1, 2] ×
False False [7] [2, 3]
True True [7, 7] [1, 2] ×
True False [7, 1] [2, 3]

I also looked at the source code in github, and these two parameters did not achieve the claimed function.

ExternalInputIterator also makes me confuse.

def __next__(self):
        ...
        if self.i >= self.n:
            raise StopIteration

        for _ in range(self.batch_size):
            ...
            self.i = (self.i + 1) % self.n
        ...

It never raise StopIteration because self.i = (self.i + 1) % self.n. This also makes it impossible to cooperate with the above functions DALIGenericIterator. The next epoch will never start at the beginning. It doesn't seem to be a problem when used on the training set, but it feels weird when used on the test set, because you don't know where it started, although it may not affect the final result.

Maybe it has to make some compromises for better compatibility with Python 2.x. But I hope that DALI can provide better design on this issue in the future.

Next is how to use DALI to read images and label from the CSV config file.

I use a different philosophy from PyTorch. I wrote about it in my blog.

You can run dali_csv.py like this and it will also show the results in ./res/:

python dali_csv.py

or provide some parameters:

CUDA_VISIBLE_DEVICES=3 python dali_csv.py -batch_size 2 -epochs 2

It has the following advantages:

  • Using a csv file, you can easily separate the training set from the test set
  • Provides the function of shuffle
  • Can return multi-labels
  • Read the complete dataset for each epoch

I highly recommend DALIGenericIterator(..., last_batch_padded=True/False, fill_last_batch=False). It will always read the complete dataset for each epoch. ffill_last_batch=True will make the last epoch have a lot of duplicate data or bring some other mistakes.

Because I made a few changes to the original structure, it most likely does not support Python 2.x. You can also easily merge the two files and use the original structure.

With Intel(R) Xeon(R) CPU E5-2650 v4, 1 TITAN Xp GPU, I compared the speed of these three situations using the KonIQ-10K dataset which has 10,073 images.

4 threads 8 threads 16 threads
PyTorch dataloader 165.55s(62.66imgs/s) 96.07s(107.97imgs/s) 53.75s(192.99imgs/s)
DALI ops.FileReader 45.92s(225.89imgs/s) 24.76s(418.98imgs/s) 15.39s(673.96imgs/s)
DALI CSV loader 44.71s(225.30imgs/s) 24.77s(406.62imgs/s) 14.82s(679.72imgs/s)

Although the server I used is always busy and all data is stored on disk, it still shows very promising speed.

This repo was inspired by tanglang96.

Releases

No releases published

Packages

No packages published

Languages