Data is not shuffled prior to validation split #12

Sanqui · 2020-10-24T18:41:07Z

Hello,

I ran into an issue when trying to use ImageDataAugmentor with a single directory, asking it to split into a training and validation set.

I have figured out two ways to acomplish this, both with faults:

datagen = ImageDataAugmentor(
        augment = AUGMENTATIONS,
        validation_split=0.2)

train_generator = datagen.flow_from_directory(
        'data',
        subset="training",
        target_size=image_size,
        class_mode='binary',
        seed=123)
        
validation_generator = datagen.flow_from_directory(
        'data',
        subset="validation",
        target_size=image_size,
        class_mode='binary',
        seed=123)

This approach runs into one big issue, namely, the validation dataset now has augmentations applied, which goes against best practice. My second approach fares better in this department:

train_datagen = ImageDataAugmentor(
        augment = AUGMENTATIONS,
        validation_split=0.2)
        
test_datagen = ImageDataAugmentor(
        validation_split=0.2)

train_generator = train_datagen.flow_from_directory(
        'data',
        subset="training",
        target_size=image_size,
        class_mode='binary',
        seed=123)
        
validation_generator = test_datagen.flow_from_directory(
        'data',
        subset="validation",
        target_size=image_size,
        batch_size=1,
        class_mode='binary',
        seed=123)

However, I have discovered a second, large issue: although the flow_from_directory method can handle shuffling data, because the list of filenames is not shuffled prior to the split, the validation dataset receives the first 0.2 files listed alphabetically, which can lead to huge biases. This can be verified by printing validation_generator.filenames.

Please advise me on this issue. I think shuffling the dataset prior to applying the validation split would be the solution here.

The text was updated successfully, but these errors were encountered:

mjkvaak · 2020-10-24T20:02:15Z

Thank you for identifying the bug. I think the same issue would persist with tf.keras ImageDataGenerator, because also there you select the data augmentations initializing the generating and only after that link the data to it e.g. with flow_from_directory. That said, you may want to report this bug also there.

I will think how the problem could be best fixed. In the meanwhile, you could avoid the problem simply by resolving the filenames, shuffling them in a dataframe and then using flow_from_dataframe instead. Below is a recipe that should be quite well tailored to your use case:

import pandas as pd
from pathlib import Path
...
image_filenames = list(Path('data').rglob('*.png')) #or *.jpg etc. depending on your dataset file format
data_df = pd.DataFrame({'filename':image_filenames})
validation_df = data_df.sample(frac=0.2) #to enforce the fixed split, you can fix the seed in `random_state` parameter
train_df = data_df[~data_df.filename.isin(validation_df.filename)]

train_datagen = ImageDataAugmentor(
        augment = AUGMENTATIONS)       
val_datagen = ImageDataAugmentor()

train_generator = train_datagen.flow_from_dataframe(
        train_df,
        target_size=image_size,
        class_mode='binary',
        seed=123)
        
validation_generator = val_datagen.flow_from_dataframe(
        validation_df,
        target_size=image_size,
        batch_size=1,
        class_mode='binary',
        seed=123)
...

Sanqui · 2020-10-25T10:57:17Z

Thanks for the advice! I botched a solution in a fork yesterday: Sanqui@e5d0dc0, however I think your way is more straightforward. I had to turn the paths into strings and also gather the classes but otherwise, this worked:

filepaths = list(Path(DATASET_DIRECTORY).rglob('*.png'))
filenames = [str(path) for path in filepaths]
classes = [str(path.parts[1]) for path in filepaths]

data_df = pd.DataFrame({'filename': filenames, 'class': classes})
validation_df = data_df.sample(frac=VALIDATION_SPLIT,
                               random_state=SEED)
train_df = data_df[~data_df.filename.isin(validation_df.filename)]

Cheers.

NullXeronier · 2021-06-29T14:29:55Z

Bug Still Exists....
flow_from_dataframe

tf.keras img valid gen not working

--> valid generator can't iterate next batch data (it raises Attribute Error : NoneType has shape)

mjkvaak · 2021-06-29T15:00:34Z

@ikarus-999 : it seems your comment is not related to the current issue. As a friendly suggestion for the future, you should post a new issue whenever it does not coincide with the existing ones.

I am fairly certain that one of the paths in your dataframe is incorrect and cv2.imread(path) will return None if path does not correspond to an actual file. Please let me know if this is not the case.

mjkvaak added the bug Something isn't working label Oct 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data is not shuffled prior to validation split #12

Data is not shuffled prior to validation split #12

Sanqui commented Oct 24, 2020 •

edited

mjkvaak commented Oct 24, 2020 •

edited

Sanqui commented Oct 25, 2020

NullXeronier commented Jun 29, 2021 •

edited

mjkvaak commented Jun 29, 2021

Data is not shuffled prior to validation split #12

Data is not shuffled prior to validation split #12

Comments

Sanqui commented Oct 24, 2020 • edited

mjkvaak commented Oct 24, 2020 • edited

Sanqui commented Oct 25, 2020

NullXeronier commented Jun 29, 2021 • edited

mjkvaak commented Jun 29, 2021

Sanqui commented Oct 24, 2020 •

edited

mjkvaak commented Oct 24, 2020 •

edited

NullXeronier commented Jun 29, 2021 •

edited