Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data is not shuffled prior to validation split #12

Open
Sanqui opened this issue Oct 24, 2020 · 4 comments
Open

Data is not shuffled prior to validation split #12

Sanqui opened this issue Oct 24, 2020 · 4 comments
Labels
bug Something isn't working

Comments

@Sanqui
Copy link

Sanqui commented Oct 24, 2020

Hello,

I ran into an issue when trying to use ImageDataAugmentor with a single directory, asking it to split into a training and validation set.

I have figured out two ways to acomplish this, both with faults:

datagen = ImageDataAugmentor(
        augment = AUGMENTATIONS,
        validation_split=0.2)

train_generator = datagen.flow_from_directory(
        'data',
        subset="training",
        target_size=image_size,
        class_mode='binary',
        seed=123)
        
validation_generator = datagen.flow_from_directory(
        'data',
        subset="validation",
        target_size=image_size,
        class_mode='binary',
        seed=123)

This approach runs into one big issue, namely, the validation dataset now has augmentations applied, which goes against best practice. My second approach fares better in this department:

train_datagen = ImageDataAugmentor(
        augment = AUGMENTATIONS,
        validation_split=0.2)
        
test_datagen = ImageDataAugmentor(
        validation_split=0.2)

train_generator = train_datagen.flow_from_directory(
        'data',
        subset="training",
        target_size=image_size,
        class_mode='binary',
        seed=123)
        
validation_generator = test_datagen.flow_from_directory(
        'data',
        subset="validation",
        target_size=image_size,
        batch_size=1,
        class_mode='binary',
        seed=123)

However, I have discovered a second, large issue: although the flow_from_directory method can handle shuffling data, because the list of filenames is not shuffled prior to the split, the validation dataset receives the first 0.2 files listed alphabetically, which can lead to huge biases. This can be verified by printing validation_generator.filenames.

Please advise me on this issue. I think shuffling the dataset prior to applying the validation split would be the solution here.

@mjkvaak
Copy link
Owner

mjkvaak commented Oct 24, 2020

Thank you for identifying the bug. I think the same issue would persist with tf.keras ImageDataGenerator, because also there you select the data augmentations initializing the generating and only after that link the data to it e.g. with flow_from_directory. That said, you may want to report this bug also there.

I will think how the problem could be best fixed. In the meanwhile, you could avoid the problem simply by resolving the filenames, shuffling them in a dataframe and then using flow_from_dataframe instead. Below is a recipe that should be quite well tailored to your use case:

import pandas as pd
from pathlib import Path
...
image_filenames = list(Path('data').rglob('*.png')) #or *.jpg etc. depending on your dataset file format
data_df = pd.DataFrame({'filename':image_filenames})
validation_df = data_df.sample(frac=0.2) #to enforce the fixed split, you can fix the seed in `random_state` parameter
train_df = data_df[~data_df.filename.isin(validation_df.filename)]

train_datagen = ImageDataAugmentor(
        augment = AUGMENTATIONS)       
val_datagen = ImageDataAugmentor()

train_generator = train_datagen.flow_from_dataframe(
        train_df,
        target_size=image_size,
        class_mode='binary',
        seed=123)
        
validation_generator = val_datagen.flow_from_dataframe(
        validation_df,
        target_size=image_size,
        batch_size=1,
        class_mode='binary',
        seed=123)
...

@Sanqui
Copy link
Author

Sanqui commented Oct 25, 2020

Thanks for the advice! I botched a solution in a fork yesterday: Sanqui@e5d0dc0, however I think your way is more straightforward. I had to turn the paths into strings and also gather the classes but otherwise, this worked:

filepaths = list(Path(DATASET_DIRECTORY).rglob('*.png'))
filenames = [str(path) for path in filepaths]
classes = [str(path.parts[1]) for path in filepaths]

data_df = pd.DataFrame({'filename': filenames, 'class': classes})
validation_df = data_df.sample(frac=VALIDATION_SPLIT,
                               random_state=SEED)
train_df = data_df[~data_df.filename.isin(validation_df.filename)]

Cheers.

@mjkvaak mjkvaak added the bug Something isn't working label Oct 25, 2020
@NullXeronier
Copy link

NullXeronier commented Jun 29, 2021

Bug Still Exists....
flow_from_dataframe

tf.keras img valid gen not working

--> valid generator can't iterate next batch data (it raises Attribute Error : NoneType has shape)

@mjkvaak
Copy link
Owner

mjkvaak commented Jun 29, 2021

@ikarus-999 : it seems your comment is not related to the current issue. As a friendly suggestion for the future, you should post a new issue whenever it does not coincide with the existing ones.

I am fairly certain that one of the paths in your dataframe is incorrect and cv2.imread(path) will return None if path does not correspond to an actual file. Please let me know if this is not the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants