Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utility for syncing training, validation, and evaluation data. #188

Open
markmester opened this issue Oct 16, 2019 · 1 comment
Open

Utility for syncing training, validation, and evaluation data. #188

markmester opened this issue Oct 16, 2019 · 1 comment

Comments

@markmester
Copy link

On most of the datasets I'm putting together, there is not always a 1-1 matching of masks to tiles. At the very least there should be clarification that the trainer needs a directory where all files are in sync. Even better would be to provide a simple pre-processing script for syncing the masks/tiles or in rs_trainer provide an option to ignore or remove un-synced masks/tiles.

Currently I just use a simple python script to sync the directory:

import os
import argparse

def dir_dict(dir: str) -> dict:
    dd = {}

    for subdir, dirs, files in os.walk(dir):
        for file in files:
            f = '/'.join(os.path.join(subdir, file).split("/")[-3:])
            dd[f] = os.path.join(subdir, file)

    return dd

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('dir1', type=str)
    parser.add_argument('dir2', type=str)
    args = parser.parse_args()

    removed = []

    dir1_dict = dir_dict(args.dir1)
    dir2_dict = dir_dict(args.dir2)

    for k, v in dir1_dict.items():
        if not dir2_dict.get(k):
            removed.append(v)
    
    for k, v in dir2_dict.items():
        if not dir1_dict.get(k):
            removed.append(v)

    for file in removed:
        os.remove(file)
        
    return len(removed)

    
if __name__ == "__main__":
    print ( f"removed {main()} un-synced files" )
@daniel-j-h
Copy link
Collaborator

See #93 and #93 (comment)

We should keep the user responsible for preparing the dataset and making sure it's in sync. What we could do in the context of #91 is to go through our assertions and make them easier to understand (and show ways to solve the problem) for our users.

rs train's pre-conditions are a dataset directory with pairs of images and labels.

I agree with you we could make it clear in the readme, though.

Would you be so kind and open a pull request explaining this? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants