Skip to content

harveyslash/ms-celeb-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 

Repository files navigation

ms-celeb-extractor

Extraction tool to parse MS Celeb dataset

The MS Celeb Dataset is a database of faces with 6,464,018 images.

Due to some error, the original dataset is gone . However, there is a torrent availble for use here It contains a tsv file with the images encoded as base64 strings.

This extraction tool helps read through the tsv and place images of the same person in their respective folders. As it reads through the tsv file, it deletes the already read entries, meaning it requires no extra disk space to save the processed files.

The reasoning for this is:

  1. Most libraries have built in helper functions to parse such a structure, including pytorch and keras/tensorflow
  2. Modern file systems hash their files, so if the path of the file is known, reading it is O(1) time
  3. Storing as the original jpeg files give a reduction in size from 95 GB to 57 GB

Installing

pip install -r requirements.txt

Usage

Usage: extractor.py [OPTIONS] COMMAND [ARGS]...

  Utility to help extract MS Celeb data into manageable fils.

Options:
  --help  Show this message and exit.

Commands:
  combine  Combine clean_list_128Vec_WT051_P010.txt and...
  process  Read lines from the MS Celeb TSV file and save into a directory...

First use the combine command to combine the two text files provided in the dataset. Details of why to combine will be clear on referring to Section "How to use C-MS-Celeb" at https://github.com/EB-Dodo/C-MS-Celeb. Further, the 2 txt files are not found in the torrent but in https://github.com/EB-Dodo/C-MS-Celeb/blob/master/clean_list.7z

Usage: extractor.py combine [OPTIONS]

  Combine clean_list_128Vec_WT051_P010.txt and relabel_list_128Vec_T058.txt
  together.

  The output of this file is used by the process command.

Options:
  --clean_list_128_path FILENAME  Path of clean_list_128Vec_WT051_P010.txt
                                  [required]

  --relabel_list_128_path FILENAME
                                  Path of relabel_list_128Vec_T058  [required]
  --output_path FILE              Path of output file  [required]
  --help                          Show this message and exit.

Then use the generated combined txt file into the process command to start extracting lines from the tsv and saving to jpeg files.

  Usage: extractor.py process [OPTIONS]

  Read lines from the MS Celeb TSV file and save into a directory structure.
  The files will be put in this format:

      root/person_x/xxx.jpg     root/person_x/xxy.jpg
      root/person_x/xxz.jpg

      root/person_y/123.jpg     root/person_y/817.jpg
      root/person_y/some.jpg

  !NOTE!: As this command reads the TSV, it will delete the lines already
  read.

Options:
  --tsv_location FILENAME    Location of the entire MS Celeb tsv file
                             [required]

  --output_dir PATH          Output directory for images  [required]
  --combined_file_path FILE  Location of the file generated by combine command
                             [required]

  --chunk_size INTEGER       Number of bytes to read from the tsv at once
  --num_threads INTEGER
  --help                     Show this message and exit.

Example:

python ms-celeb-extractor/extractor.py process --tsv_location=head.tsv --output_dir out --combined_file_path combined.txt
89it [00:03, 23.58it/s]

Contributing

Feel free to add issues or pull requests

Releases

No releases published

Packages

No packages published

Languages