Skip to content

Managing Data

James Bergstra edited this page Mar 14, 2013 · 4 revisions

Some of the data sets interfaced by skdata are hundreds of megabytes or several gigabytes in size on disk. In order to use them, it is necessary first to either download or locate a local copy of the data set. This page describes how that happens and where those files go.

Downloading data

Data sets that require downloading external data (i.e. most of them) use the mechanism implemented in skdata.data_home. This module exports a function called get_data_home that identifies a root directory, within which data set modules will create subdirectories to store downloaded files. This directory defaults to "~/.skdata" on unix-like machines, but it can be configured by a "$SKDATA_ROOT" environment variable. (See docstring in data_home.py for details.)

If you want to split data sets across different filesystems, computers, etc. then you should think about addressing that at the filesystem level. There is not currently support for such arrangements in skdata's file-locating logic. Your main mechanism for supporting such file layouts is symlinks. If some data set (e.g. imagenet or hollywood2) is too large to fit on your "/home" filesystem, or you want to share a copy with other users via a networked filesystem, then consider either (a) replacing your own "/.skdata/imagenet" folder with a symlink or else (b) configuring skdata to look for a data root directory at a different location than your "/.skdata".

Deleting data

Generally the way to get rid of all files created by a data set module is to delete that directory with the same name as the module from the "~/.skdata" directory (or wherever you configured it to be with "$SKDATA_ROOT").

Some data sets may offer scripts in their "main.py" files for deleting temporary files to free up space without erasing the files that were downloaded. Check the data set in question if you want to free up space, but avoid a future re-download.