Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mscoco url link invalid? BucketNotFoundException: 404 gs://images.cocodataset.org bucket does not exist. #108

Open
brando90 opened this issue Jan 6, 2023 · 1 comment

Comments

@brando90
Copy link

brando90 commented Jan 6, 2023

I tried running but got error:

(mds_env_gpu) brando9~/data/mds/mscoco $ gsutil -m rsync gs://images.cocodataset.org/train2017 train2017

BucketNotFoundException: 404 gs://images.cocodataset.org bucket does not exist.

what to do?

full attempt:

# 1. Download the 2017 train images and annotations from http://cocodataset.org/:
#You can use gsutil to download them to mscoco/:
mkdir -p $MDS_DATA_PATH/mscoco/
cd $MDS_DATA_PATH/mscoco/
mkdir -p train2017
# seems to directly download all files, no zip file needed
gsutil -m rsync gs://images.cocodataset.org/train2017 train2017
# todo should have 118287? number of .jpg files (note no unziping needed)
ls $MDS_DATA_PATH/mscoco/train2017 | grep -c .jpg
# download & extract annotations_trainval2017.zip
gsutil -m cp gs://images.cocodataset.org/annotations/annotations_trainval2017.zip
unzip $MDS_DATA_PATH/mscoco/annotations_trainval2017.zip -d $MDS_DATA_PATH/mscoco
# todo says: 6?
ls $MDS_DATA_PATH/mscoco/annotations | grep -c .json

## Download Otherwise, you can download train2017.zip and annotations_trainval2017.zip and extract them into mscoco/. eta ~36m.
#mkdir -p $MDS_DATA_PATH/mscoco
#wget http://images.cocodataset.org/zips/train2017.zip -O $MDS_DATA_PATH/mscoco/train2017.zip
#wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip -O $MDS_DATA_PATH/mscoco/annotations_trainval2017.zip
## both zips should be there, note: downloading zip takes some time
#ls $MDS_DATA_PATH/mscoco/
## Extract them into mscoco/ (interpreting that as extracting both there, also due to how th gsutil command above looks like is doing)
## takes some time, but good progress display
#unzip $MDS_DATA_PATH/mscoco/train2017.zip -d $MDS_DATA_PATH/mscoco
#unzip $MDS_DATA_PATH/mscoco/annotations_trainval2017.zip -d $MDS_DATA_PATH/mscoco
## two folders should be there, annotations and train2017 stuff
#ls $MDS_DATA_PATH/mscoco/
## check jpg imgs are there
#ls $MDS_DATA_PATH/mscoco/train2017
#ls $MDS_DATA_PATH/mscoco/train2017 | grep -c .jpg
## says: 118287 for a 2nd time
#ls $MDS_DATA_PATH/mscoco/annotations
#ls $MDS_DATA_PATH/mscoco/annotations | grep -c .json
## says: 6 for a 2nd time
## move them since it says so in the google NL instructions ref: for moving large num files https://stackoverflow.com/a/75034830/1601580 thanks chatgpt!
#ls $MDS_DATA_PATH/mscoco/train2017 | grep -c .jpg
#find $MDS_DATA_PATH/mscoco/train2017 -type f -print0 | xargs -0 mv -t $MDS_DATA_PATH/mscoco
#ls $MDS_DATA_PATH/mscoco | grep -c .jpg
## says: 118287 for both
#ls $MDS_DATA_PATH/mscoco/annotations/ | grep -c .json
#mv $MDS_DATA_PATH/mscoco/annotations/* $MDS_DATA_PATH/mscoco/
#ls $MDS_DATA_PATH/mscoco/ | grep -c .json
## says: 6 for both

# 2. Launch the conversion script:
python -m meta_dataset.dataset_conversion.convert_datasets_to_records \
  --dataset=mscoco \
  --mscoco_data_root=$MDS_DATA_PATH/mscoco \
  --splits_root=$SPLITS \
  --records_root=$RECORDS

# 3. Expect the conversion to take about 4 hours.

# 4. Find the following outputs in $RECORDS/mscoco/:
#80 tfrecords files named [0-79].tfrecords
ls $RECORDS/mscoco/ | grep -c .tfrecords
#dataset_spec.json (see note 1)
ls $RECORDS/mscoco/dataset_spec.json

related: brando90/pytorch-meta-dataset#20

@brando90 brando90 changed the title mscoco url link invalid? mscoco url link invalid? BucketNotFoundException: 404 gs://images.cocodataset.org bucket does not exist. Jan 6, 2023
@lamblin
Copy link
Collaborator

lamblin commented Jan 20, 2023

I can confirm that the gs://images.cocodataset.org bucket does not seem to be accessible any longer, but we're not aware of an alternative source, and the original instructions at http://go/mscoco#download still mention that address.

I'd suggest you reach out to the COCO maintainers, and if there is an updated way to get that data, please let us know so we can update the instructions and scripts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants