Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could not get the file at http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz. [RequestException] None. #96

Open
tiassap opened this issue Jun 14, 2022 · 8 comments

Comments

@tiassap
Copy link

tiassap commented Jun 14, 2022

I ran the code on Google colab.

When building German vocabulary here:

if is_interactive_notebook():
    # global variables used later in the script
    spacy_de, spacy_en = show_example(load_tokenizers)
    vocab_src, vocab_tgt = show_example(load_vocab, args=[spacy_de, spacy_en])

This error showed up:

Could not get the file at http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz. [RequestException] None.

Is this problem with torchtext?
I found that this error occurred when calling this line:

vocab_src = build_vocab_from_iterator(
        yield_tokens(train + val + test, tokenize_de, index=0),
        min_freq=2,
        specials=["<s>", "</s>", "<blank>", "<unk>"],
    )

Thank you in advance.

@aambrioso1
Copy link

aambrioso1 commented Jun 15, 2022

I am having the same problem. It seems that site:

http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz

is no longer available.

The maintainer of this repository:

https://github.com/PetrochukM/PyTorch-NLP/blob/master/torchnlp/datasets/multi30k.py

writes:

"Host www.quest.dcs.shef.ac.uk forgot to update their SSL certificate; therefore, this dataset does not download securely."

Hope this offers some insight into the problem.

@tiassap
Copy link
Author

tiassap commented Jun 21, 2022

Thank you for the info @aambrioso1

@youbinaa
Copy link

@tiassap I ran into the same problem as what you explained. Did you find another way around to access those files?

@aambrioso1
Copy link

I was able to get the code to work by using another data file. The basic idea is that the training, validation, and test sets are all lists of tuples. The tuples consist of sentence pairs in each language. This insight is nice since it makes it easy to create any language pairing you would like. Here is my implementation in Colab along with lots of notes:

https://colab.research.google.com/drive/131hohvAKRqzHg4K3_68UGL4oi4SGOB45?usp=sharing

@tiassap
Copy link
Author

tiassap commented Jun 24, 2022

Thank you @aambrioso1. It is very helpful.

So we can use other dataset as well with data format [(de_1, eng_1), ..., (de_n, en_n)] and you are using this the German/English dataset from the European Parliament Proceedings Parallel Corpus 1996-2011 https://www.statmt.org/europarl/.
And dataset training, val, and test are declared as global variable.

Just for information, @youbinaa It seems like multi30K can also be downloaded from this repo https://github.com/multi30k/dataset.

The problem is because the url source of Torchtext.datasets.Multi30k() is not accessible. Let's hope it will be fixed soon.

@EsmaeilChitgar
Copy link

EsmaeilChitgar commented Oct 27, 2023

Thank you @aambrioso1. It is very helpful.

So we can use other dataset as well with data format [(de_1, eng_1), ..., (de_n, en_n)] and you are using this the German/English dataset from the European Parliament Proceedings Parallel Corpus 1996-2011 https://www.statmt.org/europarl/. And dataset training, val, and test are declared as global variable.

Just for information, @youbinaa It seems like multi30K can also be downloaded from this repo https://github.com/multi30k/dataset.

The problem is because the url source of Torchtext.datasets.Multi30k() is not accessible. Let's hope it will be fixed soon.

How can I download in colab? I mean what change i need to to in code to download?

train, val, test = datasets.Multi30k('data', language_pair=("de", "en"))

@g-i-o-r-g-i-o
Copy link

g-i-o-r-g-i-o commented Nov 27, 2023

Thank you @aambrioso1. It is very helpful.
So we can use other dataset as well with data format [(de_1, eng_1), ..., (de_n, en_n)] and you are using this the German/English dataset from the European Parliament Proceedings Parallel Corpus 1996-2011 https://www.statmt.org/europarl/. And dataset training, val, and test are declared as global variable.
Just for information, @youbinaa It seems like multi30K can also be downloaded from this repo https://github.com/multi30k/dataset.
The problem is because the url source of Torchtext.datasets.Multi30k() is not accessible. Let's hope it will be fixed soon.

How can I download in colab? I mean what change i need to to in code to download?

train, val, test = datasets.Multi30k('data', language_pair=("de", "en"))

from torchtext.datasets import multi30k

multi30k.URL["train"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz"
multi30k.URL["valid"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz"
multi30k.URL["test"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt16_task1_test.tar.gz"

multi30k.MD5["train"] = "20140d013d05dd9a72dfde46478663ba05737ce983f478f960c1123c6671be5e"
multi30k.MD5["valid"] = "a7aa20e9ebd5ba5adce7909498b94410996040857154dab029851af3a866da8c"
multi30k.MD5["test"] = "6d1ca1dba99e2c5dd54cae1226ff11c2551e6ce63527ebb072a1f70f72a5cd36"

https://discuss.pytorch.org/t/build-vocab-from-iterator-does-not-work-in-notebook/153575/16

@minsuk-sung
Copy link

Thank you @aambrioso1. It is very helpful.
So we can use other dataset as well with data format [(de_1, eng_1), ..., (de_n, en_n)] and you are using this the German/English dataset from the European Parliament Proceedings Parallel Corpus 1996-2011 https://www.statmt.org/europarl/. And dataset training, val, and test are declared as global variable.
Just for information, @youbinaa It seems like multi30K can also be downloaded from this repo https://github.com/multi30k/dataset.
The problem is because the url source of Torchtext.datasets.Multi30k() is not accessible. Let's hope it will be fixed soon.

How can I download in colab? I mean what change i need to to in code to download?
train, val, test = datasets.Multi30k('data', language_pair=("de", "en"))

from torchtext.datasets import multi30k

multi30k.URL["train"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz" multi30k.URL["valid"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz" multi30k.URL["test"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt16_task1_test.tar.gz"

multi30k.MD5["train"] = "20140d013d05dd9a72dfde46478663ba05737ce983f478f960c1123c6671be5e" multi30k.MD5["valid"] = "a7aa20e9ebd5ba5adce7909498b94410996040857154dab029851af3a866da8c" multi30k.MD5["test"] = "6d1ca1dba99e2c5dd54cae1226ff11c2551e6ce63527ebb072a1f70f72a5cd36"

https://discuss.pytorch.org/t/build-vocab-from-iterator-does-not-work-in-notebook/153575/16

Thanks! It works!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants