Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mismatch between pretrained weights and imdb data? #1

Open
sshleifer opened this issue Mar 11, 2019 · 1 comment
Open

Mismatch between pretrained weights and imdb data? #1

sshleifer opened this issue Mar 11, 2019 · 1 comment

Comments

@sshleifer
Copy link

First, I ran ./download.sh and wget http://sato-motoki.com/research/vat/imdb_pretrained_lm_ijcai.model.

Followed by the iVat train command in README.md. I've attached the output. It seems like vocab_inv is larger than the max_vocab at the time the pretrained model was made.
What is the best way to fix this?
Thanks!

train_set:71246
avg word number:244.2789911012548
vocab:87318
avg word number (train_x): 243.84721829991528
avg word number (dev_x):241.3660095897709
avg word number (test_x):236.99672
lm_words_num:17397769
train_vocab_size: 67054
vocab_inv: 87318
Traceback (most recent call last):
  File "train.py", line 427, in <module>
    main()
  File "train.py", line 181, in main
    serializers.load_npz(args.pretrained_model, pretrain_model)
  File "/data/anaconda/envs/tf17py3/lib/python3.6/site-packages/chainer/serializers/npz.py", line 190, in load_npz
    d.load(obj)
  File "/data/anaconda/envs/tf17py3/lib/python3.6/site-packages/chainer/serializer.py", line 83, in load
    obj.serialize(self)
  File "/data/anaconda/envs/tf17py3/lib/python3.6/site-packages/chainer/link.py", line 1001, in serialize
    d[name].serialize(serializer[name])
  File "/data/anaconda/envs/tf17py3/lib/python3.6/site-packages/chainer/link.py", line 651, in serialize

    data = serializer(name, param.data)
  File "/data/anaconda/envs/tf17py3/lib/python3.6/site-packages/chainer/serializers/npz.py", line 150, in __call__
    numpy.copyto(value, dataset)
ValueError: could not broadcast input array from shape (86935,256) into shape (87318,256)

@dcetin
Copy link

dcetin commented Jun 20, 2019

Could you fix the problem somehow? I seem to run into the same problem with the pretrained weights. I was getting an error with the encoding of the data in the preprocessing, then I switched to utf-8 encoding and the preprocessing worked alright. Then I get the error you are getting while loading the pretrained weights. It isn't specified anywhere in the code but do preprocessed data used for pretrained weights use somehow a different encoding than utf-8? Thanks for the interest.

Prepare for IMDB
Prepare script is running...
Traceback (most recent call last):
  File "preprocess.py", line 79, in <module>
    prepare_imdb()
  File "preprocess.py", line 55, in prepare_imdb
    imdb_validation_pos_start_id)
  File "preprocess.py", line 24, in load_file
    words = read_text(filename.strip())
  File "preprocess.py", line 11, in read_text
    for line in f:
  File "/cluster/scratch/dcetin/.pyenv/versions/anaconda3-5.1.0/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 399: ordinal not in range(128)
train_set:71246
avg word number:242.8615501221121
vocab:87008
avg word number (train_x): 242.43914148545608
avg word number (dev_x):239.861747469366
avg word number (test_x):235.59372
lm_words_num:17297560
train_vocab_size: 66825
vocab_inv: 87008
Traceback (most recent call last):
  File "train.py", line 427, in <module>
    main()
  File "train.py", line 181, in main
    serializers.load_npz(args.pretrained_model, pretrain_model)
  File "/cluster/scratch/dcetin/.pyenv/versions/anaconda3-5.1.0/lib/python3.6/site-packages/chainer/serializers/npz.py", line 242, in load_npz
    d.load(obj)
  File "/cluster/scratch/dcetin/.pyenv/versions/anaconda3-5.1.0/lib/python3.6/site-packages/chainer/serializer.py", line 83, in load
    obj.serialize(self)
  File "/cluster/scratch/dcetin/.pyenv/versions/anaconda3-5.1.0/lib/python3.6/site-packages/chainer/link.py", line 1036, in serialize
    d[name].serialize(serializer[name])
  File "/cluster/scratch/dcetin/.pyenv/versions/anaconda3-5.1.0/lib/python3.6/site-packages/chainer/link.py", line 1033, in serialize
    super(Chain, self).serialize(serializer)
  File "/cluster/scratch/dcetin/.pyenv/versions/anaconda3-5.1.0/lib/python3.6/site-packages/chainer/link.py", line 655, in serialize
    data = serializer(name, param.data)  # type: types.NdArray
  File "/cluster/scratch/dcetin/.pyenv/versions/anaconda3-5.1.0/lib/python3.6/site-packages/chainer/serializers/npz.py", line 184, in __call__
    numpy.copyto(value, dataset)
ValueError: could not broadcast input array from shape (76935,64) into shape (77008,64)```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants