Mismatch between pretrained weights and imdb data? #1

sshleifer · 2019-03-11T04:36:43Z

First, I ran ./download.sh and wget http://sato-motoki.com/research/vat/imdb_pretrained_lm_ijcai.model.

Followed by the iVat train command in README.md. I've attached the output. It seems like vocab_inv is larger than the max_vocab at the time the pretrained model was made.
What is the best way to fix this?
Thanks!

train_set:71246
avg word number:244.2789911012548
vocab:87318
avg word number (train_x): 243.84721829991528
avg word number (dev_x):241.3660095897709
avg word number (test_x):236.99672
lm_words_num:17397769
train_vocab_size: 67054
vocab_inv: 87318
Traceback (most recent call last):
  File "train.py", line 427, in <module>
    main()
  File "train.py", line 181, in main
    serializers.load_npz(args.pretrained_model, pretrain_model)
  File "/data/anaconda/envs/tf17py3/lib/python3.6/site-packages/chainer/serializers/npz.py", line 190, in load_npz
    d.load(obj)
  File "/data/anaconda/envs/tf17py3/lib/python3.6/site-packages/chainer/serializer.py", line 83, in load
    obj.serialize(self)
  File "/data/anaconda/envs/tf17py3/lib/python3.6/site-packages/chainer/link.py", line 1001, in serialize
    d[name].serialize(serializer[name])
  File "/data/anaconda/envs/tf17py3/lib/python3.6/site-packages/chainer/link.py", line 651, in serialize

    data = serializer(name, param.data)
  File "/data/anaconda/envs/tf17py3/lib/python3.6/site-packages/chainer/serializers/npz.py", line 150, in __call__
    numpy.copyto(value, dataset)
ValueError: could not broadcast input array from shape (86935,256) into shape (87318,256)

The text was updated successfully, but these errors were encountered:

dcetin · 2019-06-20T10:30:05Z

Could you fix the problem somehow? I seem to run into the same problem with the pretrained weights. I was getting an error with the encoding of the data in the preprocessing, then I switched to utf-8 encoding and the preprocessing worked alright. Then I get the error you are getting while loading the pretrained weights. It isn't specified anywhere in the code but do preprocessed data used for pretrained weights use somehow a different encoding than utf-8? Thanks for the interest.

Prepare for IMDB
Prepare script is running...
Traceback (most recent call last):
  File "preprocess.py", line 79, in <module>
    prepare_imdb()
  File "preprocess.py", line 55, in prepare_imdb
    imdb_validation_pos_start_id)
  File "preprocess.py", line 24, in load_file
    words = read_text(filename.strip())
  File "preprocess.py", line 11, in read_text
    for line in f:
  File "/cluster/scratch/dcetin/.pyenv/versions/anaconda3-5.1.0/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 399: ordinal not in range(128)

train_set:71246
avg word number:242.8615501221121
vocab:87008
avg word number (train_x): 242.43914148545608
avg word number (dev_x):239.861747469366
avg word number (test_x):235.59372
lm_words_num:17297560
train_vocab_size: 66825
vocab_inv: 87008
Traceback (most recent call last):
  File "train.py", line 427, in <module>
    main()
  File "train.py", line 181, in main
    serializers.load_npz(args.pretrained_model, pretrain_model)
  File "/cluster/scratch/dcetin/.pyenv/versions/anaconda3-5.1.0/lib/python3.6/site-packages/chainer/serializers/npz.py", line 242, in load_npz
    d.load(obj)
  File "/cluster/scratch/dcetin/.pyenv/versions/anaconda3-5.1.0/lib/python3.6/site-packages/chainer/serializer.py", line 83, in load
    obj.serialize(self)
  File "/cluster/scratch/dcetin/.pyenv/versions/anaconda3-5.1.0/lib/python3.6/site-packages/chainer/link.py", line 1036, in serialize
    d[name].serialize(serializer[name])
  File "/cluster/scratch/dcetin/.pyenv/versions/anaconda3-5.1.0/lib/python3.6/site-packages/chainer/link.py", line 1033, in serialize
    super(Chain, self).serialize(serializer)
  File "/cluster/scratch/dcetin/.pyenv/versions/anaconda3-5.1.0/lib/python3.6/site-packages/chainer/link.py", line 655, in serialize
    data = serializer(name, param.data)  # type: types.NdArray
  File "/cluster/scratch/dcetin/.pyenv/versions/anaconda3-5.1.0/lib/python3.6/site-packages/chainer/serializers/npz.py", line 184, in __call__
    numpy.copyto(value, dataset)
ValueError: could not broadcast input array from shape (76935,64) into shape (77008,64)```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mismatch between pretrained weights and imdb data? #1

Mismatch between pretrained weights and imdb data? #1

sshleifer commented Mar 11, 2019

dcetin commented Jun 20, 2019

Mismatch between pretrained weights and imdb data? #1

Mismatch between pretrained weights and imdb data? #1

Comments

sshleifer commented Mar 11, 2019

dcetin commented Jun 20, 2019