Fix one-hot encoding #80

maxim5 · 2017-12-16T12:30:58Z

LT;DR: length calculation is wrong, padded zeros are never ignored.

Note that vocab_encode encodes each char as an index in 1..vocab_len: that's what is stored in seq before it goes through one-hot encodding. It is expected that tf.one_hot will encode only valid indices and return zeros for paddings (which is 0), but it's not what it does. Instead, it will encode every index in 0..vocab_len-1 and ignore vocab_len. This means that } char will always end the seq, while padded zeros are processed as normal chars.

Doing seq - 1 fixes both the padding 0 (should be invalid) and vocab_len (should be valid) indices.

By the way, length calculation can also be simplified to tf.reduce_sum(tf.reduce_max(seq, 2), 1)

LT;DR: length calculation is wrong, padded zeros are never ignored. Note that `vocab_encode` encodes the each char an index in `1`..`vocab_len`: that's what is stored in `seq` before it goes through one-hot encodding. It is expected that `tf.one_hot` will encode only valid indices and return zeros for paddings (which is `0`), but it's not what it does. Instead, it will encode every index in `0`..`vocab_len-1` and ignore `vocab_len`. This means that `}` char will always end the seq, while padded zeros are processed as normal chars. Doing `seq - 1` fixes both the padding `0` (should be invalid) and `vocab_len` (should be valid) indices.

Since one-hot encoding shifts the index down by 1, the generator must account for that, otherwise the sample sequence will collapse to zeros

maxim5 added 2 commits December 16, 2017 13:30

fix the sample generator as well

430d756

Since one-hot encoding shifts the index down by 1, the generator must account for that, otherwise the sample sequence will collapse to zeros

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix one-hot encoding #80

Fix one-hot encoding #80

maxim5 commented Dec 16, 2017 •

edited

Fix one-hot encoding #80

Are you sure you want to change the base?

Fix one-hot encoding #80

Conversation

maxim5 commented Dec 16, 2017 • edited

maxim5 commented Dec 16, 2017 •

edited