Skip to content
This repository has been archived by the owner on Jan 1, 2021. It is now read-only.

Fix one-hot encoding #80

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Fix one-hot encoding #80

wants to merge 2 commits into from

Conversation

maxim5
Copy link

@maxim5 maxim5 commented Dec 16, 2017

LT;DR: length calculation is wrong, padded zeros are never ignored.

Note that vocab_encode encodes each char as an index in 1..vocab_len: that's what is stored in seq before it goes through one-hot encodding. It is expected that tf.one_hot will encode only valid indices and return zeros for paddings (which is 0), but it's not what it does. Instead, it will encode every index in 0..vocab_len-1 and ignore vocab_len. This means that } char will always end the seq, while padded zeros are processed as normal chars.

Doing seq - 1 fixes both the padding 0 (should be invalid) and vocab_len (should be valid) indices.

By the way, length calculation can also be simplified to tf.reduce_sum(tf.reduce_max(seq, 2), 1)

LT;DR: length calculation is wrong, padded zeros are never ignored.

Note that `vocab_encode` encodes the each char an index in `1`..`vocab_len`: that's what is stored in `seq` before it goes through one-hot encodding. It is expected that `tf.one_hot` will encode only valid indices and return zeros for paddings (which is `0`), but it's not what it does. Instead, it will encode every index in `0`..`vocab_len-1` and ignore `vocab_len`. This means that `}` char will always end the seq, while padded zeros are processed as normal chars.

Doing `seq - 1` fixes both the padding `0` (should be invalid) and `vocab_len` (should be valid) indices.
Since one-hot encoding shifts the index down by 1, the generator must account for that, otherwise the sample sequence will collapse to zeros
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant