Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<PAD> #108

Open
duguiming111 opened this issue Jan 24, 2019 · 2 comments
Open

<PAD> #108

duguiming111 opened this issue Jan 24, 2019 · 2 comments

Comments

@duguiming111
Copy link

在数据处理部分,没用到吗?
训练集id很稀疏,只有字典存在的词才有,
还有就是
embedding = tf.get_variable('embedding', [self.config.vocab_size, self.config.embedding_dim])
embedding_inputs = tf.nn.embedding_lookup(embedding, self.input_x)
这个词向量是随机生成的,在训练过程中会训练这部分吗?

@gaussic
Copy link
Owner

gaussic commented Jan 24, 2019

的 id 为 0 ,是一个占位符,在这里特意把它留了下来,是为了后续的补齐操作,在长度不足的情况下在前面补 0,是序列预处理的常规操作。

另外,embedding 是一个 tensorflow 变量,会随着训练过程自动的训练。

@shm007g
Copy link

shm007g commented Aug 1, 2019

我基于 https://github.com/Embedding/Chinese-Word-Vectors 这里的词向量做的训练,效果略微提升2个点。

另外,我做过长度分布查看,50%分位都有600+。我看代码中没有保留unk,应该是会把很多词语过滤掉,这个应该能降低seq_len;但还是考虑可以增加max_seq_len。

count    49999.000000
mean       913.320506
std        930.094315
min          8.000000
25%        350.000000
50%        688.000000
75%       1154.000000
max      27467.000000
Name: text, dtype: float64
100%|██████████████████████████████████████████████████████████████████| 49999/49999 [00:09<00:00, 5113.81it/s]
count     4999.000000
mean       882.249050
std        863.752597
min         15.000000
25%        380.000000
50%        626.000000
75%       1072.000000
max      10919.000000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants