Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word2Vec 负样本 id 没有做到单词的映射 #982

Open
KiraYeetar opened this issue Feb 29, 2024 · 0 comments
Open

Word2Vec 负样本 id 没有做到单词的映射 #982

KiraYeetar opened this issue Feb 29, 2024 · 0 comments

Comments

@KiraYeetar
Copy link

word2vec_reader.py 第 116 行左右

  for i in range(self.neg_num):
      tmp.append(random.random())
  neg_array = self.cs.searchsorted(tmp)

  output.append(
      np.array([int(i)
                for i in neg_array]).astype('int64'))

  yield output

负采样得到的 id 用的是采样 list (self.cs) 的 index 值,并且直接作为了输出,没有映射到单词的 word_id,这可能导致模型的负采样逻辑完全出错。

另外,在取 context 词的时候为什么要对 window_size 的大小做随机呢,是为了 demo 能快速运行吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant