Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

skip_grams #25

Open
brucexx opened this issue Aug 15, 2018 · 1 comment
Open

skip_grams #25

brucexx opened this issue Aug 15, 2018 · 1 comment

Comments

@brucexx
Copy link

brucexx commented Aug 15, 2018

发现这块逻辑存在问题,

words_count = Counter(words)
words = [w for w in words if words_count[w] > 50]
In [19]:

vocab = set(words)
vocab_to_int = {w: c for c, w in enumerate(vocab)}
int_to_vocab = {c: w for c, w in enumerate(vocab)}
In [20]:
print("total words: {}".format(len(words)))
print("unique words: {}".format(len(set(words))))
total words: 8623686
unique words: 6791
In [21]:

int_words = [vocab_to_int[w] for w in words]

其实vocab_to_int这个数据只是每个单词对应的第一次出现的位置

t = 1e-5 # t值
threshold = 0.9 # 剔除概率阈值

然后这里居然用这个下标用来计算词频??有人能告诉我是什么情况

int_word_counts = Counter(int_words)
total_count = len(int_words)
word_freqs = {w: c/total_count for w, c in int_word_counts.items()}

prob_drop = {w: 1 - np.sqrt(t / word_freqs[w]) for w in int_word_counts}

对单词进行采样

train_words = [w for w in int_words if prob_drop[w] < threshold]

@andrew-zzz
Copy link

没认真看代码啊 vocab_to_int这玩意做了set(words)后取index作为一个onehot标识

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants