Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A Simple but Tough-to-Beat Baseline for Sentence Embeddings #157

Open
zachmayer opened this issue Nov 8, 2016 · 9 comments
Open

A Simple but Tough-to-Beat Baseline for Sentence Embeddings #157

zachmayer opened this issue Nov 8, 2016 · 9 comments

Comments

@zachmayer
Copy link

Paper: http://104.155.136.4:3000/pdf?id=SyK00v5xx
Blog post: http://www.offconvex.org/2016/02/14/word-embeddings-2/

Looks like an interesting idea

@dselivanov
Copy link
Owner

Thx! Have been subscribed to offconvex blog for quite some time :-)
Another thing I want to try - http://www.offconvex.org/2016/07/10/embeddingspolysemy/. I even created rksvd repo to port k-svd algorithm, but can't find time to finish it =(

@bob-rietveld
Copy link

Hi, thanks for creating this super fast package. I use it a lot. I am trying to use the glove embeddings to create sentence representations. My first attempt is to just to average the word embeddings per sentence. I can figure it out using other packages like cleanNLP, the cleanNLP tokenizer provides a sentence id. I would prefer to stay within the text2vec-verse. Do you think it is possible to average the embeddings per sentence using the current functions in the package? Thanks for your help.

@dselivanov
Copy link
Owner

@good-marketing, thats easy with a little bit of linear algebra :-) (however I will probably create model for this).

Below I will suppose you already have dtm - document-term matrix with word counts and word_vectors - word embeddings.

common_terms = intersect(colnames(dtm), rownames(word_vectors) )
dtm_averaged =  normalize(dtm[, common_terms], "l1")
# you can re-weight dtm above with tf-idf instead of "l1" norm
sentence_vectors = dtm_averaged %*% word_vectors[common_terms, ]

Let me know if code above is not clear.

@bob-rietveld
Copy link

Thanks for the prompt answer. I am able to run the code, now I'll try to figure what to make of it ;-)

@bob-rietveld
Copy link

Hi Dimitri,

I was looking at the results using the method you mentioned. The resulting sentence_vectors are now a matrix with n documents X w averaged word vectors. The problem I have is that I'd like a sentence representation, not a document representation, or am I misinterpreting your solution.

One thought I had was to split the documents into sentences and then create a dtm. Essentially each sentence is then a document, and I can apply the algebra you posted. I guess the dtm will be a lot more sparse, not sure what the effect will be. Do you think this is a 'correct' approach? Thanks for your help.

@dselivanov
Copy link
Owner

@good-marketing splitting documents into the sentences is way to go. So we just change level of granularity of our analysis. I think this approach is 100% correct, I would go myself the same way.

stringi::stri_split_* or stringr::str_split_* with proper boundary delimiter can help with splitting into sentences.

@bob-rietveld
Copy link

Great, thanks for the superfast response. Would you recommend tokenize_sentences from tokenizer.....just wondering since you're also an package author there ;-)

@dselivanov
Copy link
Owner

dselivanov commented Jul 10, 2017 via email

@sfohr
Copy link

sfohr commented Aug 13, 2019

I'll take a shot at it next month, will keep you posted!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants