Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[odin] EmbeddingsResource should extend ExplicitWordEmbeddingMap #679

Open
myedibleenso opened this issue Nov 19, 2022 · 4 comments
Open
Labels

Comments

@myedibleenso
Copy link
Member

myedibleenso commented Nov 19, 2022

Odin's EmbeddingsResource extends the deprecated SanitizedWordEmbeddingMap.

Tangential, but have we given any thought to instead using an ANN index (ex. annoy4s) for Odin?

@kwalcock
Copy link
Member

In order to avoid the deprecation, the code below can be used. However, since an InputStream is being used, nothing is keeping track of whether this set of vectors has already been loaded for other purposes. To coordinate that, the OdinResourceManager needs to be interfaced with the WordEmbeddingMapPool. Someone would need to know the naming conventions used in both the classes to do this.

package org.clulab.odin.impl

import org.clulab.embeddings.{ExplicitWordEmbeddingMap, WordEmbeddingMap}
import org.clulab.scala.WrappedArray._

import java.io.InputStream

trait OdinResource

// for distributional similarity comparisons
class EmbeddingsResource(is: InputStream) extends  OdinResource {
  val wordEmbeddingMap = ExplicitWordEmbeddingMap(is, binary = false)

  def similarity(w1: String, w2: String): Double = {
    val scoreOpt = for {
      vec1 <- wordEmbeddingMap.get(w1)
      vec2 <- wordEmbeddingMap.get(w2)
    } yield WordEmbeddingMap.dotProduct(vec1, vec2).toDouble

    scoreOpt.getOrElse(-1d)
  }
}

@myedibleenso
Copy link
Member Author

myedibleenso commented Feb 23, 2023

Thanks for the snippet, @kwalcock .

Have you all talked about using an ANN index for a large set of embeddings? Since processors is still using static word embeddings, I am thinking n-gram embeddings could help to improve the relevance of multi-token matches.

@kwalcock
Copy link
Member

As in approximate nearest neighbor? Some were used for ConceptAlignment in alignment/indexer/knn/hnswlib . Specifically, this library was used: hnswlib. Only individual strings were added to the index, so I suppose that's unigram. Are you wanting to pair the words and concatenate their vectors? I haven't heard of that mentioned in relation to processors.

@myedibleenso
Copy link
Member Author

Are you wanting to pair the words and concatenate their vectors?

No, I meant averaging summing and averaging element-wise. It seems my memory is mistaken, though, we don't currently support this kind of thing: simScore(ave(embedding(<tok-1>), embedding(<tok-1>))) > 0.6

As in approximate nearest neighbor?

Yes, an approximate nearest neighbors index.

Right now Odin token constraints support expressions like simScore("tiger") > 0.9 which retrieves the embedding for the current token being examined and calculates its cosine similarity with "tiger". Imagine if you wanted to use this pattern with a phrase like "tax attorney". Including embeddings for bigrams in some kind of in-memory store isn't very practical. An ANN index is one possible solution.

Larger context: I am thinking about extending Odin to support a new kind of Embedding-based NER (just a sketch below):

- name: "embedding-ner"
   label: ActionStar
   type: embedding
   # will compare available embeddings for n-grams of the specified sizes
   phrases: [1, 2, 3]
   pattern: |
     ave("Sylvester Stallone", "Arnold Schwarzenegger") > .9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants