[odin] EmbeddingsResource should extend ExplicitWordEmbeddingMap #679

myedibleenso · 2022-11-19T00:53:33Z

Odin's EmbeddingsResource extends the deprecated SanitizedWordEmbeddingMap.

Tangential, but have we given any thought to instead using an ANN index (ex. annoy4s) for Odin?

The text was updated successfully, but these errors were encountered:

kwalcock · 2023-02-23T17:00:31Z

In order to avoid the deprecation, the code below can be used. However, since an InputStream is being used, nothing is keeping track of whether this set of vectors has already been loaded for other purposes. To coordinate that, the OdinResourceManager needs to be interfaced with the WordEmbeddingMapPool. Someone would need to know the naming conventions used in both the classes to do this.

package org.clulab.odin.impl

import org.clulab.embeddings.{ExplicitWordEmbeddingMap, WordEmbeddingMap}
import org.clulab.scala.WrappedArray._

import java.io.InputStream

trait OdinResource

// for distributional similarity comparisons
class EmbeddingsResource(is: InputStream) extends  OdinResource {
  val wordEmbeddingMap = ExplicitWordEmbeddingMap(is, binary = false)

  def similarity(w1: String, w2: String): Double = {
    val scoreOpt = for {
      vec1 <- wordEmbeddingMap.get(w1)
      vec2 <- wordEmbeddingMap.get(w2)
    } yield WordEmbeddingMap.dotProduct(vec1, vec2).toDouble

    scoreOpt.getOrElse(-1d)
  }
}

myedibleenso · 2023-02-23T22:25:30Z

Thanks for the snippet, @kwalcock .

Have you all talked about using an ANN index for a large set of embeddings? Since processors is still using static word embeddings, I am thinking n-gram embeddings could help to improve the relevance of multi-token matches.

kwalcock · 2023-02-23T22:59:06Z

As in approximate nearest neighbor? Some were used for ConceptAlignment in alignment/indexer/knn/hnswlib . Specifically, this library was used: hnswlib. Only individual strings were added to the index, so I suppose that's unigram. Are you wanting to pair the words and concatenate their vectors? I haven't heard of that mentioned in relation to processors.

myedibleenso · 2023-02-23T23:34:20Z

Are you wanting to pair the words and concatenate their vectors?

No, I meant averaging summing and averaging element-wise. It seems my memory is mistaken, though, we don't currently support this kind of thing: simScore(ave(embedding(<tok-1>), embedding(<tok-1>))) > 0.6

As in approximate nearest neighbor?

Yes, an approximate nearest neighbors index.

Right now Odin token constraints support expressions like simScore("tiger") > 0.9 which retrieves the embedding for the current token being examined and calculates its cosine similarity with "tiger". Imagine if you wanted to use this pattern with a phrase like "tax attorney". Including embeddings for bigrams in some kind of in-memory store isn't very practical. An ANN index is one possible solution.

Larger context: I am thinking about extending Odin to support a new kind of Embedding-based NER (just a sketch below):

- name: "embedding-ner"
   label: ActionStar
   type: embedding
   # will compare available embeddings for n-grams of the specified sizes
   phrases: [1, 2, 3]
   pattern: |
     ave("Sylvester Stallone", "Arnold Schwarzenegger") > .9

myedibleenso added the odin label Nov 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[odin] EmbeddingsResource should extend ExplicitWordEmbeddingMap #679

[odin] EmbeddingsResource should extend ExplicitWordEmbeddingMap #679

myedibleenso commented Nov 19, 2022 •

edited

kwalcock commented Feb 23, 2023

myedibleenso commented Feb 23, 2023 •

edited

kwalcock commented Feb 23, 2023

myedibleenso commented Feb 23, 2023

[odin] EmbeddingsResource should extend ExplicitWordEmbeddingMap #679

[odin] EmbeddingsResource should extend ExplicitWordEmbeddingMap #679

Comments

myedibleenso commented Nov 19, 2022 • edited

kwalcock commented Feb 23, 2023

myedibleenso commented Feb 23, 2023 • edited

kwalcock commented Feb 23, 2023

myedibleenso commented Feb 23, 2023

myedibleenso commented Nov 19, 2022 •

edited

myedibleenso commented Feb 23, 2023 •

edited