Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithreading for embeddings extraction #81

Open
AFAgarap opened this issue Dec 3, 2020 · 0 comments
Open

Multithreading for embeddings extraction #81

AFAgarap opened this issue Dec 3, 2020 · 0 comments

Comments

@AFAgarap
Copy link

AFAgarap commented Dec 3, 2020

Hello. May I ask if there is a way to extract word embeddings using multiple cores?
Right now, I'm getting the word embeddings representation for the 20 newsgroups dataset, and it still takes a while to complete the whole dataset. Thank you.

For reference, this is my current function,

def extract_sentence_embeddings(
    texts: str or List, batch_size: int = 2048
) -> np.ndarray:
    """
    Returns the sentence embeddings for the input texts.

    Parameter
    ---------
    texts: str or List
        The input text to vectorize.
    batch_size: int
        The mini-batch size to use for computation.

    Returns
    -------
    vectors: np.ndarray
        The sentence embeddings representation for the input texts.
    """
    vectorizer = pymagnitude.Magnitude("data/glove.840B.300d.magnitude")
    if isinstance(texts, str):
        vectors = vectorizer.query(texts.split())
        vectors = np.mean(vectors, axis=0)
        return vectors
    elif isinstance(texts, list):
        vectors = []
        for index in range(len(texts) // batch_size):
            offset = (index * batch_size) % len(texts)
            vector = vectorizer.query(
                list(
                    map(
                        lambda text: ["", ""]
                        if len(text.split()) == 0
                        else text.split(),
                        texts[offset : offset + batch_size],
                    )
                )
            )
            vector = np.mean(vector, axis=1)
            vectors.append(vector)
        return vectors

Since I'm using 300D vectors, the memory can easily be exhausted, that's why I opt for batching the text data.

Looking forward to your response! Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant