Skip to content

Commit

Permalink
Merge pull request #40 from RohanAlexander/patch-1
Browse files Browse the repository at this point in the history
Fix minor typos
  • Loading branch information
veekaybee committed Feb 11, 2024
2 parents a10883b + cde2f13 commit 40bd3b6
Showing 1 changed file with 6 additions and 6 deletions.
12 changes: 6 additions & 6 deletions embeddings.tex
Original file line number Diff line number Diff line change
Expand Up @@ -257,7 +257,7 @@ \section{Introduction}
\begin{figure}[H]
\centering
\includegraphics[width=.7\linewidth]{figures/embeddings_1.png}
\caption{Embeddings papers in Arxiv by month. It's interesting to note the decline in frequency of embeddings-specific papers, possibly in tandem with the rise of deep learning architectures like GPT \href{https://github.com/veekaybee/what_are_embeddings/blob/main/notebooks/fig_2_embeddings_papers.ipynb}{source}}
\caption{Embeddings papers in arXiv by month. It's interesting to note the decline in frequency of embeddings-specific papers, possibly in tandem with the rise of deep learning architectures like GPT \href{https://github.com/veekaybee/what_are_embeddings/blob/main/notebooks/fig_2_embeddings_papers.ipynb}{source}}
\end{figure}

Building and expanding on the concepts in Word2Vec, the Transformer \citep{vaswani2017attention} architecture, with its self-attention mechanism, a much more specialized case of calculating context around a given word, has become the de-facto way to learn representations of growing multimodal vocabularies, and its rise in popularity both in academia and in industry has caused embeddings to become a staple of deep learning workflows.
Expand Down Expand Up @@ -430,7 +430,7 @@ \section{Recommendation as a business problem}

This is the case for many content-based businesses, all of which have feed-like surface areas for recommendations, including Netflix, Pinterest, Spotify, and Reddit. It also covers e-commerce platforms, which must surface relevant items to the user, and information retrieval platforms like search engines, which must provide relevant answers to users upon keyword queries. There is a new category of hybrid applications involving question-and-answering in semantic search contexts that is arising as a result of work around the GPT series of models, but for the sake of simplicity, and because that landscape changes every week, we'll stick to understanding the fundamental underlying concepts.

In subscription-based platforms\footnote{In ad-based services, the line between retention and revenue is a bit murkier, and we have often what's known as a multi-stakeholder problem, where the actual optimized function is a balance between meeting the needs of the user and meeting the needs of the advertiser \citep{zheng2017multi}. In real life, this can often result in a process of enshittifcation \citep{doctorow_2023} of the platform that leads to extremely suboptimal end-user experiences. So, when we create Flutter, we have to be very careful to balance these concerns, and we'll also assume for the sake of simplification that Flutter is a Good service that loves us as users and wants us to be happy.}, there is clear business objective that's tied directly to the bottom line, as outlined in this 2015 paper \citep{steck2021deep} about Netflix's recsys:
In subscription-based platforms\footnote{In ad-based services, the line between retention and revenue is a bit murkier, and we have often what's known as a multi-stakeholder problem, where the actual optimized function is a balance between meeting the needs of the user and meeting the needs of the advertiser \citep{zheng2017multi}. In real life, this can often result in a process of enshittification \citep{doctorow_2023} of the platform that leads to extremely suboptimal end-user experiences. So, when we create Flutter, we have to be very careful to balance these concerns, and we'll also assume for the sake of simplification that Flutter is a Good service that loves us as users and wants us to be happy.}, there is clear business objective that's tied directly to the bottom line, as outlined in this 2015 paper \citep{steck2021deep} about Netflix's recsys:

\begin{quote}

Expand Down Expand Up @@ -1090,7 +1090,7 @@ \subsubsection{TF-IDF}

To address the limitations of one-hot encoding, TF-IDF, or term frequency-inverse document frequency was developed. TF-IDF was introduced in the 1970s\footnote{By Karen Spärck Jones, whose paper, \href{https://blog.babbar.tech/who-is-karen-sparck-jones/}{"Synonymy and semantic classification} is fundamental to the field of NLP} as a way to create a vector representation of a document by averaging all the document's word weights. It worked really well for a long time and still does in many cases. For example, one of the most-used search functions, BM25, uses TF-IDF as a baseline \citep{schutze2008introduction} as a default search strategy in Elasticsearch/Opensearch \footnote{You can read about how Elasticsearch implements BM25 \href{https://www.elastic.co/blog/practical-bm25-part-1-how-shards-affect-relevance-scoring-in-elasticsearch}{here}}. It extends TF-IDF to develop a probability associated with the probability of relevance for each pair of words in a document and it is still being applied in neural search today \citep{svore2009machine}.

TF-IDF will tell you how important a single word is in a corpus by assigning it a weight and, at the same time, down-weight common words like ,"a", "and", and "the". This calculated weight gives us a feature for a single word TF-IDF, and also the relevance of the features across the vocabulary.
TF-IDF will tell you how important a single word is in a corpus by assigning it a weight and, at the same time, down-weight common words like, "a", "and", and "the". This calculated weight gives us a feature for a single word TF-IDF, and also the relevance of the features across the vocabulary.

We take all of our input data that's structured in sentences and break it up into individual words, and perform counts on its values, generating the bag of words. TF is term frequency, or the number of times a term appears in a document relative to the other terms in the document.

Expand Down Expand Up @@ -1209,7 +1209,7 @@ \subsubsection{TF-IDF}
\caption{Implementation of TF-IDF in scikit-learn \href{https://github.com/veekaybee/what_are_embeddings/blob/main/notebooks/fig_24_tf_idf_from_scratch.ipynb}{source}}
\end{figure}

Given that inverse document frequency is a measure of whether the word is common or not across the documents, we can see that "dreams" is important because they are rare across the documents and therefore interesting to us more so than "bird."We see that the tf-idf for a given word, "dreams", is slightly different for each of these implementations, and that's because Scikit-learn normalizes the denominator and uses \href{https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer}{a slightly different formula.} You'll also note that in the first implementation we separate the corpus words ourselves, don't remove any stop words, and don't lowercase everything. Many of these steps are done automatically in scikit-learn or can be set as parameters into the processing pipeline. We'll see later that these are critical NLP steps that we perform each time we work with text.
Given that inverse document frequency is a measure of whether the word is common or not across the documents, we can see that "dreams" is important because they are rare across the documents and therefore interesting to us more so than "bird." We see that the tf-idf for a given word, "dreams", is slightly different for each of these implementations, and that's because Scikit-learn normalizes the denominator and uses \href{https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer}{a slightly different formula.} You'll also note that in the first implementation we separate the corpus words ourselves, don't remove any stop words, and don't lowercase everything. Many of these steps are done automatically in scikit-learn or can be set as parameters into the processing pipeline. We'll see later that these are critical NLP steps that we perform each time we work with text.

TF-IDF enforces several important ordering rules on our text corpus:

Expand Down Expand Up @@ -1686,7 +1686,7 @@ \subsection{Word2Vec}
We activate the linear layer with a \textbf{ReLu} activation function, which decides whether a given weight is important or not. In this case, ReLu squashes all the negative values we initialize our embeddings layer with down to zero since we can't have inverse word relationships, and we perform linear regression by learning the weights of the model of the relationship of the words. Then, for each batch we examine the \textbf{loss}, the difference between the real word and the word that we predicted should be there given the context window - and we minimize it.
At the end of each epoch, or pass through the model, we pass the weights, or \textbf{backpropagate} them, back to the linear layer,and then again, update the weights of each word, based on the probability. The probability is calculated through a \textbf{softmax} function, which converts a vector of real numbers into a probability distribution - that is, each number in the vector, i.e. the value of the probability of each words, is in the interval between 0 and 1 and all of the word numbers add up to one. The distance, as backpropogated to the embeddings table, should converge or shrink depending on the model understanding how close specific words are.
At the end of each epoch, or pass through the model, we pass the weights, or \textbf{backpropagate} them, back to the linear layer, and then again, update the weights of each word, based on the probability. The probability is calculated through a \textbf{softmax} function, which converts a vector of real numbers into a probability distribution - that is, each number in the vector, i.e. the value of the probability of each words, is in the interval between 0 and 1 and all of the word numbers add up to one. The distance, as backpropagated to the embeddings table, should converge or shrink depending on the model understanding how close specific words are.
\begin{figure}[H]
\begin{minted}
Expand Down Expand Up @@ -2340,7 +2340,7 @@ \subsubsection{Storage and Retrieval}
\end{itemize}
\end{formal}
The most complex and customizable software that handles many of these use-cases is a vector database, and somewhere in-between vector databases and in-memory storage are vector searchplugins for existing stores like Postgres and SQLite, and caches like Redis.
The most complex and customizable software that handles many of these use-cases is a vector database, and somewhere in-between vector databases and in-memory storage are vector search plugins for existing stores like Postgres and SQLite, and caches like Redis.
The most important operation we'd like to perform with embeddings is vector search, which allows us to find embeddings that are similar to a given embedding so we can return item similarity. If we want to search embeddings, we need a mechanism that is optimized to search through our matrix data structures and perform nearest-neighbor comparisons in the same way that a traditional relational database is optimized to search row-based relations. Relational databases use a b-tree structure to optimize reads by sorting items in ascending order within a hierarchy of nodes, built on top of an indexed column in the database. We can't perform columnar lookups on our vectors efficiently, so we need to create different structures for them. For example, many vector stores are based on inverted indices.
Expand Down

0 comments on commit 40bd3b6

Please sign in to comment.