Skip to content

Commit

Permalink
Merge overleaf-2023-06-07-1707 into main
Browse files Browse the repository at this point in the history
  • Loading branch information
veekaybee committed Jun 7, 2023
2 parents 76a6234 + 78211f5 commit b69d432
Showing 1 changed file with 29 additions and 22 deletions.
51 changes: 29 additions & 22 deletions embeddings.tex
Original file line number Diff line number Diff line change
Expand Up @@ -946,10 +946,14 @@ \subsubsection{Indicator and one-hot encoding}
\centering
\caption{Our one-hot encoded data with labels}
\begin{tabular}{llll}
\hline
\rowcolor[HTML]{D5E7F7}
bird\_id & US & UK & NZ \\
\hline
012 & 1 & 0 & 0 \\
\hline
013 & 0 & 1 & 0 \\
\hline
056 & 0 & 0 & 1
\end{tabular}
\end{table}
Expand Down Expand Up @@ -987,7 +991,9 @@ \subsubsection*{Embeddings as larger feature inputs}

How would we turn this into a machine learning problem that takes features as input and a prediction as an output, knowing what we know about how to do this already? First, in order to build this matrix, we need to turn each word into a feature that's a column value and each user remains a row value.

The best way to think of the difference between tabular and free-form representations as model inputs is that a row of tabular data looks like this, \mintinline{Python}{bird1 = [012,2,"US", 5]}, and a "row" or document of text data looks like this, \mintinline{Python}{bird1 = ["No bird soars too high if he soars with his own wings."] } In both cases, each of these are vectors, or a list of values that represents a single bird.
\begin{flushleft}
The best way to think of the difference between tabular and free-form representations as model inputs is that a row of tabular data looks like this,\mintinline{python}{ [012,2,"US", 5]}, and a "row" or document of text data looks like this, \mintinline{python}{["No bird soars too high if he soars with his own wings."]} In both cases, each of these are vectors, or a list of values that represents a single bird.
\end{flushleft}

In traditional machine learning, rows are our user data about a single bird and columns are features about the bird. In recommendation systems, our rows are the individual data about each user, and our column data represents the given data about each flit. If we can factor this matrix, that is decompose it into two matrices ($Q$ and $P^T$) that, when multiplied, the product is our original matrix ($R$), we can learn the "latent factors" or features that allow us to group similar users and items together to recommend them.

Expand Down Expand Up @@ -1045,6 +1051,7 @@ \subsubsection*{Embeddings as larger feature inputs}
\begin{minted}
[
frame=lines,
autogobble,
framesep=2mm,
baselinestretch=1.2,
fontsize=\footnotesize,
Expand All @@ -1069,17 +1076,17 @@ \subsubsection*{Embeddings as larger feature inputs}

print(term_document_matrix.drop(columns=['total_count']).head(10))

flit_1 flit_2 flit_3
an 0 0 1
answer 0 0 1
because 0 0 1
bird 1 1 1
broken 1 0 0
cannot 1 0 0
die 1 0 0
does 0 0 1
dreams 1 0 0
fast 1 0 0
flit_1 flit_2 flit_3
an 0 0 1
answer 0 0 1
because 0 0 1
bird 1 1 1
broken 1 0 0
cannot 1 0 0
die 1 0 0
does 0 0 1
dreams 1 0 0
fast 1 0 0


\end{minted}
Expand Down Expand Up @@ -1184,11 +1191,11 @@ \subsubsection{TF-IDF}
# Return weight of each word in each document wrt to the total corpus
document_tfidf = pd.DataFrame([tfidf_a, tfidf_b])
document_tfidf.T
# doc 0 doc 1
a 0.018814 0.000000
dreams 0.037629 0.000000
No 0.000000 0.025086
Hold 0.018814 0.000000
# doc 0 doc 1
a 0.018814 0.000000
dreams 0.037629 0.000000
No 0.000000 0.025086
Hold 0.018814 0.000000
\end{minted}
\caption{Truncated implementation of TF-IDF, see full \href{https://github.com/veekaybee/what_are_embeddings/blob/main/notebooks/fig_24_tf_idf_from_scratch.ipynb} {source}}
\end{figure}
Expand Down Expand Up @@ -1223,11 +1230,11 @@ \subsubsection{TF-IDF}
tfidf_df.T

# How common or unique a word is in a given document wrt to the vocabulary
dreams_langstonhughes quote_william_blake 00_Document Frequency
bird 0.172503 0.197242 2.0
broken 0.242447 0.000000 1.0
cannot 0.242447 0.000000 1.0
die 0.242447 0.000000 1.0
dreams_langstonhughes quote_william_blake 00_Document Frequency
bird 0.172503 0.197242 2.0
broken 0.242447 0.000000 1.0
cannot 0.242447 0.000000 1.0
die 0.242447 0.000000 1.0
\end{minted}
\caption{Implementation of TF-IDF in scikit-learn \href{https://github.com/veekaybee/what_are_embeddings/blob/main/notebooks/fig_24_tf_idf_from_scratch.ipynb}{source}}
\end{figure}
Expand Down

0 comments on commit b69d432

Please sign in to comment.