Skip to content

Commit

Permalink
Merge pull request #17 from schicho/patch-1
Browse files Browse the repository at this point in the history
  • Loading branch information
veekaybee committed Jun 11, 2023
2 parents b536407 + 33ef4d3 commit 7797208
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion embeddings.tex
Original file line number Diff line number Diff line change
Expand Up @@ -2084,7 +2084,7 @@ \subsection{BERT}
\caption{Encoder-only architecture}
\end{figure}
After the explosive success of "Attention is All you Need", a variety of transformer architectures arose, research and implementation in this architecture exploded in deep learning. The next transformer architecture to be considered a significant step forward was \textbf{BERT} released in 2018 by Google.
After the explosive success of "Attention is All you Need", a variety of transformer architectures arose, research and implementation in this architecture exploded in deep learning. The next transformer architecture to be considered a significant step forward was \textbf{BERT}.
BERT stands for Bi-Directional Encoder and was released 2018 \citep{devlin2018bert}, based on a paper written by Google as a way to solve common natural language tasks like sentiment analysis, question-answering, and text summarization. BERT is a transformer model, also based on the attention mechanism, but its architecture is such that it only includes the encoder piece. Its most prominent usage is in Google Search, where it's the algorithm powering surfacing relevant search results. In the blog post they released on including BERT in search ranking in 2019, Google specifically discussed adding context to queries as a replacement for keyword-based methods as a reason they did this.\footnote{\href{https://blog.google/products/search/search-language-understanding-bert/}{BERT search announcement}}
BERT works as a \textbf{masked language model}. Masking is simply what we did when we implemented Word2Vec by removing words and building our context window. When we created our representations with Word2Vec, we only looked at sliding windows moving forward. The B in Bert is for bi-directional, which means it pays attention to words in both ways through scaled dot-product attention. BERT has 12 transformer layers. It starts by using \textbf{WordPiece}, an algorithm that segments words into subwords, into tokens. To train BERT, the goal is to predict a token given its context.
Expand Down

0 comments on commit 7797208

Please sign in to comment.