1990 static word embeddings 2003 neural language models 2008 multi-task learning 2015 attention 2017 transformer 2018 trainable contextual word embeddings + large scale pretraining 2019 prompt engineering Motivating Attention Given a sequence of embeddings: x_1, x_2, …, x_{n} For each x_{i}, the goal of attention is to produce a new embedding of each x_{i} named a_{i} based its dot product similarity with all other words that are before it. Let’s define:
Which means that we can write:
where:
The resulting a_{i} is the output of our attention. Attention From the above, we call the input embeddings x_{j} the values, and we will create a separate embeddings called key with which we will measure the similarity. We call the word we want the target new embeddings for the query (i.e. x_{i} from above).