What does a transformer do?

For the purposes of this post, we’ll be focusing on the transformer architecture as it is used in natural language processing tasks (specifically for text generation, like GPT)

at a high level, what it does is

* it takes in a sequence of tokens (words, characters, etc) and outputs a sequence of tokens

* it does this by attending to all the tokens in the input sequence, and then generating the output sequence

If you give a model 100 tokens in a sequence, it predicts the next token for each prefix ⇒ it’s going to output 100 predictions.
causal attention mask: it’s going to mask out the future tokens so that the model can’t cheat and look at the future tokens.

tokens are the inputs to transformers

There are two steps to this process: converting words into numbers, and then converting numbers into vectors.

how do we convert numbers into vectors?

* we use embeddings

* one-hot encoding: each word is represented by a vector of size V, where V is the size of the vocabulary. in this vector there is a 1 in the kth position, 0 everywhere else.

how do we convert words into numbers?

* we use byte-pair encodings

We learn a dictionary of vocab of tokens (sub-words).
We (approx) losslessly convert language to integers via tokenizing it.
We convert integers to vectors via a lookup table.

Note: input to the transformer is a sequence of tokens (ie integers), not vector*

Transformer Outputs are Logits

We want a probability distribution over next tokens
We want to convert a vector to a probability distribution.

* We do this by applying a softmax function to the output of the model.

The output of the model is a vector of size V, where V is the size of the vocabulary.

Convert text to tokens
Map tokens to logits
Map logits to probability distribution

1. using softmax

Implementation of a Transformer

High-level architecture of a transformer:

input tokens, integers
embeddings (lookup table that maps tokens to vectors)
series of n layers transformer blocks

* Attention: moves information from prior positions in the sequence to the current position.

* Do this for every token in parallel using the same parameters

* produces an attention pattern for each destination token

* the attention pattern is a distribution over the source tokens

Stack of encoders ⇒ Encoding component
Stack of decoders ⇒ Decoding component

Each Encoder is broken down into two sub-layers:

Multi-head self-attention mechanism
Position-wise fully connected feed-forward network

Each Decoder is broken down into three sub-layers:

Masked multi-head self-attention mechanism
Multi-head (encoder-decoder) attention mechanism
Position-wise fully connected feed-forward network

Each Decoder is broken down into three sub-layers:

Masked multi-head self-attention mechanism
Multi-head (encoder-decoder) attention mechanism
Position-wise fully connected feed-forward network

Brayden Zhang

Explorer

Why Transformers Work

What does a transformer do?

tokens are the inputs to transformers

Transformer Outputs are Logits

Implementation of a Transformer

Recent Notes

Books

Why Transformers Work

GraphCast

Graph View

Table of Contents

Backlinks