Autoregressive Models (AR/ARM)

Motivation: Factorizing the joint distribution into conditional 1D distributions to circumvent the need for modeling the (computationally expensive) joint distributions.

For example, consider a scenario involving $D$ -dimensional data, where each dimension is defined by a categorical distribution with $K$ categories. A total of $K^{D}$ parameters are required to model the joint distribution, while only $K D$ parameters are needed to model conditional 1D distributions.

Parameterization

Factorize the joint distribution to conditional 1D distributions by chain rule of probability: $p (x) = p (x_{1 : D}) = p (x_{1}, \dots, x_{D}) = p (x_{1}) p (x_{2} | x_{1}) p (x_{3} | x_{2}, x_{1}) \dots = \prod_{i = 1}^{D} p (x_{i} | x_{< i})$

where $x_{< i} = x_{1 : i - 1}$ .

Connection to RNNs

The condition of each term $p (x_{i} | x_{< i})$ becomes more complex as $i$ increases. This issue can be mitigated by:

(first-order) Markov assumption: $p (x_{i} | x_{< i}) = p (x_{i} | x_{i - 1})$ , or
Hidden Markov model (HMM): $p (x_{i} | x_{< i}) = p (x_{i} | z_{i - 1})$ , where $z_{i - 1}$ contains the compressed information of $x_{< i}$ .
- Under the special case that $z_{i}$ is a deterministic function (instead of a stochastic funcion) of $z_{i - 1}$ , the model correpsonds to RNNs.

The conditionals $p (x_{i} | x_{< i})$ can be modeled by the following:

Discrete distributions
- Categorical distributions: parameterized by $K$ categories $p_{i} \in [0, 1]^{K}$ : $p (x_{i} | x_{< i}) = \prod_{k = 1}^{K} p_{i, k}^{[x_{i} = k]}$
Continuous distributions
- (Mixture of) Gaussians: parameterized by $K$ gaussians $[(μ_{i, k}, σ_{i, k}, α_{i, k})]_{k \in {1, \dots, K}}$ : $p (x_{i} | x_{< i}) = \sum_{k = 1}^{K} α_{i, k} N (x_{i} | μ_{i, k}, σ_{i, k})$
- Cumulative distributions: autoregressive implicit quantile networks (AIQN), and etc.

In practice, while we only define the model architecture of $p (x_{i} | x_{< i})$ in code, the full model can be perceived as a deep architecture comprising $D$ layers of $p (x_{i} | x_{< i})$ .

Pointwise Evaluation

Pointwise evaluation of $p_{θ} (x)$ ?
- Yes, as long as $p_{θ} (x_{i} | x_{< i})$ is modeled as a distribution that can be easily evaluated.
- Simply compute $p_{θ} (x) = \prod_{i = 1}^{D} p_{θ} (x_{i} | x_{< i})$ .

Data Generation

Generating new samples $x \sim p_{θ} (x)$ ?
- Yes, as long as $p_{θ} (x_{i} | x_{< i})$ is modeled as a distribution that can be easily sampled from, and $p_{θ}$ doesn't overfit.
- Sample each dimension autoregressively $x_{i} \sim p_{θ} (x_{i} | x_{< i})$ , and concatenate the output after $D$ forward passes (slow, one forward pass for each conditional).

Conditional generation $x \sim p_{θ} (x | c)$ ?
- Yes, simply define $c = x_{1 : C}$ (i.e., a prompt), where $c \in R^{C}$ and $D_{new} = D + C$ .

Imputation $x_{m} \sim p_{θ} (x_{m} | x_{o})$ ?
- No in general, not for causal ARs. May require more advanced ARs such as bi-directional ARs (e.g., BERT, BAT-Fill)

Training Target

We need to derive $\nabla_{θ} \log p_{θ} (x)$ to perform Maximum Likelihood training (as shown in Maximum Likelihood Estimation):

\begin{aligned} \log p_{θ} (x) & = \log p_{θ} (x_{1}, \dots, x_{D}) \\ = \log \prod_{i = 1}^{D} p_{θ} (x_{i} | x_{< i}) \\ = \sum_{i = 1}^{D} \log p_{θ} (x_{i} | x_{< i}) \end{aligned}

The Negative Log-Likelihood (NLL) loss function is defined as: $L (θ) = E_{x \sim {\hat{p}}_{d a t a} (x)} [- \sum_{i = 1}^{D} \log p_{θ} (x_{i} | x_{< i})]$

which can be optimized based on its gradients: $\nabla_{θ} L (θ) = E_{x \sim {\hat{p}}_{d a t a} (x)} [- \sum_{i = 1}^{D} \nabla_{θ} \log p_{θ} (x_{i} | x_{< i})]$

Examples

PixelCNNs

Visualization of conditional distributions, from Figure 2 of Oord et al., 2016.

For images, $p_{θ} (x) : {0, \dots, 255}^{H W C} \to R^{H W C K}$ .
- The pixels of the image are ordered sequentially by concatenating the rows: $[row 1, row 2, \dots]$ (i.e., raster scan ordering).
- The color channels are ordered sequentially by concatenating the RGB values: $[r, g, b]$ .
- Every $p_{θ} (x_{i} | x_{< i})$ is modeled by a categorial distribution with support ${0, \dots, 255}$ (i.e., $K = 256$ ), where each category corresponds to the color intensity of the channel.
Code implementation:
- Output as categorical distribution: See the low and high parameters in the TensorFlow implementation, or the PixelCNN(pl.LightningModule) module in phlippe/uvadlc_notebooks.
  - The output of the model lies in $R^{H W C K}$ , indicating a conditional categorical distribution represented in $R^{K}$ for each pixel color.
- Output as binary categorical distribution: When the data only contains black-and-white images, the output shape can be the same as the input shape to save parameters, as in the Keras implementation or in EugenHotaj/pytorch-generative.
  - Please note that this technique only works under the special case where the data contains binary values.
- Sampling: See the def sample function in phlippe/uvadlc_notebooks.

(Generative) Transformers

The Transformer model architecture, from Figure 1 of Ashish Vaswani et al., 2017.

For natural language processing, $p_{θ} (x) : {1, \dots, K}^{D} \to R^{D K}$ (for simplicity, we assume the I/O shares the same tokenizer and operate on token level).
- Every $p_{θ} (x_{i} | x_{< i})$ is modeled by a categorial distribution with support ${1, \dots, vocab size}$ (i.e., $K = vocab size$ ), where each category corresponds to the index of a token in the vocabulary.
- The vocabulary size depends on the tokenizer used. For example, the GPT-2 tokenizer has a vocabulary size of 50,257.
Code implementation:
- Dimensions of input $x$ : See the Embeddings(nn.Module) module in harvardnlp/annotated-transformer.
  - Please note that nn.Transformer in PyTorch works in embedding space. An extra nn.Embedding lookup operator followed by a PositionalEncoding is required to map the input to embedding space.
- Categorical distribution: See the Generator(nn.Module) module in harvardnlp/annotated-transformer.
  - Please note that nn.Transformer in PyTorch works in embedding space. An extra nn.Linear layer and log-softmax operator is required to map the output from embedding space to the vocabulary distribution.
  - The output of the model lies in $R^{D K}$ , indicating a conditional categorical distribution represented in $R^{K}$ for each token.
- Sampling: See the def greedy_decode part in harvardnlp/annotated-transformer, or the for i in range(args.words) part in pytorch/examples.
The overall inference process from an application viewpoint involves the following steps:
- The input text (character sequence) is tokenized into tokens (character/byte sequences), which are then mapped to token IDs (integers). These token IDs are further transformed into a sequence of embeddings (vector sequence).
- After that, the vector sequence is input into the Transformer model, producing another sequence of embeddings (vector sequence), which in turn are transformed into logits (number sequence).
- Lastly, the logits are normalized into a sequence of probabilities (number sequence) by the softmax operator, which can be used to sample the next token ID (integer) of the output sequence. This sampling process can be repeated to generate a sequence of token IDs (integer sequence), which can be mapped back to tokens (character/byte sequences), ultimating yielding the final output text (character sequence).

Community Resources

Probabilistic Machine Learning: Advanced Topics (i.e., probml-book2)
- Chapter 22: Auto-regressive models