Skip to content

Johnson's Summary

Notation

Notation

This document provides the formal notation used in this site. The main goal is to elegantly differentiate between notational clashes across different math topics. For example, $X$ in linear algebra is a matrix, while $X$ in probability theory is a random variable.

We mostly follow the notations in the Deep Learning Book, which is based on this LaTex file. Differences between our notation and theirs will be highlighted in $blue$ for clarity.

This notation elegantly differentiates between scalars $x$ and scalar random variable $x$ , where the original random variable notation $X$ is only used with subscripts $X_{i, j}$ to denote an element of matrix $X$ .

Numbers and Arrays

Notation	Description
$a$	A scalar (integer or real)
$a$	A vector
$A$	A matrix
$A$	A tensor
$I_{n}$	Identity matrix with $n$ rows and $n$ columns
$I$	Identity matrix with dimensionality implied by context
$e^{(i)}$	Standard basis vector $[0, \dots, 0, 1, 0, \dots, 0]$ with a 1 at position $i$
$diag (a)$	A square, diagonal matrix with diagonal entries given by $a$
$a$	A scalar random variable
$a$	A vector-valued random variable
$A$	A matrix-valued random variable

Sets and Graphs

Notation	Description
$A$	A set
$R$	The set of real numbers
${0, 1}$	The set containing 0 and 1
${0, 1, \dots, n}$	The set of all integers between $0$ and $n$
$[a, b]$	The real interval including $a$ and $b$
$(a, b]$	The real interval excluding $a$ but including $b$
$A ∖ B$	Set subtraction, i.e., the set containing the elements of $A$ that are not in $B$
$G$	A graph
$P a_{G} (x_{i})$	The parents of $x_{i}$ in $G$

Indexing

Notation	Description
$a_{i}$	Element $i$ of vector $a$ , with indexing starting at 1
$a_{- i}$	All elements of vector $a$ except for element $i$
$A_{i, j}$	Element $i, j$ of matrix $A$
$A_{i, :}$	Row $i$ of matrix $A$
$A_{:, i}$	Column $i$ of matrix $A$
$A_{i, j, k}$	Element $(i, j, k)$ of a 3-D tensor $A$
$A_{:, :, i}$	2-D slice of a 3-D tensor
$a_{i}$	Element $i$ of the random vector $a$

Linear Algebra Operations

Notation	Description
$A^{⊤}$	Transpose of matrix $A$
$A^{+}$	Moore-Penrose pseudoinverse of $A$
$A ⊙ B$	Element-wise (Hadamard) product of $A$ and $B$
$\det (A)$	Determinant of $A$

Calculus

Notation	Description
$\frac{d y}{d x}$	Derivative of $y$ with respect to $x$
$\frac{\partial y}{\partial x}$	Partial derivative of $y$ with respect to $x$
$\nabla_{x} y$	Gradient of $y$ with respect to $x$
$\nabla_{X} y$	Matrix derivatives of $y$ with respect to $X$
$\nabla_{X} y$	Tensor containing derivatives of $y$ with respect to $X$
$\frac{\partial f}{\partial x} or J (f) (x)$	Jacobian matrix $J \in R^{m \times n}$ of $f : R^{n} \to R^{m}$
$\nabla_{x}^{2} f (x) or H (f) (x)$	The Hessian matrix $H \in R^{n \times n}$ of $f : R^{n} \to R$ at input point $x$
$\int f (x) d x$	Definite integral over the entire domain of $x$ ; clash of notation with indefinite integrals, will strive to avoid this notation
$\int_{S} f (x) d x$	Definite integral with respect to $x$ over the set $S$

Probability and Information Theory

Notation	Description
$a ⊥ b$	The random variables $a$ and $b$ are independent
$a ⊥ b ∣ c$	They are conditionally independent given $c$
$P (a)$	A probability distribution over a discrete variable
$p (a)$	A probability distribution over a continuous variable, or over a variable whose type has not been specified
$a \sim P$	Random variable $a$ has distribution $P$
$E_{x \sim P} [f (x)] or E f (x)$	Expectation of $f (x)$ with respect to $P (x)$ ; abuse of notation, will strive to use $E_{x \sim P} [f (x)]$ instead
$Var (f (x))$	Variance of $f (x)$ under $P (x)$ ; abuse of notation, will strive to use $Var (f (x))$ instead
$Cov (f (x), g (x))$	Covariance of $f (x)$ and $g (x)$ under $P (x)$ ; abuse of notation, will strive to use $Cov (f (x), g (y))$ instead
$H (x)$	Shannon entropy of the random variable $x$
$D_{KL} (P ‖ Q)$	Kullback-Leibler divergence of P and Q
$N (x; μ, Σ)$	Gaussian distribution over $x$ with mean $μ$ and covariance $Σ$

Functions

Notation	Description
$f : A \to B$	The function $f$ with domain $A$ and range $B$
$f \circ g$	Composition of the functions $f$ and $g$
$f (x; θ)$	A function of $x$ parametrized by $θ$ . (Sometimes we write $f (x)$ and omit the argument $θ$ to lighten notation)
$\log x$	Natural logarithm of $x$
$σ (x)$	Logistic sigmoid, $\frac{1}{1 + \exp (- x)}$
$ζ (x)$	Softplus, $\log (1 + \exp (x))$
$\| \| x \| \|_{p}$	$L^{p}$ norm of $x$
$\| \| x \| \|$	$L^{2}$ norm of $x$
$x^{+}$	Positive part of $x$ , i.e., $max (0, x)$
$1_{condition} or [condition]$	is 1 if the condition is true, 0 otherwise

Sometimes we use a function $f$ whose argument is a scalar but apply it to a vector, matrix, or tensor: $f (x)$ , $f (X)$ , or $f (X)$ . This denotes the application of $f$ to the array element-wise. For example, if $C = σ (X)$ , then $C_{i, j, k} = σ (X_{i, j, k})$ for all valid values of $i$ , $j$ and $k$ .

Datasets and Distributions

Notation	Description
$p_{d a t a}$	The data generating distribution
${\hat{p}}_{d a t a}$	The empirical distribution defined by the training set
$X$	A set of training examples
$x^{(i)}$	The $i$ -th example (input) from a dataset
$y^{(i)} or y^{(i)}$	The target associated with $x^{(i)}$ for supervised learning
$X$	The $m \times n$ matrix with input example $x^{(i)}$ in row $X_{i, :}$

Comments