Skip to content

Notation

This document provides the formal notation used in this site. The main goal is to elegantly differentiate between notational clashes across different math topics. For example, X in linear algebra is a matrix, while X in probability theory is a random variable.

We mostly follow the notations in the Deep Learning Book, which is based on this LaTex file. Differences between our notation and theirs will be highlighted in blue for clarity.

This notation elegantly differentiates between scalars x and scalar random variable x, where the original random variable notation X is only used with subscripts Xi,j to denote an element of matrix X.

Numbers and Arrays

Notation Description
a A scalar (integer or real)
a A vector
A A matrix
A A tensor
In Identity matrix with n rows and n columns
I Identity matrix with dimensionality implied by context
e(i) Standard basis vector [0,,0,1,0,,0] with a 1 at position i
diag(a) A square, diagonal matrix with diagonal entries given by a
a A scalar random variable
a A vector-valued random variable
A A matrix-valued random variable

Sets and Graphs

Notation Description
A A set
R The set of real numbers
{0,1} The set containing 0 and 1
{0,1,,n} The set of all integers between 0 and n
[a,b] The real interval including a and b
(a,b] The real interval excluding a but including b
AB Set subtraction, i.e., the set containing the elements of A that are not in B
G A graph
PaG(xi) The parents of xi in G

Indexing

Notation Description
ai Element i of vector a, with indexing starting at 1
ai All elements of vector a except for element i
Ai,j Element i,j of matrix A
Ai,: Row i of matrix A
A:,i Column i of matrix A
Ai,j,k Element (i,j,k) of a 3-D tensor A
A:,:,i 2-D slice of a 3-D tensor
ai Element i of the random vector a

Linear Algebra Operations

Notation Description
A Transpose of matrix A
A+ Moore-Penrose pseudoinverse of A
AB Element-wise (Hadamard) product of A and B
det(A) Determinant of A

Calculus

Notation Description
dydx Derivative of y with respect to x
yx Partial derivative of y with respect to x
xy Gradient of y with respect to x
Xy Matrix derivatives of y with respect to X
Xy Tensor containing derivatives of y with respect to X
fx or J(f)(x) Jacobian matrix JRm×n of f:RnRm
x2f(x) or H(f)(x) The Hessian matrix HRn×n of f:RnR at input point x
f(x)dx Definite integral over the entire domain of x; clash of notation with indefinite integrals, will strive to avoid this notation
Sf(x)dx Definite integral with respect to x over the set S

Probability and Information Theory

Notation Description
ab The random variables a and b are independent
abc They are conditionally independent given c
P(a) A probability distribution over a discrete variable
p(a) A probability distribution over a continuous variable, or over a variable whose type has not been specified
aP Random variable a has distribution P
ExP[f(x)] or Ef(x) Expectation of f(x) with respect to P(x); abuse of notation, will strive to use ExP[f(x)] instead
Var(f(x)) Variance of f(x) under P(x); abuse of notation, will strive to use Var(f(x)) instead
Cov(f(x),g(x)) Covariance of f(x) and g(x) under P(x); abuse of notation, will strive to use Cov(f(x),g(y)) instead
H(x) Shannon entropy of the random variable x
DKL(PQ) Kullback-Leibler divergence of P and Q
N(x;μ,Σ) Gaussian distribution over x with mean μ and covariance Σ

Functions

Notation Description
f:AB The function f with domain A and range B
fg Composition of the functions f and g
f(x;θ) A function of x parametrized by θ. (Sometimes we write f(x) and omit the argument θ to lighten notation)
logx Natural logarithm of x
σ(x) Logistic sigmoid, 11+exp(x)
ζ(x) Softplus, log(1+exp(x))
||x||p Lp norm of x
||x|| L2 norm of x
x+ Positive part of x, i.e., max(0,x)
1condition or [condition] is 1 if the condition is true, 0 otherwise

Sometimes we use a function f whose argument is a scalar but apply it to a vector, matrix, or tensor: f(x), f(X), or f(X). This denotes the application of f to the array element-wise. For example, if C=σ(X), then Ci,j,k=σ(Xi,j,k) for all valid values of i, j and k.

Datasets and Distributions

Notation Description
pdata The data generating distribution
p^data The empirical distribution defined by the training set
X A set of training examples
x(i) The i-th example (input) from a dataset
y(i) or y(i) The target associated with x(i) for supervised learning
X The m×n matrix with input example x(i) in row Xi,:

Comments