机器学习之数学

The enthusiastic practitioner who is interested to learn more about the magic behind successful machine learning algorithms currently faces a daunting set of pre-requisite knowledge:

Programming languages and data analysis tools.

Large-scale computation and the associated frameworks.

Mathematics and statistics and how machine learning builds on it.

常用符号

Symble	Name	Explanation / Examples
$\mathbb{R} \ \mathbf{R}$	real numbers	$ \pi \in \mathbb{R} $
$\mathbb{C} \ \mathbf{C}$	complex numbers	$ i \in \mathbb{C} $
$\sum$	summation 求和	$ \sum_{k=1}^n a_{k} = a_1 + a_2 + \dots + a_n $
$\prod$	product 积	$ \prod_{k=1}^n a_{k} = a_1 \cdot a_2 \cdot \dots \cdot a_n $
$\propto$	proportionality 比例性	$y \propto x$ means that y = kx for some constant k
$\forall$	universal quantification 通用量化	$ \forall n \in N, n^2 \ge n $
$\int$	integral 积分	$ \int_a^b x^2 dx = \frac{b^3 - a^3}{3} $
$'$	derivative 导数	If $f(x) := x^2$, then $f’(x) = 2x$.
$\partial$	partial derivative 偏导数	If $f(x, y) := x^2 y$, then $ \frac{\partial f}{\partial x} = 2xy $.
$\nabla$	gradient 梯度	$ \nabla \cdot \overset{\rightharpoonup}{v} $ = $ \frac{\partial v_x}{\partial x} + \frac{\partial v_y}{\partial y} + \frac{\partial v_z}{\partial z}$
$\Delta$	delta	$ \frac{\Delta y}{\Delta x} $ is the gradient of a straight line.
$P(A\|B)$	probability 概率	Probability of A given B
$\mathrm{E}$	expected value 期望值	$\mathrm{E}[X]$ = $ \sum_i^\infty x_i p_i $
$\| \ldots \|$	determinant 行列式	$ \det(u, v) = \begin{vmatrix} 1 & 2 \\ 2 & 9 \end{vmatrix} = 1 \times 9 - 2 \times 2 = 5 $
$\odot$	Hadamard product 哈达玛积	$ \begin{vmatrix} 1 & 2 \\ 2 & 4 \end{vmatrix} \odot \begin{vmatrix} 1 & 2 \\ 0 & 1 \end{vmatrix} = \begin{vmatrix} 1\cdot1 & 2\cdot2 \\ 2\cdot0 & 4\cdot1 \end{vmatrix} = \begin{vmatrix} 1 & 4 \\ 0 & 4 \end{vmatrix} $
$\hat a$	estimator 估计量	$\hat \theta$ is the estimator or the estimate for the parameter $\theta$
$\sigma$	selection	$ \sigma_{a \theta b} (R) = \{t : t \in R,\ t(a) \ \theta \ t(b)\} $
$argmax$	arguments of the maxima	函数输出尽可能大的输入或参数

常用术语

Name	Explanation / Examples
Covariance 协方差	$cov(X, Y) = \mathrm{E}((X - \mu)(Y - \nu)) = \mathrm{E}(X \cdot Y) - \mu \nu$ { $\mathrm{E}(X) = \mu$ , $\mathrm{E}(X) = \nu$ }
Variance 方差	$var(V) = E[(X - \mu)^2] = cov(X, X)$
Standard Deviation 标准差	$\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2 }$ { $\mathrm{E}(x) = \mu$ }
Mean Absolute Error 平均绝对误差	$\mathrm{MAE} = \frac{1}{n} {\sum_{i=1}^{n} \left\| Y_i - \hat{Y_i} \right\|} $ { $Y_{i}$: 观察值, $\hat{Y_{i}}$: 预测值 }
Mean Squared Error 均方误差	$\mathrm{MSE} = \frac{1}{n} \sum_{i=1}^{n} (Y_i - \hat{Y_i})^2 $ { $Y_{i}$: 观察值, $\hat{Y_{i}}$: 预测值 }
Root Mean Square Error 均方根误差	$\mathrm{RMSE} = \sqrt{\mathrm{MSE}}$

激活函数

Sigmoid

$$
\begin{align}
S(x) = \frac{1 + e^{-x}}{1} = \frac{e^x + 1}{e^x}
\end{align}
$$
Hyperbolic tangent (tanh) 双曲正切

$$
\begin{align}
\tanh(x) = \frac{\sinh(x)}{\cosh(x)} = \frac {e^x - e^{-x}}{e^x + e^{-x}} = \frac{e^{2x} - 1}{e^{2x} + 1}
\end{align}
$$
Rectifier (ReLU) 修正线性单元

$$
f(x) = x^+ = max(0, x)
$$
Leaky ReLU 带泄露修正线性单元

$$
f(x) =
\begin{cases}
x &\ x \gt 0, \\
0.01x &\ otherwise.
\end{cases}
$$

高等数学

数列极限

$$
\lim_{n \to \infty} x_n = a, \forall \epsilon \gt 0，\exists \ 正整数 N，当n \gt N 时, \left| x_n - a \right| < \epsilon .
$$
两个重要的极限

$$
\lim_{x \to 0} \frac{\sin x}{x} = 1
$$

$$
\lim_{x \to \infty} (1 + \frac{1}{x})^x = e
$$
泰勒公式

$$
f(x) = \sum_{n = 0}^{\infty} \frac{f^{(n)}(a)}{n!} \cdot (x - a^n) \\
f^{(n)}(a) 表示函数f在点a处的n阶导数, 如果a=0，这个级数称为麦克劳林级数。
$$
欧几里得距离 (Euclidean distance)

$$
d(p, q) = \sqrt{\sum_{i=1}^n (q_i - p_i)^2}
$$
明氏距离 (Minkowski distance)

$$
D(X,Y) = d_p(x, y) = (\sum_{i=1}^n \left|x_i - y_i\right|^p)^{\frac{1}{p}}
$$

线性代数

矩阵

$$
A = [a_{ij}]_{m \times n}
$$
矩阵的转置

$$
A^T = [a_{ji}]_{n \times m}
$$
秩

$$
A = \begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 3 & 4 \\ 4 & 3 & 2 & 1 \end{bmatrix} = \begin{bmatrix} 1 & 0 & -1 & -2 \\ 0 & 1 & 2 & 3 \\ 0 & 0 & 0 & 0 \end{bmatrix} \\
Rank(A) = Rank(A^T) = 2
$$
零空间

$$
矩阵 A 的零空间 N(A) 就是由满足 A \cdot \overset{\rightharpoonup}{x} = 0 的所有向量 \overset{\rightharpoonup}{x} 的集合。\\
一个矩阵的零空间为 \overset{\rightharpoonup}{0} 的充分必要条件是这个矩阵的所有列线性无关。
$$
左零空间

$$
矩阵 A 的左零空间是 A 的转置的零空间。 \\
N(A^T) = \begin{Bmatrix} \overset{\rightharpoonup}{x}| A^T \overset{\rightharpoonup}{x} = \overset{\rightharpoonup}{0} \end{Bmatrix} = \begin{Bmatrix} \overset{\rightharpoonup}{x}| \overset{\rightharpoonup}{x}^T A = \overset{\rightharpoonup}{0}^T \end{Bmatrix}
$$
列空间 (由每一列的向量张成的空间)

$$
A_{m \times n} = \begin{bmatrix} \overset{\rightharpoonup}{v_1} & \overset{\rightharpoonup}{v_2} \ldots \overset{\rightharpoonup}{v_n} \end{bmatrix}
$$

$$
\therefore \ C(A) = span(\overset{\rightharpoonup}{v_1}, \overset{\rightharpoonup}{v_2}, \ldots, \overset{\rightharpoonup}{v_n})
$$
行空间

$$
R(A) = C(A^T)
$$

概率论

条件概率

$$
P(A|B) = \frac{P(A \cap B)}{P(B)}
$$
贝叶斯

$$
\begin{align}
\displaystyle P(A|B) = \frac{\frac{P(A \cap B)}{P(A)} \cdot P(A)}{P(B)} = \frac{P(B|A) \cdot P(A)}{P(B)}
\end{align}
$$
Chain Rule

$$
P(F, G, P) = P(F) P(G, P | F) = P(F) P(G | F) P(P | F, G)
$$