矩阵计算是机器学习数理基础的重要部分. 本文将介绍矩阵微分原理与矩阵求导方法, 为后续的学习打下基础.
本文参考自 Mathematics for Maching Learning
回顾
让我们首先来回顾一下一元函数的微分.
对于一元函数 $f(x)$, 即
\[\begin{aligned}f:\mathbb{R}&\to\mathbb{R}\\x&\mapsto f(x)\end{aligned}\]一元函数的导数定义为
\[\frac{\mathrm{d}f}{\mathrm{d}x}:=\lim_{h\to0}\frac{f(x+h)-f(x)}h\]对于多元函数 $f(\boldsymbol{x})$, 即
\[\begin{aligned}f:\mathbb{R}^n&\to\mathbb{R}\\\boldsymbol{x}&\mapsto f(\boldsymbol{x})\end{aligned}\]其中 $\boldsymbol{x}$ 是一个 $n$ 维向量 $[x_1,x_2,\cdots, x_n]^{\text{T}}$. 多元函数的偏导数定义为
\[\begin{aligned} \begin{aligned}\frac{\partial f}{\partial x_1}\end{aligned} &\begin{aligned}=\lim_{h\to0}\frac{f(x_1+h,x_2,\ldots,x_n)-f(\boldsymbol{x})}h\end{aligned} \\ &\vdots\\ \frac{\partial f}{\partial x_n} &=\lim_{h\to0}\frac{f(x_1,\ldots,x_{n-1},x_n+h)-f(\boldsymbol{x})}h \end{aligned}\]将函数 $f(\boldsymbol{x})$ 对每一个变量 $x_i$ 的偏导数写成行向量, 就得到该函数的梯度 (gradient)
\[\nabla_\boldsymbol{x}f=\mathrm{grad}f=\frac{\mathrm{d}f}{\mathrm{d}\boldsymbol{x}}=\left[\frac{\partial f(\boldsymbol{x})}{\partial x_1}\quad\frac{\partial f(\boldsymbol{x})}{\partial x_2}\quad\cdots\quad\frac{\partial f(\boldsymbol{x})}{\partial x_n}\right]\in\mathbb{R}^{1\times n},\]对于多元函数求偏导, 我们需要关注其求导时的链式法则 (chain rule)
\[\frac\partial{\partial\boldsymbol{x}}(g\circ f)(\boldsymbol{x})=\frac\partial{\partial\boldsymbol{x}}(g(f(\boldsymbol{x})))=\frac{\partial g}{\partial f}\frac{\partial f}{\partial\boldsymbol{x}}\]现在让我们考虑一个二元函数
\[\begin{aligned}f:\mathbb{R}^n&\to\mathbb{R}\\\boldsymbol{x}=[x_1,x_2]&\mapsto f(\boldsymbol{x})\end{aligned}\]同时有
\[\left\{\begin{array}{cc}x_1=x_1(t)\\x_2=x_2(t)\end{array}\right.\]计算函数 $f$ 对变量 $t$ 的导数得
\[\frac{\mathrm{d}f}{\mathrm{d}t}=\frac{\partial f}{\partial x_1}\frac{\partial x_1}{\partial t}+\frac{\partial f}{\partial x_2}\frac{\partial x_2}{\partial t}=\begin{bmatrix}\frac{\partial f}{\partial x_1}&\frac{\partial f}{\partial x_2}\end{bmatrix}\begin{bmatrix}\frac{\partial x_1(t)}{\partial t}\\\frac{\partial x_2(t)}{\partial t}\end{bmatrix}\]如果情况变成
\[\left\{\begin{array}{cc}x_1=x_1(s,t)\\x_2=x_2(s,t)\end{array}\right.\]计算函数 $f$ 对变量 $s,t$ 的偏导数得
\[\begin{gathered} \frac{\partial f}{\partial\color{red}{s}} \begin{aligned}=\frac{\partial f}{\partial\color{blue} x_1}\frac{\partial\color{blue} x_1}{\partial\color{red} s}+\frac{\partial f}{\partial\color{blue} x_2}\frac{\partial\color{blue} x_2}{\partial\color{red}{s}},\end{aligned} \\ \frac{\partial f}{\partial\color{red}{t}} =\frac{\partial f}{\partial\color{blue}{x_1}}\frac{\partial\color{blue}{x_1}}{\partial\color{red}{t}}+\frac{\partial f}{\partial\color{blue}{x_2}}\frac{\partial\color{blue}{x_2}}{\partial\color{red}{t}}, \end{gathered}\]函数 $f$ 相对于 $[s,t]^{T}$ 的梯度为
\[\begin{aligned} \nabla_{[s,t]^{\text{T}}}f=\frac{\mathrm{d}f}{\mathrm{d}(s,t)}&=\left[\begin{aligned}\frac{\partial f}{\partial\color{blue} x_1}\frac{\partial\color{blue} x_1}{\partial\color{red} s}+\frac{\partial f}{\partial\color{blue} x_2}\frac{\partial\color{blue} x_2}{\partial\color{red}{s}},\end{aligned}\frac{\partial f}{\partial\color{blue}{x_1}}\frac{\partial\color{blue}{x_1}}{\partial\color{red}{t}}+\frac{\partial f}{\partial\color{blue}{x_2}}\frac{\partial\color{blue}{x_2}}{\partial\color{red}{t}}\right]\\ &=\frac{\partial f}{\partial\boldsymbol{x}}\frac{\partial\boldsymbol{x}}{\partial(s,t)}=\underbrace{\left[\frac{\partial f}{\partial\boldsymbol{x_1}}\quad\frac{\partial f}{\partial x_2}\right]}_{\LARGE=\frac{\partial f}{\partial\boldsymbol{x}}}\underbrace{\begin{bmatrix}{\frac{\partial x_1}{\partial s}}\quad\frac{\partial x_1}{\partial t}\\{\frac{\partial x_2}{\partial s}}\quad\frac{\partial x_2}{\partial t}\end{bmatrix}}_{\LARGE=\frac{\partial\boldsymbol{x}}{\partial(s,t)}}. \end{aligned}\]通过以上两个例子可以看出, 当我们将梯度写成行向量的形式时, 函数的梯度就可以由矩阵乘法的形式来表示. 实际上, 常见的形式是将梯度写作列向量的形式. 若写成列向量, 则在上述操作中需提前对矩阵进行转置.
向量值函数
接下来让我们考虑向量值函数
\[\begin{aligned} \boldsymbol{f}:\mathbb{R}^n&\to\mathbb{R}^m\\\boldsymbol{x}&\mapsto \boldsymbol{f}(\boldsymbol{x})\end{aligned}\]其中 $ \boldsymbol{x}=[x_1,x_2,\cdots,x_n]^{\text{T}}\in\mathbb{R}^n$. 该函数的形式如下
\[\boldsymbol{f}(\boldsymbol{x})=\begin{bmatrix}f_1(\boldsymbol{x})\\\vdots\\f_m(\boldsymbol{x})\end{bmatrix}\in\mathbb{R}^m\]将向量值函数写成上述形式使我们能将其作为一组函数看待, 对于每个 $f_i(\boldsymbol{x}), i\in 1,2,\cdots m$, 它都满足上文所所介绍的微分规则.
因此, 向量值函数 $\boldsymbol{f}(\boldsymbol{x})$ 对于任一自变量 $x_i, i=1,2,\cdots,n$ 的偏导数可以写作
\[\frac{\partial\boldsymbol{f}}{\partial x_i}=\begin{bmatrix}\frac{\partial f_1}{\partial x_i}\\\vdots\\\frac{\partial f_m}{\partial x_i}\end{bmatrix}=\begin{bmatrix}\lim\limits_{h\to0}\frac{f_1(x_1,...,x_{i-1},x_i+h,x_{i+1},...x_n)-f_1(\boldsymbol{x})}h\\\vdots\\\lim\limits_{h\to0}\frac{f_m(x_1,...,x_{i-1},x_i+h,x_{i+1},...x_n)-f_m(\boldsymbol{x})}h\end{bmatrix}\in\mathbb{R}^m.\]与上文类似, 将函数的梯度写成行向量的形式, 而向量值函数对于任一自变量的偏导数是列向量的形式, 则可得到向量值函数的梯度为如下形式
\[\begin{aligned} \frac{\mathrm{d}\boldsymbol{f}(\boldsymbol{x})}{\mathrm{d}\boldsymbol{x}}& =\left[\frac{\partial\boldsymbol{f}(\boldsymbol{x})}{\partial x_1}\cdots\frac{\partial\boldsymbol{f}(\boldsymbol{x})}{\partial x_n}\right] \\ &=\left[\begin{array}{ccc}\frac{\partial f_1(\boldsymbol{x})}{\partial x_1}&\cdots&\frac{\partial f_1(\boldsymbol{x})}{\partial x_n}\\\vdots&\ddots&\vdots\\\frac{\partial f_m(\boldsymbol{x})}{\partial x_1}&\cdots&\frac{\partial f_m(\boldsymbol{x})}{\partial x_n}\end{array}\right]\in\mathbb{R}^{m\times n}. \end{aligned}\]向量值函数 $\boldsymbol{f}:\mathbb{R}^n\to\mathbb{R}^m$ 的一阶偏导组成的矩阵被称为 Jacobian 矩阵, 其形式如下
\[\begin{aligned} \boldsymbol{J}&=\nabla_x\boldsymbol{f}=\frac{\mathrm{d}\boldsymbol{f}(\boldsymbol{x})}{\mathrm{d}\boldsymbol{x}}=\large\begin{bmatrix}\frac{\partial\boldsymbol{f}(\boldsymbol{x})}{\partial x_1}&\cdots&\frac{\partial\boldsymbol{f}(\boldsymbol{x})}{\partial x_n}\end{bmatrix} \\ &=\large\begin{bmatrix}\frac{\partial f_1(\boldsymbol{x})}{\partial x_1}&\cdots&\frac{\partial f_1(\boldsymbol{x})}{\partial x_n}\\\vdots&&\vdots\\\frac{\partial f_m(\boldsymbol{x})}{\partial x_1}&\cdots&\frac{\partial f_m(\boldsymbol{x})}{\partial x_n}\end{bmatrix}, \\ \boldsymbol{x}&=\begin{bmatrix}x_1\\\vdots\\x_n\end{bmatrix},\quad J(i,j)=\frac{\partial f_i}{\partial x_j}. \end{aligned}\]与将梯度写成列向量的形式, 写成行向量能使从单值函数向向量值函数的过渡更为自然.
下面以表格的形式总结了当输入输出分别为不同维度时, Jacobian 矩阵的维度情况
输入/输出 | $\mathbb{R}$ | $\mathbb{R}^m$ |
---|---|---|
$\mathbb{R}$ | $1\times 1$ | $m\times 1$ |
$\mathbb{R}^n$ | $1\times n$ | $m\times n$ |
矩阵的梯度
我们可能需要计算一个矩阵对一个向量或另一个矩阵的梯度, 这时计算将得到一个多维张量. 考虑以下例子.
现有 $\boldsymbol{f}=\boldsymbol{Ax},\quad\boldsymbol{f}\in\mathbb{R}^M,\quad\boldsymbol{A}\in\mathbb{R}^{M\times N},\quad\boldsymbol{x}\in\mathbb{R}^N$, 可得
\[\frac{\mathrm{d}\boldsymbol{f}}{\mathrm{d}\boldsymbol{A}}=\begin{bmatrix}\large\frac{\partial f_1}{\partial\boldsymbol{A}}\\\vdots\\\frac{\partial f_M}{\partial\boldsymbol{A}}\end{bmatrix}\in\mathbb{R}^{M\times(M\times N)}, \frac{\partial f_i}{\partial\boldsymbol{A}}\in\mathbb{R}^{1\times(M\times N)}\]接下来计算偏导数, 由于
\[\begin{aligned}f_i=\sum_{j=1}^NA_{ij}x_j,\quad i=1,\ldots,M,\end{aligned}\]考虑
\[\frac{\partial f_i}{\partial A_{kq}}, \quad k=1,2,\cdots,M,\quad q=1,2,\cdots,N\]当 $k=i$ 时
\[\frac{\partial f_i}{\partial A_{iq}}=x_q, \quad q=1,2,\cdots,N\]当 $k\neq i$ 时
\[\frac{\partial f_i}{\partial A_{kq}}=0, \quad q=1,2,\cdots,N\]于是得到
\[\begin{aligned} \frac{\partial f_i}{\partial A_{i,:}}&=\boldsymbol{x}^\top\in\mathbb{R}^{1\times1\times N} \\ \frac{\partial f_i}{\partial A_{k\neq i,:}}&=\boldsymbol{0}^\top\in\mathbb{R}^{1\times1\times N} \end{aligned}\]所以
\[\frac{\partial f_i}{\partial\boldsymbol{A}}=\begin{bmatrix}\mathbf{0}^\top\\\vdots\\\mathbf{0}^\top\\\boldsymbol{x}^\top\\\mathbf{0}^\top\\\vdots\\\mathbf{0}^\top\end{bmatrix}\in\mathbb{R}^{1\times(M\times N)}.\]常用的矩阵求导公式
以下介绍一些常用的矩阵求导公式, 各个公式的推导略去.
\[\begin{aligned} &\frac\partial{\partial\boldsymbol{X}}\boldsymbol{f}(\boldsymbol{X})^\top=\left(\frac{\partial\boldsymbol{f}(\boldsymbol{X})}{\partial\boldsymbol{X}}\right)^\top \\\\ &\frac{\partial\boldsymbol{x}^\top\boldsymbol{a}}{\partial\boldsymbol{x}}=\boldsymbol{a}^\top \\\\ &\frac{\partial\boldsymbol{a}^\top\boldsymbol{x}}{\partial\boldsymbol{x}}=\boldsymbol{a}^\top \\\\ &\frac{\partial\boldsymbol{a}^\top\boldsymbol{Xb}}{\partial\boldsymbol{X}}=\boldsymbol{ab}^\top \\\\ &\frac{\partial\boldsymbol{x}^\top\boldsymbol{Bx}}{\partial\boldsymbol{x}}=\boldsymbol{x}^\top(\boldsymbol{B}+\boldsymbol{B}^\top) \end{aligned}\]-
Previous
Principles of Economics by N.G Mankiw: Interdependence and the Gains from Trade -
Next
《机器学习(周志华)》学习笔记(四): 线性模型