memo: Calc | Derivative of "Matrix"

(2024-02-14)

Derivative is the amount of change in a target object caused by a variable’s change.
A row a matrix consists of the coefficient of each term in a linear equation. And based on the “sum rule” of derivative ($(f+g)'=f'+g'$), the derivative of the linear equation w.r.t. a variable is the summation of the derivative of each element in the row w.r.t. the variable.

d Ax

(2024-01-13)

Source video: Derivative of a Matrix : Data Science Basics - ritvikmath

Matrix 𝐀() stands for a linear transformation (function). And only the derivative of a function (𝐀𝐱) makes sense.

Matrix is a representation of linear systems.

$$ \begin{aligned} f(x) &= 𝐀𝐱 \\\ &= \begin{bmatrix} 1 & 2 \\\ 3 & 4 \end{bmatrix} \begin{bmatrix} x₁ \\\ x_2 \end{bmatrix} \\\ &= \begin{bmatrix} x₁ + 2 x₂ \\\ 3x₁ + 4x₂ \end{bmatrix} ⇒ \begin{bmatrix} f₁(x₁,x₂) \\\ f₂(x₁,x₂) \end{bmatrix} \end{aligned} $$$$ \frac{d𝐀𝐱}{d𝐱} = \begin{bmatrix} ∂f₁/∂x₁ & ∂f₁/∂x₂ \\\ ∂f₂/∂x₁ & ∂f₂/∂x₂ \end{bmatrix}= \begin{bmatrix} 1 & 2 \\\ 3 & 4 \end{bmatrix} $$

The derivative of the linear transformation 𝐀𝐱 w.r.t. x is A. It analog to single-variable function.

A matrix $A$ is a “scalar”. More concretely, it’s a collection of scalars in a box.

Therefore, the derivative of A means the derivative of a constant, which would be 0. So, it doesn’t make any sense.

Thereby, we are not calculating the derivative of a matrix, but the derivative of a linear transformation 𝐀𝐱 w.r.t. 𝐱.

d xᵀAx

$$ \begin{aligned} 𝐱ᵀ𝐀𝐱 &= \begin{bmatrix} x₁ & x₂ \end{bmatrix} \begin{bmatrix} a₁₁ & a₁₂ \\\ a₂₁ & a₂₂ \end{bmatrix} \begin{bmatrix} x₁ \\\ x₂ \end{bmatrix} \\\ &= \begin{bmatrix} x₁ & x₂ \end{bmatrix} \begin{bmatrix} a₁₁x₁+ a₁₂x₂ \\\ a₂₁x₁ + a₂₂x₂ \end{bmatrix} \\\ &= a₁₁x₁²+ a₁₂x₁x₂ + a₂₁x₁x₂ + a₂₂x₂² ⇒ f(x₁,x₂) \end{aligned} $$

Consider 𝐀 is a symmetric matrix, so a₂ = a₃. Then, $𝐱ᵀ𝐀𝐱 = a₁₁x₁²+ a₁₂x₁x₂ + a₂₁x₁x₂ + a₂₂x₂² = f(x₁,x₂)$

The derivative of the linear transformation 𝐱ᵀ𝐀𝐱:

$$ \begin{aligned} \frac{d𝐱ᵀ𝐀𝐱}{d𝐱} &= \begin{bmatrix} ∂f/∂x₁ \\\ ∂f/∂x₂ \end{bmatrix} \\\ &= \begin{bmatrix} 2a₁₁x₁+2a₁₂x₂ \\\ 2a₁₂x₂ + 2a₂₂x₂ \end{bmatrix} \\\ &= 2 \begin{bmatrix} a₁₁ & a₁₂ \\\ a₁₂ & a₂₂ \end{bmatrix} \begin{bmatrix} x₁ \\\ x₂ \end{bmatrix} \\\ &= 2𝐀𝐱 \end{aligned} $$

It’s an analog to quadratic of matrix operations.

3 cases

Source article: The derivative matrix - Math Insight

A matrix 𝐀 contains elements that are functions of a scalar x.
- The $\frac{d𝐀}{dx}$ is a matrix of the same size as 𝐀.
  
  Refer to Definition 5 in Matrix Differentiation - Department of Atmospheric Sciences
The derivative of a multi-variable scalar-valued function $f$ is a matrix of partial derivatives of each function with respect to each variable.
- Derivative of 𝐟 w.r.t. each coordinate axis.
- $\frac{df}{d𝐱} = [ \frac{∂f}{∂x₁}\ \frac{∂f}{∂x₂}\ ⋯ \ \frac{∂f}{∂xₙ} ]$
A matrix 𝐀 contains elements that are functions of a vector 𝐱.
- $𝐀(𝐱) = 𝐟(𝐱) = (f_1(𝐱),\ f_2(𝐱),\ ..., f_m(𝐱)) = \begin{bmatrix} f_1(𝐱) \\\ f_2(𝐱) \\\ ⋮ \\\ f_m(𝐱) \end{bmatrix}$
- The $\frac{d𝐀}{d𝐱}$ is a matrix with the size of mxn:
  $$ \frac{d𝐀}{d𝐱} = \begin{bmatrix} \frac{f_1}{x_1} & \frac{f_1}{x_2} & ⋯ & \frac{f_1}{xₙ} \\\ \frac{f_2}{x_1} & \frac{f_2}{x_2} & ⋯ & \frac{f_2}{xₙ} \\\ ⋮ & ⋮ & ⋮ & ⋮ \\\ \frac{f_m}{x_1} & \frac{f_m}{x_2} & ⋯ & \frac{f_m}{xₙ} \\\ \end{bmatrix} $$

Matrix derivative

(2023-02-12)

Matrix derivatie is in terms of the whole matrix, instead of each element. Whereas partial derivatives of a matrix

Given a matrix $[^{a\ b}\_{d\ c}]$,the derivative of its inverse matrix $\frac{1}{ac-bd}[^{\ c\ -b}\_{-d\ a}]$ w.r.t. the original matrix is the “coefficient” in their relation:

$$ \underbrace{ \begin{bmatrix} c & -b \\\ -d & a \end{bmatrix} \frac{1}{ac-bd} \begin{bmatrix} c & -b \\\ -d & a \end{bmatrix} }\_{\text{Coefficient}} \begin{bmatrix} a & b \\\ d & c \end{bmatrix} = \begin{bmatrix} c & -b \\\ -d & a \end{bmatrix} $$

This transformation can be understood as that the original matrix first times its inverse $\frac{1}{ac-bd}[^{\ c\ -b}\_{-d\ a}]$ to become the identity matrix $[^{1\ 0}_{0\ 1}]$, which gets multiplied by $[^{\ c\ -b}\_{-d\ a}]$ to yield the inverse matrix.

Therefore, the coefficient is:
$$ \frac{1}{ac-bd} \begin{bmatrix} c & -b \\\ -d & a \end{bmatrix} \begin{bmatrix} c & -b \\\ -d & a \end{bmatrix} = \frac{1}{ac-bd} \begin{bmatrix} c² + bd & -bc-ab \\\ -cd-ad & bd+a²\end{bmatrix} $$
In this case, is the optimizing objective the whole matrix $[^{a\ b}_{d\ c}]$, with its coefficient serving as the gradient?
perplexity

On the other hand, the partial derivatives of the inverse matrix $\frac{1}{ac-bd}[^{\ c\ -b}\_{-d\ a}]$ with respect to each element a, b, c, d can be conceptualized as:

how does changes in the 4 “variables” $a,\ b,\ c,\ d$ affect the matrix $\frac{1}{ac-bd}[^{\ c\ -b}\_{-d\ a}]$

$$ \begin{aligned} \frac{ ∂\frac{1}{ac-bd} \begin{bmatrix} c & -b \\\ -d & a \end{bmatrix}}{∂a} &= \begin{bmatrix} \frac{∂}{∂a} (\frac{c}{ac-bd} ) & \frac{∂}{∂a} (\frac{-b}{ac-bd}) \\\ \frac{∂}{∂a} (\frac{-d}{ac-bd}) & \frac{∂}{∂a} (\frac{a}{ac-bd} ) \\\ \end{bmatrix} \\\ &= \begin{bmatrix} \frac{-c²}{(ac-bd)²} & \frac{bc}{(ac-bd)²} \\\ \frac{dc}{(ac-bd)²} & \frac{-bd}{(ac-bd)²} \\\ \end{bmatrix} \end{aligned} $$

The total change of the matrix magnitude caused by moving $a$ by one unit would be:

$$\frac{∂ (\frac{1}{ac-bd} [^{\ c\ -b}\_{-d\ a}] )}{∂a} = \frac{-c² + bc + dc - bd}{(ac-bd)²} $$

Particularly, with this derivative, $a$ can be optimized via gradient descent.

Similarly, the partial derivatives of the matrix w.r.t. $b,\ c,\ d$ are:

$$ \begin{aligned} \frac{∂ (\frac{1}{ac-bd} [^{\ c\ -b}\_{-d\ a}] )}{∂b} &= \frac{cd-ac-d²+ad}{(ac-bd)²} \\\ \frac{∂ (\frac{1}{ac-bd} [^{\ c\ -b}\_{-d\ a}] )}{∂c} &= \frac{-bd+ba+da-a²}{(ac-bd)²} \\\ \frac{∂ (\frac{1}{ac-bd} [^{\ c\ -b}\_{-d\ a}] )}{∂d} &= \frac{cb-b²-ac+ab}{(ac-bd)²} \\\ \end{aligned} $$

(2024-02-13)

Matrix Derivatives: What’s up with all those transposes ? - David Levin

Gradient: Matrix form -> indices form -> matrix form

Matrix Calculus - Online

XᵀwX

(2024-04-06)

拆分成：向量函数 + 多元函数

空间的基可以是多项式函数, 幂函数, 所以线性方程可以表示非线性函数

【微积分和线性代数碰撞的数学盛宴：最小二乘法公式推导！】-晓之车高山老师 - bilibili

(2024-05-15)

【高等数学笔记】多元向量值函数的导数与微分_- CSDN - seh_sjlj

(2024-07-22)

Source video: 手推机器学习1⃣️—矩阵求导 - S-WangZ(2024-05-24)

Scalar-value function $f: \\R^n → \\R$
- Defined with field and vector space: Scalar-valued function definition - SE (Searched by “scalar function” in DDG)
  - A field $k$ comprises one set k and two operations: addition and multiplicaiton. $k = (k, +, ⋅)$
  - A vector space $V$ comprises two sets $k$ and $V$ and two operations: addition and multiplicaiton. $V = (V, +, k, ⋅)$
  - An element in the set k is a scalar. An element in the set V is a vector.
  - Scalar-field funtion f maps a vector space to a scalar: $f: V → k$
- “Scalar function is a function with one-dimensional scalar output” Scalar Function, Definition of Scalar - Statistics How To
- “A scalar-value function is a function that takes one or more values and returns a single value.” World Web Math: Vector Calculus: Scalar Valued Functions - MIT

Table of contents

d Ax

d xᵀAx

3 cases

Matrix derivative

XᵀwX