HW 2
2.1
L ( x ) = ∥ A x ∥ 2 2 − λ ( ∥ x ∥ 2 2 − 1 ) , A ∈ R n × n , x ∈ R n L(x)=\|Ax\|_2^2-\lambda (\|x\|_2^2-1),\; A\in \R^{n\times n},\, x\in \R^n L ( x ) = ∥ A x ∥ 2 2 − λ ( ∥ x ∥ 2 2 − 1 ) , A ∈ R n × n , x ∈ R n 求 ∂ L ∂ x \frac{\partial L}{\partial x} ∂ x ∂ L
∂ L ∂ x = ∂ [ ( A x ) T A x − λ x T x ] ∂ x = 2 A A T x − 2 λ x \begin{aligned}
\frac{\partial L}{\partial x} &= \frac{\partial [(Ax)^TAx-\lambda x^Tx]}{\partial x}\\
&=2AA^Tx-2\lambda x
\end{aligned} ∂ x ∂ L = ∂ x ∂ [( A x ) T A x − λ x T x ] = 2 A A T x − 2 λ x
2.2
L ( x ) = ∥ V − W H ∥ F 2 , V ∈ R m × n , W ∈ R m × k , H ∈ R k × n , L(x)=\|V-WH\|_F^2,\,V\in \R^{m\times n},\,W\in \R^{m\times k},\,H\in \R^{k\times n},\, L ( x ) = ∥ V − W H ∥ F 2 , V ∈ R m × n , W ∈ R m × k , H ∈ R k × n , 求 ∂ L ∂ W , ∂ L ∂ V \frac{\partial L}{\partial W},\,\frac{\partial L}{\partial V} ∂ W ∂ L , ∂ V ∂ L
∂ L ∂ W = ∂ t r ( ( V − W H ) T ( V − W H ) ) ∂ W = − 2 V H T + 2 W H H T \begin{aligned}
\frac{\partial L}{\partial W} &=\frac{\partial \mathrm{tr}((V-WH)^T(V-WH))}{\partial W}\\
&=-2VH^T+2WHH^T
\end{aligned} ∂ W ∂ L = ∂ W ∂ tr (( V − W H ) T ( V − W H )) = − 2 V H T + 2 W H H T
∂ L ∂ V = − 2 V − 2 W H \frac{\partial L}{\partial V} = -2V-2WH
∂ V ∂ L = − 2 V − 2 W H
2.3
let f = X T A X f=X^TAX f = X T A X
d f = d ( X T A X ) = d t r ( X T A X ) = t r ( X T A T d X + X T A d X ) = t r ( X T ( A + A T ) d X ) \begin{aligned}
df &= d (X^TAX)\\
&=d \mathrm{tr}(X^TAX)\\
&=\mathrm{tr}(X^TA^TdX + X^T A dX)
&=\mathrm{tr}(X^T(A+A^T)dX)
\end{aligned} df = d ( X T A X ) = d tr ( X T A X ) = tr ( X T A T d X + X T A d X ) = tr ( X T ( A + A T ) d X )
∂ ∂ X t r ( X T A X ) = X T ( A + A T ) \frac{\partial}{\partial X}\mathrm{tr}(X^TAX)=X^T(A+A^T)
∂ X ∂ tr ( X T A X ) = X T ( A + A T )
HW 3
3.1 Logits on Multi-Classification
paramater space: θ = { w i , b i } i = 1 , … , n \theta = \{w_i,b_i\}\; i=1,\ldots,n θ = { w i , b i } i = 1 , … , n , samples: X = { x 1 , … , x n } X=\{x_1,\ldots,x_n\} X = { x 1 , … , x n } , labels: Y = { y 1 , … , y n } Y=\{y_1,\ldots,y_n\} Y = { y 1 , … , y n } , y i ∈ y = { 1 , … , K } , K y_i\in y = \{1,\ldots,K\},\;K y i ∈ y = { 1 , … , K } , K number of classes.
ℓ ( θ ) = ∑ i = 1 n log p ( y i ∣ x i ; θ ) p ( y i ∣ x i ; θ ) = ∏ j = 1 K p ( y i = j ) I ( y i = j ) p ( y = j ∣ x ) = exp ( w j T x + b ) 1 + ∑ k = 1 K exp ( w k T x + b ) j = 1 , … , K − 1 p ( y = K ∣ x ) = 1 1 + ∑ k = 1 K − 1 exp ( w k T x + b ) \ell(\theta) = \sum_{i=1}^n \log p(y_i|x_i;\theta)\\
p(y_i|x_i;\theta) = \prod_{j=1}^K p(y_i=j)^{\mathbb I(y_i=j)}\\
p(y=j|x) = \frac{\exp(w_j^Tx+b)}{1+\sum_{k=1}^K\exp(w_k^Tx+b)}\qquad j=1,\ldots,K-1\\
p(y=K|x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(w_k^Tx+b)}
ℓ ( θ ) = i = 1 ∑ n log p ( y i ∣ x i ; θ ) p ( y i ∣ x i ; θ ) = j = 1 ∏ K p ( y i = j ) I ( y i = j ) p ( y = j ∣ x ) = 1 + ∑ k = 1 K exp ( w k T x + b ) exp ( w j T x + b ) j = 1 , … , K − 1 p ( y = K ∣ x ) = 1 + ∑ k = 1 K − 1 exp ( w k T x + b ) 1
plug them in:
ℓ ( θ ) = ∑ i = 1 n ∑ j = 1 K I ( y i = j ) log p ( y i = j ∣ x i ; θ ) = ∑ i = 1 n ( ∑ j = 1 K − 1 I ( y i = j ) log exp ( w j T x i + b ) 1 + ∑ k = 1 K − 1 exp ( w k T x i + b ) + I ( y i = K ) log 1 1 + ∑ k = 1 K − 1 exp ( w k T x i + b ) ) \begin{aligned}
\ell(\theta) &= \sum_{i=1}^n\sum_{j=1}^K\mathbb I(y_i=j) \log p(y_i = j|x_i;\theta)\\
&= \sum_{i=1}^n\left(\sum_{j=1}^{K-1}\mathbb I(y_i=j) \log \frac{\exp(w_j^Tx_i+b)}{1+\sum_{k=1}^{K-1}\exp(w_k^Tx_i+b)}+\mathbb I(y_i=K) \log \frac{1}{1+\sum_{k=1}^{K-1}\exp(w_k^Tx_i+b)}\right)
\end{aligned}
ℓ ( θ ) = i = 1 ∑ n j = 1 ∑ K I ( y i = j ) log p ( y i = j ∣ x i ; θ ) = i = 1 ∑ n ( j = 1 ∑ K − 1 I ( y i = j ) log 1 + ∑ k = 1 K − 1 exp ( w k T x i + b ) exp ( w j T x i + b ) + I ( y i = K ) log 1 + ∑ k = 1 K − 1 exp ( w k T x i + b ) 1 )
Using the softmax function:
p ( y = j ∣ x ) = σ j ( z ) = exp ( z j ) ∑ k = 1 K exp ( z k ) p(y=j|x)=\sigma_j (z) = \frac{\exp{(z_j)}}{\sum_{k=1}^{K}\exp{(z_k)}}
p ( y = j ∣ x ) = σ j ( z ) = ∑ k = 1 K exp ( z k ) exp ( z j )
ℓ ( θ ) = ∑ i = 1 n ∑ j = 1 K I ( y i = j ) log exp ( z j ( i ) ) ∑ k = 1 K exp ( z k ( i ) ) = ∑ i = 1 n ∑ j = 1 K I ( y i = j ) log exp ( w j T x i + b ) ∑ k = 1 K exp ( w k T x i + b ) \begin{aligned}
\ell(\theta) &= \sum_{i=1}^n\sum_{j=1}^K\mathbb I(y_i=j) \log \frac{\exp{(z_j^{(i)})}}{\sum_{k=1}^{K}\exp{(z_k^{(i)})}}\\
&= \sum_{i=1}^n\sum_{j=1}^K\mathbb I(y_i=j) \log \frac{\exp{(w_j^Tx_i+b)}}{\sum_{k=1}^{K}\exp{(w_k^Tx_i+b)}}
\end{aligned}
ℓ ( θ ) = i = 1 ∑ n j = 1 ∑ K I ( y i = j ) log ∑ k = 1 K exp ( z k ( i ) ) exp ( z j ( i ) ) = i = 1 ∑ n j = 1 ∑ K I ( y i = j ) log ∑ k = 1 K exp ( w k T x i + b ) exp ( w j T x i + b )
Derivatives: note p i = p ( y i = j ∣ x i ; θ ) = σ j ( z ( i ) ) p_i = p(y_i=j|x_i; \theta)=\sigma_j (z^{(i)}) p i = p ( y i = j ∣ x i ; θ ) = σ j ( z ( i ) )
∂ p i ∂ z j = { p i ( 1 − p j ) i = j − p i p j i ≠ j = p i ( I ( i = j ) − p j ) \begin{aligned}\frac{\partial p_i}{\partial z_j} &=\begin{cases} p_i(1-p_j) & i=j\\ -p_ip_j &i\neq j\end{cases} \\
&=p_i(\mathbb I(i=j)-p_j)\end{aligned} ∂ z j ∂ p i = { p i ( 1 − p j ) − p i p j i = j i = j = p i ( I ( i = j ) − p j )
∂ z k ∂ w k = x ∂ z k ∂ b = 1 \frac{\partial z_k}{\partial w_k} = x \qquad \frac{\partial z_k}{\partial b} = 1
∂ w k ∂ z k = x ∂ b ∂ z k = 1
∂ ℓ ∂ z k = ∑ i = 1 n ∑ j = 1 K I ( y i = j ) 1 p i p i ( I ( i = k ) − p k ) = ∑ i = 1 n ∑ j = 1 K I ( y i = j ) ( I ( i = k ) − p k ) = ∑ j = 1 K ( I ( y k = j ) ( 1 − p k ) − ∑ i ≠ k I ( y i = j ) p k ) = ∑ j = 1 K ( I ( y k = j ) − ∑ i = 1 K I ( y i = j ) p k ) = ∑ j = 1 K ( I ( y k = j ) − p k ) = − K p k + 1 \begin{aligned}\frac{\partial \ell}{\partial z_k} &=\sum_{i=1}^n\sum_{j=1}^K\mathbb I(y_i=j) \frac{1}{p_i}p_i(\mathbb I(i=k)-p_k)\\
&=\sum_{i=1}^n\sum_{j=1}^K\mathbb I(y_i=j) (\mathbb I(i=k)-p_k)\\
&=\sum_{j=1}^K\left(\mathbb I(y_k=j)(1-p_k)-\sum_{i\neq k}\mathbb I(y_i=j)p_k\right)\\
&= \sum_{j=1}^K \left(\mathbb I(y_k=j)-\sum_{i=1}^K\mathbb I(y_i=j)p_k\right)\\
&=\sum_{j=1}^K(\mathbb I(y_k=j)-p_k)\\
&= -Kp_k +1
\end{aligned} ∂ z k ∂ ℓ = i = 1 ∑ n j = 1 ∑ K I ( y i = j ) p i 1 p i ( I ( i = k ) − p k ) = i = 1 ∑ n j = 1 ∑ K I ( y i = j ) ( I ( i = k ) − p k ) = j = 1 ∑ K I ( y k = j ) ( 1 − p k ) − i = k ∑ I ( y i = j ) p k = j = 1 ∑ K ( I ( y k = j ) − i = 1 ∑ K I ( y i = j ) p k ) = j = 1 ∑ K ( I ( y k = j ) − p k ) = − K p k + 1
∇ w k ℓ = ∂ ℓ ∂ z k ∂ z k ∂ w k = ( − K p k + 1 ) x ∇ b ℓ = ∂ ℓ ∂ z k ∂ z k ∂ w k = − K p k + 1 \nabla_{w_k}\ell = \frac{\partial \ell}{\partial z_k}\frac{\partial z_k}{\partial w_k} = (-Kp_k +1)x\\
\nabla_{b}\ell = \frac{\partial \ell}{\partial z_k}\frac{\partial z_k}{\partial w_k} = -Kp_k +1
∇ w k ℓ = ∂ z k ∂ ℓ ∂ w k ∂ z k = ( − K p k + 1 ) x ∇ b ℓ = ∂ z k ∂ ℓ ∂ w k ∂ z k = − K p k + 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 def loss_function : Cross Entropy Lossdef compute_grad : return grad_W, grad_bdef gradient_descent : while stop == False W ,b = compute_grad do descent if |newpoint - oldpoint| < eps: stop = True return resultdef predict : return index of softmax(gradient_descent).max
3.2 Multi-Class LDA
Suppose dataset { ( x i , y i ) } y i ∈ y = { 1 , … , K } \{(\bold x_i,y_i)\}\quad y_i\in y=\{1,\ldots,K\} {( x i , y i )} y i ∈ y = { 1 , … , K } . Class k k k have N k N_k N k samples X k X_k X k .
Step 1: Pre-Computing
Mean Vectors:
μ k = 1 N k ∑ i = 1 N k x i \mu_k = \frac{1}{N_k}\sum_{i=1}^{N_k}x_i
μ k = N k 1 i = 1 ∑ N k x i
N k N_k N k is number of samples in class k k k
μ = 1 N ∑ i = 1 N x i \mu = \frac{1}{N}\sum_{i=1}^{N}x_i
μ = N 1 i = 1 ∑ N x i
N N N is the total number of samples.
Within-Class Scatter Matrix:
S k = ∑ x ∈ X k ( x − μ k ) ( x − μ k ) T S_k = \sum_{x\in X_k}(x-\mu_k)(x-\mu_k)^T
S k = x ∈ X k ∑ ( x − μ k ) ( x − μ k ) T
S W = ∑ k = 1 K S k S_W=\sum_{k=1}^K S_k
S W = k = 1 ∑ K S k
Between-Class Scatter Matrix
S B = ∑ k = 1 K N k ( μ k − μ ) ( μ k − μ ) T S_B = \sum_{k=1}^{K}N_k(\mu_k-\mu)(\mu_k-\mu)^T
S B = k = 1 ∑ K N k ( μ k − μ ) ( μ k − μ ) T
Step 2: Solve Generalized Eigenvalue Problem
S W − 1 S B w = λ w S_W^{-1}S_B \bold w=\lambda \bold w
S W − 1 S B w = λ w
Use Numerical Methods.
Step 3
Select d ′ ≤ N − 1 d'\leq N-1 d ′ ≤ N − 1 Top Eigenvectors to form transformation matrix W W W , projecting data to lower-dimensional space.
Z = W T X Z=W^TX
Z = W T X
Step 4
Apply any classification method that suits lower dimensions.
3.3 In Multi-Class LDA, proof: r a n k ( S W − 1 S B ) ≤ K − 1 \mathrm{rank}(S_W^{-1}S_B)\leq K-1 rank ( S W − 1 S B ) ≤ K − 1
r a n k ( S W − 1 S B ) ≤ min { r a n k ( S W − 1 ) , r a n k ( S B ) } \mathrm{rank}(S_W^{-1}S_B) \leq \min \{\mathrm{rank}(S_W^{-1}),\mathrm{rank}(S_B)\}
rank ( S W − 1 S B ) ≤ min { rank ( S W − 1 ) , rank ( S B )}
for the Between-Class Scatter Matrix S B S_B S B :
r a n k ( S B ) = r a n k ( ∑ k = 1 K N k ( μ k − μ ) ( μ k − μ ) T ) ≤ ∑ k = 1 K r a n k ( ( μ k − μ ) ( μ k − μ ) T ) \begin{aligned}
\mathrm{rank}(S_B) &= \mathrm{rank}\left(\sum_{k=1}^{K}N_k(\mu_k-\mu)(\mu_k-\mu)^T\right)\\
&\leq \sum_{k=1}^{K}\mathrm{rank}\left((\mu_k-\mu)(\mu_k-\mu)^T\right)
\end{aligned} rank ( S B ) = rank ( k = 1 ∑ K N k ( μ k − μ ) ( μ k − μ ) T ) ≤ k = 1 ∑ K rank ( ( μ k − μ ) ( μ k − μ ) T )
notice that:
r a n k ( ( μ k − μ ) ( μ k − μ ) T ) ≤ 1 \mathrm{rank}\left((\mu_k-\mu)(\mu_k-\mu)^T\right) \leq 1
rank ( ( μ k − μ ) ( μ k − μ ) T ) ≤ 1
μ = 1 N ∑ k = 1 K μ k N k \mu = \frac{1}{N}\sum_{k=1}^K\mu_k N_k
μ = N 1 k = 1 ∑ K μ k N k
∑ k = 1 K N k ( μ − μ k ) = 0 \sum_{k=1}^KN_k(\mu-\mu_k)=0
k = 1 ∑ K N k ( μ − μ k ) = 0
Thus vectors { μ − μ k } \{\mu-\mu_k\} { μ − μ k } are linear correlated. There is atmost K − 1 K-1 K − 1 vectors that are linear independent. There must be at least one matrix ( μ j − μ ) ( μ j − μ ) T (\mu_j-\mu)(\mu_j-\mu)^T ( μ j − μ ) ( μ j − μ ) T that has r a n k = 0 \mathrm{rank}=0 rank = 0 . Therefore:
r a n k ( S W − 1 S B ) ≤ r a n k ( S B ) ≤ K − 1 \mathrm{rank}(S_W^{-1}S_B)\leq \mathrm{rank}(S_B) \leq K-1
rank ( S W − 1 S B ) ≤ rank ( S B ) ≤ K − 1
3.4 Convexity of functions
Definition: for all x 1 , x 2 ∈ D , λ ∈ ( 0 , 1 ) x_1, x_2\in D,\; \lambda \in (0,1) x 1 , x 2 ∈ D , λ ∈ ( 0 , 1 ) , there is:
f ( λ x 1 + ( 1 − λ x 2 ) ) ≤ λ f ( x 1 ) + ( 1 − λ ) f ( x 2 ) f(\lambda x_1+(1-\lambda x_2))\leq \lambda f(x_1)+(1-\lambda) f(x_2)
f ( λ x 1 + ( 1 − λ x 2 )) ≤ λ f ( x 1 ) + ( 1 − λ ) f ( x 2 )
y = 1 1 + exp ( w T x + b ) y=\frac{1}{1+\exp(w^Tx+b)}
y = 1 + exp ( w T x + b ) 1
∇ w y = − x e − w T x + b ( 1 + e − w T x + b ) 2 \nabla_w y = \frac{-x e^{-w^Tx+b}}{(1+e^{-w^Tx+b})^2}
∇ w y = ( 1 + e − w T x + b ) 2 − x e − w T x + b
e = e − w T x + b e= e^{-w^Tx+b} e = e − w T x + b
∇ w 2 y = ( 1 + e ) 2 x x T e − x e 2 ( 1 + e ) e x T ( 1 + e ) 4 = ( − e 3 + e ) x x T ( 1 + e ) 4 \begin{aligned}\nabla_w^2 y &= \frac{(1+e)^2 xx^Te -xe 2(1+e)e x^T}{(1+e)^4}\\
&=\frac{(-e^3+e)xx^T}{(1+e)^4}
\end{aligned} ∇ w 2 y = ( 1 + e ) 4 ( 1 + e ) 2 x x T e − x e 2 ( 1 + e ) e x T = ( 1 + e ) 4 ( − e 3 + e ) x x T
The Hessian could be negative defintie, therefore Convexity is not generalized.
ℓ ( β ) = ∑ i = 1 m ( − y i β T x ^ i + ln ( 1 + e β T x ^ i ) ) \ell(\beta) = \sum_{i=1}^m \left( -y_i\beta^T\hat x _i + \ln (1+e^ {\beta^T\hat x_i})\right)
ℓ ( β ) = i = 1 ∑ m ( − y i β T x ^ i + ln ( 1 + e β T x ^ i ) )
∇ β ℓ = ∑ i = 1 m − y i x ^ i + e 1 + e \nabla_\beta \ell = \sum_{i=1}^m -y_i\hat x_i +\frac{e}{1+e}
∇ β ℓ = i = 1 ∑ m − y i x ^ i + 1 + e e
e = e β T x ^ i e=e^ {\beta^T\hat x_i} e = e β T x ^ i
∇ β 2 ℓ = ∑ i = 1 m x ^ i x ^ i T e ( 1 + e ) − e 2 x ^ i x ^ i T ( 1 + e ) 2 = ∑ i = 1 m x ^ i x ^ i T ( 1 + e ) 2 \begin{aligned}\nabla_\beta^2 \ell &= \sum_{i=1}^m \frac{\hat x_i\hat x_i^Te(1+e)-e^2 \hat x_i\hat x_i^T}{(1+e)^2}\\
&=\sum_{i=1}^m \frac{\hat x_i\hat x_i^T}{(1+e)^2}
\end{aligned} ∇ β 2 ℓ = i = 1 ∑ m ( 1 + e ) 2 x ^ i x ^ i T e ( 1 + e ) − e 2 x ^ i x ^ i T = i = 1 ∑ m ( 1 + e ) 2 x ^ i x ^ i T
This Hessian is semi-positive definite, Convexity proven.
HW 4
4.1