Machine Learning Homework

HW 2

2.1

L(x)=Ax22λ(x221),  ARn×n,xRnL(x)=\|Ax\|_2^2-\lambda (\|x\|_2^2-1),\; A\in \R^{n\times n},\, x\in \R^nLx\frac{\partial L}{\partial x}

Lx=[(Ax)TAxλxTx]x=2AATx2λx\begin{aligned} \frac{\partial L}{\partial x} &= \frac{\partial [(Ax)^TAx-\lambda x^Tx]}{\partial x}\\ &=2AA^Tx-2\lambda x \end{aligned}

2.2

L(x)=VWHF2,VRm×n,WRm×k,HRk×n,L(x)=\|V-WH\|_F^2,\,V\in \R^{m\times n},\,W\in \R^{m\times k},\,H\in \R^{k\times n},\,LW,LV\frac{\partial L}{\partial W},\,\frac{\partial L}{\partial V}

LW=tr((VWH)T(VWH))W=2VHT+2WHHT\begin{aligned} \frac{\partial L}{\partial W} &=\frac{\partial \mathrm{tr}((V-WH)^T(V-WH))}{\partial W}\\ &=-2VH^T+2WHH^T \end{aligned}

LV=2V2WH\frac{\partial L}{\partial V} = -2V-2WH

2.3

let f=XTAXf=X^TAX

df=d(XTAX)=dtr(XTAX)=tr(XTATdX+XTAdX)=tr(XT(A+AT)dX)\begin{aligned} df &= d (X^TAX)\\ &=d \mathrm{tr}(X^TAX)\\ &=\mathrm{tr}(X^TA^TdX + X^T A dX) &=\mathrm{tr}(X^T(A+A^T)dX) \end{aligned}

Xtr(XTAX)=XT(A+AT)\frac{\partial}{\partial X}\mathrm{tr}(X^TAX)=X^T(A+A^T)

HW 3

3.1 Logits on Multi-Classification

paramater space: θ={wi,bi}  i=1,,n\theta = \{w_i,b_i\}\; i=1,\ldots,n , samples: X={x1,,xn}X=\{x_1,\ldots,x_n\} , labels: Y={y1,,yn}Y=\{y_1,\ldots,y_n\}, yiy={1,,K},  Ky_i\in y = \{1,\ldots,K\},\;K number of classes.

(θ)=i=1nlogp(yixi;θ)p(yixi;θ)=j=1Kp(yi=j)I(yi=j)p(y=jx)=exp(wjTx+b)1+k=1Kexp(wkTx+b)j=1,,K1p(y=Kx)=11+k=1K1exp(wkTx+b)\ell(\theta) = \sum_{i=1}^n \log p(y_i|x_i;\theta)\\ p(y_i|x_i;\theta) = \prod_{j=1}^K p(y_i=j)^{\mathbb I(y_i=j)}\\ p(y=j|x) = \frac{\exp(w_j^Tx+b)}{1+\sum_{k=1}^K\exp(w_k^Tx+b)}\qquad j=1,\ldots,K-1\\ p(y=K|x) = \frac{1}{1+\sum_{k=1}^{K-1}\exp(w_k^Tx+b)}

plug them in:

(θ)=i=1nj=1KI(yi=j)logp(yi=jxi;θ)=i=1n(j=1K1I(yi=j)logexp(wjTxi+b)1+k=1K1exp(wkTxi+b)+I(yi=K)log11+k=1K1exp(wkTxi+b))\begin{aligned} \ell(\theta) &= \sum_{i=1}^n\sum_{j=1}^K\mathbb I(y_i=j) \log p(y_i = j|x_i;\theta)\\ &= \sum_{i=1}^n\left(\sum_{j=1}^{K-1}\mathbb I(y_i=j) \log \frac{\exp(w_j^Tx_i+b)}{1+\sum_{k=1}^{K-1}\exp(w_k^Tx_i+b)}+\mathbb I(y_i=K) \log \frac{1}{1+\sum_{k=1}^{K-1}\exp(w_k^Tx_i+b)}\right) \end{aligned}

Using the softmax function:

p(y=jx)=σj(z)=exp(zj)k=1Kexp(zk)p(y=j|x)=\sigma_j (z) = \frac{\exp{(z_j)}}{\sum_{k=1}^{K}\exp{(z_k)}}

(θ)=i=1nj=1KI(yi=j)logexp(zj(i))k=1Kexp(zk(i))=i=1nj=1KI(yi=j)logexp(wjTxi+b)k=1Kexp(wkTxi+b)\begin{aligned} \ell(\theta) &= \sum_{i=1}^n\sum_{j=1}^K\mathbb I(y_i=j) \log \frac{\exp{(z_j^{(i)})}}{\sum_{k=1}^{K}\exp{(z_k^{(i)})}}\\ &= \sum_{i=1}^n\sum_{j=1}^K\mathbb I(y_i=j) \log \frac{\exp{(w_j^Tx_i+b)}}{\sum_{k=1}^{K}\exp{(w_k^Tx_i+b)}} \end{aligned}

Derivatives: note pi=p(yi=jxi;θ)=σj(z(i))p_i = p(y_i=j|x_i; \theta)=\sigma_j (z^{(i)})

pizj={pi(1pj)i=jpipjij=pi(I(i=j)pj)\begin{aligned}\frac{\partial p_i}{\partial z_j} &=\begin{cases} p_i(1-p_j) & i=j\\ -p_ip_j &i\neq j\end{cases} \\ &=p_i(\mathbb I(i=j)-p_j)\end{aligned}

zkwk=xzkb=1\frac{\partial z_k}{\partial w_k} = x \qquad \frac{\partial z_k}{\partial b} = 1

zk=i=1nj=1KI(yi=j)1pipi(I(i=k)pk)=i=1nj=1KI(yi=j)(I(i=k)pk)=j=1K(I(yk=j)(1pk)ikI(yi=j)pk)=j=1K(I(yk=j)i=1KI(yi=j)pk)=j=1K(I(yk=j)pk)=Kpk+1\begin{aligned}\frac{\partial \ell}{\partial z_k} &=\sum_{i=1}^n\sum_{j=1}^K\mathbb I(y_i=j) \frac{1}{p_i}p_i(\mathbb I(i=k)-p_k)\\ &=\sum_{i=1}^n\sum_{j=1}^K\mathbb I(y_i=j) (\mathbb I(i=k)-p_k)\\ &=\sum_{j=1}^K\left(\mathbb I(y_k=j)(1-p_k)-\sum_{i\neq k}\mathbb I(y_i=j)p_k\right)\\ &= \sum_{j=1}^K \left(\mathbb I(y_k=j)-\sum_{i=1}^K\mathbb I(y_i=j)p_k\right)\\ &=\sum_{j=1}^K(\mathbb I(y_k=j)-p_k)\\ &= -Kp_k +1 \end{aligned}

wk=zkzkwk=(Kpk+1)xb=zkzkwk=Kpk+1\nabla_{w_k}\ell = \frac{\partial \ell}{\partial z_k}\frac{\partial z_k}{\partial w_k} = (-Kp_k +1)x\\ \nabla_{b}\ell = \frac{\partial \ell}{\partial z_k}\frac{\partial z_k}{\partial w_k} = -Kp_k +1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def loss_function:
Cross Entropy Loss

def compute_grad:
return grad_W, grad_b
def gradient_descent:
while stop == False
W ,b = compute_grad
do descent
if |newpoint - oldpoint| < eps:
stop = True
return result

def predict:
return index of softmax(gradient_descent).max

3.2 Multi-Class LDA

Suppose dataset {(xi,yi)}yiy={1,,K}\{(\bold x_i,y_i)\}\quad y_i\in y=\{1,\ldots,K\} . Class kk have NkN_k samples XkX_k .
Step 1: Pre-Computing
Mean Vectors:

μk=1Nki=1Nkxi\mu_k = \frac{1}{N_k}\sum_{i=1}^{N_k}x_i

NkN_k is number of samples in class kk

μ=1Ni=1Nxi\mu = \frac{1}{N}\sum_{i=1}^{N}x_i

NN is the total number of samples.

Within-Class Scatter Matrix:

Sk=xXk(xμk)(xμk)TS_k = \sum_{x\in X_k}(x-\mu_k)(x-\mu_k)^T

SW=k=1KSkS_W=\sum_{k=1}^K S_k

Between-Class Scatter Matrix

SB=k=1KNk(μkμ)(μkμ)TS_B = \sum_{k=1}^{K}N_k(\mu_k-\mu)(\mu_k-\mu)^T

Step 2: Solve Generalized Eigenvalue Problem

SW1SBw=λwS_W^{-1}S_B \bold w=\lambda \bold w

Use Numerical Methods.

Step 3
Select dN1d'\leq N-1 Top Eigenvectors to form transformation matrix WW , projecting data to lower-dimensional space.

Z=WTXZ=W^TX

Step 4
Apply any classification method that suits lower dimensions.

3.3 In Multi-Class LDA, proof: rank(SW1SB)K1\mathrm{rank}(S_W^{-1}S_B)\leq K-1

rank(SW1SB)min{rank(SW1),rank(SB)}\mathrm{rank}(S_W^{-1}S_B) \leq \min \{\mathrm{rank}(S_W^{-1}),\mathrm{rank}(S_B)\}

for the Between-Class Scatter Matrix SBS_B :

rank(SB)=rank(k=1KNk(μkμ)(μkμ)T)k=1Krank((μkμ)(μkμ)T)\begin{aligned} \mathrm{rank}(S_B) &= \mathrm{rank}\left(\sum_{k=1}^{K}N_k(\mu_k-\mu)(\mu_k-\mu)^T\right)\\ &\leq \sum_{k=1}^{K}\mathrm{rank}\left((\mu_k-\mu)(\mu_k-\mu)^T\right) \end{aligned}

notice that:

rank((μkμ)(μkμ)T)1\mathrm{rank}\left((\mu_k-\mu)(\mu_k-\mu)^T\right) \leq 1

μ=1Nk=1KμkNk\mu = \frac{1}{N}\sum_{k=1}^K\mu_k N_k

k=1KNk(μμk)=0\sum_{k=1}^KN_k(\mu-\mu_k)=0

Thus vectors {μμk}\{\mu-\mu_k\} are linear correlated. There is atmost K1K-1 vectors that are linear independent. There must be at least one matrix (μjμ)(μjμ)T(\mu_j-\mu)(\mu_j-\mu)^T that has rank=0\mathrm{rank}=0. Therefore:

rank(SW1SB)rank(SB)K1\mathrm{rank}(S_W^{-1}S_B)\leq \mathrm{rank}(S_B) \leq K-1

3.4 Convexity of functions

Definition: for all x1,x2D,  λ(0,1)x_1, x_2\in D,\; \lambda \in (0,1) , there is:

f(λx1+(1λx2))λf(x1)+(1λ)f(x2)f(\lambda x_1+(1-\lambda x_2))\leq \lambda f(x_1)+(1-\lambda) f(x_2)

y=11+exp(wTx+b)y=\frac{1}{1+\exp(w^Tx+b)}

wy=xewTx+b(1+ewTx+b)2\nabla_w y = \frac{-x e^{-w^Tx+b}}{(1+e^{-w^Tx+b})^2}

e=ewTx+be= e^{-w^Tx+b}

w2y=(1+e)2xxTexe2(1+e)exT(1+e)4=(e3+e)xxT(1+e)4\begin{aligned}\nabla_w^2 y &= \frac{(1+e)^2 xx^Te -xe 2(1+e)e x^T}{(1+e)^4}\\ &=\frac{(-e^3+e)xx^T}{(1+e)^4} \end{aligned}

The Hessian could be negative defintie, therefore Convexity is not generalized.

(β)=i=1m(yiβTx^i+ln(1+eβTx^i))\ell(\beta) = \sum_{i=1}^m \left( -y_i\beta^T\hat x _i + \ln (1+e^ {\beta^T\hat x_i})\right)

β=i=1myix^i+e1+e\nabla_\beta \ell = \sum_{i=1}^m -y_i\hat x_i +\frac{e}{1+e}

e=eβTx^ie=e^ {\beta^T\hat x_i}

β2=i=1mx^ix^iTe(1+e)e2x^ix^iT(1+e)2=i=1mx^ix^iT(1+e)2\begin{aligned}\nabla_\beta^2 \ell &= \sum_{i=1}^m \frac{\hat x_i\hat x_i^Te(1+e)-e^2 \hat x_i\hat x_i^T}{(1+e)^2}\\ &=\sum_{i=1}^m \frac{\hat x_i\hat x_i^T}{(1+e)^2} \end{aligned}

This Hessian is semi-positive definite, Convexity proven.

HW 4

4.1


Machine Learning Homework
https://blog.jacklit.com/2024/10/07/机器学习作业2,3,4/
作者
Jack H
发布于
2024年10月7日
许可协议