Machine Learning

Notes on Machine Learning. 《机器学习》 周志华 清华大学出版社

0 Maths

Matrix Calculus: 1, 2
xRn,  yRm,  XRm×n,  f:RnRm,  A,B,CRn×n\bold x \in \mathbb{R^n},\; \bold y \in \mathbb{R^m},\;X \in \mathbb{R}^{m\times n},\;\bold f : \mathbb{R}^n\to \mathbb{R}^m,\; A,B,C\in \mathbb{R}^{n\times n}

yxRm×n\frac{\partial\bold y}{\partial\bold x} \in \mathbb R^{m\times n}

df=fxdxd \bold f = \frac{\partial \bold f }{\partial \bold x}d\bold x

J=xf=dfdxRm×nJ=\nabla_{\bold x}\bold f = \frac{d \bold f}{d\bold x}\in \mathbb{R}^{m\times n}

dtr(A)=tr(dA)d\mathrm{tr}(A)=\mathrm{tr}(dA)

d(AB)=d(A)B+Ad(B)d(AB)=d(A)B+Ad(B)

dAT=(dA)TdA^T = (dA)^T

df=ijdfxijdxij=tr((fX)TdX)d\bold f = \sum_i\sum_j \frac{\partial d\bold f }{\partial x_{ij}}dx_{ij} = \mathrm{tr}\left((\frac{\partial \bold f }{\partial X})^TdX\right)

tr(f(x))x=tr(f(x)x)\frac{\partial \mathrm{tr}(f(x))}{\partial x} = \mathrm{tr}\left(\frac{\partial f(x)}{\partial x}\right)

tr(XTX)X=2X\frac{\partial \mathrm{tr}(X^TX)}{\partial X}= 2X

tr(ABC)=tr(BCA)=tr(CAB)\mathrm{tr}(ABC) = \mathrm{tr}(BCA)=\mathrm{tr}(CAB)

tr(ATB)=AijBij\mathrm{tr}(A^TB)=\sum\sum A_{ij}B_{ij}

Frobenius Norm :

XF=(tr(XTX))1/2=(i=1mj=1nXij2)1/2\|X\|_F = (\mathrm{tr}(X^TX))^{1/2}=\left(\sum_{i=1}^m\sum_{j=1}^n X_{ij}^2\right)^{1/2}

Gradient of Least-Sqaures Loss in Linear Model: y=Φθ,  θRD  ,ΦRN×D,  yRN\bold y = \Phi \theta,\; \theta\in \mathbb R^D\;, \Phi \in \mathbb{R}^{N\times D},\; \bold y \in \mathbb R^N
Define

e(θ)=yΦθL(e)=e2e(\theta) = \bold y - \Phi \theta\qquad L(e) = \|e\|^2

Lθ=2(yΦθ)TΦR1×D\frac{\partial L}{\partial \theta} = -2(\bold y -\Phi \theta)^T\Phi \in \mathbb R^{1\times D}

3 Linear Models

3.1

is learning a linear w\bold w and bb:

f(x)=wTx+bf( \bold x )= \bold w^T\bold x+b

Linear models have good comprehensibility.

3.2 Linear Regression

Given a dataset D={(x1,y1),,(xm,ym)}D = \{(\bold x_1,y_1),\ldots,(\bold x_m,y_m)\}
Using MSE is basically solving Least Squares Problem.
for a single variate:

f(xi)=wxi+blet  f(xi)yif(x_i)=wx_i+b\quad \text{let}\; f(x_i)\to y_i

the loss is: yif(xi)=yiwxiby_i - f(x_i) = y_i-wx_i-b
Parameter estimation: calculate w,  b\bold w ,\; b to minimize MSE:

E(w,b)=i=1m(yiwxib)2E_{(w,b)}=\sum_{i=1}^m(y_i-wx_i-b)^2

derivatives:

E(w,b)w=2(wi=1mxi2i=1m(yib)xi)E(w,b)b=2(mbi=1m(yiwxi))\frac{\partial E_{(w,b)}}{\partial w}=2\left(w\sum_{i=1}^mx_i^2-\sum_{i=1}^m(y_i-b)x_i\right)\\\frac{\partial E_{(w,b)}}{\partial b}=2\left(mb-\sum_{i=1}^m(y_i-wx_i)\right)

let the derivatives be 0 to find the closed-form solution:

w=i=1myi(xixˉ)i=1mxi21m(i=1mxi)2w = \frac{\sum_{i=1}^my_i(x_i-\bar x)}{\sum_{i=1}^mx_i^2-\frac{1}{m}\left(\sum_{i=1}^mx_i\right)^2}

b=1mi=1m(yiwxi)b=\frac{1}{m}\sum_{i=1}^m (y_i-wx_i)

see multivariate on book page 56.
Generalized linear model: pluggin a good link function gg

y=g1(wTx+b)y=g^{-1}(\bold w^T\bold x+b)

3.3 Logit Regression

used as classifier. Apply Sigmoid function.
Logit : log odds, the ratio of the possibility of correct vs incorrect.
Example on a binary cluster problem. Use the sigmoid function:

y=11+ezy=\frac{1}{1+e^{-z} }

plug in the model’s output : z=f(x)=wTx+bz = f(\bold x) = \bold w^Tx + b

y=11+e(wTx+b)y=\frac{1}{1+e^{-(\bold w^Tx + b)} }

log\log it:

lny1y=wTx+b\ln \frac{y}{1-y}=\bold w^Tx+b

the odds here is :

p(y=1x)p(y=0x)=y1y\frac{p(y=1|x)}{p(y=0|x)} = \frac{y}{1-y}

the log odds logit is:

lnp(y=1x)p(y=0x)=lny1y=wTx+b\ln\frac{p(y=1|x)}{p(y=0|x)}=\ln \frac{y}{1-y}=\bold w^Tx+b

Using Maximum Likelihood Estimattion:

L(w,b)=i=1mp(yixi)L(\bold w, b) = \prod_{i=1}^m p(y_i|\bold x_i)

log\log it getting a minimization problem Minimum negative Log Likelihood

(w,b)=i=1mlnp(yixi)\ell(\bold w, b) = -\sum_{i=1}^m \ln p(y_i|\bold x_i)

3.4 Linear Discriminant Analysis (LDA)

For a given training set, train the algorithm so that the projection of similar samples land close to each other, the projections of different samples lands apart from one another. It projects samples onto a singular line in vector space.
Given a binary classification dataset D={(xi,yi)}i=1myi{0,1}D=\{(\bold x_i,y_i)\}_{i=1}^m\quad y_i\in \{0,1\} , let Xi,μi,ΣiX_i, \bold\mu_i, \bold\Sigma_i be the sample set, average, covariance matrix of which set with property indexed after ii.

Seems difficult, to be continued…

4 Decision Tree

4.1

A divide-and-conquer method

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
input: Traning Set D, Properties Set A
TreeGenerate(D, A):
Generate Node
if "all samples in D are in the same class C" :
Node.type = C
if "A is empty" or "samples in D have the same property in A":
Node.type = "most type in sample D"
Node = old.leaf
for a_star_v in a_star:
Node.Create_Branch()
if D_v.isEmpty():
Branch = leaf
Branch.type = "most types in D"
else:
TreeGenerate(D_v, A\{a_star})

4.2

How to choose the most best property distiguisher? Generally, we want the purity of each branch to be higher, that is branches should contain samples within the same class.
Information Entropy quantifies purity in samples. Suppose in the sample DD the kthk\text{th} kind of sample has a portion of pkp_k, the information entropy of DD is defined by:

Ent(D)=k=1Ypklog2pk\mathrm{Ent}(D)= - \sum_{k=1}^{|\mathcal Y |}p_k\log_2p_k

The less the entropy, the more the purity.
Information Gain

Gain(D,a)=Ent(D)v=1VDvDEnt(Dv)\mathrm{Gain}(D,a) = \mathrm{Ent}(D)- \sum_{v=1}^V \frac{|D^v|}{|D|}\mathrm{Ent}(D^v)

The more the Gain, the more that using property aa in dividing has a higher impact on improving purity.
In practice, using Gain might have a preference in properties with more choices. To solve this uneveness, Quinlan introduced Gain Ratio :

Gain_Ratio(D,a)=Gain(D,a)IV(a)IV(a)=v=1VDvDlog2DvD\mathrm{Gain\_Ratio}(D,a) = \frac{\mathrm{Gain}(D,a)}{\mathrm{IV}(a)}\qquad \mathrm{IV}(a)=-\sum_{v=1}^V\frac{|D^v|}{|D|}\log_2\frac{|D^v|}{|D|}

IV\mathrm{IV} is the intrinsic value of property aa . The more choices property aa have, the bigger IV\mathrm{IV} will be.

Classification and Regression Tree (CART) uses Gini index to choose dividing properties.

Gini(D)=k=1Ykkpkpk=1k=1Ypk2\mathrm{Gini}(D)= \sum_{k=1}^{|\mathcal Y|}\sum_{k'\neq k}p_kp_{k'}=1-\sum_{k=1}^{|\mathcal Y|}p_k^2

The less the Gini is, the higher the purity.
For property aa, the Gini Index is:

Gini_Index(D,a)=v=1VDvDGini(Dv)\mathrm{Gini\_Index}(D,a)=\sum_{v=1}^V\frac{|D^v|}{|D|}\mathrm{Gini}(D^v)

4.3 Pruning

Pruning is a method to counter overfitting. pre-pruning and post-pruning.
pre-pruning is pruning when generating the tree. If the new branch have negative effect on predictions, the branch is pruned. post-prunining is pruning after the tree is fully generated. If a branch does better when pruned, then it gets pruned.

4.4 Continous Properties and Missings

6 Support Vector Machine

6.1

Using a hyperplane to classify datasets.
A Support Vector are points that lie around the hyperplane margin, they are the points that are hard to distinguish.
Hyperplane :wTx+b=0w^Tx+b=0
The distance of a point to the hyperplane is

r=wx+bwr = \frac{|wx+b|}{\|w\|}

The Margin :

γ=2w\gamma=\frac{2}{\|w\|}

We want to maximize the margin, which is minimizing w2\|w\|^2. This is a Convex Quadratic Progamming problem.

minw,b12w2s.t.yi(wxi+b)1\min_{w,b} \frac{1}{2}\|w\|^2\quad s.t. y_i(wx_i+b)\geq 1

6.2 KKT

Consider problem: minxf(x)s.t.  gi0,hj(x)=0,i=1,,mj=1,,n\min_x f(x)\quad s.t.\; g_i\leq 0, h_j(x) = 0, i=1,\ldots,m \quad j=1,\ldots,n
Using the Lagrangian:

L(x,μ,λ)=f(x)+iμigi(x)+jλjhj(x)L(x,\mu,\lambda) = f(x) + \sum_i \mu_ig_i(x)+ \sum_j\lambda_jh_j(x)

KKT conditions:

  1. Stationary $$\nabla_xL=\nabla f(x^) + \sum_{i=1}^{m} \mu_i \nabla g_i(x^) + \sum_{j=1}^{n} \lambda_j \nabla h_j(x^*) = 0$$
  2. Primal feasibility $$g_i(x^)\leq 0 ;\forall i\qquad h_j(x^) = 0 ; \forall j$$
  3. Dual feasibility $$\mu_i\geq 0,\forall i$$
  4. Complementary slackness $$\mu_ig_i(x^*)= 0;\forall i$$

Using Lagrangian and applying KKT, the SVM problem is transformed into a dual form

maxαi=1mαi12i=1mj=1mαiαjyiyjxiTxjs.t.  i=1mαiyi=0  αi0\max_{\bold\alpha}\, \sum_{i=1}^m \alpha_i - \frac{1}{2}\sum_{i=1}^m\sum_{j=1}^m\alpha_i\alpha_j y_iy_jx_i^Tx_j\\s.t.\;\sum_{i=1}^m \alpha_iy_i=0 \; \alpha_i\geq 0

The Stationary Gradients conditions are applied during the process.
SMO algorithm

6.3 Kernel Function

If the problem cannot be classifed in a low dimension space, there always exits a function ϕ(x)\phi(x) that maps to a higher dimension, ensuring that it can be classifed by superplane. Normally we use Hilbert Space. A kernel function computes the similarity (or inner product) between two data points in a higher-dimensional feature space without explicitly computing the coordinates of those points in that space.

κ(xi,xj)=ϕ(xi)ϕ(xj)\kappa(x_i,x_j) = \phi(x_i) \phi(x_j)

6.4 Soft Margins

We allow the SVM to make some mistakes.

[
\min_{w, b, \xi} \quad \frac{1}{2} |w|^2 + C \sum_{i=1}^{N} \xi_i
]
Subject to:
[
y_i (w \cdot x_i + b) \geq 1 - \xi_i \quad \text{for all } i
]
[
\xi_i \geq 0 \quad \text{for all } i
]

Where:
(w) is the weight vector,
(b) is the bias term,
(C) is the regularization parameter,
(\xi_i) are the slack variables.

Given 3 coins A B C, the possibility of flipping heads are: π,p,q\pi ,p,q
Test: $$
A = \begin{array}{}
B & A \text{ is heads} \
C & A \text{ is tails}
\end{array}

Proj 7 proof chebyshev points : $$\|f-P_n\|\leq c\rho^{-n}$$ and $$lesbegue_n(X)\leq c\log n


Machine Learning
https://blog.jacklit.com/2024/10/05/Machine Learning/
作者
Jack H
发布于
2024年10月5日
许可协议