let the derivatives be 0 to find the closed-form solution:
w=∑i=1mxi2−m1(∑i=1mxi)2∑i=1myi(xi−xˉ)
b=m1i=1∑m(yi−wxi)
see multivariate on book page 56.
Generalized linear model: pluggin a good link functiong
y=g−1(wTx+b)
3.3 Logit Regression
used as classifier. Apply Sigmoid function. Logit : log odds, the ratio of the possibility of correct vs incorrect.
Example on a binary cluster problem. Use the sigmoid function:
y=1+e−z1
plug in the model’s output : z=f(x)=wTx+b
y=1+e−(wTx+b)1
log it:
ln1−yy=wTx+b
the odds here is :
p(y=0∣x)p(y=1∣x)=1−yy
the log odds logit is:
lnp(y=0∣x)p(y=1∣x)=ln1−yy=wTx+b
Using Maximum Likelihood Estimattion:
L(w,b)=i=1∏mp(yi∣xi)
log it getting a minimization problem Minimum negative Log Likelihood
ℓ(w,b)=−i=1∑mlnp(yi∣xi)
3.4 Linear Discriminant Analysis (LDA)
For a given training set, train the algorithm so that the projection of similar samples land close to each other, the projections of different samples lands apart from one another. It projects samples onto a singular line in vector space.
Given a binary classification dataset D={(xi,yi)}i=1myi∈{0,1} , let Xi,μi,Σi be the sample set, average, covariance matrix of which set with property indexed after i.
Seems difficult, to be continued…
4 Decision Tree
4.1
A divide-and-conquer method
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
input: Traning Set D, Properties Set A TreeGenerate(D, A): Generate Node if"all samples in D are in the same class C" : Node.type = C if"A is empty"or"samples in D have the same property in A": Node.type = "most type in sample D" Node = old.leaf for a_star_v in a_star: Node.Create_Branch() if D_v.isEmpty(): Branch = leaf Branch.type = "most types in D" else: TreeGenerate(D_v, A\{a_star})
4.2
How to choose the most best property distiguisher? Generally, we want the purity of each branch to be higher, that is branches should contain samples within the same class. Information Entropy quantifies purity in samples. Suppose in the sample D the kth kind of sample has a portion of pk, the information entropy of D is defined by:
Ent(D)=−k=1∑∣Y∣pklog2pk
The less the entropy, the more the purity. Information Gain
Gain(D,a)=Ent(D)−v=1∑V∣D∣∣Dv∣Ent(Dv)
The more the Gain, the more that using property a in dividing has a higher impact on improving purity.
In practice, using Gain might have a preference in properties with more choices. To solve this uneveness, Quinlan introduced Gain Ratio :
IV is the intrinsic value of property a . The more choices property a have, the bigger IV will be.
Classification and Regression Tree (CART) uses Gini index to choose dividing properties.
Gini(D)=k=1∑∣Y∣k′=k∑pkpk′=1−k=1∑∣Y∣pk2
The less the Gini is, the higher the purity.
For property a, the Gini Index is:
Gini_Index(D,a)=v=1∑V∣D∣∣Dv∣Gini(Dv)
4.3 Pruning
Pruning is a method to counter overfitting. pre-pruning and post-pruning.
pre-pruning is pruning when generating the tree. If the new branch have negative effect on predictions, the branch is pruned. post-prunining is pruning after the tree is fully generated. If a branch does better when pruned, then it gets pruned.
4.4 Continous Properties and Missings
6 Support Vector Machine
6.1
Using a hyperplane to classify datasets.
A Support Vector are points that lie around the hyperplane margin, they are the points that are hard to distinguish.
Hyperplane :wTx+b=0
The distance of a point to the hyperplane is
r=∥w∥∣wx+b∣
The Margin :
γ=∥w∥2
We want to maximize the margin, which is minimizing ∥w∥2. This is a Convex Quadratic Progamming problem.
w,bmin21∥w∥2s.t.yi(wxi+b)≥1
6.2 KKT
Consider problem: minxf(x)s.t.gi≤0,hj(x)=0,i=1,…,mj=1,…,n
Using the Lagrangian:
The Stationary Gradients conditions are applied during the process. SMO algorithm
6.3 Kernel Function
If the problem cannot be classifed in a low dimension space, there always exits a function ϕ(x) that maps to a higher dimension, ensuring that it can be classifed by superplane. Normally we use Hilbert Space. A kernel function computes the similarity (or inner product) between two data points in a higher-dimensional feature space without explicitly computing the coordinates of those points in that space.
κ(xi,xj)=ϕ(xi)ϕ(xj)
6.4 Soft Margins
We allow the SVM to make some mistakes.
[
\min_{w, b, \xi} \quad \frac{1}{2} |w|^2 + C \sum_{i=1}^{N} \xi_i
]
Subject to:
[
y_i (w \cdot x_i + b) \geq 1 - \xi_i \quad \text{for all } i
]
[
\xi_i \geq 0 \quad \text{for all } i
]
Where:
(w) is the weight vector,
(b) is the bias term,
(C) is the regularization parameter,
(\xi_i) are the slack variables.
Given 3 coins A B C, the possibility of flipping heads are: π,p,q
Test: $$
A = \begin{array}{}
B & A \text{ is heads} \
C & A \text{ is tails}
\end{array}
Proj 7
proof chebyshev points : $$\|f-P_n\|\leq c\rho^{-n}$$ and $$lesbegue_n(X)\leq c\log n