Data Analysis - Supervised Learning

PCA

使用SVD分解和使用协方差矩阵进行计算的一致性。
In the classical implementation of PCA, we explicitly compute the covariance matrix of the data. However, in this task, PCA method is based on SVD, which avoids the need to explicitly compute the covariance matrix.
Proof:
$X \in R^{d \times n}$ is the data matrix, where $d$ is the number of features and $n$ is the number of samples. The covariance matrix $C$ can be computed as follows:
$C = \frac{1}{n} X X^{T}$
We perform the singular value decomposition of $X$ :
$X = U Σ V^{T}$
Where $U \in R^{d \times d}$ is the left singular vector matrix, $Σ \in R^{d \times n}$ is the diagonal matrix of singular values, and $V \in R^{n \times n}$ is the right singular vector matrix.
Derivation:
Expand C using the SVD of X:
$C = \frac{1}{n} X X^{T} = \frac{1}{n} U Σ V^{T} V Σ^{T} U^{T} = \frac{1}{n} U Σ Σ^{T} U^{T}$
where we used the fact that $V^{T} V = I$ , the identity matrix.
Now, observe that:
- $Σ Σ^{T}$ is a diagonal matrix with the squares of the singular values on the diagonal.
- Therefore, $C$ can be expressed as:
$C = U (\frac{1}{n} Σ^{2}) U^{T}$
where,
- The columns of U are the eigenvectors of C
- The eigenvalues are the scaled squared singular values $\frac{σ_{i}^{2}}{n}$ , which correspond to the varience captured along each principal component.
Therefore, SVD-based PCA is mathematically equivalent to the classical PCA approach via covariance matrix eigen-decomposition.
Note: $C$ should be
$C = \frac{1}{n - 1} X X^{T} .$
Hereafter, we will use the covariance matrix as $C = \frac{1}{n - 1} X X^{T}$ for the PCA implementation.
Kernel PCA的详细推导过程。
Kernel PCA is a combined technique of PCA and the kernel trick, where we are still interested in using the PCA process to find the features $Z \in R^{k \times n}$ . However, such a transformation from $X \in R^{d \times n}$ to $Z \in R^{k \times n}$ now becomes non-linear, as a non-linear kernel function can be applied to first transformed $X \in R^{d \times n}$ to $ϕ (X) \in R^{D \times n}$ in a superspace with $D > d$ , then, the linear PCA is performed to transform $ϕ (X) \in R^{D \times n}$ to $Z \in R^{k \times n}$ . This kernel PCA process brings a major advantage:
- Since the calculation of $Z$ can be non-linear, and the dimensionality of $Z$ is now $k \in [1, D)$ with $D > d$ , these characteristics allow us to search for solutions in a new space (not limited by the original dimentionality $d$ ), and such solutions may be linear.
For example, with kernel PCA, for a linearly-inseparable dataset $X \in R^{d \times n}$ with a low dimensionality, e.g., d = 2, now it may be possible to solve such classification task with linear solutions, while in a new space.
However, we would like to avoid the explicit computation of the high-dimensional $ϕ (X)$ for the PCA decomposition, which can
be done by involving the kernel function $K (x_{i}, x_{j}) =< ϕ (x_{i}), ϕ (x_{j}) >$ with the plain PCA, creating the kernel PCA solution. Two different kernel function will be explored here:
1. Homogeneous Polynomial kernel: $K (x_{i}, x_{j}) = (< x_{i}, x_{j} >)^{p}$ , where $p > 0$ is the polynomial degree.
2. Radial Basis Function (RBF) kernel: $K (x_{i}, x_{j}) = e^{- γ | | x_{i} - x_{j} | |_{2}^{2}}$ , where $γ = \frac{1}{2 σ^{2}}$ and $σ$ is the width or scale of a Gaussian distribution centered at $x_{j}$ .
Mathematically prove howwe can compute the PC Martix?
First things first, we denote:
- $X \in R^{d \times N}$ is the input matrix, where $d$ and $N$ are the number of the features and samples, respectively.
- $x_{i} \in R^{d}$ is the $i$ -th column vector for $X$ . Therefore, $X = [x_{1}, x_{2}, \dots, x_{N}]$ .
- $ϕ (\cdot)$ is a nonlinear transformation. $ϕ (\cdot) : R^{d} \to F$ .
- $ϕ (X) \in R^{D \times N}$ is the mapped matrix on a higher or infinity dimensional eigenspeace $F$ .
- $ϕ (x_{i}) \in R^{D}$ is the $i$ -th column vector for $ϕ (X)$ . Therefore, $ϕ (X) = [ϕ (x_{1}), ϕ (x_{2}), \dots, ϕ (x_{N})]$ .
- $K \in R^{N \times N}$ is the Gram matrix, whose element $k_{i j}$ is given by the kernel function $K (x_{i}, x_{j}) =< ϕ (x_{i}), ϕ (x_{j}) >$
The first thing is to center the mapped matrix $ϕ (X)$ in the feature space $F$ , which is defined as follows:
$\begin{aligned} \tilde{ϕ} (x_{i}) & = ϕ (x_{i}) - \frac{1}{N} \sum_{j = 1}^{N} ϕ (x_{j}) \\ = ϕ (x_{i}) - \frac{1}{N} ϕ (X) 1_{N} \end{aligned}$
where $1_{N} \in R^{N}$ is the vector of ones.
Then, centered mapped matrix $\tilde{ϕ} (X)$ can be denoted as:
$\begin{aligned} \tilde{ϕ} (X) & = [\tilde{ϕ} (x_{1}), \tilde{ϕ} (x_{2}), \dots, \tilde{ϕ} (x_{N})] \\ = [ϕ (x_{1}) - \frac{1}{N} ϕ (X) 1_{N}, ϕ (x_{2}) - \frac{1}{N} ϕ (X) 1_{N}, \dots, ϕ (x_{N}) - \frac{1}{N} ϕ (X) 1_{N}] \\ = ϕ (X) - \frac{1}{N} ϕ (X) 1_{N} 1_{N}^{T} . \end{aligned}$
Similar to the Linear PCA, we have (SVD)
$\tilde{ϕ} (X) = U Σ V^{T},$
where:
- $U \in R^{D \times D}$ is the left singular vector (orthonormal) matrix, whose column vectors are eigenvectors of $\tilde{ϕ} (X) \tilde{ϕ} (X)^{T}$ ,
- $Σ \in R^{D \times N}$ is the diagonal matrix of singular values, whose elements are ordered from largest to smallest, i.e., $σ_{1} \geq σ_{2} \geq \dots \geq σ_{D}$ , and
- $V \in R^{N \times N}$ is the right singular vector (orthonormal) matrix, whose column vectors are eigenvectors of $\tilde{ϕ} (X)^{T} \tilde{ϕ} (X)$ .
Notice that the covariance matrix $C$ of $\tilde{ϕ} (X)$ and the Gram matrix $K$ of $\tilde{X}$ are denoted as:
$C = \frac{1}{n - 1} \sum_{i = 1}^{n} (\tilde{ϕ} (x_{i}) \tilde{ϕ} (x_{i})^{T}) = \frac{1}{n - 1} \tilde{ϕ} (X) \tilde{ϕ} (X)^{T},$ $\begin{aligned} K & = [\begin{array}{c} < \tilde{ϕ} (x_{1}), \tilde{ϕ} (x_{1}) > & < \tilde{ϕ} (x_{1}), \tilde{ϕ} (x_{2}) > & \dots & < \tilde{ϕ} (x_{1}), \tilde{ϕ} (x_{N}) > \\ < \tilde{ϕ} (x_{2}), \tilde{ϕ} (x_{1}) > & < \tilde{ϕ} (x_{2}), \tilde{ϕ} (x_{2}) > & \dots & < \tilde{ϕ} (x_{2}), \tilde{ϕ} (x_{N}) > \\ ⋮ & ⋮ & ⋱ & ⋮ \\ < \tilde{ϕ} (x_{N}), \tilde{ϕ} (x_{1}) > & < \tilde{ϕ} (x_{N}), \tilde{ϕ} (x_{2}) > & \dots & < \tilde{ϕ} (x_{N}), \tilde{ϕ} (x_{N}) > \end{array}] \\ = \tilde{ϕ} (X)^{T} \tilde{ϕ} (X) . \end{aligned}$
and both $C$ and $K$ are symmetric matrices.
Therefore, it is clear that:
$C = \frac{1}{n - 1} U Σ Σ^{T} U^{T} = \frac{1}{n - 1} U Σ^{2} U^{T},$
and
$K = V Σ Σ^{T} V^{T} = V Σ^{2} V^{T} .$
If $ϕ (X)$ is centered, then $C$ and $K$ are both centered. Therefore, it is no need to explicitly centered $X$ or $ϕ (X)$ . The centered Gram matrix $\tilde{K}$ can be computed as follows:
$\begin{aligned} \tilde{K} & = \tilde{ϕ} (X)^{T} \tilde{ϕ} (X) = {(ϕ (X) - \frac{1}{N} ϕ (X) 1_{N} 1_{N}^{T})}^{T} (ϕ (X) - \frac{1}{N} ϕ (X) 1_{N} 1_{N}^{T}) \\ = ϕ (X)^{T} ϕ (X) - \frac{1}{N} ϕ (X)^{T} ϕ (X) 1_{N} 1_{N}^{T} - \frac{1}{N} 1_{N}^{T} 1_{N} ϕ (X)^{T} ϕ (X) + \frac{1}{N^{2}} 1_{N}^{T} 1_{N} ϕ (X)^{T} ϕ (X) 1_{N} 1_{N}^{T} \\ = K - \frac{1}{N} K 1_{N} 1_{N}^{T} - \frac{1}{N} 1_{N}^{T} 1_{N} K + \frac{1}{N^{2}} 1_{N}^{T} 1_{N} K 1_{N} 1_{N}^{T} \end{aligned}$
If we denote $N = \frac{1}{n} 1_{N} 1_{N}^{T}$ , which is a $N \times N$ matrix, all the elements of $N$ are equal to $\frac{1}{n}$ , then, we have:
$\tilde{K} = K - N K - K N + N^{T} K N$
From now, we use $K$ to refer centere Gram matrix $\tilde{K}$ .
We then still follow the method of finding the first principal component. We know that the PCs are the eigenvectors of $C$ . Notice that the column vectors of $U$ are eigenvectors of $C$ , therefore, we have:
$C u_{i} = λ_{i} u_{i} .$
Since the eigenvectors $u_{i}$ is a linear combination of $\tilde{ϕ} (X)$ , which is
$u_{i} = \sum_{i = 1}^{n} α_{i} \tilde{ϕ} (x_{i}) = \tilde{ϕ} (X) \vec{α},$
where the ${\vec{α}}_{i}$ is the linear combination factor. We may contrict the mangitude of $\vec{α}$ is equal to 1, i.e., $‖ \vec{α} ‖ = 1$ .
Therefore,
$\begin{aligned} C u_{i} & = λ_{i} u_{i} \\ \frac{1}{n - 1} \tilde{ϕ} (X) \tilde{ϕ} (X)^{T} \tilde{ϕ} (X) {\vec{α}}_{i} & = λ_{i} \tilde{ϕ} (X) {\vec{α}}_{i} \\ \tilde{ϕ} (X) K {\vec{α}}_{i} & = (n - 1) λ_{i} \tilde{ϕ} (X) {\vec{α}}_{i} \\ K {\vec{α}}_{i} & = (n - 1) λ_{i} {\vec{α}}_{i}, \end{aligned}$
which means that the eigenvalues of $K$ are $(n - 1) λ_{i}$ , and the eigenvectors of $K$ are ${\vec{α}}_{i}$ .
Therefore, from the eigen value decomposition, we can derive that
$K = V Σ^{2} V^{T} \Rightarrow K V = Σ^{2} V .$
thus, we know that
$V = [α_{1}, α_{2}, \dots, α_{n}],$
which is $v_{i} = {\vec{α}}_{i}$ , and
$Σ^{2} = (n - 1) Λ,$
which is $σ_{i} = \sqrt{(n - 1) λ_{i}}$ . Therefore, after normalize $u_{i}$ to obtain a unit vector, we have:
$\begin{aligned} u_{i} & = \frac{\tilde{ϕ} (X) v_{i}}{| | \tilde{ϕ} (X) v_{i} | |} \\ = \frac{\tilde{ϕ} (X) v_{i}}{\sqrt{v_{i}^{T} \tilde{ϕ} (X)^{T} \tilde{ϕ} (X) v_{i}}} \\ = \frac{\tilde{ϕ} (X) v_{i}}{\sqrt{v_{i}^{T} K v_{i}}} \\ = \frac{\tilde{ϕ} (X) v_{i}}{\sqrt{σ_{i}^{2}}} = \frac{\tilde{ϕ} (X) v_{i}}{σ_{i}} . \end{aligned}$
Therefore, the final Principal Components are given by:
$U = [\frac{\tilde{ϕ} (X) v_{1}}{σ_{1}}, \frac{\tilde{ϕ} (X) v_{2}}{σ_{2}}, \dots, \frac{\tilde{ϕ} (X) v_{k}}{σ_{k}}],$
where $k$ is the number of the PCs.
Mathematically prove how to compute the transformed dataset.
From the subtask 1, we obtain the PC matrix $U$ . We can then compute the transformed dataset $Z$ as follows:
$Z = U^{T} \tilde{ϕ} (X) = {[u_{1}, u_{2}, \dots, u_{k}]}^{T} \tilde{ϕ} (X) .$
Expanding $Z$ , we have:
$Z = [\begin{matrix} u_{1}^{T} \tilde{ϕ} (x_{1}) & u_{1}^{T} \tilde{ϕ} (x_{1}) & \dots & u_{1}^{T} \tilde{ϕ} (x_{n}) \\ u_{2}^{T} \tilde{ϕ} (x_{1}) & u_{2}^{T} \tilde{ϕ} (x_{2}) & \dots & u_{2}^{T} \tilde{ϕ} (x_{n}) \\ ⋮ & ⋮ & ⋱ & ⋮ \\ u_{k}^{T} \tilde{ϕ} (x_{1}) & u_{k}^{T} \tilde{ϕ} (x_{2}) & \dots & u_{k}^{T} \tilde{ϕ} (x_{n}) \end{matrix}] .$
For $i, j$ in $u_{i}^{T} \tilde{ϕ} (x_{j})$ , where $i \in [1, k]$ , and $j \in [1, n]$ , we have
$\begin{aligned} u_{i}^{T} \tilde{ϕ} (x_{j}) & = {[\frac{\tilde{ϕ} (X) v_{i}}{σ_{i}}]}^{T} \tilde{ϕ} (x_{j}) = \frac{1}{σ_{i}} v_{i}^{T} \tilde{ϕ} (X)^{T} \tilde{ϕ} (x_{j}) \\ = \frac{1}{σ_{i}} v_{i}^{T} k_{_, j} \\ = \frac{1}{\sqrt{(n - 1) λ_{i}}} v_{i}^{T} k_{_, j} . \end{aligned}$
where $k_{_, j}$ is the $j$ -th column of the Gram matrix $K$ .
Therefore,
$Z = \frac{1}{\sqrt{n - 1}} Diag (\frac{1}{\sqrt{λ_{1}}}, \frac{1}{λ_{2}}, \dots, \frac{1}{\sqrt{λ_{k}}}) V_{k}^{T} K = \frac{1}{\sqrt{n - 1}} Λ_{k}^{- \frac{1}{2}} V_{k}^{T} K .$
where,
- $Λ$ is the diagonal matrix, comprised of $λ_{i}$ , the eigenvalues of $K$ .
- $Λ_{k}^{- \frac{1}{2}}$ is the $k \times k$ diagonal matrix, comprised of $\frac{1}{\sqrt{λ_{i}}}$ , the inverse of the square root of $λ_{i}$ , where it refers to the top k eigenvalues of $K$ .
- $V_{k}$ is the $k \times n$ matrix, comprised of the top k eigenvectors of $K$ .
- $K$ is the Gram matrix, where $k_{i j} = K (x_{i}, x_{j}) =< ϕ (x_{i}), ϕ (x_{j}) >$ , which is the kernel function.
Therefore, to obtain the tranformed dataset $Z$ , we need to compute the Gram matrix $K$ first and center it, then, we use a eigen value decomposition to obtain the $V$ and $Λ$ , and finally, using the above equation, we can compute the transformed dataset $Z$ using the above equation.
However, since the direction of the optimization is the same, we sometimes can remove the $\frac{1}{\sqrt{n - 1}}$ term. This is how the KernelPCA in the Package Scikit-learn works. Advantages are:
- Improved Numerical Stability: Omitting the factor prevents transformed coordinates from becoming extremely small, especially for large n. This avoids potential floating-point precision issues, underflow errors, and increased sensitivity to rounding errors in subsequent computations on the reduced-dimensional data.
- Direct Kernel Space Scaling: This scaling is arguably more natural within the kernel context and avoids an arbitrary dependency on n without losing the essential relative geometry between data points.
- Formula Conciseness: The transformation formula is simpler and more directly linked to the kernel matrix eigendecomposition.

K-Means

在K-Means的收敛性推导中，After the updating step, the sum of squared distance is also ensured to not increasing (≤), if Euclidean distance is used to measure data similarity. 解释一下原因。
在K-means算法中，更新步骤后（即重新计算簇中心后），平方距离之和（Sum of Squared Distances, SSD）确保不会增加（ $\leq$ ）。这主要是因为K-means算法的本质是一个迭代优化过程，它在每一步都试图最小化SSD。
具体来说，当使用欧几里得距离作为相似性度量时，每次迭代分为两步：
1. 分配步（Assignment Step）
在这一步中，每个数据点 $x_{i}$ 被分配到离它最近的簇中心 $c_{j}$ 。这个操作本身就是为了最小化每个数据点到其所属簇中心的距离平方。因此，在分配步之后，SSD必然会减少或保持不变。因为如果一个数据点可以分配到一个更近的簇中心，那么将其重新分配到那个更近的簇中心必然会降低总体的SSD。
1. 更新步（Update Step）
在这一步中，每个簇的中心被更新为其内部所有数据点的均值。数学上可以证明，对于一个给定的数据点集合，这些点的均值是使集合内所有点到该均值的平方距离之和最小的那个点。
假设一个簇 $C_{k}$ 包含数据点 $x_{1}, x_{2}, \dots, x_{m}$ 。我们想找到一个点 $c_{k}$ 来最小化 $\sum_{i = 1}^{m} ∥ x_{i} - c_{k} ∥^{2}$ 。对 $c_{k}$ 求导并令其为零，可以得到 $c_{k} = \frac{1}{m} \sum_{i = 1}^{m} x_{i}$ ，即这些点的均值。这意味着，在更新簇中心后，每个簇内部的平方距离之和达到了局部最小。
综合以上两步，每次迭代都会使得总的平方距离之和减小或者保持不变。
- 分配步确保了每个点到其当前所属簇中心的距离是最小的（对于给定的簇中心）。
- 更新步确保了每个簇的中心是其内部数据点的最优代表（使得簇内平方距离最小）。
这两步的联合作用保证了SSD在一个非递增的序列中。由于SSD是非负的，并且每次迭代都会减小或不变，这个过程最终会收敛到一个局部最优解，即SSD不再发生显著变化。
K-Means++.
K-Means++ 是一种用于优化 K-Means 聚类算法初始质心选择的方法。
标准的 K-Means 算法有一个显著的缺点：它的聚类结果和收敛速度对初始簇中心（也称为质心或均值点）的选择非常敏感。
- 如果初始质心选择得不好（例如，所有初始质心都挤在数据点的某一小部分区域），K-Means 算法很容易陷入局部最优解，导致最终的聚类效果很差，无法准确地反映数据的真实分布。
- 糟糕的初始选择还会导致算法需要更多的迭代才能收敛，从而降低效率。
K-Means++ 就是为了解决这个问题而提出的。它的目标是选择一组“更好”的初始质心，使得这些质心在数据空间中尽可能地分散开来，从而提高 K-Means 算法的收敛速度和聚类质量。
K-Means++ 的核心思想是：让选择的下一个初始质心，尽可能地远离已经选择的质心。 这样可以确保选出的质心能够更好地覆盖整个数据空间，减少初始质心集中于某一区域的可能性。
假设我们要将数据集聚类成 $K$ 个簇。K-Means++ 选择 $K$ 个初始质心的步骤如下：
1. 选择第一个质心：
  - 从所有数据点中随机均匀地选择一个点作为第一个簇中心 $c_{1}$ 。
2. 选择后续质心（核心步骤）：
  - 对于数据集中的每一个数据点 $x_{i}$ ，计算它到目前为止已经选择的所有簇中心中最近那个中心的距离。我们将这个距离表示为 $D (x_{i})$ 。例如，如果已经选择了 $j$ 个质心 $c_{1}, c_{2}, \dots, c_{j}$ ，那么 $D (x_{i}) = min_{k = 1, \dots, j} ∥ x_{i} - c_{k} ∥^{2}$ （通常使用平方欧几里得距离）。
  - 接下来，选择下一个簇中心时，不再是随机均匀地选择，而是采用带权重的随机抽样。点 $x_{i}$ 被选为下一个质心 $c_{j + 1}$ 的概率与其 $D (x_{i})^{2}$ 成正比。具体公式为： $P (x_{i}) = \frac{D (x_{i})^{2}}{\sum_{x_{j} \in 所有数据点} D (x_{j})^{2}}$ 这意味着，距离当前已选质心越远的点，被选为下一个质心的概率就越大。
3. 重复步骤 2：
  - 重复执行步骤 2，直到我们成功选择了 $K$ 个簇中心。
4. 运行标准 K-Means：
  - 一旦 $K$ 个初始质心被选择出来，就将它们作为标准 K-Means 算法的初始点，并按照 K-Means 的迭代过程（分配步和更新步）继续进行聚类，直到收敛。
K-Means++ 的优点
- 提高聚类质量： 通过更合理地初始化质心，K-Means++ 显著降低了 K-Means 算法陷入局部最优解的风险，从而得到更优的聚类结果。
- 加快收敛速度： 更好的初始质心通常意味着算法能够更快地找到稳定的聚类结果，减少迭代次数。
- 简单易实现： 尽管比随机初始化复杂一些，但 K-Means++ 的逻辑相对直观，易于实现。
K-Means++ 是一种改进的 K-Means 算法初始化策略。它通过一种“距离最远优先”的随机抽样方法来选择初始簇中心，确保这些中心在数据空间中尽可能分散，从而有效提升了 K-Means 算法的聚类性能和稳定性。
Soft K-means.

在探讨 Soft K-means 算法时，其核心在于如何处理数据点对簇的归属这一不确定性。与传统的 K-means 算法中每个数据点被“硬性”地唯一分配给一个簇不同，Soft K-means 引入了概率的概念，允许每个数据点以一定的“责任”（responsibility）或概率属于多个簇。这种不确定性，使得数据点所属的簇成为了一个隐藏变量（Latent Variable）。

公式：

r_{i j} = \frac{\exp (- β ‖ x_{i} - μ_{j} ‖_{2}^{2})}{\sum_{k = 1}^{K} \exp (- β ‖ x_{i} - μ_{k} ‖_{2}^{2})}

μ_{j} = \frac{\sum_{i = 1}^{n} r_{i j} x_{i}}{\sum_{i = 1}^{n} r_{i j}}

隐藏变量的引入
在 Soft K-means 中，我们无法直接观测到每个数据点究竟属于哪个簇。例如，一个数据点可能位于两个簇的中间区域，此时对其进行硬性划分会丢失信息。因此，将数据点 $x_{i}$ 属于哪个簇这一信息视为一个隐藏变量 $z_{i}$ 是必要的。我们的目标是估计这个隐藏变量的分布，以及模型的其他参数（如簇中心、簇的权重和方差等）。
EM 算法在 Soft K-means 中的应用
为了解决含有隐藏变量的概率模型的参数估计问题，期望最大化（Expectation-Maximization, EM）算法成为了 Soft K-means 的核心优化方法。EM 算法通过迭代的两个步骤，间接地最大化观测数据的似然函数：
E 步（期望步 - Expectation Step）： 在此步骤中，我们利用当前已知的模型参数（例如，上一步迭代得到的簇中心和方差），来计算每个数据点 $x_{i}$ 属于每个簇 $k$ 的后验概率。这个后验概率就是所谓的“责任”或“软分配概率”，通常表示为 $γ (z_{i k})$ 或 $p (z_{k} | x_{i})$ 。它量化了在当前模型参数下，数据点 $x_{i}$ 属于簇 $k$ 的“置信度”。这个步骤是对隐藏变量 $z_{i}$ 的期望进行估计，为后续的 M 步提供依据。
M 步（最大化步 - Maximization Step）： 在 E 步计算出每个数据点对每个簇的责任后，M 步的目标是更新模型参数，以最大化当前观测数据在这些“责任”下的似然函数。例如，新的簇中心是通过所有数据点的加权平均计算得出的，其中权重即为该点对该簇的责任。这个过程本质上是在寻找一组新的模型参数，使得在给定 E 步的隐藏变量估计下，观测数据的概率达到最大。
最大似然估计（MLE）思想的体现
整个 Soft K-means 算法，通过 EM 迭代过程，完美地体现了**最大似然估计（Maximum Likelihood Estimation, MLE）**的思想。
EM 算法的目标是最大化观测数据 $X$ 的边际似然 $P (X | θ)$ ，其中 $θ$ 代表所有模型参数。由于存在隐藏变量 $Z$ ，直接最大化边际似然是困难的。EM 算法通过转而最大化期望的完全数据对数似然 $E_{Z | X, θ_{old}} [\log P (X, Z | θ)]$ 来实现这一目标。
E 步负责计算这个期望值，即根据当前参数和观测数据，得出隐藏变量的最佳后验分布（即责任）。
M 步则在此期望值的基础上，选择新的参数来最大化它。
每一次 EM 迭代都保证似然函数是非递减的，从而确保算法最终收敛到一个局部最优解。因此，Soft K-means 利用 EM 算法来处理不确定的簇分配这一隐藏变量，正是对最大似然估计原理的实际应用和实现。

DBSCAN

DBSCAN is short for Density-Based Spatial Clustering of Applications with Noise.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) 是一种基于密度的聚类算法。与 K-Means 等需要预先指定聚类数量（K值）的算法不同，DBSCAN 能够发现任意形状的簇，并且能够识别出噪声点（异常点）。
DBSCAN 的核心思想是：一个簇是由密度相连（density-reachable）的点的最大集合。它认为，如果一个区域的点密度足够高，那么这些点就属于同一个簇。
DBSCAN 算法主要依赖于两个核心参数：
$ϵ$ (epsilon) / eps：
半径。它定义了一个圆形邻域的半径。对于数据集中的每个点，DBSCAN 会考虑在这个半径内有多少个其他点。
MinPts：
最小点数。在一个点的 $ϵ$ 半径邻域内，如果包含的点的数量达到或超过 MinPts，那么这个点就被认为是核心点（Core Point）。
根据这两个参数，DBSCAN 将数据点分为三种类型：
核心点（Core Point）：
如果一个点在其 $ϵ$ 邻域内包含至少 MinPts 个其他点（包括它自己），那么它就是一个核心点。核心点是簇的“骨架”。
边界点（Border Point）：
如果一个点在其 $ϵ$ 邻域内包含的点数少于 MinPts，但它位于某个核心点的 $ϵ$ 邻域内（即它是某个核心点的直接密度可达点），那么它就是一个边界点。边界点是簇的“边缘”。
噪声点（Noise Point / Outlier）：
如果一个点既不是核心点，也不是边界点（即它在其 $ϵ$ 邻域内包含的点数少于 MinPts，并且它不属于任何核心点的邻域），那么它就是一个噪声点。噪声点不属于任何簇。
DBSCAN 算法的聚类过程大致如下：
从数据集中随机选择一个未被访问的点。
检查这个点的 $ϵ$ 邻域。
如果其邻域内的点数小于 MinPts，则将该点标记为噪声点（暂时，它之后可能成为边界点）。
如果其邻域内的点数达到或超过 MinPts，则将该点标记为核心点，并创建一个新的簇。
将该核心点邻域内的所有点添加到当前簇中。对于这些新加入的点，如果它们也是核心点，则递归地扩展它们的邻域，将更多密度相连的点加入到当前簇中。
重复上述过程，直到所有点都被访问并标记（属于某个簇或被识别为噪声）。
优点：
无需预设簇的数量 K：DBSCAN 能够根据数据的密度自动发现簇的数量。
发现任意形状的簇：它不像 K-Means 那样只擅长发现球状簇，DBSCAN 可以识别出不规则形状的簇，如L形、S形等。
识别噪声点：DBSCAN 能够明确地区分出哪些点是噪声，不将它们强制分配给任何簇。
对初始点不敏感：聚类结果对起始点的选择不敏感（除非起始点本身是噪声点）。
缺点：
参数选择困难： $ϵ$ 和 MinPts 这两个参数的选取对聚类结果有很大影响，且通常需要人工经验或多次尝试。
处理密度差异大的簇有困难：如果数据集中存在密度差异很大的簇（例如，一个非常密集的簇和一个非常稀疏的簇），DBSCAN 很难用同一组参数同时很好地处理它们。
高维数据表现不佳：在高维空间中，距离度量的有效性降低，选择合适的 $ϵ$ 变得更加困难，导致“维度灾难”问题。
总而言之，DBSCAN 是一种强大且灵活的聚类算法，特别适用于处理具有复杂形状簇和包含噪声的数据集。

MLE to MAP

Formula.
For Maximum Likelihood Estimation:
$\arg max_{θ} P (D | θ)$
MLE就是求一个参数的取值，使得这个参数在这个样本上面表现的最好。
For Maximum A Posteriori:
$\arg max_{θ} P (θ | D)$
MAP是已经给定了这个数据，需要结合先验知识来得到一个后验的估计。这个后验的估计要求既能够较好地反映这个数据的分布特征，又要使这个参数在一般的常理之内。
只是Bayes Theorem.
$P (Posterior) = \frac{P (Likelihood) \times P (Prior)}{P (Evidence)}$

GMM

在高斯混合模型中，如果我的所有数据都只用一个多元高斯分布来进行刻画，EM算法还有没有使用的必要？

EM算法的目的： EM算法（期望最大化算法）主要用于含有隐变量的概率模型的参数估计。在高斯混合模型中，隐变量是每个数据点所属的高斯分量。当你有多个高斯分量时，你需要EM算法来迭代地估计每个数据点属于哪个分量（E步），然后根据这个估计来更新每个分量的参数（M步）。
单个高斯分布的参数估计： 如果您的模型只有一个多元高斯分布，那么就没有“混合”的概念，也没有隐变量来指示数据点属于哪个高斯分量（因为它只有一个）。在这种情况下，您可以直接使用**最大似然估计（MLE）**来求解该多元高斯分布的参数：
均值（Mean）：所有数据点的样本均值。
协方差矩阵（Covariance Matrix）：所有数据点的样本协方差矩阵。
这些参数可以直接通过封闭形式的解计算出来，不需要迭代过程。
总结：
高斯混合模型（GMM）：当有多个高斯分量时，数据点的所属分量是隐变量，需要EM算法来估计参数。
单个多元高斯分布：没有隐变量，可以直接通过最大似然估计（计算样本均值和样本协方差）来估计参数，无需EM算法。

为什么GMM算法没有闭式解？

There is NO closed-form solution for them, one obvious reason is the interdependence of $ϕ_{j}$ with $μ_{j}$ and $Σ_{j}$ . We need another solution to compute ${ϕ_{j}}, {μ_{j}}, {Σ_{j}}$ , under the perspective of maximizing data log-likelihood $\log p (D | {ϕ_{j}}, {μ_{j}}, {Σ_{j}})$
The Expectation-Maximization algorithm introduces a way to address this task.
It is very similar to the two-step process in k-means
当我们在 GMM 中最大化似然函数时，我们会遇到一个对数内部包含求和项的表达式：
$L (θ) = \sum \log (\sum π_{k} N (x_{i} | μ_{k}, Σ_{k}))$
这个对数内部的求和项正是症结所在。它意味着我们不知道每个数据点 $x_{i}$ 究竟是由哪个高斯分量生成的（这就是隐变量）。如果这个隐变量已知，我们就能将问题分解成多个独立的高斯分布估计，每个都有闭式解。然而，由于隐变量是未知的，我们无法直接求解对数似然函数的导数并将其设为零来得到解析解。
EM 算法正是为了解决这类含有隐变量的问题而生。它通过迭代的方式，先“猜测”隐变量的分布（E 步），然后基于这个猜测更新模型参数（M 步），从而逐步逼近最优解。

模型（GMM）的 E 步中，为什么对于每一个数据点 $x_{i}$ ，其对所有高斯分量 $j$ 的责任 $r_{i j}$ 之和 $\sum_{j = 1}^{K} r_{i j}$ 必须等于 1？这个性质与 M 步中混合权重 $π_{j}$ 的更新有什么关系？
1. 为什么 $\sum_{j = 1}^{K} r_{i j} = 1$ ？
这个等式是基于概率的完备性原则。
- $r_{i j}$ 的定义： $r_{i j}$ 表示给定数据点 $x_{i}$ 和当前模型参数 $θ$ ，数据点 $x_{i}$ 来自第 $j$ 个高斯分量的后验概率 $p (z_{i j} = 1 | x_{i}, θ)$ 。
- 隐变量的性质： 在 GMM 中，我们假设每一个数据点 $x_{i}$ 都必然且只由 $K$ 个高斯分量中的某一个生成。也就是说，对于每个数据点 $x_{i}$ ，其对应的隐变量 $z_{i}$ （表示它属于哪个分量）必然会是 $1, 2, \dots, K$ 中的一个确定值。
- 概率的归一化： 由于数据点 $x_{i}$ 必然属于且仅属于一个分量，那么它来自所有可能分量的后验概率之和必须为 1。这就像任何一个事件在所有可能结果上的概率之和总是 1。
因此，
$\sum_{j = 1}^{K} r_{i j} = \sum_{j = 1}^{K} p (z_{i j} = 1 | x_{i}, θ) = 1$
这确保了每个数据点的“责任”在所有分量上是完整分配的，没有遗漏或重复。
1. 与 M 步中混合权重 $π_{j}$ 更新的关系
$\sum_{j = 1}^{K} r_{i j} = 1$ 这个特性在 M 步中更新混合权重 $π_{j}$ 时起着至关重要的作用，它保证了更新后的权重是合理且规范化的。
在 M 步中，混合权重 $π_{j}$ 的更新公式为：
$π_{j}^{n e w} = \frac{\sum_{i = 1}^{N} r_{i j}}{N}$
其关系体现在以下几点：
- 分子的含义： $\sum_{i = 1}^{N} r_{i j}$ 表示第 $j$ 个高斯分量对整个数据集所有数据点所承担的“总责任”。我们可以将其理解为第 $j$ 个簇所“拥有”的有效数据点数量（因为 $r_{i j}$ 是软分配）。
- 分母的含义： $N$ 是数据集中的总数据点数量。
- $π_{j}$ 的物理意义： $π_{j}$ 代表第 $j$ 个高斯分量在整个混合模型中所占的比例或先验概率。
- 总责任的守恒： 我们可以验证所有簇的总责任之和等于总数据点数 $N$ ： $\sum_{j = 1}^{K} (\sum_{i = 1}^{N} r_{i j}) = \sum_{i = 1}^{N} (\sum_{j = 1}^{K} r_{i j})$ 由于 $\sum_{j = 1}^{K} r_{i j} = 1$ （每个数据点的责任之和为 1），上式变为： $\sum_{i = 1}^{N} (1) = N$ 这意味着，所有分量“分享”的总有效数据点数量恰好等于实际的数据点总数 $N$ 。
- 保证 $π_{j}$ 的合理性： 将第 $j$ 个簇的有效数据点数量 $\sum_{i = 1}^{N} r_{i j}$ 除以总数据点数 $N$ ，得到的 $π_{j}^{n e w}$ 自然地表示了该簇在整个数据集中的比例。由于所有簇的有效数据点总数等于 $N$ ，这自动保证了所有更新后的混合权重之和为 1： $\sum_{j = 1}^{K} π_{j}^{n e w} = \sum_{j = 1}^{K} \frac{\sum_{i = 1}^{N} r_{i j}}{N} = \frac{1}{N} \sum_{j = 1}^{K} \sum_{i = 1}^{N} r_{i j} = \frac{1}{N} \sum_{i = 1}^{N} (\sum_{j = 1}^{K} r_{i j}) = \frac{1}{N} \sum_{i = 1}^{N} (1) = \frac{1}{N} \cdot N = 1$
综上所述， $\sum_{j = 1}^{K} r_{i j} = 1$ 是 E 步中计算后验概率的基本性质，它确保了每个数据点的责任被完全分配。这个性质在 M 步中被巧妙地利用，使得混合权重 $π_{j}$ 的更新公式能够准确地反映每个簇所“捕获”的数据点的比例，并且自动满足所有混合权重之和为 1 的必要条件。
GMM中EM算法的公式：

E步：
$r_{i j} = \frac{π_{j} N (x_{i} | μ_{j}, Σ_{j})}{\sum_{k = 1}^{K} π_{k} N (x_{i} | μ_{k}, Σ_{k})}$
M步：
$μ_{j}^{n e w} = \frac{\sum_{i = 1}^{N} r_{i j} x_{i}}{\sum_{i = 1}^{N} r_{i j}} Σ_{j}^{n e w} = \frac{\sum_{i = 1}^{N} r_{i j} (x_{i} - μ_{j}^{n e w}) (x_{i} - μ_{j}^{n e w})^{T}}{\sum_{i = 1}^{N} r_{i j}} π_{j}^{n e w} = \frac{\sum_{i = 1}^{N} r_{i j}}{N}$

K-means是如何对GMM进行初始化的？

运行 K-means： 首先，对数据运行 K-means 算法，得到 $K$ 个硬性（明确划分的）簇和它们的质心。
设置 GMM 初始参数：
均值 ( $μ_{j}$ )： 将每个高斯分量的初始均值设置为对应 K-means 簇的质心。
协方差 ( $Σ_{j}$ )： 将每个高斯分量的初始协方差设置为对应 K-means 簇内数据点的样本协方差。
混合系数 ( $π_{j}$ )： 将每个高斯分量的初始混合系数设置为对应 K-means 簇中数据点数量占总数据点数量的比例。

Summary of a GMMs by the EM Algorithm proof.
- We start from representing data likelihood $\log p (D | θ)$
- Then look at sample likelihood $\log p (x_{i} | θ)$
- Introducing a latent variable $z_{i}$ and its latent distribution $q (z_{i})$
- We found out that an alternative way to approximate MLE: maximizing the ELBO of the data likelihood $\log p (x_{i} | θ)$
- The E-step finds one representation of the ELBO, by equalizing $q (z_{i})$ with the posterior $p (z_{i} | x_{i}, θ) \Leftrightarrow$ responsibility $r_{i j}$
- The M-step then maximizing this ELBO through zero-derivatives, therefore leading to the solution of parameters $θ$ , namely, ${ϕ_{1}, μ_{1}, Σ_{1}}, {ϕ_{2}, μ_{2}, Σ_{2}}, . . . {ϕ_{k}, μ_{k}, Σ_{k}}$

Ensemble Learning

Prove that the bound of training error is

P r_{i \sim D_{1}} [H (x_{i}) \neq y_{i}] \leq \prod_{t = 1}^{T} \sqrt{1 - 4 γ_{t}^{2}} = \exp (- 2 \sum_{t = 1}^{T} γ_{t}^{2})

where $γ_{t} = \frac{1}{2} - ϵ_{t}$ . Comprehension about it.

Solution: if $\forall t : γ_{t} \geq | γ | > 0$ , then training error $\leq e^{- 2 γ^{2} T}$ .
The term $e^{- 2 γ^{2} T}$ is related to the time $T$ , so:
As the training progresses, the upper bound of the training error will reduce exponentially (fast training).
Convergence? Two conditions:
Training error goes to $0$ .
Or, $γ_{t} = 0$ , equivalently, $ϵ_{t} = 0.5$ . Boosting gets stuck: the boosting weights on training set are in such a way that every weak learner has 50% error.

What is the meaning of "Bootstrap" in Ensemble Learning?
在集成学习（Ensemble Learning）中，“Bootstrap”通常指的是自助采样法，它是Bagging（Bootstrap Aggregating） 这种集成学习方法的核心组成部分。
Bootstrap 是一种有放回的随机抽样方法。它的基本思想是从原始数据集中反复地、有放回地抽取与原始数据集大小相同（或近似相同）的样本集。
具体来说，对于一个包含 $N$ 个样本的原始数据集 $D$ ：
1. 有放回抽样： 从 $D$ 中随机抽取一个样本，并将其添加到新的样本集 $D^{'}$ 中。
2. 重复 $N$ 次： 重复步骤 1 共 $N$ 次。
3. 生成新的数据集： 最终得到的 $D^{'}$ 就是一个自助采样集。 $D^{'}$ 的大小与 $D$ 相同，但由于是有放回抽样，它可能包含 $D$ 中重复的样本，也可能缺失 $D$ 中的一些样本（估计约有 36.8% 的原始样本不会出现在 $D^{'}$ 中）。
通过重复这个过程多次，我们可以生成多个不同的自助采样集 $D_{1}^{'}, D_{2}^{'}, \dots, D_{M}^{'}$ 。

Data Analysis - Supervised Learning ​

PCA ​

K-Means ​

DBSCAN ​

MLE to MAP ​

GMM ​

Ensemble Learning ​

Data Analysis - Supervised Learning

PCA

K-Means

DBSCAN

MLE to MAP

GMM

Ensemble Learning