fancyboi999 · fancyboi999 · Jun 12, 2026 · Jun 12, 2026
diff --git a/.sync-upstream-base b/.sync-upstream-base
@@ -1 +1 @@
-b963cf63ffbb574af35da8db301ebb1381515ed8
+148de3663f8bab4c90355292de8d0fac81dc2a86
diff --git a/phases/01-math-foundations/01-linear-algebra-intuition/docs/zh.md b/phases/01-math-foundations/01-linear-algebra-intuition/docs/zh.md
@@ -178,6 +178,10 @@ QR 分解内部就是这么干的。Q 是那组标准正交基，R 记录投影
 - 计算特征值（QR 算法）
 - 最小二乘回归（标准的数值方法）
 
+```figure
+eigen-directions
+```
+
 ## 动手构建
 
 ### 第 1 步：从零写向量（Python）

diff --git a/phases/01-math-foundations/02-vectors-matrices-operations/docs/zh.md b/phases/01-math-foundations/02-vectors-matrices-operations/docs/zh.md
@@ -113,6 +113,10 @@ Broadcasting stretches the vector across rows:
 
 每个现代框架都会自动做这件事。理解它能让你在形状看起来不对、代码却照跑不误时不犯迷糊。
 
+```figure
+vector-projection
+```
+
 ## 动手构建
 
 ### 第 1 步：Vector 类

diff --git a/phases/01-math-foundations/03-matrix-transformations/docs/zh.md b/phases/01-math-foundations/03-matrix-transformations/docs/zh.md
@@ -232,6 +232,10 @@ det = -1:  area preserved but orientation flipped (reflection)
 | det(Reflection) | = -1     (orientation flipped)
 ```
 
+```figure
+matrix-transform
+```
+
 ## 动手构建
 
 ### 第 1 步：从零写变换矩阵（Python）

diff --git a/phases/01-math-foundations/04-calculus-for-ml/docs/zh.md b/phases/01-math-foundations/04-calculus-for-ml/docs/zh.md
@@ -388,6 +388,10 @@ graph RL
 
 前向传播算出预测和损失。反向传播算出损失对每个权重的梯度。然后每个权重往下坡迈一小步。重复几百万步。这就是深度学习。
 
+```figure
+derivative-tangent
+```
+
 ## 动手构建
 
 ### 第 1 步：从零写数值导数

diff --git a/phases/01-math-foundations/05-chain-rule-and-autodiff/docs/zh.md b/phases/01-math-foundations/05-chain-rule-and-autodiff/docs/zh.md
@@ -163,6 +163,10 @@ PyTorch 内部：
 
 这个图是动态的（define-by-run）。每次前向传播都会构建一个新图。这就是为什么 PyTorch 支持在模型里写控制流（if/else、循环）。
 
+```figure
+chain-rule
+```
+
 ## 动手构建
 
 ### 第 1 步：Value 类

diff --git a/phases/01-math-foundations/06-probability-and-distributions/docs/zh.md b/phases/01-math-foundations/06-probability-and-distributions/docs/zh.md
@@ -254,6 +254,10 @@ Log-softmax 把 softmax 和 log 合在一起以保证数值稳定。PyTorch 在
 
 从任意分布采样需要逆变换采样、拒绝采样或重参数化技巧（VAE 里用）这类技术。
 
+```figure
+gaussian-pdf
+```
+
 ## 动手构建
 
 ### 第 1 步：概率基础

diff --git a/phases/01-math-foundations/07-bayes-theorem/docs/zh.md b/phases/01-math-foundations/07-bayes-theorem/docs/zh.md
@@ -195,6 +195,10 @@ MAP 在参数本身之上加了一个先验。如果你相信参数应该偏小
 
 **模型比较是贝叶斯的。** 贝叶斯信息准则（BIC）、边际似然和贝叶斯因子，全都用贝叶斯推理在模型之间选择而不过拟合。
 
+```figure
+bayes-update
+```
+
 ## 动手构建
 
 ### 第 1 步：贝叶斯定理函数

diff --git a/phases/01-math-foundations/08-optimization/docs/zh.md b/phases/01-math-foundations/08-optimization/docs/zh.md
@@ -195,6 +195,10 @@ graph TD
 
 尖锐的最小值泛化得差。平坦的最小值泛化得好。这是带动量的 SGD 在最终测试准确率上常常胜过 Adam 的原因之一：它的噪声防止落进尖锐的最小值。
 
+```figure
+gradient-descent
+```
+
 ## 动手构建
 
 ### 第 1 步：定义一个测试函数

diff --git a/phases/01-math-foundations/09-information-theory/docs/zh.md b/phases/01-math-foundations/09-information-theory/docs/zh.md
@@ -273,6 +273,10 @@ Perplexity = e^H(P,Q)   (if using nats)
 
 GPT-2 在常见基准上达到约 30 的困惑度。现代模型在表示充分的领域里能做到个位数。
 
+```figure
+entropy-kl
+```
+
 ## 动手构建
 
 ### 第 1 步：信息量和熵

diff --git a/phases/01-math-foundations/10-dimensionality-reduction/docs/zh.md b/phases/01-math-foundations/10-dimensionality-reduction/docs/zh.md
@@ -190,6 +190,10 @@ explained_ratio_k = eigenvalue_k / sum(all eigenvalues)
 
 重构误差不止用来选 k。你可以用它做异常检测：重构误差高的样本是不符合学到的子空间的离群点。这是生产系统里基于 PCA 的异常检测的基础。
 
+```figure
+pca-axes
+```
+
 ## 动手构建
 
 ### 第 1 步：从零写 PCA

diff --git a/phases/01-math-foundations/11-singular-value-decomposition/docs/zh.md b/phases/01-math-foundations/11-singular-value-decomposition/docs/zh.md
@@ -360,6 +360,10 @@ It is faster and more numerically stable.
 
 这意味着你在第 10 课学的关于降维的一切，引擎盖下都是 SVD。PCA 是 SVD 在机器学习里最常见的应用。
 
+```figure
+svd-rank-reconstruction
+```
+
 ## 动手构建
 
 ### 第 1 步：用幂迭代从零写 SVD

diff --git a/phases/01-math-foundations/12-tensor-operations/docs/zh.md b/phases/01-math-foundations/12-tensor-operations/docs/zh.md
@@ -101,6 +101,10 @@ graph LR
 
 关键模式：`i,i->`（点积）、`i,j->ij`（外积）、`ii->`（迹）、`ij->ji`（转置）、`bij,bjk->bik`（批量矩阵乘法）、`bhtd,bhsd->bhts`（注意力分数）。
 
+```figure
+tensor-broadcast
+```
+
 ## 动手构建
 
 代码在 `code/tensors.py` 里。每一步都引用那里的实现。

diff --git a/phases/01-math-foundations/13-numerical-stability/docs/zh.md b/phases/01-math-foundations/13-numerical-stability/docs/zh.md
@@ -388,6 +388,10 @@ LayerNorm(x) = (x - mean(x)) / (std(x) + epsilon) * gamma + beta
 原因：float16 表示不了低于 6e-8 的梯度幅度或高于 65,504 的激活值。
 修复：用带损失缩放的混合精度（AMP），或改用 bfloat16。
 
+```figure
+logsumexp-stability
+```
+
 ## 动手构建
 
 ### 第 1 步：演示浮点精度极限

diff --git a/phases/01-math-foundations/14-norms-and-distances/docs/zh.md b/phases/01-math-foundations/14-norms-and-distances/docs/zh.md
@@ -423,6 +423,10 @@ Product quant.    Compress vectors, search       FAISS (memory-constrained)
 
 HNSW（分层可导航小世界）是现代向量数据库里占主导的算法。它构建一个多层图，每个节点连到它的近似最近邻。搜索从顶层（稀疏、长跳）开始，下降到底层（密集、短跳）。
 
+```figure
+norm-unit-balls
+```
+
 ## 动手构建
 
 ### 第 1 步：所有范数和距离函数

diff --git a/phases/01-math-foundations/16-sampling-methods/docs/zh.md b/phases/01-math-foundations/16-sampling-methods/docs/zh.md
@@ -437,6 +437,10 @@ Reverse process (learned):
 
 整个图像生成过程就是迭代采样：从噪声出发，每一步以学到的去噪模型为条件，采样一个噪声稍微少一点的版本。
 
+```figure
+monte-carlo-pi
+```
+
 ## 动手构建
 
 ### 第 1 步：均匀采样和逆 CDF 采样

diff --git a/phases/01-math-foundations/17-linear-systems/docs/zh.md b/phases/01-math-foundations/17-linear-systems/docs/zh.md
@@ -386,6 +386,10 @@ CG 用于：
 
 **特征工程。** X^T X 的条件数告诉你特征是否共线。如果 kappa 大，就丢特征或加正则化。
 
+```figure
+linear-system-conditioning
+```
+
 ## 动手构建
 
 ### 第 1 步：带部分主元的高斯消元

diff --git a/phases/01-math-foundations/18-convex-optimization/docs/zh.md b/phases/01-math-foundations/18-convex-optimization/docs/zh.md
@@ -381,6 +381,10 @@ Replace x_i^T x_j with K(x_i, x_j) to get the kernel trick.
 | Adam | O(n) | O(n) | 深度学习默认 |
 | K-FAC | O(n) | 每层 O(n) | 研究、大批量训练 |
 
+```figure
+convex-vs-nonconvex
+```
+
 ## 动手构建
 
 ### 第 1 步：凸性检查器

diff --git a/phases/01-math-foundations/19-complex-numbers/docs/zh.md b/phases/01-math-foundations/19-complex-numbers/docs/zh.md
@@ -267,6 +267,10 @@ graph LR
     U1 --> A3
 ```
 
+```figure
+roots-of-unity
+```
+
 ## 动手构建
 
 ### 第 1 步：Complex 类

diff --git a/phases/01-math-foundations/20-fourier-transform/docs/zh.md b/phases/01-math-foundations/20-fourier-transform/docs/zh.md
@@ -275,6 +275,10 @@ Example:
 
 真正的频率分辨率只取决于观测时间 T = N / fs。要分辨相隔 delta_f 的两个频率，你至少需要 T = 1 / delta_f 秒的数据。再多补零也改变不了这个根本极限。
 
+```figure
+fourier-synthesis
+```
+
 ## 动手构建
 
 ### 第 1 步：从零写 DFT

diff --git a/phases/01-math-foundations/21-graph-theory/docs/zh.md b/phases/01-math-foundations/21-graph-theory/docs/zh.md
@@ -241,6 +241,10 @@ graph LR
 | 谱聚类 | 无监督节点分组 |
 | PageRank | 节点重要性、网页搜索 |
 
+```figure
+graph-degree-distribution
+```
+
 ## 动手构建
 
 ### 第 1 步：从零写图类

diff --git a/phases/01-math-foundations/22-stochastic-processes/docs/zh.md b/phases/01-math-foundations/22-stochastic-processes/docs/zh.md
@@ -228,6 +228,10 @@ graph LR
 | 马尔可夫决策过程 | 强化学习 |
 | Metropolis-Hastings | 贝叶斯推断、后验采样 |
 
+```figure
+random-walk-diffusion
+```
+
 ## 动手构建
 
 ### 第 1 步：随机游走模拟器

diff --git a/phases/02-ml-fundamentals/02-linear-regression/docs/zh.md b/phases/02-ml-fundamentals/02-linear-regression/docs/zh.md
@@ -147,6 +147,10 @@ Cost = MSE + lambda * sum(w_i^2)
 
 惩罚项抑制大权重。超参数 lambda 控制这个权衡：lambda 越大，权重越小、正则化越强。这会在后面的课里深入讲。现在你只需知道它存在，以及它为什么有用。
 
+```figure
+linear-regression-fit
+```
+
 ## 动手构建
 
 ### 第 1 步：生成示例数据

diff --git a/phases/02-ml-fundamentals/03-logistic-regression/docs/zh.md b/phases/02-ml-fundamentals/03-logistic-regression/docs/zh.md
@@ -169,6 +169,10 @@ F1 = 2 * (Precision * Recall) / (Precision + Recall)
 - **召回率**：当假负例代价高时（癌症筛查，你不想漏掉肿瘤）
 - **F1**：当你需要一个平衡的单一指标时
 
+```figure
+logistic-sigmoid
+```
+
 ## 动手构建
 
 ### 第 1 步：sigmoid 函数和数据生成

diff --git a/phases/02-ml-fundamentals/04-decision-trees/docs/zh.md b/phases/02-ml-fundamentals/04-decision-trees/docs/zh.md
@@ -189,6 +189,10 @@ importance(feature_j) = sum over all nodes where feature_j is used:
 
 当数据有空间或序列结构（图像、文本、音频）时，神经网络才赢。对于平铺的特征表格，树是默认选择。
 
+```figure
+decision-tree-depth
+```
+
 ## 动手构建
 
 ### 第 1 步：Gini 不纯度和熵

diff --git a/phases/02-ml-fundamentals/05-support-vector-machines/docs/zh.md b/phases/02-ml-fundamentals/05-support-vector-machines/docs/zh.md
@@ -221,6 +221,10 @@ SVM 在这些场景仍然赢：
 - 有清晰间隔结构的二分类
 - 异常检测（单类 SVM）
 
+```figure
+svm-margin
+```
+
 ## 动手构建
 
 ### 第 1 步：hinge loss 和梯度

diff --git a/phases/02-ml-fundamentals/06-knn-and-distances/docs/zh.md b/phases/02-ml-fundamentals/06-knn-and-distances/docs/zh.md
@@ -223,6 +223,10 @@ prediction = sum(w_i * y_i) / sum(w_i)
 
 KNN 回归产生分段常数（加权时分段平滑）的预测。它无法外推到训练数据范围之外。如果训练目标全在 0 到 100 之间，KNN 永远不会预测出 200。
 
+```figure
+knn-smoothness
+```
+
 ## 动手构建
 
 ### 第 1 步：距离函数

diff --git a/phases/02-ml-fundamentals/07-unsupervised-learning/docs/zh.md b/phases/02-ml-fundamentals/07-unsupervised-learning/docs/zh.md
@@ -122,6 +122,10 @@ GMM 能建模椭圆形簇（不像 K-Means 只能球形），并天然处理重
 - **DBSCAN**：噪声点按定义就是异常
 - **GMM**：在所有高斯下概率都低的点是异常
 
+```figure
+kmeans-step
+```
+
 ## 动手构建
 
 ### 第 1 步：从零实现 K-Means

diff --git a/phases/02-ml-fundamentals/08-feature-engineering/docs/zh.md b/phases/02-ml-fundamentals/08-feature-engineering/docs/zh.md
@@ -104,6 +104,10 @@ TF-IDF = TF * IDF
 
 **为什么选择重要：** 一个有 10 个好特征的模型，通常胜过一个有 10 个好特征加 90 个噪声特征的模型。噪声特征给了模型在不可泛化的训练数据规律上过拟合的机会。
 
+```figure
+feature-scaling
+```
+
 ## 动手构建
 
 ### 第 1 步：从零实现数值变换

diff --git a/phases/02-ml-fundamentals/09-model-evaluation/docs/zh.md b/phases/02-ml-fundamentals/09-model-evaluation/docs/zh.md
@@ -142,6 +142,10 @@ K=5 或 K=10 是标准选择。每个数据点恰好被用于验证一次。平
 
 **测试太频繁**：每次你看测试性能再调整，就在过拟合测试集。测试集是一次性的。
 
+```figure
+precision-recall-threshold
+```
+
 ## 动手构建
 
 ### 第 1 步：训练/验证/测试划分

diff --git a/phases/02-ml-fundamentals/10-bias-variance/docs/zh.md b/phases/02-ml-fundamentals/10-bias-variance/docs/zh.md
@@ -254,6 +254,10 @@ flowchart TD
     G --> H[试更复杂的模型]
 ```
 
+```figure
+bias-variance
+```
+
 ## 动手构建
 
 `code/bias_variance.py` 里的代码运行完整的偏差-方差分解实验。下面是逐步的做法。

diff --git a/phases/02-ml-fundamentals/12-hyperparameter-tuning/docs/zh.md b/phases/02-ml-fundamentals/12-hyperparameter-tuning/docs/zh.md
@@ -245,6 +245,10 @@ print(f"Nested CV MSE: {-outer_scores.mean():.4f} +/- {outer_scores.std():.4f}")
 
 **拿不准时：** 随机搜索，试验次数取超参数数量的 2 倍（比如 6 个超参数 = 至少 12+ 次试验）。你会惊讶于 50 次试验的随机搜索有多频繁地打败精心设计的网格搜索。
 
+```figure
+k-fold-cv
+```
+
 ## 动手构建
 
 ### 第 1 步：从零实现网格搜索

diff --git a/phases/02-ml-fundamentals/14-naive-bayes/docs/zh.md b/phases/02-ml-fundamentals/14-naive-bayes/docs/zh.md
@@ -225,6 +225,10 @@ flowchart LR
 log P(class | features) = log P(class) + sum_i log P(feature_i | class)
 ```
 
+```figure
+naive-bayes
+```
+
 ## 动手构建
 
 `code/naive_bayes.py` 里的代码从零实现了 MultinomialNB 和 GaussianNB。

diff --git a/phases/02-ml-fundamentals/17-imbalanced-data/docs/zh.md b/phases/02-ml-fundamentals/17-imbalanced-data/docs/zh.md
@@ -211,6 +211,10 @@ flowchart TD
     M -->|是| O[交付]
 ```
 
+```figure
+class-imbalance
+```
+
 ## 动手构建
 
 ### 第 1 步：生成一个不平衡数据集

diff --git a/phases/03-deep-learning-core/01-the-perceptron/docs/zh.md b/phases/03-deep-learning-core/01-the-perceptron/docs/zh.md
@@ -112,6 +112,10 @@ AND (separable):        XOR (not separable):
 
 解法：把感知机叠成多层。一个多层感知机可以把两个线性决策组合成一个非线性决策，从而解决 XOR。
 
+```figure
+perceptron-boundary
+```
+
 ## 动手构建
 
 ### 第 1 步：Perceptron 类

diff --git a/phases/03-deep-learning-core/02-multi-layer-networks/docs/zh.md b/phases/03-deep-learning-core/02-multi-layer-networks/docs/zh.md
@@ -146,6 +146,10 @@ graph LR
 
 神经网络是可组合的。你可以把它们叠起来、串起来、并行跑。一个 Whisper 模型用一个编码器网络处理音频，再用一个独立的解码器网络生成文本。现代 LLM 是仅解码器（decoder-only）的。BERT 是仅编码器（encoder-only）的。T5 是编码器-解码器的。架构的选择决定了模型能做什么。
 
+```figure
+mlp-forward
+```
+
 ## 动手构建
 
 纯 Python，不用 numpy。每个矩阵运算都从零手写。

diff --git a/phases/03-deep-learning-core/03-backpropagation/docs/zh.md b/phases/03-deep-learning-core/03-backpropagation/docs/zh.md
@@ -132,6 +132,10 @@ dL/db1 = dL/dz1
 
 每个梯度都是从损失反向追溯回来的一连串局部导数的乘积。反向传播就这么点东西。
 
+```figure
+backprop-vanishing
+```
+
 ## 动手构建
 
 ### 第 1 步：Value 节点
Original file line number	Diff line number	Diff line change
		@@ -1 +1 @@
		b963cf63ffbb574af35da8db301ebb1381515ed8
		148de3663f8bab4c90355292de8d0fac81dc2a86