Reading: Hands-On ML - Quick notes (Chapter 4 — 9)

👉 List of all notes for this book. IMPORTANT UPDATE November 18, 2024: I've stopped taking detailed notes from the book and now only highlight and annotate directly in the PDF files/book. With so many books to read, I don't have time to type everything. In the future, if I make notes while reading a book, they'll contain only the most notable points (for me).

Chapter 4. Training Models

MSE cost function for linear regression model

$$ \operatorname{MSE}\left(\mathbf{X}, h_{\boldsymbol{\theta}}\right)=\frac{1}{m} \sum_{i=1}^m\left(\boldsymbol{\theta}^{\top} \mathbf{x}^{(i)}-y^{(i)}\right)^2 $$
Closed-form solution: nghiệm bằng công thức toán, trực tiếp

$$ \widehat{\boldsymbol{\theta}}=\left(\mathbf{X}^{\top} \mathbf{X}\right)^{-1} \mathbf{X}^{\top} \mathbf{y} $$
@ operator performs matrix multiplication. Dùng cho Numpy, TF, PyTorch, JAX, not on pure Python.
Normal eqn & SVD approach chậm với n (#features) nhưng lại khá nhanh với m (#instances). ← nhiều features, GD nhanh hơn
Gradient Descent: pp optimization tổng quát. Ý tưởng chính là tweak parameters iteratively để minimize cost functions. ← compute the local gradient và đi theo hướng dốc đó từ đỉnh đến đáy (đáy = min loss)
With feature scaling → go faster! ← StandardScaler
Batch gradient descent → use whole batch of training data at each step to computer the gradient
- Learning rate: slow → too long, high → underfitting
  
  Figure 4-8. Gradient descent with various learning rates
- Find good learning rate → use grid search.
- How many #epochs? → high but with condition (tolerance) to stop when gradient is tiny. Smaller tolerance → longer to train each epoch.
Batch GD → whole training set → slow → ngược lại hoàn toàn là Stochastic GD ← pick random one instance at each step.
- Good: When cost functin is irregular → SGD helps jumping out of local min.
- Bad: never come to optimal.
→ Adjust learning rate (use learning schedule): Start wide then smaller smaller. ← simulated annealing algo (luyện kim: kim loại nóng chảy làm nguội từ từ)
- Some estimators also have a partial_fit() method that you can call to run a single round of training on one or more instances.
  - warm_start=True with fit() will continue to train where it left off.
Mini-batch GD: each step, based on 1 < mini batches < full
- vs SGD: tận dụng lợi thế của GPU, opt matrix operations để có performance boost.
Polynomial Regression (khi data ko phải đường thẳng) ← PolynomialFeatures
How to tell a model is underfitting or overfitting?
- Recall: use cross-validation → well on training but poor on cross-val → overfitting. Poor on both → underfitting.
- Learning curves: plots of training errors and validation errors vs training iteration. ← learning_curve()
  
  Figure 4-15. Learning curves ← no gap between 2 curves ← underfiting → both training and val curves are high
  
  → to fix: choose again: better model or better features.
  
  Figure 4-16. Learning curves for the 10th-degree polynomial model ← overfitting → training curve is better and val curve
  
  → to fix: more data
Bias / variance trade-off
- Bias (thành kiến) → wrong assumption (nghĩ linear nhưng lại là bật cao) ← high bias → underfitting
- Variance → high variance → overfitting
- Irreducible error ← data bị noise ← clean data
→ Gọi là trade-off vì tăng bật → tăng variance nhưng lại giảm bias và ngược lại.
A good way to reduce overfitting → regularized
- Ridge regression (Ridge) ← use $l_2$
  
  $$ J(\boldsymbol{\theta})=\operatorname{MSE}(\boldsymbol{\theta})+\frac{\alpha}{m} \sum_{i=1}^n \theta_i{ }^2 $$
- Lasso regression (Least absolute shrinkage and selection operator regression, Lasso) ← use $l_1$
  
  $$ J(\boldsymbol{\theta})=\operatorname{MSE}(\boldsymbol{\theta})+2 \alpha \sum_{i=1}^n\left|\theta_i\right| $$
  
  An important characteristic of lasso regression is that it tends to eliminate the weights of the least important features (i.e., set them to zero).
- Elastic net regression: a middle ground between ridge regression and lasso regression.
  
  $$ J(\boldsymbol{\theta})=\operatorname{MSE}(\boldsymbol{\theta})+r\left(2 \alpha \sum_{i=1}^n\left|\theta_i\right|\right)+(1-r)\left(\frac{\alpha}{m} \sum_{i=1}^n \theta_i^2\right) $$
- (source) Ridge regression can't zero coefficients, resulting in all or none in the model. Unlike this, LASSO provides parameter shrinkage and variable selection. For highly correlated covariates, consider the Elastic Net instead of the LASSO.
- It’s important to scale the data before regularization.
- Use what? → Avoid not using regulariztion; if only a few features are useful → lasso/elastic; when #features > #instances → use elastic.
Early stopping: stop training as long as the validation error reaches min.

Figure 4-20. Early stopping regularization
copy.deepcopy() → copies both model hyperparams & learned params whereas sklearn.base.clone() only copies hyperparams.
Logistic regression: we can use regression for classification (0/1) → estimate the probability an instance belongs to which class. ← gọi là “regression” vì là extension của linear regression (chỉ cần apply cái sigmoid function trước khi output)
- ML Coursera 3 - w3: Logistic Regression | Notes of Thi
- Regularized logistic regression | Notes of Thi
- Sigmoid → 2 classes ← can be generalized to support multiple classes
- Softmax → multiple classes ← softmax regression / multinomial logistic regression
- Read more about these activation functions in this note.
- ❓Chưa hiểu lắm chỗ từ dataset là 2 classes này lại liên quan đến prob, rùi liên quan đến logistic regression? ← xem thêm vid Andrew giảng.
Iris dataset: contains the sepal and petal length and width of 150 iris flowers of three different species: Iris setosa, Iris versicolor, and Iris virginica.

Figure 4-22. Flowers of three iris plant species
The softmax regression classifiier predicts only one class at a time (multiclass, not multioutput) ← cannot use it to recognize multiple people in one picture.