Reading: Hands-On ML - Quick notes (Chapter 10 — 13)

👉 List of all notes for this book. IMPORTANT UPDATE November 18, 2024: I've stopped taking detailed notes from the book and now only highlight and annotate directly in the PDF files/book. With so many books to read, I don't have time to type everything. In the future, if I make notes while reading a book, they'll contain only the most notable points (for me).

Chapter 10. Introduction to Artificial Neural Networks with Keras

Note: DL by DL.AI — Course 1: NN and DL
The Perceptron: one of the simplest ANN architectures (ANN = Artificial Neural Networks)

Figure 10-4. TLU (threshold logic unit): an artificial neuron that computes a weighted sum of its inputs $w^Tx$, plus a bias term b, then applies a step function
Most common step function is Heaviside step function, sometimes sign function is used.

$$ \text { heaviside }(z)= \begin{cases}0 & \text { if } z<0 \\ 1 & \text { if } z \geq 0\end{cases} $$

$$ \operatorname{sgn}(z)=\begin{array}{ll}-1 & \text { if } z<0 \\0 & \text { if } z=0 \\+1 & \text { if } z>0\end{array} $$
How is a perceptron trained? → follows Hebb’s rule. “Cells that fire together, wire together” (the connection weight between two neurons tends to increase when they fire simultaneously.)
perceptrons has limit (eg. cannot solve XOR problem) → use multiplayer perceptron (MLP)
perceptrons do not output a class probability → use logistic regression instead.
When an ANN contains a deep stack of hidden layers → deep neural network (DNN)
Thời xưa, máy tính chưa mạnh → train MLPs is a problem kể cả khi dùng gradient descent.
→ Backpropagation : an algo to minimize the cost function of MLPs.
- Forward propagation: from X to compute the cost J
- Backward propagation: compute derivaties and optimize the params → update params
→ Read this note (DL course 1).

<aside> ☝ From this, I've decided to browse additional materials to deepen my understanding of Deep Learning. I found that the book has become more generalized than I expected, so I'll explore other resources before returning to finish it.

</aside>

Watch more: Neural networks | 3Blue1Brown - YouTube
- Understand the article: “Attention is all you need”
It’s important to initialize all the hidden layers connection weights randomly!
Replace step function in MLP by sigmoid function because sigmoid function has a well-defined nonzero derivative everywhere!
The ReLU activation (The rectified linear unit function) is continous but not differentiable at 0. In practice, it works very well and fast to compute, so it becomes the default.
Some popular activations
Conversely, a large enough DNN with nonlinear activations can theoretically approximate any continuous function.
Regression MLPs → use MLPs for regression tasks, MLPRegressor
gradient descent does not converge very well when the features have very different scales