👉 List of all notes for this book. IMPORTANT UPDATE November 18, 2024: I've stopped taking detailed notes from the book and now only highlight and annotate directly in the PDF files/book. With so many books to read, I don't have time to type everything. In the future, if I make notes while reading a book, they'll contain only the most notable points (for me).
<aside> 📔 Jupyter notebook for this chapter: on Github, on Colab, on Kaggle.
</aside>
We use dataset MNIST in this chapter = 70K small images of digits handwriten. ← “Hello world” of ML.
Download from OpenML.org. ← use sklearn.datasets.fetch_openml
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', as_frame=False)
# data contains images -> dataframe isn't suitable, so as_frame=False
X, y = mnist.data, mnist.target
X.shape # (70000, 784)
sklean.datasets
contains 3 types of functions:
fetch_*
functions such as fetch_openml()
to download real-life datasets.load_*
functions to load small toy datasets (no need to download)make_*
functions to generate fake datasets.70K images, 784 features. Each image = 28x28 pixels.
Plot an image
import matplotlib.pyplot as plt def plot_digit(image_data):
image = image_data.reshape(28, 28) plt.axis("off")
plt.imshow(image, cmap="binary")
some_digit = X[0] plot_digit(some_digit) plt.show()
y[0]
= 5
MNIST from fetch_openml()
is already split into a training set (first 60K, already shuffled) and test set (last 10K).
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
Training set is already shuffled ← good for cross-validation (all are similar).
Let’s simplify the problem - “detect only the number 5” ← binary classifier (2 classes, 5 or non-5).
Good to start is stochastic gradient descent (SGD, or stochastic GD) classifier ← SGDClassifier
← deals with training instances independently, one at a time ← handling large datasets effeciently, well suited for online training.
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)
sgd_clf.predict([some_digit])
Evaluating a classifier is often significantly trickier than evaluating a regressor!
Use cross_val_score()
← use k-folds.
from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")
Wow, get 95% accuracy with SGD but it’s good? → Let’s try DummyClassifier
← classifies every single image in the most frequent class (non-5) and then use cross_val_score
→ 90% accuracy! Why? It’s because only 10% are 5s! ← If you always guess that an image is not a 5, 90% of the time, you’re right!