[Chollet] Deep Learning with Python

This book is written for people with Python programming experience who want to get started with machine and deep learning. You don't need any prior machine or deep learning experience: this book covers all the necessary basics from the start.

Fundamentals

Artificial intelligence, machine learning, and deep learning

This adjustment is the task of the optimizer, which implements what is called the Backpropagation algorithm: the central algorithm in deep learning. Although deep learning is a fairly old subfield of machine learning, it only gained prominence in the early 2010s.

Before deep learning

Together, these two properties have made deep learning far more successful than previous approaches to machine learning. A great way to get a feel for the current landscape of machine learning algorithms and tools is to check out machine learning competitions on Kaggle.

Figure 1.11 A decision tree: the parameters that are learned are the questions about the data

Why deep learning? Why now?

Machine learning - especially deep learning - has become central to the product strategy of these tech giants. In the early days, deep learning required significant C++ and CUDA expertise, which few people possessed.

A first look at a neural network

An optimizer—The mechanism by which the network will update itself based on the data it sees and its loss function. Two quantities are displayed during training: the loss of the network over the training data, and the accuracy of the network over the training data.

Data representations for neural networks

Data type (usually called dtype in Python libraries) - This is the type of data in the tensor; For example, the type of a tensor can be float32, uint8, float64, and so on. To make this more concrete, let's look back at the data we processed in the MNIST example.

Figure 2.2 The fourth sample in our dataset

The gears of neural networks: tensor operations

1 Axes (called broadcast axes) are added to the smaller tensor to match the ndim of the larger tensor. 2 The smaller tensor is repeated next to these new axes to match the full form of the larger tensor.

Figure 2.5 Matrix dot-product box diagram

The engine of neural networks

You can then move the coefficients in the opposite direction of the gradient, reducing the loss. 3 Calculate the network loss on the batch, a measure of the mismatch between y_pred and y.

Figure 2.12 Gradient descent down a 2D loss surface (two learnable parameters)

Looking back at our first example

Two key concepts that you will see often in future chapters are loss and optimization. In this chapter, we will take a closer look at the core components of neural networks that we introduced in Chapter 2: layers, networks, objective functions, and optimization. By the end of this chapter, you will be able to use neural networks to solve simple machine learning problems such as classification and regression over vector data.

You will then be ready to begin building a more principled, theory-driven understanding of machine learning in Chapter 4.

Anatomy of a neural network

But as you move forward, you'll be exposed to a much wider variety of network topologies. What you will then be looking for is a good set of values for the weight tensors involved in these tensor operations. Just remember that all the neural networks you build will be equally ruthless in reducing their loss function—so choose your target wisely, or you'll have to deal with unwanted side effects.

In the next few chapters, we explicitly describe which loss functions should be chosen for a wide range of common tasks.

Introduction to Keras

Keras is a model-level library that provides high-level building blocks for developing deep learning models. In the future, it is likely that Keras will be expanded to work with even more deep learning execution engines. We recommend using the TensorFlow backend as the default for most of your deep learning needs as it is the most widely used, scalable, and production-ready solution.

On GPU, TensorFlow packages a library of well-optimized deep learning operations, the NVIDIACUDA Deep Neural Network library (cuDNN).

Setting up a deep-learning workstation

If you don't already have a GPU that you can use for deep learning (a recent, high-end NVIDIA GPU), then running deep learning experiments in the cloud is an easy and low-cost way to get started without having to buy any additional equipment. As of mid-2017, the cloud offering that makes it easier to get started with deep learning is undoubtedly AWSEC2. As of mid-2017, we recommend NVIDIATITAN Xp as the best card on the market for deep learning.

There is no shortage of tutorials on installing Keras and common deep learning dependencies.

Classifying movie reviews

The input data are vectors and the labels are scalars (1s and 0s): this is the easiest setup you'll ever come across. Because you are facing a binary classification problem and the output of your network is a probability (you end your network with a single-unit layer with a sigmoid activation), it is best to use . You now train the model for 20 epochs (20 iterations over all samples in the x_train and y_train tensors), in minibatches of 512 samples.

At the same time, you monitor the loss and accuracy of the 10,000 samples you set aside.

Figure 3.4 The rectified linear unit function

Classifying newswires

This means that the network will output the probability distribution in 46 different output classes - for each input sample, the network will produce a 46-dimensional output vector, where output [i] is the probability that the sample belongs to class i. It measures the distance between two probability distributions: here between the output of the network probability distribution and the true label distribution. The network can squeeze most of the necessary information into these eight-dimensional representations, but not all.

It minimizes the distance between the probability distribution output by the network and the true distribution of targets.

Predicting house prices: a regression example

Since there are so few patterns available, you will use a very small mesh with two hidden layers of 64 units each. This is a typical setting for scalar regression (regression where you are trying to predict a single continuous value). Once you're done tuning the other model parameters (in addition to the number of epochs, you can also adjust the size of the hidden layers), you can train the final production model on all the training data with the best parameters , and then see how it performs on the test data.

When working with little data, K-fold validation can help reliably estimate your model.

Four branches of machine learning

For example, autoencoders are a well-known example of self-supervised learning, where the generated targets are input, unmodified data. Binary Classification - A classification task where each input sample must be categorized into two exclusive categories. Multi-class classification - A classification task where each input sample must be categorized into more than two categories: for example, classifying handwritten digits.

Vector regression - A task where the target is a set of continuous values: for example, a continuous vector.

Evaluating machine-learning models

This is the simplest evaluation protocol and has one flaw: if little data is available, your validation and test sets may contain too few samples to be statistically representative of the existing data. This is easy to identify: if different random rounds of shuffling the data before splitting end up with very different model performance measures, then you have this problem. The final result is the average of the results obtained from each K-fold validation run.

Data representativeness – You want both your training set and test set to be representative of the available data.

Figure 4.1 Simple hold- hold-out validation split

Data preprocessing, feature engineering, and feature learning

In many cases, it is not reasonable to expect a machine learning model to be able to learn from completely arbitrary data. If you choose to use the raw pixels of an image as input, you have a problem with machine learning. A simple machine learning algorithm can then learn to associate these coordinates with the corresponding time of day.

Does this mean you don't have to worry about feature engineering as long as you are using deep neural networks.

Figure 4.3 Feature engineering for reading the time on a clock

Overfitting and underfitting

The dots are the validation loss values of the smallest network, and the intersections are the initial network (remember, a lower validation loss signals a better model). The dots are the validation loss values of the largest network, and the intersections are the initial network. L1 adjustment - The added cost is proportional to the absolute value of the weight coefficients (L1 rate of weights).

L2 adjustment - The added cost is proportional to the square of the value of the weight coefficients (the L2 rate of the weights).

Figure 4.5 shows how the bigger network fares compared to the reference network.

The universal workflow of machine learning

Once you know what you're aiming for, you need to determine how you'll measure your current progress. Once you obtain a model that has statistical power, the question becomes whether your model is sufficiently powerful. To figure out how big a model you need, you need to develop a model that is overfit.

When you see the model's performance on the validation data start to degrade, you have achieved overfitting.

Table 4.1 Choosing the right last-layer activation and loss function for your model

Deep learning in practice

Introduction to convnets

This is what the term feature map means: each dimension on the depth axis is a feature (or filter), and the output of the 2D tensor[:, :, n] is the 2D spatial map of the response of this filter onto the input. Output feature map depth—The number of filters calculated by the convolution. Using step 2 means that the width and height of the feature map are reduced by a factor of 2 (in addition to any changes caused by boundary effects).

In the convnet example, you may have noticed that the size of the feature maps is halved after each MaxPooling2D layer.

Figure 5.1 Images can be broken into local patterns such as edges, textures, and so on.

Training a convnet from scratch on a small dataset

You'll use the same general structure again: the convnet will be a stack of alternating Conv2D (with relu activation) and MaxPooling2D layers. Since you are tackling a binary classification problem, you end the network with a single unit (a dense layer of size 1) and a sigmoid activation. 142 CHAPTER 5 Deep Learning for Computer Vision Let's save the model; you will use it in section 5.4.

As a next step to improve your accuracy on this problem, use a pretrained model, which is the focus of the next two sections.

Figure 5.8 Samples from the Dogs vs. Cats dataset. Sizes weren’t modified: the samples are heterogeneous in size, appearance, and so on.

Using a pretrained convnet

Note that the level of generality (and therefore reusability) of the representations extracted by specific convolutional layers depends on the depth of the layer in the model. You extract features from these images by calling the predictive method of the conv_base model. For the same reason, it is only possible to refine the upper layers of the convolutional base if the classifier at the top has already been trained.

In the original Kaggle competition around this dataset, this would have been one of the top results.

Figure 5.14 Swapping classifiers while keeping the same convolutional base

Visualizing what convnets learn

Then you get an input image: an image of a cat, not part of the images the network was trained on. Let's try to map the fourth channel of activation of the first layer of the original model (see Figure 5.25). Now you need a way to calculate the value of the loss tensor and the gradient tensor, given an input image.

The channel-wise average of the resulting feature map is the heatmap of the class activation.

Figure 5.25 Fourth channel of the activation of the first layer on the test cat picture

Finally, you will use OpenCV to generate an image that overlays the original image on the heatmap you just obtained (see Figure 5.36). This chapter explores deep learning models that can process text (understood as sequences of words or sequences of characters), time series, and sequence data in general. The two basic deep learning algorithms for sequence processing are recurrent neural networks and 1D convolutions, the one-dimensional version of the 2D convolutions we covered in previous chapters.

Time series comparisons, such as estimating how closely related two documents or two stock tickers are.

Working with text data

One-time encoding is the most common and basic way to convert a token to a vector. While the vectors obtained by one-time encoding are binary, sparse (consisting mostly of zeros), and very high-dimensional (of the same dimensionality as the number of words in the vocabulary), embedded words are low-dimensional floating-point vectors (that is, dense vectors as opposed to sparse vectors); see Figure 6.2. Unlike word vectors obtained by one-time encoding, word embeddings are learned from data.

On the other hand, one-hot encoding words typically lead to vectors that are 20,000-dimensional or larger (capturing a vocabulary of 20,000 tokens in this case).