along with a paired sentence description: for example, “A Tabby cat is leaning on a wooden table, with one paw on a laser mouse and the other on a black laptop.”3
This captioning model has two connected halves. The first half of the model is a network that learns to generate “descriptive” numerical representations of the scene (Tabby cat, laser mouse, paw), which are then taken as input to the second half. That second half is a recurrent neural network that generates a coherent sentence by putting those numerical descriptions together. The two halves of the model are trained together on image-caption pairs.
The second half of the model is called recurrent because it generates its outputs (individual words) in subsequent forward passes, where the input to each forward pass includes the outputs of the previous forward pass. This generates a dependency of the next word on words that were generated earlier, as we would expect when dealing with sentences or, in general, with sequences.
2.3.1 NeuralTalk2
The NeuralTalk2 model can be found at https://github.com/deep-learning-with- pytorch/ImageCaptioning.pytorch. We can place a set of images in the data directory and run the following script:
python eval.py --model ./data/FC/fc-model.pth
➥ --infos_path ./data/FC/fc-infos.pkl --image_folder ./data
Let’s try it with our horse.jpg image. It says, “A person riding a horse on a beach.”
Quite appropriate.
3 Andrej Karpathy and Li Fei-Fei, “Deep Visual-Semantic Alignments for Generating Image Descriptions,”
https://cs.stanford.edu/people/karpathy/cvpr2015.pdf.
TRAINED END-TO-END ON
Now, just for fun, let’s see if our CycleGAN can also fool this NeuralTalk2 model.
Let’s add the zebra.jpg image in the data folder and rerun the model: “A group of zebras are standing in a field.” Well, it got the animal right, but it saw more than one zebra in the image. Certainly this is not a pose that the network has ever seen a zebra in, nor has it ever seen a rider on a zebra (with some spurious zebra patterns). In addi- tion, it is very likely that zebras are depicted in groups in the training dataset, so there might be some bias that we could investigate. The captioning network hasn’t described the rider, either. Again, it’s probably for the same reason: the network wasn’t shown a rider on a zebra in the training dataset. In any case, this is an impres- sive feat: we generated a fake image with an impossible situation, and the captioning network was flexible enough to get the subject right.
We’d like to stress that something like this, which would have been extremely hard to achieve before the advent of deep learning, can be obtained with under a thousand lines of code, with a general-purpose architecture that knows nothing about horses or zebras, and a corpus of images and their descriptions (the MS COCO dataset, in this case). No hardcoded criterion or grammar—everything, including the sentence, emerges from patterns in the data.
The network architecture in this last case was, in a way, more complex than the ones we saw earlier, as it includes two networks. One is recurrent, but it was built out of the same building blocks, all of which are provided by PyTorch.
At the time of this writing, models such as these exist more as applied research or novelty projects, rather than something that has a well-defined, concrete use. The results, while promising, just aren’t good enough to use … yet. With time (and addi- tional training data), we should expect this class of models to be able to describe the world to people with vision impairment, transcribe scenes from video, and perform other similar tasks.
2.4 Torch Hub
Pretrained models have been published since the early days of deep learning, but until PyTorch 1.0, there was no way to ensure that users would have a uniform inter- face to get them. TorchVision was a good example of a clean interface, as we saw ear- lier in this chapter; but other authors, as we have seen for CycleGAN and NeuralTalk2, chose different designs.
PyTorch 1.0 saw the introduction of Torch Hub, which is a mechanism through which authors can publish a model on GitHub, with or without pretrained weights, and expose it through an interface that PyTorch understands. This makes loading a pretrained model from a third party as easy as loading a TorchVision model.
All it takes for an author to publish a model through the Torch Hub mechanism is to place a file named hubconf.py in the root directory of the GitHub repository. The file has a very simple structure:
dependencies = ['torch', 'math']
def some_entry_fn(*args, **kwargs):
model = build_some_model(*args, **kwargs) return model
def another_entry_fn(*args, **kwargs):
model = build_another_model(*args, **kwargs) return model
In our quest for interesting pretrained models, we can now search for GitHub reposi- tories that include hubconf.py, and we’ll know right away that we can load them using the torch.hub module. Let’s see how this is done in practice. To do that, we’ll go back to TorchVision, because it provides a clean example of how to interact with Torch Hub.
Let’s visit https://github.com/pytorch/vision and notice that it contains a hub- conf.py file. Great, that checks. The first thing to do is to look in that file to see the entry points for the repo—we’ll need to specify them later. In the case of TorchVision, there are two: resnet18 and resnet50. We already know what these do: they return an 18- layer and a 50-layer ResNet model, respectively. We also see that the entry-point func- tions include a pretrained keyword argument. If True, the returned models will be ini- tialized with weights learned from ImageNet, as we saw earlier in the chapter.
Now we know the repo, the entry points, and one interesting keyword argument.
That’s about all we need to load the model using torch.hub, without even cloning the repo. That’s right, PyTorch will handle that for us:
import torch
from torch import hub
resnet18_model = hub.load('pytorch/vision:master', 'resnet18',
pretrained=True)
This manages to download a snapshot of the master branch of the pytorch/vision repo, along with the weights, to a local directory (defaults to .torch/hub in our home directory) and run the resnet18 entry-point function, which returns the instantiated model. Depending on the environment, Python may complain that there’s a module missing, like PIL. Torch Hub won’t install missing dependencies, but it will report them to us so that we can take action.
At this point, we can invoke the returned model with proper arguments to run a forward pass on it, the same way we did earlier. The nice part is that now every model published through this mechanism will be accessible to us using the same modalities, well beyond vision.
Optional list of modules the code depends on
One or more functions to be exposed to users as entry points for the repository. These functions should initialize models according to the arguments and return them.
Name and branch of the GitHub repo
Name of the entry- point function Keyword argument
Note that entry points are supposed to return models; but, strictly speaking, they are not forced to. For instance, we could have an entry point for transforming inputs and another one for turning the output probabilities into a text label. Or we could have an entry point for just the model, and another that includes the model along with the pre- and postprocessing steps. By leaving these options open, the PyTorch developers have provided the community with just enough standardization and a lot of flexibility. We’ll see what patterns will emerge from this opportunity.
Torch Hub is quite new at the time of writing, and there are only a few models pub- lished this way. We can get at them by Googling “github.com hubconf.py.” Hopefully the list will grow in the future, as more authors share their models through this channel.
2.5 Conclusion
We hope this was a fun chapter. We took some time to play with models created with PyTorch, which were optimized to carry out specific tasks. In fact, the more enterpris- ing of us could already put one of these models behind a web server and start a busi- ness, sharing the profits with the original authors!4 Once we learn how these models are built, we will also be able to use the knowledge we gained here to download a pre- trained model and quickly fine-tune it on a slightly different task.
We will also see how building models that deal with different problems on differ- ent kinds of data can be done using the same building blocks. One thing that PyTorch does particularly right is providing those building blocks in the form of an essential toolset—PyTorch is not a very large library from an API perspective, especially when compared with other deep learning frameworks.
This book does not focus on going through the complete PyTorch API or review- ing deep learning architectures; rather, we will build hands-on knowledge of these building blocks. This way, you will be able to consume the excellent online documen- tation and repositories on top of a solid foundation.
Starting with the next chapter, we’ll embark on a journey that will enable us to teach our computer skills like those described in this chapter from scratch, using PyTorch. We’ll also learn that starting from a pretrained network and fine-tuning it on new data, without starting from scratch, is an effective way to solve problems when the data points we have are not particularly numerous. This is one further reason pre- trained networks are an important tool for deep learning practitioners to have. Time to learn about the first fundamental building block: tensors.
4 Contact the publisher for franchise opportunities!
2.6 Exercises
1 Feed the image of the golden retriever into the horse-to-zebra model.
a What do you need to do to the image to prepare it?
b What does the output look like?
2 Search GitHub for projects that provide a hubconf.py file.
a How many repositories are returned?
b Find an interesting-looking project with a hubconf.py. Can you understand the purpose of the project from the documentation?
c Bookmark the project, and come back after you’ve finished this book. Can you understand the implementation?
2.7 Summary
A pretrained network is a model that has already been trained on a dataset.
Such networks can typically produce useful results immediately after loading the network parameters.
By knowing how to use a pretrained model, we can integrate a neural network into a project without having to design or train it.
AlexNet and ResNet are two deep convolutional networks that set new bench- marks for image recognition in the years they were released.
Generative adversarial networks (GANs) have two parts—the generator and the discriminator—that work together to produce output indistinguishable from authentic items.
CycleGAN uses an architecture that supports converting back and forth between two different classes of images.
NeuralTalk2 uses a hybrid model architecture to consume an image and pro- duce a text description of the image.
Torch Hub is a standardized way to load models and weights from any project with an appropriate hubconf.py file.
39
It starts with a tensor
In the previous chapter, we took a tour of some of the many applications that deep learning enables. They invariably consisted of taking data in some form, like images or text, and producing data in another form, like labels, numbers, or more images or text. Viewed from this angle, deep learning really consists of building a system that can transform data from one representation to another. This transformation is driven by extracting commonalities from a series of examples that demonstrate the desired mapping. For example, the system might note the general shape of a dog and the typical colors of a golden retriever. By combining the two image properties, the system can correctly map images with a given shape and color to the golden retriever label, instead of a black lab (or a tawny tomcat, for that matter). The resulting system can consume broad swaths of similar inputs and produce meaning- ful output for those inputs.
This chapter covers
Understanding tensors, the basic data structure in PyTorch
Indexing and operating on tensors
Interoperating with NumPy multidimensional arrays
Moving computations to the GPU for speed
The process begins by converting our input into floating-point numbers. We will cover converting image pixels to numbers, as we see in the first step of figure 3.1, in chap- ter 4 (along with many other types of data). But before we can get to that, in this chapter, we learn how to deal with all the floating-point numbers in PyTorch by using tensors.
3.1 The world as floating-point numbers
Since floating-point numbers are the way a network deals with information, we need a way to encode real-world data of the kind we want to process into something digestible by a network and then decode the output back to something we can understand and use for our purpose.
A deep neural network typically learns the transformation from one form of data to another in stages, which means the partially transformed data between each stage can be thought of as a sequence of intermediate representations. For image recognition, early representations can be things such as edge detection or certain textures like fur.
Deeper representations can capture more complex structures like ears, noses, or eyes.
In general, such intermediate representations are collections of floating-point numbers that characterize the input and capture the data’s structure in a way that is instrumental for describing how inputs are mapped to the outputs of the neural net- work. Such characterization is specific to the task at hand and is learned from relevant