The Elements of Style
A spate of applications have popped/cropped up in recent years with slogans like “Make Anything Art.” They purport to transfer the style of one image and render the content of another image in that style. In the sets of images below, the small inset image is the source of the “style” which is transferred to the larger image. It’s an impressive trick, although I don’t know that it accurately represents what we mean by ‘style’.
Some prominent style-bending apps are:
- Deep Dream https://deepdreamgenerator.com/generator-style
- Pikazo http://www.pikazoapp.com/
- Dreamscope https://dreamscopeapp.com/
- Deepart https://deepart.io/latest/
- Prisma https://prisma-ai.com/
The techniques used in all of these applications are based on the ideas introduced in a 2015 paper by three professors from the University of Tübingen: Leon A. Gatys, Alexander S. Ecker and Matthias Bethge. Entitled A Neural Algorithm of Artistic Style, the paper describes a method for extracting content information from one image, stylistic information from a second, and then rendering the derived content in the derived style.
The process employs a deep, convolutional neural network (a variation on the VGG network). This is a kind of network that is commonly used in image recognition, but applying it to image creation was something relatively rare at the time.
First it makes sense to define what we mean by some of these terms.
A Neural Network is a collection of nodes, called neurons or perceptrons, arranged in layers and linked together. Below is a diagram of a perceptron. It has any number of input values, which are modified by weights and summed together. If the resultant value exceeds a defined threshold, it is propagated forward through a bias function (depicted here as sigmoid, although many functions are possible), and used as one input value for the next perceptron in line.
When the final layer produces its values that is considered the output of the network for a given input. The fitness of the “answer,” is analyzed, and the input weights of each perceptron are modified based on the level of error encountered. This is how the network ‘learns’.
Deep Neural Network
Traditional machine learning relies on shallow nets, composed of one input and one output layer, and at most one hidden layer in between. More than three layers (including input and output) qualifies as “deep” learning. So deep is a strictly defined, technical term that means more than one hidden layer.
In deep-learning networks, each layer of nodes trains on a distinct set of features based on the previous layer’s output. The further you advance into the neural net, the more complex the features your nodes can recognize, since they aggregate and recombine features from the previous layer.
This is known as feature hierarchy, and it is a hierarchy of increasing complexity and abstraction.
These nets are capable of discovering latent structures within unlabeled, unstructured data — raw media such as pictures, texts, video and audio recordings.
Deep-learning networks perform automatic feature extraction without human intervention, unlike most traditional machine-learning algorithms. Given that feature extraction is a task that can take teams of data scientists years to accomplish, deep learning is a way to circumvent the chokepoint of limited experts. It augments the powers of small data science teams, which by their nature do not scale.
Convolutional Neural Network
These networks use a special architecture which is particularly well-adapted to classify images. Using this architecture makes convolutional networks fast to train. This, in turn, helps us train deep, many-layer networks, which are very good at classifying images. Today, deep convolutional networks or some close variant are used in most neural networks for image recognition.
In convolutional neural networks, each layer stores information in an abstraction based on the previous layer. For example, the first layer may search for dark pixels in a line to represent an edge. The next layer may then look for two perpendicular edges to represent a corner. The last layer can then return a classification based on which of the features are present and how they are arranged.
If we pass an image through a convolutional network and record the activations of each layer, or what information was passed through, we can retrieve a general structure of the contents of an image.
This information changes based on which layer is used. Information from the first layer would contain what specific pixels were present, while higher layers might show where edges are present, but not what pattern of pixels are considered edges.
Convolutional neural networks use three basic ideas: local receptive fields, shared weights, and pooling.
Local Receptive Fields
Where more common forms of neural networks tend to be depicted as linear “layers”, the convolutional network can be better envisioned as a set of two-dimensional layers, corresponding pixel-to-perceptron. (If the input image is 28x28px then the input layer of the network will be 28×28 perceptrons)
As usual, the input pixels are connected to a layer of hidden neurons. But not every input pixel is connected to every hidden neuron. Instead, connections are made in small, localized regions of the input image.
To be more precise, each neuron in the first hidden layer will be connected to a small region of the input neurons, say, for example, a 5×5 region, corresponding to 25 input pixels. So, for a particular hidden neuron, there might be connections that look like this:
That region in the input image is called the local receptive field for the hidden neuron. It’s a little window on the input pixels. Each connection learns a weight. And the hidden neuron learns an overall bias as well. You can think of that particular hidden neuron as learning to analyze its particular local receptive field.
We then slide the local receptive field across the entire input image. For each local receptive field, there is a different hidden neuron in the first hidden layer. To illustrate this concretely, let’s start with a local receptive field in the top-left corner:
Then we slide the local receptive field over by one pixel to the right (i.e., by one neuron), to connect to a second hidden neuron:
And so on, building up the first hidden layer.
Shared Weights And Biases
Each hidden neuron has a bias and 5×5 weights connected to its local receptive field. The same weights and bias are used for each of the 24×24 hidden neurons.
This means that all the neurons in the first hidden layer detect exactly the same feature, just at different locations in the input image. Suppose the weights and bias are such that the hidden neuron can pick out a vertical edge in a particular local receptive field. That ability is also likely to be useful at other places in the image. And so it is useful to apply the same feature detector everywhere in the image. Move a picture of a cat a little ways, and it’s still an image of a cat.
For this reason, the map from the input layer to the hidden layer is called a feature map. The weights defining the feature map are called the shared weights. And the bias defining the feature map in this way is the shared bias. The shared weights and bias are often said to define a kernel or filter.
The network structure described so far can detect just a single kind of localized feature. To do image recognition more than one feature map is necessary. And so a complete convolutional layer consists of several different feature maps:
In the example shown, there are 3 feature maps. Each feature map is defined by a set of 5×5 shared weights, and a single shared bias. The result is that the network can detect 3 different kinds of features, with each feature being detectable across the entire image.
In addition to the convolutional layers just described, convolutional neural networks also contain pooling layers. Pooling layers are usually used immediately after convolutional layers. What the pooling layers do is simplify the information in the output from the convolutional layer.
In detail, a pooling layer takes each feature map output from the convolutional layer and prepares a condensed feature map. For instance, each unit in the pooling layer may summarize a region of 2×2 neurons in the previous layer.
As mentioned above, the convolutional layer usually involves more than a single feature map. Pooling is applied to each feature map separately.
One of the most popular benchmarks for image classification algorithms today is the ImageNet Large Scale Visual Recognition Challenge – where teams compete to create algorithms which classify objects contained within millions images into one of 1,000 different categories. All winning architectures in recent years have been some form of convolutional neural network.
In 2014, the winner of the ImageNet challenge was a network created by the Visual Geometry Group (VGG) at Oxford University, achieving a classification error rate of only 7.0%. Gatys et. al use this network – which has been trained to be extremely effective at object recognition – as a basis for trying to extract content and style representations from images.
It consists of 16 layers of convolution and ReLU non-linearity, separated by 5 pooling layers and ending in 3 fully connected layers.
The main building blocks of convolutional neural networks are the convolution layers. This is where a set of feature detectors are applied to an image to produce a feature map, which is essentially a filtered version of the image.
Networks that have been trained for the task of object recognition learn which features it is important to extract from an image in order to identify its content. The feature maps in the convolution layers of the network can be seen as the network’s internal representation of the image content. As we go deeper into the network these convolutional layers are able to represent much larger scale features and thus have a higher-level representation of the image content.
This can be demonstrated by constructing images whose feature maps at a chosen convolution layer match the corresponding feature maps of a given content image. We expect the two images to contain the same content – but not necessarily the same texture and style.
We can see that as we reconstruct the original image from deeper layers we still preserve the high-level content of the original but lose the exact pixel information.
Unlike content representation, the style of an image is not well captured by simply looking at the values of a feature map in a convolutional neural network trained for object recognition.
However, Gatys et. al found that we can extract a representation of style by looking at the spatial correlation of the values within a given feature map. Mathematically, this is done by calculating the Gram matrix of a feature map. (If you understand what this is, more power to you. I’m at a loss.)
As with the content representation, if we had two images whose feature maps at a given layer produced the same Gram matrix we would expect both images to have the same style, but not necessarily the same content. Applying this to early layers in the network would capture some of the finer textures contained within the image whereas applying this to deeper layers would capture more higher-level elements of the image’s style. Gatys et. al found that the best results were achieved by taking a combination of shallow and deep layers as the style representation for an image.
The diagram below shows images that have been constructed to match the style representation of Pablo Picasso’s ‘Portrait of Dora Maar’. Results are shown for combining an increasing number of layers to represent the image’s style.
We can see that the best results are achieved by a combination of many different layers from the network, which capture both the finer textures and the larger elements of the original image.
Styling an Image
To generate a styled image we first start with a random image, sometimes known as white noise. We then iteratively improve the image.
To describe it simply, we first pass the image through the VGG network to calculate the total style and content loss. We then backpropagate this error through the network to allow us to determine the gradient of the loss function with respect to the input image. We can then make a small update to the input image in the negative direction of the gradient which will cause our loss function to decrease in value (gradient descent). We repeat this process until the loss function is below a threshold we are happy with.