In this note, we implement a variational autoencoder (VAE) using the neural network framework in Wolfram Language and train it on the CelebFaces Attributes (CelebA) dataset. New images can be generated by sampling from the learned latent space. We explore how the VAE captures and manipulates image features, particularly the concept of attractiveness.

Model

VAEs are a type of deep learning model that can learn latent representations of data. As illustrated in Figure 1, an encoder network takes an input image $x$ is and compresses it into a latent variable $z$. This latent space captures the essence of the image. A decoder network takes the latent variable $z$ and reconstructing an image $x’$ that resembles the original input $x$.The decoder represents the conditional probability $p(x\vert z)$ and the encoder presents the probability $q(z\vert x)$, which approximates the distribution of $z$ in latent space. In this implementation, the distribution of $z$ is a multivariate Gaussian.

Figure 1. Diagram of the variational autoencoder.

The encoder and decoder are constructed with multiple convolutional layers and transposed convolution layers, respectively. The encoder uses convolutional layers to compress the image’s information progressively. The decoder performs the opposite task. It takes the latent variable, essentially a compressed representation of the image, and employs transposed convolutional layers to progressively upscale and rebuild the image.

Dataset

The CelebFaces Attributes (CelebA) Dataset provides $202599$ celebrity images, each labeled with $40$ facial attributes, including attractiveness.

Wolfram Language Implementation

The deep network VAE is implemented in Wolfram Language. The network structure is adopted from Generative Deep Learning by David Foster.

In this implementation, we use the VAE structure, which consists of an encoder and a decoder (Figure 2). The encoder has four convolutional layers; its structure is shown in Figures 3 and 4. The output of the encoder is the mean and logarithm of the multivariate Gaussian distribution with a dimension of $200$. We assume that the covariance matrix of the Gaussian distribution is diagonal, which means the logarithm of the variance is a one-dimensional array. On the other hand, the decoder has five transposed convolutional layers (or Deconvolution layers in Wolfram Language), as shown in Figure 5. The decoder takes random samples from the multivariate Gaussian distribution and produces the decoded images as output.

Figure 2. Diagram of the variational autoencoder.
Figure 3. Diagram of the encoder.
Figure 4. The convolution layers in the encoder.
Figure 5. The decoder consists of five tranposed convolution layers.

The loss function used in this model consists of two parts. Firstly, it involves calculating the mean squared error between the input image and the decoded image from the decoder. Secondly, it includes computing the Kullback-Leibler (KL) divergence between the multivariate Gaussian distribution generated by the encoder and a multivariate normal distribution. The KL divergence is calculated using the below formula:

\[D_{KL} = \frac{1}{2}\sum_i \left(\mu_i^2+\exp(\log\sigma_i^2)-\log\sigma_i-1\right). \notag\]

Here, $\mu_i$ and $\sigma_i^2$ represent the mean and variance of the multivariate Gaussian distribution that is generated by the encoder.

The original images, with size $178\times218$, were reduced to $32\times32$ for faster computation.

Training

The VAE is trained on all 202599 images in the CelebA dataset using a GPU. It undergoes ten rounds of training. The loss plot in Figure 6 shows that ten rounds of training are sufficient.

files = File/@FileNames[All,"data/archive/img_align_celeba/img_align_celeba"]

result = NetTrain[vae, files, All, LossFunction->"Loss", BatchSize->500,  TargetDevice->{"GPU", All}, MaxTrainingRounds->10];
Figure 6. The loss plot of the training.

Result

Image Generation

In Figure 7, the input images are in the top row, which were fed into the trained VAE model. The corresponding output images are in the bottom row. These images were upscaled from $32\times 32$ pixel size to $96\times 96$ using the Very Deep Convolutional Networks, which is an architecture inspired by VGG. The trained network from Wolfram Neural Net Repository was used to achieve this.

Figure 7. Top row: inputs to the VAE; Bottom row: corresponding outputs.

Once the VAE was trained, it became capable of generating entirely new, realistic-looking images. The VAE achieves this by sampling random points from the learned latent space and feeding them to the decoder. The decoder, having learned the relationships between latent variables and image features, can produce novel images that reside within the distribution of the training data. Figure 8 shows $40$​ random images generated this way.

Figure 8. 40 generated images.

Visualizing Convolutional Layers

To understand how the VAE processes images, we examine the outputs of each convolutional layer in the encoder and decoder. For this purpose, we input a single image (Figure 9) into the VAE, and then observe and display the resulting outputs of the convolutional layers.

Figure 9. Input to the encoder. Left: original image (size 178x218); Right: scaled image fed to the encoder (size 32x32).

Convulaitonal Layers in the Encoder

Figures 10-13 shows the encoder progressively extracts higher-level features from the input image during the encoding phase.

Figure 10. The patterns of the 128 channels after the first convolution layer.
Figure 11. The patterns of the 128 channels after the second convolution layer.
Figure 12. The patterns of the 128 channels after the third convolution layer.
Figure 13. The patterns of the 128 channels after the fourth convolution layer.

Convolution Layers in the Decoder

Figures 14-18 show the decoder gradually builds up the image from the latent representation, reintroducing more details layer by layer.

Figure 14. The patterns of the 128 channels after the first transposed convolution layer.
Figure 15. The patterns of the 128 channels after the second transposed convolution layer.
Figure 16. The patterns of the 128 channels after the third transposed convolution layer.
Figure 17. The patterns of the 128 channels after the fourth transposed convolution layer.
Figure 18. The patterns of the 3 channels after the fifth and final transposed convolution layer.

Morphing Attractiveness in Latent Space

The encoder converts training images into a a 200-dimensional multivariate normal distribution within the latent space, where each image can be represented by the mean, a 200-dimensional vector in the latent space. To visually explore the image distribution in the latent space, we use the UMAP dimesnional reduction method, which condenses the 200-dimensional vectors into a 2-dimensional space.

Addtionally, each image receives a label indicating attractiveness ($\mathrm{attractive}=1$) or unattractiveness ($\mathrm{attractive}=-1$). Figure 19 shows distinct distribions for attractive and unattractive images in the latent space.

Figure 19. Distribution of the images in latent space. Each dot represents one image. The two disks are the mean of the attractive and unattractive images. The color represents attractive (blue) and unattractive (red) lebels.

Let’s define $\Delta$ as the difference between the means of attrative and unattractive images in the latent space. Altering an image involves adding multiples of $\Delta$ to its latent space representation:

\[z' = z + n\Delta, \notag\]

where $z$ is the latent space representation of an image, $n$ is a real multiplier, and $z’$ denotes the modified representation.

Figure 20 illustrate the transition of four images transitioning from more attractive to less attractive. This transiiton suggests that more attractive individuals tend to have longer hair instead of being bald, exhibit plumber features, and possess more feminite characteristics.

Figure 20. Altering the attractivenss of an image. Each row displays an image transitioning from more attractive to less attractive. The number represents the multiplier of Δ added to the latent space representation of the images.

Conclusion

The variational autoencoder efficiently converts images into probability distributions within latent space, capturing not just objective features like hair length but also subjective attributes such as attractiveness. Implementing this deep network is straightforward using the Neural Network Framework available in the Wolfram Language.

References

Foster, David (2023). Generative Deep Learning: Teaching Machines To Paint, Write, Compose, and Play (2nd ed.). O’Reilly Media, Inc.

CelebFaces Attributes (CelebA) Dataset

Very Deep Net for Super-Resolution

UMAP:Uniform Manifold Approximation and Projection for Dimension Reduction