Encoding Rotated Images with Autoencoder
In this study, we explore the application of PyTorch-based autoencoders, featuring convolutional layers, to encode images from the Fashion-MNIST dataset. Our autoencoder effectively encodes and decodes original images. However, a notable challenge arises when we feed rotated images into the model. These rotated images are often decoded incorrectly and classified into different categories. We address this issue by training the model on a combination of original and randomly rotated images, enabling it to decode rotated input correctly.
Data
We use a subset of the Fashion-MNIST dataset for our experiments, focusing on four out of the ten available classes. Below is the code snippet for loading the dataset and a sample of the images.
import torch
from torch import nn
from torch.utils.data import DataLoader, TensorDataset
from torchvision import datasets
from torchvision.transforms import ToTensor, RandomRotation
dataset_train = datasets.FashionMNIST(
root='data',
train=True,
transform = ToTensor(),
download=True)
dataset_test = datasets.FashionMNIST(
root='data',
train=False,
transform = ToTensor(),
download=True)
Autoencoder Model
The autoencoder comprises an encoder and decoder, both constructed using 2D convolutional layers. The encoded representation of an image consists of a pair of real numbers (x, y).
class autoencoder(nn.Module):
def __init__(self, input_channel):
PAD = 1
super().__init__()
self.encoder = nn.Sequential(
nn.Conv2d(input_channel, 32, (3,3), stride=2, padding=PAD),
nn.ReLU(),
nn.Conv2d(32, 64, (3,3), stride=2, padding=PAD),
nn.ReLU(),
nn.Conv2d(64, 128, (3,3), stride=2, padding=PAD),
nn.ReLU(),
nn.Flatten(),
nn.LazyLinear(2)
)
self.decoder = nn.Sequential(
nn.LazyLinear(2048),
nn.Unflatten(1, (128, 4, 4)),
nn.ConvTranspose2d(128, 64, kernel_size=(3,3), stride=2, padding=1),
nn.ReLU(),
nn.ConvTranspose2d(64, 32, kernel_size=(3,3), stride=2, padding=1, output_padding=1),
nn.ReLU(),
nn.ConvTranspose2d(32, input_channel, kernel_size=(3,3), stride=2, padding=1, output_padding=1),
nn.Sigmoid()
)
def decode(self, x):
return self.decoder(x)
def encode(self, x):
return self.encoder(x)
def forward(self, x):
return self.decoder(self.encoder(x))
A summary of the model structure and parameters is provided in the code below.
from torchsummary import summary
device = 'cuda' if torch.cuda.is_available() else 'cpu'
summary(autoencoder(1).to(device), (1,28, 28), device=device)
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 32, 14, 14] 320
ReLU-2 [-1, 32, 14, 14] 0
Conv2d-3 [-1, 64, 7, 7] 18,496
ReLU-4 [-1, 64, 7, 7] 0
Conv2d-5 [-1, 128, 4, 4] 73,856
ReLU-6 [-1, 128, 4, 4] 0
Flatten-7 [-1, 2048] 0
Linear-8 [-1, 2] 4,098
Linear-9 [-1, 2048] 6,144
Unflatten-10 [-1, 128, 4, 4] 0
ConvTranspose2d-11 [-1, 64, 7, 7] 73,792
ReLU-12 [-1, 64, 7, 7] 0
ConvTranspose2d-13 [-1, 32, 14, 14] 18,464
ReLU-14 [-1, 32, 14, 14] 0
ConvTranspose2d-15 [-1, 1, 28, 28] 289
Sigmoid-16 [-1, 1, 28, 28] 0
================================================================
Total params: 195,459
Trainable params: 195,459
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.38
Params size (MB): 0.75
Estimated Total Size (MB): 1.13
----------------------------------------------------------------
Training with Original Images
We initially train the model using only the original images. The training process involves minimizing the Mean Squared Error (MSE) loss between the input and the reconstructed output.
model = autoencoder(1)
model.to(device)
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
n_epoches = 51
n_print = 10
loss_history = []
for epoch in range(n_epoches):
loss_train = 0
count_train = 0
loss_test = 0
count_test = 0
# training
model.train()
for ds,_ in dl_train:
x = ds.to(device)
pred = model(x)
loss = loss_fn(pred, x)
loss_train += loss.item()*len(ds)
count_train += len(ds)
loss.backward()
optimizer.step()
optimizer.zero_grad()
# testing
model.eval()
with torch.no_grad():
for ds, _ in dl_test:
x = ds.to(device)
pred = model(x)
loss_test += loss_fn(pred, x).item()*len(ds)
count_test += len(ds)
if epoch % n_print == 0:
print(f"epoch={epoch}, train loss={loss_train/count_train}, test loss={loss_test/count_test}")
loss_history.append([epoch, loss_train/count_train, loss_test/count_test, loss_train, count_train, loss_test,count_test])
Each image is encoded into a pair of real numbers denoted as $(x,y)$. Figure 2 illustrates the distribution of encoded images in the training dataset.
Figure 3 displays sample images both before and after passing through the trained model. The autoencoder filters out specific image details during this process, preserving only the underlying general features.
Incorrect Decoding of Randomly Rotated Images
When we introduce randomly rotated images into the autoencoder, it often fails to decode them accurately. These rotated images are erroneously classified into different categories. Figure 4 presents examples of this phenomenon for each of the four classes. For instance, when rotated, a Trouser image may be decoded as a Bag, Dress, or Sandal. The same applies to Dress images being decoded as Bags. However, rotated Sandal and Bag images are decoded correctly, likely due to their distinct structural characteristics.
Training Model with Randomly Rotated Images
To address the issue of incorrect decoding, we train the model with randomly rotated images. For each image in the training dataset, five randomly rotated versions are added as inputs. The target image for both the original and rotated versions remains the same as the original image.
After this training, the autoencoder can accurately decode rotated images in the test dataset, as shown in Figure 5.
Figure 6 provides an analysis of the embeddings generated by the encoder. Each class is represented by one image and nine randomly rotated versions. It is evident that the classes are separated in the embedding space, and within each class, there is variability due to the different rotations.
Summary
In conclusion, our autoencoder, featuring convolutional layers, successfully encodes images into a lower-dimensional representation. Initially, it struggles to decode randomly rotated images correctly. However, after training with these rotated images, the autoencoder can accurately decode. Each decoded image represents a distribution of random rotations, indicating the model’s ability to generalize rotation invariance.
References
Foster, D. (2023). Generative Deep Learning (2nd ed.). O’Reilly Media.