PyTorch / datasets / dataloader / data transfer to GPU – I – properties of some torchvision datasets

For an old fan of Tensorflow2 it is somewhat satisfactory to notice that some TF2- problems also exist in analogous form in a PyTorch environment.

Anyone who has worked with visual data knows that one needs to modify/augment and transform the image data and then load them from some storage under CPU control to the GPU’s VRAM during network training. On systems with only a few GB of VRAM we sometimes are forced to provide images one batch after another to the GPU to keep the VRAM consumption on the GPU within acceptable limits. This is what data-pipelines do for us.

Some of TF2- related pipeline tools gave me a headache, because under certain conditions they were painstakingly slow compared to what happened on the GPU. In particular for relatively small NN-models and small batches of data. This potential mismatch between CPU and GPU capabilities also came up with PyTorch on my (relatively old) Linux systems. And some readers may have experienced similar problems on Google’s Colab service, too.

With this and the next 2 posts, I want to describe some recipes for loading data faster into the GPU. Recipes, which I found spread over Internet forums. In particular, I want to have a look at the extreme case of loading all image data to the GPU ahead of any training operations.

I am just a beginner with PyTorch, so experienced users may just look with a pitying view at these three posts. But, I hope these posts may help other PyTorch beginners … To keep things simple I take the MNIST and FashionMNIST datasets as examples – although they have some special characteristics.

In this first post we look a bit closer at some properties of typical torchvision datasets. This will help us later to understand what a dataloader-object does and how we can transfer all image data of such a set directly to the GPU. I assume that the reader is familiar with the fact that data handling on a GPU requires that we provide the data in form of tensors, i.e. array-like objects of a special format suited for GPU operations.

The data property of Dataset objects for MNIST and FashionMNIST

Let us first look at the interface for downloading and using an available dataset for images. Below some code for FashionMNIST:


import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
from torchvision.transforms import ToTensor, Normalize, Compose 
import matplotlib.pyplot as plt
from PIL import Image
# -------------------------------
# Path to Fashion MNIST data - this is were the dataset gets stored 
root = '/mnt_ramdisk/FashionMNIST_data/'
# Load the data (if necessary)
train_data = datasets.FashionMNIST(root=root, 
                               train=True, 
                               download=True, 
                               transform=Compose([
                                   ToTensor()
                               ])
)
print(train_data.__len__)
print()
print()
test_data = datasets.MNIST(root=root, 
                           train=False, 
                           download=True, 
                           transform=Compose([
                               ToTensor(),
                           ])
)  
print(test_data.__len__)

Note that datasets are provided by the package “torchvision“.

Obviously, one can define the path where to store the data. And we can define a chain of transformation operations which shall be applied to the data. In the case above, we are not astonished that ToTensor() appears as an element in this chain. We could have supplemented normalization or augmentation operations. I left such steps out for the sake of simplicity.

The object definition looks promising. Most introductory documentations now directly turn to the application of a Dataloader() on the datatset-object. A dataloader-object will handle further processing of the data for us. But, such a straightforward line of argumentation leaves the interested ML-developer with some open points. There are 2 points which I did not understand by following the usual path of PyTorch documentation:

Datasets may well have their own policies of what kind of data really are downloaded and saved in one of the target folders on your Linux system. Even the standard documentation of basic and specific classes for pytorchvision datasets (as e.g. the MNIST dataset) focuses on the class methods. The rules for certain obligatory functions of a standard Dataset class must, of course, be fulfilled by the classes for specific datasets. E.g., a method __getitem__() is always required. But how the data look like in their original downloaded form is something else.
Another point which remains somewhat obscure is the question when and how the transformations prescribed by some “transform”-settings in the Dataset interface are applied.

Both points become, however, much clearer when one looks at the source code. And the information there opens up for a controlled option to load some datasets completely to the GPU.

For MNIST you find the source code here. First, note the central statement in the __init__ () function there.


self.data, self.targets = self._load_data()

So, in these cases we will find some data in the properties “data” and “target” of a concrete object instance of this class. Reading a bit in the code makes it clear that “data” are the training data, in case we choose a parameter value train=True. In case train=false we get label data.

Ok, let us look at what kind of data format we actually get after having run the above statements:


print("ds shape = ", train_data.data.shape)


ds shape =  torch.Size([60000, 28, 28])


print(train_data.data[1]) 


tensor([[  0,   0,   0,   0,   0,   1,   0,   0,   0,   0,  41, 188, 103,  54,
          48,  43,  87, 168, 133,  16,   0,   0,   0,   0,   0,   0,   0,   0],
        [  0,   0,   0,   1,   0,   0,   0,  49, 136, 219, 216, 228, 236, 255,
         255, 255, 255, 217, 215, 254, 231, 160,  45,   0,   0,   0,   0,   0],
...
[  0,   0,   0,   0,   0,   1,   0,   0, 139, 146, 130, 135, 135, 137,
         125, 124, 125, 121, 119, 114, 130,  76,   0,   0,   0,   0,   0,   0]],
       dtype=torch.uint8)

We take to notice that the data residing in train_data.data are already tensor data. So, if they fulfilled dimension expectations of a NN-model, we could load them directly into the GPU. However, these raw data do not fulfill the requirements.

PIL image data vs dataset.data

Now, you may answer that we actually do get image data from a dataset as the following example code from a Pytorch tutorial shows:


labels_map = {
    0: "T-Shirt",  1: "Trouser",  2: "Pullover", 3: "Dress",
    4: "Coat",     5: "Sandal",   6: "Shirt",    7: "Sneaker",
    8: "Bag",      9: "Ankle Boot", 
}
figure = plt.figure(figsize=(6, 6))
cols, rows = 3, 3
for i in range(1, cols * rows + 1):
    sample_idx = torch.randint(len(train_data), size=(1,)).item()
    img, label = train_data[sample_idx]
    figure.add_subplot(rows, cols, i)
    plt.title(labels_map[label])
    plt.axis("off")
    plt.imshow(img.squeeze(), cmap="gray")
plt.show()

The result looks like:

How is this possible? What happens under the hood of our dataset-object?

One question that directly comes up: Why is the call of the squeeze()-method required when calling imshow() of Matplotlib ?

Another more fundamental point is the statement


    img, label = train_data[sample_idx]

We find the base for this in the code of the “__get_item__” – function of the dataset class, which actually provides “img” and “label” in the above code :


def __getitem__(self, index: int) -> Tuple[Any, Any]:
        """
        Args:
            index (int): Index

        Returns:
            tuple: (image, target) where target is index of the target class.
        """
        img, target = self.data[index], int(self.targets[index])

        # doing this so that it is consistent with all other datasets
        # to return a PIL Image
        img = Image.fromarray(img.numpy(), mode="L")

        if self.transform is not None:
            img = self.transform(img)

        if self.target_transform is not None:
            target = self.target_transform(target)

        return img, target

Can we check this in more detail? Well, we just have to repeat the approach of __get_item__:


img = train_data.data[1]
img2 = Image.fromarray(img.numpy(), mode="L")
print("Info for img2 :") 
print(img2)
print()
trans = transforms.Compose([ ToTensor(), ])
img3 = trans(img2)
print("Info for img3 :") 
print("Shape imgg3 : ", img3.shape)
print(img3)

The output of this code snippet is :

Info for img2 :
<PIL.Image.Image image mode=L size=28x28 at 0x2FE587A3F1D0>

Info for img3 :
Shape imgg3 :  torch.Size([1, 28, 28])
tensor([[[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0039, 0.0000, 0.0000,
          0.0000, 0.0000, 0.1608, 0.7373, 0.4039, 0.2118, 0.1882, 0.1686,
          0.3412, 0.6588, 0.5216, 0.0627, 0.0000, 0.0000, 0.0000, 0.0000,
          0.0000, 0.0000, 0.0000, 0.0000],
...
 [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0039, 0.0000, 0.0000,
          0.5451, 0.5725, 0.5098, 0.5294, 0.5294, 0.5373, 0.4902, 0.4863,
          0.4902, 0.4745, 0.4667, 0.4471, 0.5098, 0.2980, 0.0000, 0.0000,
          0.0000, 0.0000, 0.0000, 0.0000]]])

This shows what the dataset object internally does, when we retrieve an indexed tuple from it:

In a first step it uses the provided original tensor data (uint8) and turns them into a Numpy array which is transformed into a PIL image. These images work with floats in the range [0.0, 1.0] to define pixel values.
In a second step these float data are then turned into an image data tensor in the form [C, W, H] (with C: color channels, W: Width, H: Height). For our gray-scale MNIST images we get a shape of [1, 28, 28].

The second step explains why we we must use the squeeze-method before calling imshow(). Note that __get_item__ provides a PIL image if no transformation to a tensor is performed. Otherwise we get a tensor which must be squeezed to become an understandable 28×28 array input for plt.imshow(). Note: If you had not requested a transformation to a tensor when you defined the dataset, you could have directly provided the PIL image data to plt.imshow().

The important point for further analysis is that a Dataloader-object iterates over the data of a Dataset by using the method “__get_item__” of the Dataset-object.

Conclusion

In this post we have seen that at least some available torchvision datasets deliver their data already in a basic PyTorch tensor format. We have to refer to the “data“-property of the instantiated dataset object to access the respective tensor array. The basic image tensors do, however, not fit the standard expectations for image tensors by NNs. Defining a transformation chain containing ToTensor() for the dataset’s parameter “transform” solves this problem. Such transformations are applied to the tuples (img, label) which we get when we call an indexed element of the Dataset. Such an indexed element actually is a tuple of tensor image-data and a tensor for a label, if we requested a ToTensor-transformation for both.

A visualization of the tensor image data of a requested dataset element requires an application of the squeeze-function to get a format which can be handled by matplotlib.

Links

Code for MNIST dataset: https://pytorch.org/ vision/0.21/ _modules/ torchvision/ datasets/ mnist.html#MNIST

Tutorial for datasets and dataloader: https://pytorch.org/ tutorials/ beginner/ basics/ data_tutorial.html

Addendum, 14.03.2025 – How to get the value of a label?

A reader has asked ho to get the value of a label from the downloaded tensors of a dataset. You find the labels of images, which you may need for the training of a discriminator NN, in the property “dataset.targets” – also in tensor form.

This tensor has in the case of MNIST / FashionMNIST a dimension of zero – it actually is a number.

To use the labels outside of the GPU you have to call the item() method of the tensor object. See the following example, which produces a plot of a certain image (with index 15):


labels_map = {
    0: "T-Shirt",  1: "Trouser",  2: "Pullover", 3: "Dress",
    4: "Coat",     5: "Sandal",   6: "Shirt",    7: "Sneaker",
    8: "Bag",      9: "Ankle Boot", 
}
idx = 15
img = train_data.data[idx]
img2 = Image.fromarray(img.numpy(), mode="L")
label = train_data.targets[idx].item()
labels_map = {
    0: "T-Shirt",  1: "Trouser",  2: "Pullover", 3: "Dress",
    4: "Coat",     5: "Sandal",   6: "Shirt",    7: "Sneaker",
    8: "Bag",      9: "Ankle Boot", 
}

# Plot the image 
figure = plt.figure(figsize=(4, 4))
cols, rows = 1, 1
for i in range(1, cols * rows + 1):
    figure.add_subplot(rows, cols, i)
    plt.title(labels_map[label])
    plt.axis("off")
    plt.imshow(img2, cmap="gray")
plt.show()