PyTorch / datasets / dataloader / data transfer to GPU – II – dataloader too slow on CPU?

In the last post of this mini-series we saw that some Pytorch torchvision datasets have a directly accessible property “data“. There we find image data in a dataset specific format. In the case of MNIST and FashionMNIST (and for many other sets) these data are already torch tensors. However, due to the fact that these tensors are squeezed, they do not fit the format a NN requires for image tensors. They lack a dimension for indexing the usual color layers. Thus, we cannot use these tensors directly as input for a NN-model.

For standard usage of an image dataset, we therefore define a PyTorch dataset object with an added chain of transformation operations. A standard operation is e.g. ToTensor() – which remedies deficits regarding format expectations of NNs. All defined transformation operations are invoked whenever we call an indexed element of the dataset object. Without transform operations, such an element typically provides a tuple of image data (in PIL format) and label data (for MNIST/FashionMNIST as integers).

A defined ToTensor() transform operation eventually delivers usable PyTorch tensors. The resulting image data follow the required [C, W, H] format. An obligatory __get_item__() function of the dataset class controls how the data of a dataset object are handled internally.

In this post we look a bit at the setup of a PyTorch “dataloader” and what it does with dataset elements on the CPU of our Linux-based ML-system. A dataloader’s actions can become a performance bottleneck, in particular during the training of relatively small NNs. We will see that for small networks and small batch-sizes the GPU may wait and wait for data provision from the CPU. This is an effect that may hit ML-beginners on standard PC systems with limited CPU power and older RAM devices. Some measures may help to overcome this bottleneck – in parts. Below, I provide code snippets to enable the reader to perform his/her own test runs.

Test data and test NN: To test the functionality of a PyTorch dataloader we first define a MNIST dataset object. Then we create a dataloader object which uses our dataset. Afterward we pass the data to a simple CNN with 4 successive simple Conv2D layers – each working with a stride of 2. The results of 256 maps are fed into to 2 full connected layers producing an output via log_softmax. I will not go into details and the disadvantages of this extremely simple NN as this is not of interest here. We are only interested in CPU and GPU performance and their relative load. This is also the reason why we regard training runs only – and no intermediate evaluation of test data or pure inference runs.

Plan of work: We vary parameters of the dataloader object in successive training runs of our NN. During the runs we grossly measure the CPU and GPU load as well as the turnaround time for 40 training epochs. We will find that the parameters for the number of parallel dataloader workers (num_workers) and for the batch size are decisive fro performance optimization. However, our tests will also reveal that the back and forth operations to turn the delivered image data first into PIL images and afterward back to tensors, as well as additional normalization operations limit the transfer rate of data to the GPU. In addition, such basic operations actually are a total waste of CPU time; in a standard dataloader setup – would be repeated epoch for epoch.

Hint: The first sections of this post contain somewhat standard preparation steps.

Preparation and code snippets

I should say that I did all of my tests in a Jupyterlab environment. I assume that the reader is familiar with such a browser-based environment. I further assume that you know how to restart the notebook kernel and how to rerun a notebook after substantial changes to a test-run’s setup.

We first load some Python modules. Afterward, we define parameters for the setup of the training and the dataloader. We also check for a cuda-device, in my case an available Nvidia TI 4060.

You may sometimes have to scroll to see all of the code.


import os, sys, time, math 
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
from torchvision.transforms import ToTensor, Normalize, Compose 
import matplotlib.pyplot as plt
from PIL import Image

Some of the following control parameters will become clearer later on. I will later also indicate which parameters have to be changed for certain runs.


num_samples = 60000  # MNIST & FashionMNIST 

# Parameters for NN training 
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Learning rate 
LR = 1.e-4

# Batch Size 
BATCH_SIZE_TRAIN = 32

# Number of epochs
NUM_EPOCHS = 41 
NUM_BATCHES = math.ceil(num_samples / BATCH_SIZE_TRAIN) 
print("Num_BATCHES = ", NUM_BATCHES) 
# print all NP_Batches 
NP_Batches = int(num_samples / BATCH_SIZE_TRAIN / 100.) * 20
#print(NP_Batches)

# Preload all data to the GPU?
b_PRELOAD = True

# Parameters for setting up a dataloader 
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# if True => Tensor data set with all transforms done once 
b_TENSOR_DATASET = False
if b_PRELOAD: 
    b_TENSOR_DATASET = True

# Dataloader parameters
b_STANDARD_PARAMS = True
NUM_WORKERS = 0
b_PIN_MEMORY = False 
b_PERSISTENT_WORKERS = False

Note: We will not use options controlled by “b_PRELOAD” and “b_TENSOR_DATASET” parameters in this post.

The parameter “b_TENSOR_DATASET” is a decisive one, although we will not use it in this post. However, in the next post of this series, we will perform basic data transformations ahead of the NN-training. Then we will create datasets built on tensors. The parameter “b_PRELOAD” controls whether the elements of a tensor dataset will be preloaded to the GPU ahead of a training run.
We will not set these two parameter to True in the present post, i.e. we will not use related options in the code.

We now test for CUDA:


# Checking for cuda-device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
if device.type == 'cuda':
    print(torch.cuda.get_device_name(0)) 

-------------------------
Using device: cuda
NVIDIA GeForce RTX 4060 Ti
Memory Usage:
Allocated: 0.0 GB
Cached:    0.0 GB

This is a standard procedure for PyTorch. I use CUDA 12.8, cudnn 9.7 and Nvidia drivers 570.127.06.

Definition and download of a dataset (for the MNIST example)

We need to download the MNIST data, of course. The following statements perform a download automatically if the dataset is not found at a defined path. We define a dataset object for the MNIST training images and labels:


# Path to MNIST data
ds_path = '/mnt_ramdisk/MNIST_data/'

# Definition of a dataset with standard parameters
# and a chain of basic transformation operations 
train_data = datasets.MNIST(root=ds_path, 
                               train=True, 
                               download=True, 
                               transform=Compose([
                                   ToTensor(),
                                   Normalize( (0.1302,),
                                              (0.3069,) )
                               ])
)
print(train_data.__len__)

-------------------------------------------
<bound method MNIST.__len__ of Dataset MNIST
    Number of datapoints: 60000
    Root location: /mnt_ramdisk/MNIST_data/
    Split: Train
    StandardTransform
Transform: Compose(
               ToTensor()
               Normalize(mean=(0.1302,), std=(0.3069,))
           )>

-----------------------

Typically, I have a RAM disk mounted on /mnt_ramdisk. So, reading the data fast enough certainly is not a problem for our tests. Even if we used a standard hard disk, the Linux system would buffer all data access on disks in a sufficiently large RAM, anyway.

Note: In contrast to the example in the first post, I have added a standardization of the data to the transform chain. I.e., the dataloader will in addition to internal data handling (change to PIL-format) create tensors in the right format for us and also standardize the data. On the fly during the training.

If you wonder where the mean and the standard deviation values provided to the Normalize-function come from: We can get these data directly from the downloaded tensors of the MNIST image data:


pics = train_data.data.to(torch.float32).unsqueeze(1) 
pics2 = pics / 256.
print("mean = ",  pics2.mean(), " :: stddev = ", pics2.std() )

The 1st statement above should be clear after the first post of this series. This snippet gives us an output of :

mean = tensor(0.1302) :: stddev = tensor(0.3069).

Readers of the first post in this mini-series may now ask where the division by 256 is done when we later use the dataset with a dataloader. Well, have a look at what we found for the function __get_item__() in the last post. This function includes a transformation of the tensors to Numpy arrays and to the PIL-format. The latter is where the division happens during the standard use of a dataloader.

Creation of a so called tensor dataset for runs with prepared tensors

The next sequence of statements prepares another dataset composed of fully normalized tensors. We will use such a “tensor dataset” in runs of the next post in this series. For now, note that we must use a special class “TensorDataset” to create a dataset object based on image tensors of e.g. dtype=torch.float32.

We first simply use the pics2-tensor from above to get yet another tensor “pics3” of normalized data.


b_print = False
if b_TENSOR_DATASET: 
    trans = transforms.Compose([ Normalize( (0.1302,),
                                            (0.3069,) )
                              ])
    # Perform the model training 
    start_time = time.perf_counter()    
    pics3 = trans(pics2)
    end_time = time.perf_counter()
    cpu_time = end_time - start_time
    print("CPU time for normalization: ", cpu_time)     
    print()
    print("Shape images = ", pics3.shape)
    print()
    if b_print: 
        print("Shape of img tensor : ",pics3[1,0].shape)
        print()
        print(pics3[1,0])


# !!!!! 
# When preloading do not forget that you use 
# standard parameters in the dataloader 
#    Multiple Workers and memory_pinning is not possible for data on the GPU!
if b_PRELOAD: 

    # Label tensors to GPU 
    labels_pre = labels.to(device)
    print (labels_pre.shape)
     
    # Normalized Img data to the GPU  
    picts_pre  = pics3.to(device)
    print (picts_pre.shape)

------------------------
CPU time for normalization:  0.04189275699991413
Shape images =  torch.Size([60000, 1, 28, 28])
torch.Size([60000])
torch.Size([60000, 1, 28, 28])

We then create a “tensor dataset” via the class “TensorDataset“.


if b_TENSOR_DATASET:
    if not b_PRELOAD: 
        tens_train_ds = torch.utils.data.TensorDataset(pics3, labels)
    else: 
        tens_train_ds = torch.utils.data.TensorDataset(picts_pre, labels_pre)
        print("Preload - dataset formed by loaded tensors")

As said: These variants become important in a forthcoming post, only.

Definition and a first usage of a dataloader

How do we define a dataloader? It is easy.


if not b_PRELOAD: 
    if not b_TENSOR_DATASET: 
        if b_STANDARD_PARAMS:
            train_dataloader = DataLoader(train_data, batch_size=BATCH_SIZE_TRAIN, shuffle=True) 
        else:                                  
            train_dataloader = DataLoader(train_data, batch_size=BATCH_SIZE_TRAIN, shuffle=True, 
                                          num_workers=NUM_WORKERS, pin_memory=b_PIN_MEMORY, 
                                          persistent_workers=b_PERSISTENT_WORKERS  )
        #test_dataloader  = DataLoader(test_data,  batch_size=BATCH_SIZE_TEST,  shuffle=False)
    else:
        if b_STANDARD_PARAMS:
            train_tens_loader = DataLoader(tens_train_ds, batch_size=BATCH_SIZE_TRAIN, shuffle=True) 
        else: 
            train_tens_loader = DataLoader(tens_train_ds, batch_size=BATCH_SIZE_TRAIN, shuffle = True,
                                      num_workers=NUM_WORKERS, pin_memory=b_PIN_MEMORY, 
                                      persistent_workers=b_PERSISTENT_WORKERS  )
            #test_loader  = DataLoader(tens_train_ds,  batch_size=BATCH_SIZE_TEST,  shuffle=False)

if b_PRELOAD:
        # We absolutely need standrad parameters !!!
        train_tens_loader = DataLoader(tens_train_ds, batch_size=BATCH_SIZE_TRAIN, shuffle = True)

The first simple variant “DataLoader(train_data, batch_size=BATCH_SIZE_TRAIN, shuffle=True)” works with standard parameters and performs a shuffling of the data during training. There are, however, more available parameters; see here. I will come back to some of them when we activate the commented lines.

The complicated conditions regarding a dataloader for fully prepared tensor data will not be used in this post, but become central in the next one. Note, however, that a dataloader for tensors preloaded to the GPU require standard parameters! Things like memory pinning and multiple workers would not work in this case.

Statements for separate dataloaders of test data are not used in this post. Note, however, that shuffling would not required for the test data.

What is a dataloader?
I just quote from the PyTorch documentation: “It represents a Python iterable over a dataset”.

How can we use a dataloader?
This is easy, too: We just have to deal with the iterable. We use the enumerate function for this purpose. See the following simple example:


# Example for using the dataloader 
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
b_run_loader = True 
if b_run_loader and not b_PRELOAD: 
    if not b_TENSOR_DATASET:
        d_loader = train_dataloader
    else: 
        d_loader = train_tens_loader
        print("using tensor dataset")
        
    for batch_i, (X,y) in enumerate(d_loader):
        mean = X.mean()
        stddev = X.std() 
        # Some output 
        # The current position is given by all batches so far + len(X) 
        if batch_i % NP_Batches == 0: 
            print("batch = ", "{:4.0f}".format(batch_i), 
                  " :: mean : ", "{:10.5f}".format(mean),
                  " :: stddev : ", "{:10.5f}".format(stddev)
                 ) 

------------------------------------------------------
batch =     0  :: mean :    -0.00167  :: stddev :     1.00048
batch =    50  :: mean :    -0.00630  :: stddev :     0.99772
batch =   100  :: mean :    -0.00196  :: stddev :     1.00246
batch =   150  :: mean :    -0.02309  :: stddev :     0.97656
batch =   200  :: mean :    -0.00836  :: stddev :     0.99679
batch =   250  :: mean :    -0.00524  :: stddev :     0.99781
batch =   300  :: mean :     0.00660  :: stddev :     1.01225
batch =   350  :: mean :    -0.00551  :: stddev :     0.99595
batch =   400  :: mean :    -0.00455  :: stddev :     0.99664
batch =   450  :: mean :     0.01409  :: stddev :     1.01712

The result data shown were produced for the standard dataset based on the non-normalized original MNIST data, i.e. we used the “train_dataloader” object, which in turn invokes the “train_data” dataset object.

Note that the iteration is based on batches. The size of the batches was defined via the parameter “batch_size” when we instantiated the dataloader object.

By looking closely at the data we learn that the defined transformation operations for our basic “train_dataloader/train_data” – combination indeed have been performed:

After a standardization of the data, the mean-value of a batch and its standard deviation should be close to 0 and 1, respectively. This is obviously the case.

As we have learned in the last post, the function which triggers the transformations is the __get_item__() – function of the dataset class. This function is called during the iteration over the dataset; it provides each element (a tuple) of the batches. (In the example above we did not use the labels [y].)

Cooperation of the dataloader with a NN-model

Let us assume that we have defined a NN-model, an optimizer (e.g. RMSprop) and a loss function (e.g. nn.NLLLoss). My simple test-model defined by a class “ConvNet()” can be described as follows:


from torchsummary import summary
class ConvNet(nn.Module):
    def __init__(self):
        super(ConvNet, self).__init__()
        self.cn1 = nn.Conv2d(1,32,3,1,padding=1); self.cn2 = nn.Conv2d(32,64,3,2,padding=1)
        self.cn3 = nn.Conv2d(64,128,3,2, padding=1); self.cn4 = nn.Conv2d(128,256,3,2)
        self.fl1 = nn.Flatten()
        self.fc1 = nn.Linear(2304, 64); self.fc2 = nn.Linear(64, 10)

    def forward(self,x):
        x=self.cn1(x); x=F.leaky_relu(x)
        x=self.cn2(x); x=F.leaky_relu(x)
        x=self.cn3(x); x=F.leaky_relu(x)
        x=self.cn4(x); x=F.leaky_relu(x)
        x=self.fl1(x); x=self.fc1(x); x=F.leaky_relu(x)
        x=self.fc2(x) # Here no RELU
        # Output via softmax 
        mod_out=F.log_softmax(x, dim=1)
        return mod_out

model = ConvNet()
# print the summary of the model
summary(model, input_size=(1, 28, 28), batch_size=-1)

# Load the model to the cuda device
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 
model.to(device)

# Loss and optimizer
# ~~~~~~~~~~~~~~~~~~~~ 
loss_fn = nn.NLLLoss() 
optimizer=optim.RMSprop(model.parameters(), lr=LR)

--------------------------------------------------------
----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1           [-1, 32, 28, 28]             320
            Conv2d-2           [-1, 64, 14, 14]          18,496
            Conv2d-3            [-1, 128, 7, 7]          73,856
            Conv2d-4            [-1, 256, 3, 3]         295,168
           Flatten-5                 [-1, 2304]               0
            Linear-6                   [-1, 64]         147,520
            Linear-7                   [-1, 10]             650
================================================================
Total params: 536,010
Trainable params: 536,010
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.37
Params size (MB): 2.04
Estimated Total Size (MB): 2.42
----------------------------------------------------------------

This model was assigned to a variable “model“.

Note: The model must be loaded to the GPU for using respective CUDA acceleration. This can be achieved by using “model.to(device)”.

Loss function and optimizer: You may well change the loss function and the optimizer according to your own preferences for MNIST.

Training loop: The interaction between the dataloader and the NN-model can be defined via a function training_loop() that controls the handling of a batch of data – in our case on the GPU:


def train_loop(model, device, train_dataloader, optim, loss_fn, NP_Batches, epoch, b_preload=False):
    size = len(train_dataloader.dataset)
    print("epoch = ", epoch) 
    # We iterate over all batches each with batch_size, but the last X may be shorter 
    for batch_i, (X,y) in enumerate(train_dataloader):
        if not b_PRELOAD: 
            X, y = X.to(device), y.to(device)
        
        # Standard steps for a batch
        # (Reset gradient, prediction, loss determination) 
        optim.zero_grad()
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        loss.backward()
        optim.step()     

        # Some output 
        # The current position is given by all batches so far + len(X) 
        if epoch%5 == 0 and batch_i % NP_Batches == 0: 
            loss, current = loss.item(), batch_i * BATCH_SIZE_TRAIN + len(X)
            print(f"loss: {loss:>7f} [{current:>5d} / {size:>5d}]")

Readers who have worked with a NN-model in PyTorch know this kind of steps, of course.

Note, by the way, that we do not load the X,y-data to the GPU in cases where we have already preload the tensors.

Performing a training run

The following statements initiate a training run over the defined number of epochs.



# Set the model into train mode
model.train()

# Do we work with preloaded data ? 
if not b_TENSOR_DATASET: 
        loader = train_dataloader
else:
    print("load via TENSOR loader") 
    loader = train_tens_loader

# Perform the model training 
start_time = time.perf_counter()
for epoch in range(NUM_EPOCHS):
    #train_loop(model, device, train_dataloader, optimizer, loss_fn, NP_Batches, epoch) 
    train_loop(model, device, loader, optimizer, loss_fn, 
               NP_Batches, NUM_BATCHES, epoch, b_preload=b_PRELOAD) 
     
end_time = time.perf_counter()
cpu_time = end_time - start_time
print()
print(" Calc time: ", cpu_time)

The somewhat convoluted conditions regarding the call of the right loader allow for a potential change to tensor datasets, respective dataloaders and a preload of data to the GPU. As said, we will, however, neither use tensor datasets nor a preloading in this post.

Training run with standard parameters of the dataloader and all transformation operations

During training run I measure an average CPU load via Linux tools. I pick an average GPU load via “watch -n1.0 nvidia-smi” on a console terminal.

Background activity: I keep a PA server streaming music from a player active in the background – just to have some background load. Additionally, many open Firefox tabs create further background load. The real run is done in a Jupyterlab tab of Firefox. Due to background activities I have a basic CPU load in my KDE environment (stretched over three 3 screens each with 2560×1440) varied between 2% to 9% ahead of the training run. The GPU load varies between 2% and 12% – depending on my interactions with application windows on the KDE desktop. These data mark a variation imposed on my measured run data. I have tried to gain average values over the full training period.

We start with a run for which we use a standard dataset and standard parameters of the dataloader. All defined transformation operation, i.e. toTensor() and Normalize(), are performed for each dataset element. Note that this basically follows the prescriptions of a standard tutorial on PyTorch datasets.

The following table shows parameter settings and performance data for such a kind of training run. Note that the batch size has a value of

BATCH_SIZE_TRAIN = 32 .

This is not an unusual value.

Parameter	Batch Size	Run time	Avg. CPU load	Avg. GPU load	Avg. GPU Watts
Standard Shuffle	32	486 secs	17%	13%	36

A run over 40 epochs takes 486 secs – which is a disaster. The GPU and CPU obviously work far from their capacity limits.

epoch =  0
initial loss: 2.316442
....
epoch =  40
avg loss =  0.0005348916165530682
Calc time:  468.9557462849998

Model convergence is good, however. We started with an averaged loss of 2.35 to 2.5 over the very first batch of the first epoch. The final average loss of epoch 40 varies statistically between 0.0005 and 0.002. Of course, 40 epochs with a LR of 1.e-4 is much too much effort for MNIST, but this is not the topic here. Actually, we are running into massive overfitting after the first 15 epochs. But, we are interested in other things.

Changing parameters of the dataloader

Looking around in the Internet you may find a hint that you should set the parameter pin_memory to True. Let us try this. From now on we set our control parameter for using standard parameters to False:

b_STANDARD_PARAMS = False

Further parameters are listed in the leftmost column of the following table.

Parameter	Batch Size	Run time	Avg. CPU load	Avg. GPU load	Avg. GPU Watts
Shuffle pin_memory=True	32	475 secs	17%	13%	36

epoch =  0
initial loss: 2.316442
....
epoch =  40
avg loss =  0.0007310810033231974
Calc time:  475.5468245379998

Well, that did not help on my system. The reason might be that the latest version determine automatically whether pin_memory is used. More important is, however, that this parameter gets effective when multiple workers are active.

So, let us in addition try different values of the parameter num_workers. It enables a kind of parallelization. Instead of one dataloader process we can start multiple, which will use available CPU cores in parallel. While this may improve the performance of data transformation, note however that the data transfer to the GPU may still become a bottleneck. Here both the physical architectures of mainboard, CPU and GPU on one side and the number of physically available pipelines to the GPU may play a role.

Parameter	Batch Size	Run time	Avg. CPU load	Avg. GPU load	Avg. GPU Watts
Shuffle pin_memory=True num_workers=1	32	445 secs	24%	14%	36
Shuffle pin_memory=True num_workers=2	32	266 secs	42%	22%	42
Shuffle pin_memory=True num_workers=3	32	220 secs	52%	27%	48
Shuffle pin_memory=True num_workers=4	32	218 secs	52%	27%	48
Shuffle pin_memory=True num_workers=4 persistent_workers=True	32	215 secs	52%	27%	48
Shuffle pin_memory=True num_workers=6 persistent_workers=True	32	212 secs	53%	27%	48

Note that I have allowed for 7 threads to run in the startup script for Jupyterlab (export OPENBLAS_NUM_THREADS=7; export OMP_NUM_THREADS=7).

With one worker there was only a very small improvement. This had to be expected. But up to num_workers = 4, we see substantial improvements. However, for more workers we see no further major gain. Small improvement may even be within statistical ranges.

What could be reasons for this finding?

Well, I have just 4 real CPU cores and due to hyperthreading 8 CPU threads. The hyperthreads may not come with much of an advantage in our situation.
Another point is the number of tensors to be transferred. For a small batch size [BS] as BS=32 there are so many tensors to transfer to the GPU that the loader workers may have to wait for one another to finish. Data below with a larger BS actually do support this point.

Regarding point 2 also the latency and performance of the RAM may play a role. Anyway, further workers appear to be pretty useless under my present conditions – and the GPU load won’t go up. We obviously have to change other parameters in addition.

The impact of the Batch Size

Lets us change the batch size to BATCH_SIZE_TRAIN=64 and further up to BATCH_SIZE_TRAIN=256. Note that with such a change we reduce the number of tensors which have to be transferred to the GPU.

Parameter	Batch Size	Run time	Avg. CPU load	Avg. GPU load	Avg. GPU Watts
Shuffle pin_memory=True num_workers=3 persistent_workers=True	64	160 secs	55%	28%	50
Shuffle pin_memory=True num_workers=4 persistent_workers=True	64	129 secs	73%	34%	53
Shuffle pin_memory=True num_workers=6 persistent_workers=True	64	120 secs	74%	36%	57
Shuffle pin_memory=True num_workers=4 persistent_workers=True	128	116 secs	64%	33%	53
Shuffle pin_memory=True num_workers=6 persistent_workers=True	128	87 secs	90%	44%	72
Shuffle pin_memory=True num_workers=4 persistent_workers=True	256	107 secs	62%	34%	62
Shuffle pin_memory=True num_workers=6 persistent_workers=True	256	81 secs	88%	45%	76

In this case I used the best value of three runs for each parameter. We see that the batch size plays an important role. By making the batch size larger, we rise the GPU load a bit.

Summary of data

The following graphics summarizes some of the runtime data – and their dependence on the Number of Workers [NW] and the Batch Size [BS].

Runtime vs.number of dataloader workers and batch size

The data groups from left to right belong to BS=32, 64, 128, 256, respectively. Note the plateau for a small batch size of 32 and a number of workers NW ≥ 4. Such a distinctive plateau is not seen for larger batch sizes. So, for larger batch sizes BS ≥ 128, additional workers NW ≥ 6 actually do help! In my case however, we then reach a CPU load close to 90%.

We see that we get relatively optimal data for 6 workers and a batch size of 128. Going to a larger batch size does not come with a substantial improvement. I think this is something typical for the Nvidia 4060 TI. I have seen it for other datasets, too, when I worked with Tensorflow TF2.

Let us face it: We have approached a kind of limit – and my GPU still does not work at its capacity level for the MNIST case. Its average load got stuck at 45%, only.

Conclusion

In this post we have learned that the transformation operations ToTensor() and Normalize(), which we define as a minimum for image data, really do cost a noticeable amount of CPU time when we use a PyTorch dataloader object to manipulate the data. All transformation operations are based on internal steps creating PIL data (from tensor data!), manipulating them and going back to usable tensor data. In addition these operations are superfluously repeated during each epoch. Nevertheless, this is the approach described in standard PyTorch tutorials on the topic.

The result for small datasets, small batch sizes and small NN models is a mess. The GPU waits for data and works far below its limits.

The sum of transform processes (plus the transfer to the GPU?) obviously can become a major bottleneck on systems with a relatively modern consumer GPU. On many systems a mismatch between CPU and GPU capabilities might therefore hinder a PyTorch beginner to efficiently work with the usual learning examples as MNIST/FashionMNIST and relatively simple NN-models.

However, we still have to investigate how big the impact of the pure transfer of our eventual tensors to the GPU is. So far, we have only seen the combined effects of CPU-processing and tensor transfer. In the next post we will find that shuffling and a pure tensor transfer organized by a Torch dataloader indeed are parts of our problem causes.

Editorial hint (correction): I corrected the last statement on the 18th of March 2025, after some new tests.

We saw at least that some parameters controlling a PyTorch dataloader object may help to improve the overall performance. In particular the number of workers and the batch size are decisive. We got down from 480 secs to almost 80 secs for the MNIST example, a simple CNN and 40 epochs. This is already a factor of 6. However, we still experienced a big gap between CPU/RAM and GPU capabilities.

In the next post we will see that working with prepared and preloaded tensor data will give us another acceleration factor of 2 to 2.4 (on a 4060 TI). But such an approach may not help in cases when statistical image augmentation is required during training.

A preliminary lesson is: Do what you can regarding data transformations ahead of a training run – and try to delegate as much augmentation operations as possible to the GPU.

Stay tuned …