Performance comparison of PyTorch and Keras3 with TF2 and Torch backends for a small NN-model on a Nvidia 4060 TI – II – Keras with Torch backend

Not all tasks in Machine Learning [ML] require big LLMs or LLM-based interfaces. Actually, many interesting ML-tasks can be solved with neural networks [NNs] that fit well into the VRAM of a modern GPU or TPU affordable even for private persons. This statements holds in particular for image processing. We have groups of people either learning to work with ML on their own Linux systems or preparing developments for customers in reduced form on their systems. Questions of performance, GPU protection against damage from permanent maximum load and, last but not least, energy saving during the training of relatively small NN-models are not only interesting for SMEs, but also for the named groups of private people working on ML solutions.

Regarding performance during the training of small NN-models, we have seen in the first post of this mini-series that, (1) PyTorch excels with full float32 precision and that (2) Keras3/TF2 excels with “mixed precision”:

Performance of PyTorch vs. Keras 3 with tensorflow/torch backends for a small NN-model on a Nvidia 4060 TI – I – Torch vs. Keras3/TF2 and relevant parameters

[with TF2 meaning Tensorflow2]. “Mixed precision” with Keras3/TF2 turned out to be a big time and energy saver.

Now, you may be excited what Keras3 with the torch backend has to offer with regard to performance and energy saving. This is the topic of this post. Unfortunately, it is going to become a dull experience. As we will see the performance drops in parts drastically below the performance of either a pure PyTorch approach or a Keras3/TF2-combination. Certain options, which helped us before, as e.g. jit-compilation of a model, can (presently?) not even be used for the Keras3/Torch combination.

In this post I first discuss which sort of data types can be combined with a NN- model built with the help of Keras3 tools. We will also discuss whether we can use a Keras-built model in a classical PyTorch training loop. Afterward, I present results of test runs for a small CNN and MNIST data on my Nvidia 4060 TI.

Why would we use Keras at all to work with neural networks [NNs]?

Why would we use Keras to work with neural networks [NNs]? From a user perspective the main reason is the relatively simple and safe way of setting up (complicated) NN-models and/or using (pre-trained) available standard networks. To get them running you just have to follow three steps. (1) Define the model or pick an available one, (2) “compile” the model with a loss function and optimizer, (3) train or refine it by running a model.fit() function or predict results for new input via model.predict().

Remark on flexibility: Keras has much been criticized for not providing enough flexibility and dynamic graph changes on the fly. This is only partially true as some flexibility to setting up and modifying models at runtime can be gained after a certain smooth learning curve and referring to Tensorflow interfaces for model creation. See e.g. here and here for flexible input shapes. Even pre-trained models can get different input shapes (within limits). What appears static can be made relatively flexible and dynamic with some preparation and forethought in an eager execution environment as TF2.

Aside from dynamic and flexibility aspects the basic model setup is relatively easy and safe – in particular with the so called “functional interface”. Therefore, when combining Keras with Torch,

model construction with Keras/TF2 (in a more or less dynamic way )
and a combination with surrounding Torch elements for data handling and/or the training loop

would be the main objectives.

Torch backend integration with Keras3 – options and some deficits (?)

A short series of tests shows: The flexibility of combining models created with Keras3 with PyTorch elements is quite convincing. You can e.g. combine setting up a neural network via the Keras methods and

combine the Keras based NN-model with a Torch tensor-dataset, but use the Keras optimizers, loss functions and the standard model.compile() plus model.fit() methods,
combine the Keras based NN-model with Torch tensors directly fed into the model via model.fit(),
combine the Keras based NN-model with a Torch tensor-dataset, Torch optimizers and loss functions and put all into a classical Torch training loop. But see the hint below!

Important hint regarding the application of a Torch loss function:
The torch.nn.CrossEntropyLoss-function differs from Keras’ CategorialCrossentropy-function! The latter requires the output (in array form) scaled by softmax activation, while the Torch-based loss does not. This has to be taken care during model creation. In addition the order of arguments provided to the loss function is different – torch-loss: (prediction, target) vs. Keras loss (target, prediction)!

However, some other things, which I used regularly with Keras/TF2 do not seem to work. But, unfortunately, they are relevant regarding performance:

Parameter steps_per_execution =1, only: You cannot use the full spectrum of the model.compile() options in Keras3 for the Torch backend.
Most important: You can not assign the “steps_per_execution“-parameter any other value than just 1.
Note that this parameter had a huge impact on performance in case of the Tensorflow 2 [TF2] backend. Obviously, the Keras developers have found no simple method, yet, to translate this into the PyTorch world.
This problem alone lets us expect a drop in performance.
jit-compilation impossible: You may get runtime errors – even for my simple test CNN – when you try to activate the “jit_compile” – option with Keras’ model.compile() functionality and the Torch backend.
For my simple CNN I could not get rid of this error.
No use of Torch optimizer with model.compile(): You cannot use Torch optimizers (from the torch.optim-module) together with Keras’ model.compile() and model.fit().
No Keras optimizer in a Torch training loop: You can not use a Keras optimizer in a Torch training loop. Even using a Keras loss function in a Torch loop requires some preparatory steps; e.g. a reversal of arguments to the function and an application of softmax in the last model layer.
You can not directly use a torch dataset, but only a torch dataloader in the model.fit()-function.

I tried to cover the most important and supported variants named above with test runs.

Restrictions affecting all test runs

I was forced to set the model.compile parameter “steps_per_execution” to steps_per_execution=1.
jit-compilation caused either error (plain Keras 3) – or a warning that it was set to false (with old keras-core libs).
pre-loading of torch tensors to GPU, with num_workers > 0 led to error.
Tensor preparation was done such that the format was (NB x H x W x C), with NB being the number of batcjhes, H: the height dimension of images, w: width of images, c: color depth (= color layers).

Remarks on data format and older keras-core modules:
I did test runs for a (NBxCxHxW)-format, too, by switching the format expected by the Keras model via using the function keras.config.set_image_data_format. The results were practically the same as presented below. I also did test runs for the old “keras-core”-libraries. The results were worse than with the present keras-modules and are not given below. For SW versions see the previous post.

Test runs A: Model, optimizer, loss, model.compile and model.fit from Keras3 libraries + Torch tensors

Test runs with data provision via a torch (tensor) dataset/dataloader – with and without “mixed precision”

The runs below summarize results created with a Keras model and standard model.compile, model.fit functions. The data, however, were provided by a torch dataloader into model.fit(). The torch dataloader was built upon a torch tensor-dataset with tensors having the right dimensions (NBxHxWxC) for the Keras model. The creation of batches and the shuffling of data was done by the Torch dataloader. The batch size was chosen to be BS=256 – as in the comparable test runs discussed in the previous post for plain PyTorch and plain Keras3/TF2 solutions.

Config	Input data format	NW	SPE	time async [sec]	time preload, [sec]	GPU load [%]	EC GPU [Watt]	Remarks
Torch tensor-dataset, standard params	NBxHxWxC	–	–	87.9	87.2	54-58	80-84	CPU load varies: 18-24%
Torch tensor-dataset, NW=0, non-persistenced_workers, pinned memory	NBxHxWxC	0	–	92.0	–	50-56	78-82	CPU load varies: 55-63%
Torch tensor-dataset, NW=1, persistenced_workers, pinned memory	NBxHxWxC	1	–	83.8	–	55-61	84-87	CPU load varies: 25-31%
Torch tensor-dataset, NW=4, persistenced_workers, pinned memory	NBxHxWxC	4	–	83.8	–	55-61	82-87	CPU load varies: 25-31%
Torch tensor-dataset, NW=6, persistenced_workers, pinned memory	NBxHxWxC	6	–	83.6	–	55-61	82-87	CPU load varies: 25-31%
Torch tensor-dataset, standard params Mixed Precision	NBxHxWxC	–	–	93.4	92.5	23-31	42-46	CPU load varies: 18-24%
Torch tensor-dataset, NW=6, persistenced_workers, pinned memory Mixed Precision	NBxHxWxC	6	–	92.9	–	27-30	43-47	CPU load varies: 25-31%

These numbers are all significantly worse than the numbers we got in the previous post both for a pure PyTorch and for a pure Keras/TF2 approach.

The best results regarding turnaround time could be achieved with

a tensor-dataset of prepared data tensors not preloaded to the GPU,
a just one activated worker process for the transfer of the data to the GPU.

This is interesting as we had different experiences in the previous post for we got best results for num_worker=0 and preloaded tensors. But corresponding standard settings gave worse results. In addition: A soon as a worker process was activated by num_workers ≥ 1 pre-loading caused an error.

Surprisingly, the differences are particularly bad for “mixed precision”. Regarding the overall turnaround time, mixed precision seems to be a rather bad idea for this kind of combination Keras 3 with a torch backend. However, we get a lower GPU load and save some energy with mixed precision.

But also without “mixed precision” the differences in time regarding Keras/torch with tensor datasets/dataloaders in comparison to a pure PyTorch approach (without jit-compilation) are bigger than a factor 2.4 – to the worse. What a shock!

Test runs with provision of Torch tensors directly to model.fit()

So, feeding a Keras3 based NN-model with data from a Torch dataloader was disappointing in comparison both to plain Keras3/TF2 or plain PyTorch solutions. What about delivering Torch tensors directly via the model.fit-interface to the NN-model?

Config	Input data format	NW	SPE	time async [sec]	time preload, [sec]	GPU load [%]	EC GPU [Watt]	Remarks
Torch tensors to model.fit, batches/shuffling via model.fit tensors not preloaded to GPU	NBxHxWxC	–	–	77.6	–	59-65	88-93	CPU load varies: 56-67%
Torch tensors to model.fit, batches/shuffling via model.fit tensors preloaded to GPU	NBxHxWxC	–	–	–	73.4	59-65	88-93	CPU load varies: 19-24%
Torch tensors directly to model.fit, batches/shuffling via model.fit tensors not preloaded, mixed precision	NBxHxWxC	–	–	85.3	–	27-33	46-49	CPU load varies: 56-64%
Torch tensors to model.fit, batches/shuffling via model.fit tensors preloaded, mixed precision	NBxHxWxC	–	–	–	80.1	27-32	47-51	CPU load varies: 18-23%

We get a noteworthy improvement! Still, the results still are worse than the data of comparable runs with plain PyTorch or plain Keras3/TF2. At least we get close to the results for pure Keras3/TF2-runs with TF-datasets or Numpy arrays.

Note the high CPU load for tensors not preloaded to the GPU. From looking at the load of individual CPU cores/threads, I got the impression that the backend in this case uses a maximum of allowed threads to move data to the GPU.

Test runs B: Model from Keras3 libraries + Torch tensors, Torch loss/optimizer and Torch training loop

An explanation for the rather bad performance seen above would refer to the enforced setting steps_per_execution=1 and a lack of jit-compilation. Why mixed precision runs only reduce the energy consumption, but do not improve turnaround times may also depend on these factors (in particular on missing jit-compilation) and, in addition, on required format transformations.

Regarding test runs with Torch tensor-datasets/dataloaders, a Keras-based model and a Torch training loop we may expect a better transfer of data to the GPU – but not a real breakthrough for mixed precision; compare with the results in the previous post .

Note that we can enforce “mixed precision” via settings for Keras and/or by using the torch.autocast-function in the PyTorch training loop. This explains multiple respective lines in the following table.

Config	Input data format	NW	SPE	time async [sec]	time preload, [sec]	GPU load [%]	EC GPU [Watt]	Remarks
Torch tensor-dataset, standard params tensors not pre-loaded to GPU	NBxHxWxC	–	1	51.4	52.6	78-92	112-121	CPU load varies: 16-25%
Torch tensor-dataset, standard params tensors not preloaded Mixed Precision via Keras	NBxHxWxC	–	1	57.1	–	33-37	51-53	CPU load varies: 18-27%
Torch tensor-dataset, standard params tensors not preloaded Mixed Precision via Torch	NBxHxWxC	–	1	55.8	–	57-63	76-78	CPU load varies: 18-30%
Torch tensor-dataset, standard params tensors not preloaded *Mixed Precision via Keras and* Torch**	NBxHxWxC	–	1	58.4	–	31-38	52-54	CPU load varies: 18-26%
Torch tensor-dataset, NW=1, persistenced_workers, pinned memory, tensors not preloaded	NBxHxWxC	1	1	46.2	–	91-100	122-132	CPU load varies: 27-33%
Torch tensor-dataset, NW=6, persistenced_workers, pinned memory, tensors not preloaded	NBxHxWxC	6	1	46.8	–	92-100	124-130	CPU load varies: 25-41%
Torch tensor-dataset, NW=1, persistenced_workers, pinned memory, mixed precision via Keras	NBxHxWxC	1	1	52.1	–	34-41	56-62	CPU load varies: 24-33%
Torch tensor-dataset, NW=1, persistenced_workers, pinned memory, mixed precision via Torch	NBxHxWxC	1	1	50.8	–	62-72	76-81	CPU load varies: 27-35%

Once again, we see that for the Keras3/Torch-combination and Torch dataloaders, it is best to not use standard parameters – including num_workers=0 – for defining a Torch dataloader, but to use num_workers=1. Also, the Torch tensors should not be preloaded to the GPU’s VRAM.

“Mixed precision” via Keras settings helps again with energy consumption and GPU load. However, it does not improve the total turnaround time compared to full precision runs, but makes them worse.

Regarding absolute values of the turnaround time we now got much better than values for a pure Keras3/TF2 run with steps_per_execution= 1, tf.data tensor-datasets, no jit-compilation, no mixed precision. We even get close to optimal values for a pure Keras3/TF2 without mixed-precision.

However, with the best turnaround time of 46.2 sec we are by around 15 secs away from the best value for a pure PyTorch approach, which reached 35 secs to 38 secs.

Conclusion

We get a somewhat depressing performance result regarding small NN-models and the Keras3/Torch combination:

If we only cared abut turnaround times we see no performance advantages of a Keras3/Torch combination in comparison to either a pure PyTorch or a pure Keras3/TF2 approach. The results are worse than for any of the pure PyTorch or Keras3/TF2 approaches. And for small NN-models the advantage of setting them up with Keras are not that big that one would absolutely have to use Keras for a simpler programming logic.

A real advantage of using a Keras model is, however, given with respect to “mixed precision”: When we choose mixed precision settings for Keras [via “mixed_precision.set_global_policy”] both GPU load and energy consumption are much better reduced for Keras models than using the “torch.autocast-function” in the PyTorch training loop.

But: Taking into account the results of the first post I would stick to a pure Python approach in case our model runs into trouble for “mixed precision”. However, if your models behave well with “mixed precision” then the real big saver of turnaround time and energy consumption is the Keras3/TF2 combination with “mixed precision”.

The present situation may change dramatically as soon as the Keras developers manage to allow for setting the performance relevant parameters “steps_per_execution” to values close to the batch size in the model.compile-function, and at the same time make a jit-compilation error-free.