Machine Learning on a Linux system is no fun without a GPU and its parallel processing capabilities. On a system with a Nvidia card you need basic Nvidia drivers and additional libraries for optimal support of Deep Neural Networks and Linear Algebra operations on the GPU sub-processors. E.g., Keras and Tensorflow 2 [TF2] use CUDA and cuDNN-libraries on your Nvidia GPU. Basic information can be found here:
This means that you must not only perform an installation of (proprietary) Nvidia drivers, but also of CUDA and cuDNN on your Linux system. As I have started to work with ResNet-110v2 and ResNet-164v2 variants lately I was interested whether I could get a combination of
TF 2.15 with Keras
the latest of Nvidia GPU drivers 545.29.06 – but see the Addendum at the end of the post and the warnings therein.
the latest CUDA-toolkit version 12.3
and cuDNN version 8.9.7
to work on an Opensuse Leap 15.5 system. This experiment ended successfully, although the present compatibility matrices on the Nvidia web pages do not yet include the named combination. While system wide installations of the CUDA-toolkit and cuDNN are no major problems, some additional settings of environment variables are required to make the libraries available for Python notebooks in Jupyterlab or classic Jupyter Notebooks (i.e. IPython based environments). These settings are not self-evident.
This post summarizes the most important steps of a standard system-wide installation of CUDA and cuDNN on an Opensuse Leap 15.5 system. I do not install TensorRT in this post. As long as you do not work with (pre-trained) LLMs you do not really need TensorRT.
Level of his post: Active ML user – advanced. You should know how RPM and tar-based installations work on a Leap system. You should also have a working Python3 installation (in a virtual environment) and a Jupyter Notebook or (better) a Jupyterlab-installation on your system to be able to perform ML-tests based on Keras. I do not discuss a Jupyter and Python installation in this post.
Limitations and requirements
GPU capabilities: You need a fairly new Nvidia graphics card to make optimal use of the latest CUDA features. In my case I tested with a Nvidia 4060 TI. Normally the drivers and libraries should detect the capabilities of older cards and adapt to them. But I have not tested with older graphics cards.
Disk space: CUDA and cuDNN require a substantial amount of disk space (almost 7 GiB) when you install the full CUDA-toolkit as it is recommended by NVIDIA.
Remark regarding warnings: Installing CUDA 12.3 and using it with Tensorflow2.15 will presently (Jan. 2024) lead to warnings in your Python 3 notebooks. However, in my experience these warnings have no impact on the performance. My 4060 TI did its job in test calculations with convolutional Autoencoders and ResNets as expected. Regarding ResNets even 5% faster than with CUDA 11.2.
Alternative installation methods: You may find information about a pure Python based installations including CUDA via pip. See e.g. here: https://blog.tensorflow.org/2023/11/whats-new-in-tensorflow-2-15.html. While this potentially makes local user-specific installations easier, the disadvantage for multiple virtual Python environments is the resulting consumption of disk space. So, I still prefer a system wide installation. It also seems to be such that one should not mix both ways of installation – system-wide and virtual-environment specific. I have e.g. tried to install TensorRT via pip after a systemwide standard CUDA installation. The latter itself had worked. But after the additional TensorRT installation with pip my GPU could no longer used by Keras/TF2 based ML code started from Jupyterlab notebooks.
Installation of basic Nvidia drivers
The Nvidia graphics card must already be supported for regular X or Wayland services on a Linux system. CUDA and cuDNN come on top.
Note: You need a fairly new Nvidia driver for CUDA-12.3 ! To get the latest drivers for an Opensuse system I install the proprietary Nvidia drivers from the Opensuse’s Nvidia repository:
Nvidia Repository address for Leap 15.5:https://download.nvidia.com/opensuse/leap/15.5
Note that presently YaST2 has a bug (see here). You may need to use zypper on the command-line to this repository to your package manager. See the man pages for zypper for the right syntax. IN the end you should see the Nvidia repository in YAST2:
This post provides some preliminary impressions and measured performance factors comparing the RTX 4060 TI to a previously used Nvidia Geforce 960 GTX. The factors were derived from training runs of Convolutional Neural Networks and Autoencoders used for object classification on images and generative tasks. These ML-runs were also used to measure the maximum temperature level and subjectively compare the fan noise of the 4060 TI vs. the GTX 960. So, the difference to what you find on other sites comparing GPUs is that I focused on ML-related tests and not on video games or game specific benchmarks.
A compromise regarding the value for money – some theoretical values for 40XX-cards
Nvidia, in my opinion, exploits its monopoly regarding ML-capable cards maximally; so ML capable cards still are very expensive. Finding an affordable compromise requires to compare specifications of GPU variants. The basic specification data for the RTX 4060 TI 16GB (and other variants of the Ada Lovelace architecture) can be found here. Some data can also be seen in the following picture:
A 4060 TI can not provide the GPU performance of a 4070, 4080 or 4090. In the following comparisons of a few specifications I leave out the RTX 4090 as the top model with a price tag above 1700 € presently in Germany. The 4090 is a card for ML-professionals or rich enthusiasts, but not for a normal private consumer as myself.
VRAM: The RTX 4060 TI is one of the 40XX-cards which provides 16 GB VRAM. Also the RTX 4080 comes with 16 GB VRAM. So the RTX 4080 it is a direct competitor for the RTX 4060 TI 16GB regarding value for money. Note that there is also a 4060 TI variant available which only provides 8GB. The RTX 4070 TI has much in common with a 4080, but a lower amount of VRAM, namely 12GB. All in all the politics of Nvidia for the RTX 4070 (TI) is a bit questionable as it does not really address the requirements of ML-people. But the lack of VRAM on the RTX 4070 (TI) was criticized by the gamer community, too.
Price: The price tag of the 4060 TI is around and below 470 € (in Germany), presently (Oct. 2023). I.e. a RTX 4060 TI costs roughly less than 37% and 55% of what you have to pay for a RTX 4080 (1250 €) and RTX 4070 TI (880 €), respectively. I took the prices from the German Amazon site.
TDP: A RTX 4080 can draw up to 320 Watt in power consumption, a RTX 4070 TI up to 285 Watt and a RTX 4090 up to 450 Watt. The RTX 4060 TI, in contrast, requests only up to 165 Watt (nominal TDP).
Speed/Performance: According to published consumer and game-based benchmarks the RTX 4080 GPU is roughly a factor of 2 to 2.2 faster than the “RTX 4060 TI 16GB”. Consistently, the RTX 4080 has roughly by a factor of 2.25 more cores / tensor cores. The memory bandwidth of the RTX 4080 is 256 Bit vs. only 128 Bit for the 4060 TI. Note also that the PCIe Link Speed width for the 4060 TI is only x8, instead of x16 for a RTX 4070 TI or a RTX 4080. However, the memory clock speed of the RTX 4060 TI is more than twice as high as for the RTX 4080. A RTX 4070 TI appears to be around a factor of 1.6 faster than a RTX 4060 TI.
A 4060 TI is only around 20% effectively faster than its older counterpart, the RTX 3060 TI. But the RTX 3060 TI has a significantly higher power consumption (up to 200 Watt), only (GB VRAM and regarding supported standards and operations it is behind the 4060 TI.
Summary: Regarding specifications the RTX 4060 TI in comparison to a RTX 4070 TI and a RTX 4080 certainly is a compromise regarding performance vs. price tag. But:
Even with two 4060 TI you would be well below the price level of one RTX 4080. But with two 4060 TI cards you would get 32 GB of VRAM in total – which is a decisive factor for some ML experiments. As VRAM is a critical factor, two 4060 TI would also almost certainly be a better deal than one 4070 TI. So, if you are in a position where you start with ML think carefully. The option to extend your experiments onto a combination of 2 RTX 4060 TI is a relevant future option.
Regarding low power consumption and related heat and noise levels a single 4060 TI in theory is without match. For me as a becoming ML addict the question of power consumption is a decisive one – I do not want to care too much about cooling and my energy bill when the GPU is under load for some hours.
Performance of Machine Learning runs on a RTX 4060 TI vs. a GTX 960
My main interests to do some preliminary tests myself was what I would gain in comparison to my old GTX 960 (vendor Gigabyte) when I did some training runs for Neural networks, more specifically CNNs with 9 million up to 38 million parameters. My old GTX 960 ad 4 GB VRAM only. So, all experiments included image data transfers from the RAM to the GPU’s VRAM via an ImageDataGenerator()-batch-pipeline. An important factor for the performance, therefore, is that the RAM is big enough to contain all relevant image tensors – in my test cases between 60,000 and 200,000. I did not read any data from disk, but had them preloaded in the RAM.
Due to its small amount of VRAM The GTX 960 certainly is no reasonable card these days for really deep ML-networks and respective algorithms. It is also too slow for many kinds of experiments with transformers. However, the GTX 960 has a low TDP of 120 Watt.
Expectations from standard benchmarks: Regarding expectations for the difference in performance between a 960 GTX and a RTX 4060 TI you may have a look at test results at tomshardware.com: see here. From the numbers we find there one may expect the RTX 4060 TI to show a factor of 4.4 in performance gain vs. a GTX 960.
However, ML-tests involve different operations than video games, namely more complicated tensor operations. The ML-performance, therefore, depends on many factors – e.g. on the tensor framework used. Which in my case was Tensorflow2 with Keras as a frontend. The performance of Nvidia cards also depends on CUDA and cuDNN drivers (including optimized Linear Algebra libraries for Deep Neural Networks). While CUDA has a current version of 12.2 I have done my tests with the older version CUDA 11.2. The proprietary Nvidia driver version was 535.113.01. Data were measured via the nvidia-settings app and “watch -n1.0 nvidia-smi” on a terminal. The system was run under Opensuse Leap 15.4.
As heavy ML tests also involve intermittent data transfers between the GPU and the standard RAM the system’s PCIe-environment, the CPU and the RAM’s clock frequency also have an impact. So the numbers given below and evaluated on a system with a Z170 board and a i7-6700K processor may not indicate the achievable optimum.
Performance for ML test cases
I used around 5 different test cases either directly based on CNNs or using CNN-based Autoencoders to be trained for different ML tasks regarding image analysis and image classification as well as generative tasks, respectively. Different layer structures including normalization and drop-out layers were used. The numbers of parameters to be optimized were between 9.3 and 38 million. I used between 60,000 and 210,000 images per run. The batch size during training was limited to 128 image tensors first. The color images had a resolution of 96×96 px. Two of the test cases used almost 4.0 GB of the available VRAM on the GTX 960 at this batch size. For some of the tests I raised the batch size later to 256. This, of course, increased the VRAM usage by at least a factor of 2.
In all tests I found that the percentage of GPU usage rose above 93 % permanently on both cards.
The relevant turnaround times for my selected training and evaluation runs differed by a factor of roughly 3.5 and 3.8 between the 960 GTX and the RTX 4060 TI – i.e. the RTX 4060 TI is on average by a factor of 3.7 faster than a 960 GTX.
Raising the batch size to 256 tensors transferred via an ImageDataGenerator()-pipeline from the RAM to the GPU’s VRAM and being handled there as a step unit during my training epochs gave an additional rise in the performance of the 4060 TI: It became by a factor of roughly 4.0 to 4.2 faster than the 960 GTX. This is not a world, but certainly significant.
Interestingly, the VRAM consumption rose not only by a factor of 2, but by 2.5 in some test cases due to changing the batch size by a factor of 2. So, to gain a factor of 4 in performance in comparison to a GTX 960 you may have to use much more VRAM on the RTX 4060 TI than on the GTX 960.
So regarding reasonable ML-tests you may not gain more than a factor of 3.6 to 4.1 in effective performance by replacing a GTX 960 with a RTX 4060 TI.
Power consumption, GPU temperatures and fan noise
Energy consumption during KDE desktop usage: Regarding power consumption under normal Linux conditions I have very positive news: I work with a Linux based KDE desktop stretched across 3 screens (two with 2560×1440 px and one with 1920×1200 px resolution). The power consumption of the 4060 TI was only around 12.4 to 14.0 Watt during standard desktop operations. This is significantly less than for the 960 GTX, which used 30 to 32 Watt.
Energy consumption under full load: During my ML-experiments the 960 TX consumed 108 Watt on average whereas the 4060 TI consumed between 117 Watt and 125 Watt. A factor of 1.16 in energy consumption during load phases for a gain in speed of more than 3.6 and for a 4 times bigger VRAM is more than acceptable.
Level 4 was only reached temporarily during the ML-runs.
GPU temperature: The temperature of the 4060 TI (and the GTX 960) never exceeded a maximum level of 71° Celsius under full load. This was slightly lower than for the GTX 960 (73° Celsius). Note, however, that these temperatures requires free space below the GPU fans. I.e. the next PCIe slot should not be used by cards with major dimensions and which produce much heat themselves. See a section below for more information.
Fan noise: As you see from the picture above the two fans in the MSI Ventus model do a proper job. The RPM never rose beyond 1875 – at room temperatures around 23° Celsius. The adaptive fan control works with a slight delay regarding the temperature rise vs. time. This appears to be reasonable as the temperature could drop again during a time slice. In Nov. 2023 new TI models with Torx fans will appear which may be even more effective.
Coming from a phase of low load the fans only start rotating when the GPU temperature reaches more than 60° Celsius. This guarantees a zero noise level under standard operation conditions.
Under standard usage conditions (KDE desktop, 3 attached screens) and dependent on the temperature in the case the GPU temperature without any active fan sometimes rose from 45° Celsius to 51° Celsius – but not more. With a bit of fan rotation at 1200 RPM the temperature at once went down to 32° Celsius. This is acceptable as my Alpenföhn CPU-cooler stretches down to 2 cm above the graphics card and my case is relatively densely packed with multiple HDs, SSDs, sound cards and a Raid controller.
Under stress conditions the GPU fans were only audible outside the PC case when I fully concentrated to hear them willingly. I could not hear no coil whine from the graphics card.
Handling, size and other aspects
Regarding height the graphic cards occupies the space of two PCIe slots. However, its length is with roughly 20 cm much less than that of the 960 GTX. The width is around 12 cm.
Important Warning: The compact size of the card comes with a disadvantage: Another card at the next PCIe slot below may cover a lot of the GPU’s fan area. Which is not good for an efficient cooling of the GPU. I had to move other PCIe cards (in particular a heat producing raid controller) to other slots to free some space below the graphics card. I noticed a drop in GPU temperature under stress conditions of up to 7° Celsius afterward. So, this important for ML interested users.
The RTX 4060 TI with 16GB VRAM is not a dream card for ML-focused users. But it is a reasonable compromise and offers a lot of value for your money. The 16 GB VRAM are especially valuable to extend the range of ML-experiments beyond those which can be done on cards with only 4 GB or 8 GB VRAM. So it is even interesting for users of a RTX 3060 TI or a good old Geforce 1080 (TI) – although the improvement in performance will then be be less than a factor of 1.2 and 1.4, respectively. Users of old cards like a Geforce 960 GTX will experience a performance jump by at least a factor of 3.5 up to 4.2 regarding ML-tasks.
The relatively low power consumption of around 125 Watt and its silent operation – even under heavy load – are major plus points of the RTX 4060 TI. And there is the option to add a second 4060 TI card to your ML-system when prices drop.