Skip to content

CNN

ResNet basics – I – problems of standard CNNs

Convolutional Neural Networks [CNNs] do a good job regarding the analysis of image or video data. They extract correlation patterns hidden in the numeric data of our media objects, e.g. images or videos. Thereby, they get an indirect access to (visual) properties of displayed physical objects – like e.g. industry tools, vehicles, human faces, ….

But there are also problems with standard CNNs. They have a tendency to eliminate some small scale patterns. Visually this leads to smoothing or smear-out effects. Due to an interference of applied filters artificial patterns can appear when CNN-based Autoencoders shall recreate or generate images after training. In addition: The number of sequential layers we can use within a CNN on a certain level of resolution is limited. This is on the one hand due to the number of parameters which rises quickly with the number of layers. On the other hand and equally important vanishing gradients can occur during error-back-propagation and cause convergence problems for the usually applied gradient descent method during training.

A significant improvement came with so called Deep Residual Neural Networks [ResNets). In this post series I discuss the most important differences in comparison to standard CNNs. I start the series with a short presentation of some important elements of CNNs. In a second post I will directly turn to the structure of the so called ResNet V2-architecture [2].

To get a more consistent overview over the historical development I recommend to read the series of original papers [1], [2], [3] and a chapter in the book of R. Atienza [4] in addition. This post series only summarizes and comments the ideas in the named resources in a rather personal way. For me it serves as a preparation and overall documentation for Python programs. But I hope the posts will help some other readers to start working with ResNets, too. In a third post I will also look at the math discussed in some of the named papers on ResNets.

Level of this post: Advanced. You should be familiar with the concepts of CNNs and have some practical experience with this type of Artificial Neural Network. You should also be familiar with the Keras framework and standard layer-classes provided by this framework.

Elements of simple CNNs

A short repetition of a CNN’s core elements will later help us to better understand some important properties of ResNets. ResNets are in my opinion a natural extensions to CNNs and will allow us to build really deep networks based on convolutional layers. The discussion in this post focuses on simple standard CNNs for image analysis. Note however, that the application spectrum is much broader, 1D-CNNs can for example be used to detect patterns in sequential data flows as texts.

We can use CNNs to detect patterns in images that depict objects belonging to certain object-classes. Objects of a class have some common properties. For real world tasks the spectrum of classes must of course be limited.

The idea is that there are detectable patterns which are characteristic of the object classes. Some people speak of characteristic features. Then the identification of such patterns or features would help to classify objects on a new images a trained CNN gets confronted with. Or a combination of patterns could help to recreate realistic object images. A CNN must therefore provide not only some mechanism to detect patterns, but also a mechanism for an internal pattern representation which can e.g. be used as basic information for a classification task.

We can safely assume that the patterns for objects of a certain class will show specific structures on different length scales. To cover a reasonable set of length scales we need to look at images at different levels of resolution. This is one task which a CNN must solve; certain elements of its layer architecture must ensure a systematic change in resolution and of the 2-dimensional length scales we look at.

Pattern detection itself is done by applying filters on sub-scales of the spatial dimensions covered by a certain level of resolution. The filtering is done by so called “Convolutional Layers“. A filter tests the overlap of a given object’s 2-diimensional structures with some filter-related periodic pattern on smaller scales. Relevant filter-parameters for optimal patterns are determined during the training of a CNN. The word “optimal” refers to the task the CNN shall eventually solve.

The basic structure of a CNN (e.g. to analyze the MNIST dataset) looks like this:

The sketched simple CNN consists of only three “Convolutional Layers”. Technically, the Keras framework provides a convolutional layer suited for 2-dimensional tensor data by a Python class “Conv2D“. I use this term below to indicate convolutional layers.

Each of our CNN’s Conv2D-layers comprises a series of rectangular arrays of artificial neurons. These arrays are called “maps” or sometimes also “feature maps“.

All maps of a Conv2D-layer have the same dimensions. The output signals of the neurons in a map together represent a filtered view on the original image data. The deeper a Conv2D-layer resides inside the CNN’s network the more filters had an impact on the input and output signals of the layer’s maps. (More precisely: of the neurons of the layer’s maps).

Resolution reduction (i.e. a shift to larger length scales) is in the depicted CNN explicitly done by intermittent pooling-layers. (An alternative would be that the Conv2D-layers themselves work with a stride parameter s = 2; see below.) The output of the innermost convolution layer is flattened into a 1-diemsnional array, which then is analyzed by some suitable sub-network (e.g. a tiny MLP).

Filters and kernels

Convolution in general corresponds to applying a (sub-scale) filter-function to another function. Mathematically we describe this by so called convolution integrals of the functions’ product (with a certain way of linking their arguments). A convolution integral measures the degree of overlap of a (multidimensional) function with a (multidimensional) filter-function. See here for an illustration.

As we are working with signals of distinct artificial neurons our filters must be applied to arrays of discrete signal values. The relevant arrays our filters work upon are the neural maps of a (previous) Conv2D-layer. A sub-scale filter operates sequentially on coherent and fitting sub-arrays of neurons of such a map. It defines an averaged signal of such a sub-array which is fed into a neuron of map located in the following Conv2D-layer. By sub-scale filtering I mean that the dimensions of the filter-array are significantly smaller than the dimensions of the tested map. See the illustration of these points below.

The sub-scale filter of a Conv2D-layer is technically realized by an array of fixed parameter-values, a so called kernel. A filter’s kernel parameters determine how the signals of the neurons located in a covered sub-array of a map are to be modified before adding them up and feeding them into a target neuron. The parameters of a kernel are also called a filter’s weights.

Geometrically you can imagine the kernel as an (k x k)-array systematically moved across an array of [n x n]-neurons of a map (with n > k). The kernel’s convolution operation consists of multiplying each filter-parameter with the signal of the underlying neuron and afterward adding the results up. See the illustration below.

For each combination of a map M[N, i] of a layer LN with a map M[N+1, m] of the next layer L(N+1) there exists a specific kernel, which sequentially tests fitting sub-arrays of map M[N, i] . The filter is moved across map M[N, i] with a constant shift-distance called stride [s]. When the end of a row is reached the filter-array is moved vertically down to another row at distance s.

Note on the difference of kernel and map dimensions: The illustration shows that we have to distinguish between the dimensions of the kernel and the dimensions of the resulting maps. Throughout this post series we will denote kernel dimensions in round brackets, e.g. (5×5), while we refer to map dimensions with numbers in box brackets, e.g. [11×11].

In the image above map M[N, i] has a dimension of [6×6]. The filter is based on a (3×3) kernel-array. The target maps M[N+1, m] all have a dimension of [4×4], corresponding to a stride s=1 (and padding=”valid” as the kernel-array fits 4 times into the map). For details of strides and paddings please see [5] and [6].

Whilst moving with its assigned stride across a map M[N, i] the filter’s “kernel” mathematically invokes a (discrete) convolutional operation at each step. The resulting number is added to the results of other filters working on other maps M[N, j]. The sum is fed into a specific neuron of a target map M[N+1, m] (see the illustration above).

Thus, the output of a Conv2D-layer’s map is the result of filtered input coming from previous maps. The strength of the remaining average signal of a map indicates whether the input is consistent with a distinct pattern in the original input data. After having passed all previous filters up to the length scale of the innermost Conv2D-layer each map reacts selectively and strongly to a specific pattern, which can be rather complex (see pattern examples below).

Note that a filter is not something fixed a priori. Instead the weights of the filters (convolution kernels) are determined during a CNN’s training and weight optimization. Loss optimization dictates which filter weights are established during training and later used at inference, i.e. for the analysis of new images.

Note also that a filter (or its mathematical kernel) represents a repetitive sub-scale pattern. This leads to the fact that patterns detected on a specific length scale very often show a certain translation and a limited rotation invariance. This in turn is a basic requirement for a good generalization of a CNN-based algorithm.

A filter feeds neurons located in a map of a following Conv2D-layer. If a layer N has p maps and the following layer has q maps, then a neuron of a map M[N+1, m] receives the superposition of the outcome of (p*q) different filters (and respective kernels).

Patterns and features

Patterns which fit some filters, of course appear on different length scales and thus at all Conv2D-layers. We first filter for small scale patterns, then for (overlayed) patterns on larger scales. A certain combination of patterns on all length scales investigated so far is represented by the output of the innermost maps.

All in all the neural activation of the maps at the innermost layers result from (surviving) signals which have passed a sequence of non-linearly interacting filters. (Non-linear due to the non-linearity of the neurons’ activation function.) A strong overall activation of an innermost map corresponds to a unique and characteristic pattern in the input image which “survived” the chain of filters on all investigated scales.

Therefore a map is sometimes also called a “feature map”. A feature refers to a distinct (and maybe not directly visible) pattern in the input image data to which an inner map reacts strongly.

Increasing number of maps with lower resolution

When reducing the length scales we typically open up space for more significant pattern combinations; the number of maps increases with each Conv-layer (with a stride s=2 or after a pooling layer). This is a very natural procedure due to filter combinatorics.

Examples of patterns detected for MNIST images

A CNN in the end detects and establishes patterns (inherent in the image data) which are relevant for solving a certain problem (e.g. classification or generative reconstruction). A funny thing is that these “feature patterns” can be visualized.

The next image shows the activation pattern (signal strengths) of the neurons of the 128 (3×3) innermost maps of a CNN that had been trained for the MNIST data and was confronted with an image displaying the digit “6”.

The other image shows some “featured” patterns to which six selected innermost maps react very sensitively and with a large averaged output after having been a trained on MNIST digit images.

 

These patterns obviously result from translations and some rotation of more elementary patterns. The third pattern seems to useful for detecting “9”s at different positions on an image. The fourth pattern for the detection of images of “2”s. It is somewhat amusing of what kind of patterns a CNN thinks to be interesting to distinguish between digits!

If you are interested of how to create images of patterns to which the maps of the innermost Conv2D-layer reacts to see the book of F. Chollet on “Deep Learning with Python” [5]. See also a post of the physicist F. Graetz “How to visualize convolutional features in 40 lines of code” at “towardsdatascience.com”. For MNIST see my own posts on the visualization of filter specific patterns in my linux-blog. I intend to describe and apply the required methods for layers of ResNets somewhere else in the present ML-blog.

Deep CNNs

The CNN depicted above is very elementary and not a really deep network. Anyone who has experimented with CNNs will probably have tried to use groups of Conv2D-layers on the same level of resolution and map-dimensions. And he/she probably will also have tried to stack such groups to get deeper networks. VGG-Net (see the literature, e.g. [2, 5, 6] ) is a typical example of a deeper architecture. In a VGG-Net we have a group of sequential Conv2D layers on each level of resolution – each with the same amount of maps.

BUT: Such simple deep networks do not always give good results regarding both error rates, convergence and computational time. The number of parameters rises quickly without a really satisfactory reward.

Problems of deep CNNs

In a standard CNN each inner convolutional layer works with data that were already filtered and modified by previous layers.

An inner filter can not adapt to original data, but only to filtered information. But any filter does eliminate some originally present information … This also occurs at transitions to layers working on larger dimensions, i.e. with maps of reduced resolution: The first filter working on a larger length scale (lower resolution) eliminates information which originally came from a pooling layer (or a Conv2D-layer with stride=2). The original averaged data are not available to further layers working on the same new length scale.

Therefore, a standard CNN deals with a problem of rather fast information reduction. Furthermore, the maps in a group have no common point of reference – except overall loss optimization. Each filter can eventually only become an optimal one if the previous filtering layer has already found a reasonable solution. An individual layer cannot learn something by itself. This in turn means: During training the net must adapt as a whole, i.e. as a unit. This strong coupling of all layers can enhance the number of training epochs and also create problems of convergence.

How could we work against these trends? Can we somehow support an adaption of each layer to unfiltered data – i.e. support some learning which does not completely dependent on previous layers? This is the topic of the next post in this series.

Conclusion

CNNs were building blocks in the history of Machine Learning for image analysis. Their basic elements as Conv2D-layers, filters and respective neural maps on different length scales (or resolutions) work well in networks whose depth is limited, i.e. when the total number of Conv2D-layers is small (3 to 10). The number of parameters rises quickly with a network’s depth and one encounters convergence problems. Experience shows that building really deep networks with Conv2D-layers requires additional architectural elements and layer-combinations. Such elements are cornerstones of Residual Networks. I will discuss them in the next post of this series. See

ResNet basics – II – ResNet V2 architecture

Literature

[1] K. He, X. Zhang, S. Ren , J. Sun, “Deep Residual Learning for Image Recognition”, 2015, arXiv:1512.03385v1
[2] K. He, X. Zhang, S. Ren , J. Sun, “Identity Mappings in Deep Residual Networks”, 2016, version 2 arXiv:1603.05027v2
[3] K. He, X. Zhang, S. Ren , J. Sun, “Identity Mappings in Deep Residual Networks”, 2016, version 3, arXiv:1603.05027v3
[4] R. Atienza, “Avanced Deep Learning with Tensorflow 2 and Keras”, 2nd edition, 202, Packt Publishing Ltd., Birmingham, UK (see chapter 2)
[5] F. Chollet, “Deep Learning with Python”, 2017, Manning Publications, USA
[6] A. Geron, “Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow”, 3rd ed., 2023, O’ReillyMedia Inc., Sebastopol, USA CA

 

Preliminary test of a Nvidia RTX 4060 TI 16GB with neural networks

Recently I had the opportunity to test a Nvidia RTX 4060 TI (vendor: MSI, model:Ventus ) on my Linux system against a Geforce GTX 960.

For private consumers as me who are not interested in gaming, but in Machine Learning [ML] this type of card can be interesting. I name three reasons:

  • the price level
  • the amount of available VRAM (16 GB)
  • the power consumption (which again has to do with a price, namely the one you have to pay for energy).

Some of the readers of this blog may miss a criterion like “performance“. The reason is that I regard the VRAM criterion as more important as raw GPU power. I have commented on this in another post, already. See Criteria for ML capable graphic cards: Amount of VRAM or raw GPU power?

This post provides some preliminary impressions and measured performance factors comparing the RTX 4060 TI to a previously used Nvidia Geforce 960 GTX. The factors were derived from training runs of Convolutional Neural Networks and Autoencoders used for object classification on images and generative tasks. These ML-runs were also used to measure the maximum temperature level and subjectively compare the fan noise of the 4060 TI vs. the GTX 960. So, the difference to what you find on other sites comparing GPUs is that I focused on ML-related tests and not on video games or game specific benchmarks.

A compromise regarding the value for money – some theoretical values for 40XX-cards

Nvidia, in my opinion, exploits its monopoly regarding ML-capable cards maximally; so ML capable cards still are very expensive. Finding an affordable compromise requires to compare specifications of GPU variants. The basic specification data for the RTX 4060 TI 16GB (and other variants of the Ada Lovelace architecture) can be found here. Some data can also be seen in the following picture:

A 4060 TI can not provide the GPU performance of a 4070, 4080 or 4090. In the following comparisons of a few specifications I leave out the RTX 4090 as the top model with a price tag above 1700 € presently in Germany. The 4090 is a card for ML-professionals or rich enthusiasts, but not for a normal private consumer as myself.

VRAM: The RTX 4060 TI is one of the 40XX-cards which provides 16 GB VRAM. Also the RTX 4080 comes with 16 GB VRAM. So the RTX 4080 it is a direct competitor for the RTX 4060 TI 16GB regarding value for money. Note that there is also a 4060 TI variant available which only provides 8GB. The RTX 4070 TI has much in common with a 4080, but a lower amount of VRAM, namely 12GB. All in all the politics of Nvidia for the RTX 4070 (TI) is a bit questionable as it does not really address the requirements of ML-people. But the lack of VRAM on the RTX 4070 (TI) was criticized by the gamer community, too.

Price: The price tag of the 4060 TI is around and below 470 € (in Germany), presently (Oct. 2023). I.e. a RTX 4060 TI costs roughly less than 37% and 55% of what you have to pay for a RTX 4080 (1250 €) and RTX 4070 TI (880 €), respectively. I took the prices from the German Amazon site.

TDP: A RTX 4080 can draw up to 320 Watt in power consumption, a RTX 4070 TI up to 285 Watt and a RTX 4090 up to 450 Watt. The RTX 4060 TI, in contrast, requests only up to 165 Watt (nominal TDP).

Speed/Performance: According to published consumer and game-based benchmarks the RTX 4080 GPU is roughly a factor of 2 to 2.2 faster than the “RTX 4060 TI 16GB”. Consistently, the RTX 4080 has roughly by a factor of 2.25 more cores / tensor cores. The memory bandwidth of the RTX 4080 is 256 Bit vs. only 128 Bit for the 4060 TI. Note also that the PCIe Link Speed width for the 4060 TI is only x8, instead of x16 for a RTX 4070 TI or a RTX 4080. However, the memory clock speed of the RTX 4060 TI is more than twice as high as for the RTX 4080. A RTX 4070 TI appears to be around a factor of 1.6 faster than a RTX 4060 TI.

A 4060 TI is only around 20% effectively faster than its older counterpart, the RTX 3060 TI. But the RTX 3060 TI has a significantly higher power consumption (up to 200 Watt), only (GB VRAM and regarding supported standards and operations it is behind the 4060 TI.

Summary: Regarding specifications the RTX 4060 TI in comparison to a RTX 4070 TI and a RTX 4080 certainly is a compromise regarding performance vs. price tag. But:

  • Even with two 4060 TI you would be well below the price level of one RTX 4080. But with two 4060 TI cards you would get 32 GB of VRAM in total – which is a decisive factor for some ML experiments. As VRAM is a critical factor, two 4060 TI would also almost certainly be a better deal than one 4070 TI. So, if you are in a position where you start with ML think carefully. The option to extend your experiments onto a combination of 2 RTX 4060 TI is a relevant future option.
  • Regarding low power consumption and related heat and noise levels a single 4060 TI in theory is without match. For me as a becoming ML addict the question of power consumption is a decisive one – I do not want to care too much about cooling and my energy bill when the GPU is under load for some hours.

Performance of Machine Learning runs on a RTX 4060 TI vs. a GTX 960

My main interests to do some preliminary tests myself was what I would gain in comparison to my old GTX 960 (vendor Gigabyte) when I did some training runs for Neural networks, more specifically CNNs with 9 million up to 38 million parameters. My old GTX 960 ad 4 GB VRAM only. So, all experiments included image data transfers from the RAM to the GPU’s VRAM via an ImageDataGenerator()-batch-pipeline. An important factor for the performance, therefore, is that the RAM is big enough to contain all relevant image tensors – in my test cases between 60,000 and 200,000. I did not read any data from disk, but had them preloaded in the RAM.

Due to its small amount of VRAM The GTX 960 certainly is no reasonable card these days for really deep ML-networks and respective algorithms. It is also too slow for many kinds of experiments with transformers. However, the GTX 960 has a low TDP of 120 Watt.

Expectations from standard benchmarks: Regarding expectations for the difference in performance between a 960 GTX and a RTX 4060 TI you may have a look at test results at tomshardware.com: see here. From the numbers we find there one may expect the RTX 4060 TI to show a factor of 4.4 in performance gain vs. a GTX 960.

However, ML-tests involve different operations than video games, namely more complicated tensor operations. The ML-performance, therefore, depends on many factors – e.g. on the tensor framework used. Which in my case was Tensorflow2 with Keras as a frontend. The performance of Nvidia cards also depends on CUDA and cuDNN drivers (including optimized Linear Algebra libraries for Deep Neural Networks). While CUDA has a current version of 12.2 I have done my tests with the older version CUDA 11.2. The proprietary Nvidia driver version was 535.113.01. Data were measured via the nvidia-settings app and “watch -n1.0 nvidia-smi” on a terminal. The system was run under Opensuse Leap 15.4.

As heavy ML tests also involve intermittent data transfers between the GPU and the standard RAM the system’s PCIe-environment, the CPU and the RAM’s clock frequency also have an impact. So the numbers given below and evaluated on a system with a Z170 board and a i7-6700K processor may not indicate the achievable optimum.

Performance for ML test cases

I used around 5 different test cases either directly based on CNNs or using CNN-based Autoencoders to be trained for different ML tasks regarding image analysis and image classification as well as generative tasks, respectively. Different layer structures including normalization and drop-out layers were used. The numbers of parameters to be optimized were between 9.3 and 38 million. I used between 60,000 and 210,000 images per run. The batch size during training was limited to 128 image tensors first. The color images had a resolution of 96×96 px. Two of the test cases used almost 4.0 GB of the available VRAM on the GTX 960 at this batch size. For some of the tests I raised the batch size later to 256. This, of course, increased the VRAM usage by at least a factor of 2.

In all tests I found that the percentage of GPU usage rose above 93 % permanently on both cards.

The relevant turnaround times for my selected training and evaluation runs differed by a factor of roughly 3.5 and 3.8 between the 960 GTX and the RTX 4060 TI – i.e. the RTX 4060 TI is on average by a factor of 3.7 faster than a 960 GTX.

Raising the batch size to 256 tensors transferred via an ImageDataGenerator()-pipeline from the RAM to the GPU’s VRAM and being handled there as a step unit during my training epochs gave an additional rise in the performance of the 4060 TI: It became by a factor of roughly 4.0 to 4.2 faster than the 960 GTX. This is not a world, but certainly significant.

Interestingly, the VRAM consumption rose not only by a factor of 2, but by 2.5 in some test cases due to changing the batch size by a factor of 2. So, to gain a factor of 4 in performance in comparison to a GTX 960 you may have to use much more VRAM on the RTX 4060 TI than on the GTX 960.

So regarding reasonable ML-tests you may not gain more than a factor of 3.6 to 4.1 in effective performance by replacing a GTX 960 with a RTX 4060 TI.

Power consumption, GPU temperatures and fan noise

Energy consumption during KDE desktop usage: Regarding power consumption under normal Linux conditions I have very positive news: I work with a Linux based KDE desktop stretched across 3 screens (two with 2560×1440 px and one with 1920×1200 px resolution). The power consumption of the 4060 TI was only around 12.4 to 14.0 Watt during standard desktop operations. This is significantly less than for the 960 GTX, which used 30 to 32 Watt.

Energy consumption under full load: During my ML-experiments the 960 TX consumed 108 Watt on average whereas the 4060 TI consumed between 117 Watt and 125 Watt. A factor of 1.16 in energy consumption during load phases for a gain in speed of more than 3.6 and for a 4 times bigger VRAM is more than acceptable.

Level 4 was only reached temporarily during the ML-runs.

GPU temperature: The temperature of the 4060 TI (and the GTX 960) never exceeded a maximum level of 71° Celsius under full load. This was slightly lower than for the GTX 960 (73° Celsius).
Note, however, that these temperatures requires free space below the GPU fans. I.e. the next PCIe slot should not be used by cards with major dimensions and which produce much heat themselves. See a section below for more information.

Fan noise: As you see from the picture above the two fans in the MSI Ventus model do a proper job. The RPM never rose beyond 1875 – at room temperatures around 23° Celsius. The adaptive fan control works with a slight delay regarding the temperature rise vs. time. This appears to be reasonable as the temperature could drop again during a time slice. In Nov. 2023 new TI models with Torx fans will appear which may be even more effective.

Coming from a phase of low load the fans only start rotating when the GPU temperature reaches more than 60° Celsius. This guarantees a zero noise level under standard operation conditions.

Under standard usage conditions (KDE desktop, 3 attached screens) and dependent on the temperature in the case the GPU temperature without any active fan sometimes rose from 45° Celsius to 51° Celsius – but not more. With a bit of fan rotation at 1200 RPM the temperature at once went down to 32° Celsius. This is acceptable as my Alpenföhn CPU-cooler stretches down to 2 cm above the graphics card and my case is relatively densely packed with multiple HDs, SSDs, sound cards and a Raid controller.

Under stress conditions the GPU fans were only audible outside the PC case when I fully concentrated to hear them willingly. I could not hear no coil whine from the graphics card.


Handling, size and other aspects

Regarding height the graphic cards occupies the space of two PCIe slots. However, its length is with roughly 20 cm much less than that of the 960 GTX. The width is around 12 cm.

Important Warning: The compact size of the card comes with a disadvantage:
Another card at the next PCIe slot below may cover a lot of the GPU’s fan area. Which is not good for an efficient cooling of the GPU. I had to move other PCIe cards (in particular a heat producing raid controller) to other slots to free some space below the graphics card. I noticed a drop in GPU temperature under stress conditions of up to 7° Celsius afterward. So, this important for ML interested users.

Conclusion

The RTX 4060 TI with 16GB VRAM is not a dream card for ML-focused users. But it is a reasonable compromise and offers a lot of value for your money. The 16 GB VRAM are especially valuable to extend the range of ML-experiments beyond those which can be done on cards with only 4 GB or 8 GB VRAM. So it is even interesting for users of a RTX 3060 TI or a good old Geforce 1080 (TI) – although the improvement in performance will then be be less than a factor of 1.2 and 1.4, respectively. Users of old cards like a Geforce 960 GTX will experience a performance jump by at least a factor of 3.5 up to 4.2 regarding ML-tasks.

The relatively low power consumption of around 125 Watt and its silent operation – even under heavy load – are major plus points of the RTX 4060 TI. And there is the option to add a second 4060 TI card to your ML-system when prices drop.

Links

https://gpu.userbenchmark.com/Compare/Nvidia-RTX-4080-vs-Nvidia-RTX-4060-Ti/4138vs4149

https://versus.com/de/nvidia-geforce-rtx-4060-ti-16gb-vs-nvidia-geforce-rtx-4080-16gb