Skip to content

General ML-topics

Some general comments and opinions regarding ML

The Meaning of Object Features in different ML-Contexts

When I gave a few introductory courses on basic Machine Learning [ML] algorithms in 2022, I sometimes ran into a discussion about “features“. The discussions were not only triggered by my personal definition, but also by some introductory books on ML the attendants had read. Across such textbooks, but even in a single book on ML the authors have a tendency to use the term “features” in different contexts of ML-algorithms and in particular Artificial Neural Networks [ANN]. Unfortunately, the meaning of the term is a bit different in the covered contexts. This can lead to some misunderstandings.

With this post I want to specify the most important contexts in which the term “feature” appears, comment on the differences and suggest some measures to distinguish a bit better.

Level of the post: Advanced. You should already be familiar with ML and ANNs, pattern detection and respective variable spaces .

Features in different contexts

In general a feature addresses some property of an object. One would think that an object of interest for some ML application can be described precisely enough by quantifying its relevant properties. How then can it be that object properties get a different meaning in different contexts? The following considerations help to understand why.

We need numeric object data as input for ML-algorithms. But do we always get a direct information about physical properties of an object? Or is this information about an important feature only indirectly accessible? In this context media may play a role. We also must take into account that the processes of a trained ML-algorithm typically map an object’s input data to a point in some abstract multidimensional space which is spanned by internal and abstract variables of the algorithm. These variables could also be regarded (abstract) “features” of an object. In addition ML-algorithms detect and extract (sometimes hidden) patterns in the input data of objects. Such a pattern is also often called a “feature” characterizing a whole class of objects.

Guided by these thoughts I distinguish the following four main contexts regarding different meanings of the term “feature“:

Context 1 – input and training data based on selected and quantified object properties
The first relevant context concerns the representation of an object in a useful way for a numerical ML-algorithm. A “feature” is a quantifiable property of a class of objects to which we want to apply an algorithm. We define a single object by an ordered array (= tensor) providing numeric values for a set of selected, relevant properties. Such an array represents our object numerically and can be used as input to a computer program, which realizes an ML-algorithm. If numeric values of the properties are available for a whole bunch of objects we can use them as training data for our algorithm.

Mathematically, we interpret a property as a variable which takes a specific value for a selected single object. Thus the numerical representation of an object requires a set of multiple variables. Therefore, we often present the available original training data of our objects as data points in a multidimensional space with an Euclidean coordinate system [ECS]. Each axis of the ECS represents one of our feature variables by which we describe our objects. Sometimes this space is called the (original) “feature space” of the objects. Actually, it is a space to represent numeric training data available for our objects.

Context 2 – object information embedded in the data of some medium
What set of “properties” is used to define quantified input data of objects often depends on the way or form by which we register information about our objects. During information gathering media (as images, videos, sound recordings, …) can play a decisive role.

Let us take an example: We may want to train an ML-algorithm to distinguish between classes of elephants, like to distinguish an African from an Indian elephant. But relevant data of a bunch of elephants may be available in the form of pictures, only. One image for each of the elephants. We may not have have any direct numeric data for an elephant’s properties like its length, height, weight, ear size, … The data of relevant physical properties of elephants would in our case be indirectly embedded in media data.

In such a case we would probably use pixel values as our training data. I.e., the “features” our ML-algorithm gets confronted with would be provided as arrays of pixel values – corresponding to one variable for each of the image ‘s color pixels. Yet, the objects we really are interested in would be the photographed elephants. Our algorithm should distinguish between (depicted) elephants just from analyzing a respective image. The distinctive features must then be evaluated indirectly.

Such a situation opens room for misunderstandings regarding the objects the ML-algorithm really deals with (see the discussion below).

Context 3 – patterns extracted from object data
A “feature” is also used as a term to qualify a pattern which a ML-algorithm may somehow have detected in and extracted from some original training data (by some tricky mathematical methods).

Such pattern-based “features” summarize correlations in the original training data. The detected patterns can be abstract ones or they may correspond to physical properties of the objects. These features may not have been directly referenced by the training data presented to the ML-algorithm, but could have been detected during the training process. E.g. by the evaluation of correlations.

In such a case these features were hidden in the training data. Think again of images of elephant faces for which the training data were chosen to be pixel values: A pattern-based “feature” a capable algorithm detects may then be something like an elephant’s “nose” or “trunk”. More precisely: a nose-like pattern of positional correlations of certain pixel values.

But in other cases the detected pattern-based features may relate to some correlations between data which correspond to no concrete single physical property, but to more or less abstract property relations. E.g., there could be a relation between the size of an elephant and a date of birth, because after some date the food was changed or a genetic modification overtook for a group of elephants.

Context 4 – features as abstract variables of latent representation spaces of objects
The internal processes of many ML-algorithms, especially neural networks, map the data points (or vector) representing objects in the variable space of the input data to data points (or vector) in an internal or latent representation space. A ML-algorithm, e.g. an ANN, can be regarded as a complicated function mapping a vector of a high dimensional vector space to a vector of another vector space which a typically lower number of dimensions.

In the case of ANNs these internal representation spaces relate to vectorized data which are produced by neurons of a special (flat) layer of neurons. Such a layer typically follows a sequence of other analyzing and processing layer and summarizes in a way the results. The output of each of the neurons in this special inner layer can be regarded as a variable for a vector component. The processed data for a specific object thus lead to specific values corresponding to data points in an abstract multidimensional space. If such data are externalized and not directly subject to further internal and classifying networks then we speak of an accessible latent space.

The variables that span an internal or latent object representation space are abstract ones – but they can sometimes also measure the overlap with patterns in physical properties of the objects. In the case of Convolutional Neural Networks [CNNs] an internal or latent representation space condenses information about detected patterns and the degree of overlap of a given object with such a pattern. In this sense internal or latent representation (vector) spaces may also represent secondary, pattern based object features in the sense of context 3.

An internal representation space for objects is in some ANN-contexts (especially regarding Natural Language Processing by ANNs) also called an “embedding space“. The difference in my understanding lies in the way the mapping of training data into a representational space is done: In the case of an embedding space mapping is done by neuron layers close to the input side of a neural network. I.e. the input data are first mapped to an internal representation space and are afterward processed by other network layers. The relevant network parameters for the initial mapping (= embedding) are achieved during training via some parameter optimization. In the case of a latent or inner representation space we instead use data produced by neurons which are members of some special inner layer following major processing layers (as e.g. convolutional or residual layers).

See a Wikipedia article about latent spaces which distinguishes between the “feature space” of context 1 and the “latent feature space” of context 4.

A topic for confusion

The example of image data of elephants makes it clear why we often must define precisely what we mean when we speak about “features” of “objects”. In particular, we must be careful to distinguish between media objects and objects indirectly presented by our media objects. We also must address patterns as particular features and internal object representations. Key questions are:

Do we speak of quantified physical and abstract features of the objects we are interested in? Or do media objects play a role whose features encapsulate the data of the really relevant objects? Or do we speak of patterns? Or do we refer to variables of internal or latent feature spaces?

One widespread source of confusion is that we confuse a media object and the object encoded in media data. We speak of “elephants” when the real objects a ML-algorithm is confronted with are the images of elephants. Then an algorithm classifying elephants on the basis of image data does not really distinguish between different classes of elephants (or other photographed objects). Instead it actually distinguishes between images with different kinds of pixel correlations. If we are lucky the detected pixel correlation patterns reflect some information about single feature or the combination of multiple (physical) features of elephants (or other imaged objects).

Note that the the interpretation of the input data and the latent data of an ML-algorithm would change substantially if we had not used images of elephants and respective pixel values as training data, but data directly quantifying physical properties of an elephant – as e.g. the length of its trunk – to define our “objects”.

But a ML-algorithm may also detect patterns which the human brain cannot even see in pictures of objects. Then the algorithm would work with features in context 2, 3, 4 for which we may not even have a name. The features at least in context 3 and 4 in the end are always abstract – and chosen by the algorithm under optimization criteria.

The interesting thing is that the feature variables chosen to be our training data may totally obscure the really relevant features and respective data of the described objects. If we gave a human being a series of pixel value data and did not show the respective image in the usual 2-dimensional and colored way, we would have enormous difficulties to extract patterns of the photographed elephants. This is exactly the situation an artificial neural network is confronted with.

Be more precise when describing what you mean by a feature

We can resolve some of the confusion about features by specifying more precisely what we talk about. Personally, I would like to completely drop the word “feature space” for the variable space of training and input data to a ML-algorithm. Regarding the training data the terms “input or training variables” and “variable space of training data” seem much more appropriate. If required we should at least speak of “training data features” or “input data features”.

Concerning context 2 we must clarify what the primary objects whose feature data we feed into an algorithm are – and what the secondary objects are and how their features are indirectly encoded in the primary objects. We must also say which kind of objects we are interested in. Such a clarifying distinction is a must in the context of media data.

Context 3 related features, i.e. patterns, are in my opinion a helpful construction, in particular for describing aspects of CNNs. But such features must clearly be characterized as detected (correlation) patterns in the original input data. It should also be said, in which way such a pattern-based feature impacts the output of the algorithm. In case of CNNs referring to “patterns of feature maps” could be helpful to indicate that certain (sub-) layers of a CNN react strongly to a certain type of input pattern.

Regarding “features” in context 4 I think that the differences between internal and latent data representation or between “embedded” or “latent” representation spaces are not really decisive. We can in general speak of a “latent space” when we mean a multidimensional space to which some operational processes of a trained ML-algorithm or ANN map input data of objects to. Regarding the variables defining the respective vector space I would prefer to talk of “related latent variables” and a respective “latent variable space”. If we absolutely must discuss “features” we, at least we should use the term “latent features”.

Conclusion

Referring to features during a discussion of ML-algorithms, their input, output and internal or latent object representation may cause trouble if the term is not explained precisely. There are at least four contexts in which the term “feature” has a different meaning. Sometimes it appears to be better to avoid the term at all and instead refer to the relevant mathematical variables. Careful use is of particular importance if we describe our objects of interest via media as e.g. images.

 

Criteria for ML capable graphic cards: Amount of VRAM or raw GPU power?

Some of my readers may be interested in having a private environment to study Machine Learning [ML] techniques and perform experiments with complex Neural Network algorithms. I do not talk about AI professionals, but about people (as myself) who are students or privately interested in ML-techniques. And about people who have a limited budget for their AI and ML interests.

Even if you are not a professional you sooner or later may find that a new and better suited graphics card is required for your ML studies. As the prices for graphics cards of the monopolist in this market segment, namely Nvidia, still are extremely high the question may arise what your most important criterion for choosing a certain type of card should be.

In my opinion the most relevant criteria, one has to consider and weigh during a buyer decision, are:

  1. The price level (I avoid adjectives as “reasonable”, “relatively moderate” intentionally as Nvidia in my opinion uses its monopoly position to make a maximum profit.)
  2. The amount of available VRAM.
  3. Raw GPU power and performance in terms of characteristic HW parameters as e.g. the GPU frequency. (But note that the performance of a certain ML algorithm may depend on many more parameters and should always be evaluated with the help of well defined test cases. VRAM and total turnaround performance of many ML algorithms may show a strong correlation.)
  4. Energy consumption (which again has to do with a secondary price tag, namely that of running energy costs).

For private persons, who may have a very limited budget for their ML hobby, criterion 1 will always be dominant. But the variety of graphic cards available for a certain chip generation and the respective variation of HW properties and price tags is big. Most people would like to see criteria 2 and 3 being fulfilled at the same time. But you may find respective cards to be unaffordable. Criterion 4, in my opinion, often is totally underestimated.

With this post I want to briefly discuss criteria 2 to 4 and give you a recommendation regarding their relative weight.

Power consumption – the underestimated criterion

When you start performing training runs for modern Artificial Neural Networks [ANNs] you will soon learn that the GPU usage rises to above 90%. I have sometimes seen a permanent GPU load of 95%. When you watch the power consumption of your graphics card during such runs you may find that it also reaches above 85% to 90% of the nominal maximum power consumption value. Without any overclocking.

In 2020 I used a lot of my free time to work with different types of Deep Learning networks on a modest card namely a 960 GTX (4GB). As the performance of such a card is limited by around 160 Watts I was really astonished about my energy expenses at the end of the year: My ML interest increased my expenses in 2020 by about 30%. Which is more than significant. Taking into account that more powerful cards of each chip-generation may consume up to 300 Watt I would like to warn private ML enthusiasts:

Do not underestimate the energy consumption ML experiments may cause even on moderate graphics cards. ML is an expensive hobby for private addicts. Especially in countries like Germany where the price tag for electrical energy is higher than anywhere else in Europe.

Another important aspect is the rise in GPU core temperature. I have experienced peak values of the GPU temperature of more than 75° up to 80° Celsius. Combined with high fan rotation rates – and some respective noise. However, the more powerful a graphics card of a certain chip-generation is the more relevant the cooling and the associate noise problems become. So, if having a quiet, relatively cool system is a topic for you, a compromis regarding the performance level of a new GPU card may be appropriate from the very beginning.

VRAM vs. GPU performance

A RTX 4090 card with 24 GB VRAM may be something you dream of as a Linux PC user, but something you cannot afford. Then looking at the model palette of the 4090 chip series a serious question may arise: Should you focus on a cheaper model with less GPU power but more VRAM – or the other way round?

May advice is: It depends on your type of experiments, but in most cases and for the main purpose of studying various types of modern ANNS, GANs and Transfomers the size of the available VRAM is more important.

Why? Well, you may be able to await the result of an overnight calculation. But for really deep ANNs like some variants of CNNs, RNNs or other networks with many layers a lack of VRAM may render your planned experiments impossible. Even if you load your data during training and/or evaluation runs in really small batches. In any case you must have enough VRAM to keep the ANN’s model parameters and two or more batches within the available VRAM. Similar arguments hold for (transformer) networks handling texts and respective vector models. Even some steps for the preparation of texts may require a significant amount of VRAM.

In addition, VRAM and the total turnaround performance of many ML algorithms are not at all independent of each other. The more data you can keep in VRAM during your runs the better. Data transfer from and to the RAM is costly in terms of total turnaround time.

Note that there typically are two bath sizes which may become relevant: One determines how many data vectors are handled before updating your model parameters during training runs. The VRAM organization of a concrete tensor algorithm has some degrees of freedom, but in general this batch size will raise VRAM consumption. The other relevant batch size is that of packets during batched data transfer from the RAM (or disks) to the GPU’s VRAM. Depending on your PCIe bus width and the graphics card larger batches may have an additional impact on performance. Effectively transferring data from the RAM to the GPU and back often requires a delicate balance between system capabilities and the chosen transfer batch size. The latter will also raise the VRAM requirements.

So VRAM is at least as important as the raw GPU performance in terms of GPU core and VRAM frequencies. In most cases VRAM is even more important. For being able to test certain types of deep neural network types the amount of available VRAMt may become the dominant criterion which must be fulfilled for any kind of experiment.

Reasonable VRAM sizes for a start

I did my first ML experiments on a card with only 4GB of VRAM. You can do a lot with such a card. But the more you play around with deep and relatively modern ANNs the more painful it gets and the more time you must invest in programming tricks. But I would say: For a start graphic cards with 8GB and a GPU above the 960 GTX level are sufficient. If you really plan to study generative ML algorithms or really deep neural networks or NLP algorithms at least 16 GB of VRAM are a must.

Regarding price vs. VRAM: It may be more reasonable to buy two cards, each with 16GB VRAM and a less powerful GPU, than a most advanced GPU with 24GB VRAM.

Conclusion

Most often VRAM is more important than pure GPU performance – at least for people who want to study basic ML algorithms and ANN properties. Choosing less GPU power may also be consistent with reducing your system’s overall power consumption and its heat as well as its noise level.

A blog on experiences with Machine Learning experiments

This blog covers some basics, experiments and related math in the field of Machine Learning [ML]. It is a personal blog and not an ordered book. Contents comes with numerical experiments I had some fun with.

I write in general about experiments which one can perform on a medium equipped Linux PC. Meaning: This blog will mainly cover conventional experiments which can e.g. be done with Scikit-Learn and Neural Networks with a rather limited number of layers. Still, I think that one can learn quite a lot of interesting things from such limited experiments.

Besides the fun factor: One can prepare oneself via studying some basics for bigger and more professional tasks.

For the time being this post is not yet about GPT and other advanced transformer based neural networks. The reason is simply that I need a new graphics card to perform related experiments. I will order one soon.

Who is this blog for?

I expect this blog to be interesting for people who have already started with private ML projects – but are no experts, yet. There is a variety of standard experiments one typically starts with. You will sooner or later find such experiments with variations here in this blog. But I also intend to cover some experiments which you may not find in introductory text books. So, the posts will cover topics both for beginners and advanced users of Python, Numpy, Scikit-Learn Keras and Tensorflow. I will try to point out what level of knowledge may be required to understand a post or a post series.

You are invited to ask questions, write comment and exchange experiences. However, I expect that you open an account on this blog and let me check your comments before publishing them.

Equipment to do your own experiments

If you want to do similar projects as discussed here you should be prepared to have some 32 GB RAM and a Nvidia card with at least 4 to 8 GB of VRAM. My personal programming environment are Jupyter Lab (for Python) and Eclipse with PyDev. I strongly advice you not to work with Jupyter, only. Instead you should systematically gather and reorder your work with neural networks systematically within classes and reusable methods. And you should collect your classes in suitable Python modules. An Eclipse/PyDev environment in my opinion is much more suitable for such tasks than Juypter.

I do all my ML experiments on Linux systems. Please, do not expect me to answer questions regarding PyDev and Jupyter installations on Windows.

Some math

What may distinguish this post from others is that I sometimes will write about mathematical aspects I stumble across during my experiments and which I find interesting. I will try to confine posts within a separate main category.

Most of the mathematical subjects I have so far looked into deal with linear algebra (matrix operations), some features of statistical multivariate normal distributions, ellipsoids and ellipses.
Further topics will follow.

The role of my linux-blog

Some people may know me from my linux-blog hosted at anracom.com. In the linux-blog I wrote about Linux- and LAMP-related topics the first years (up to 2014). During the last 10 years, however, the linux-blog has become a container for all kind of IT-topics.

Among other things it got a growing section for Machine Learning. As some readers of the linux-blog have recently complained about an overload of only partially Linux-related topics I have opened this new blog. I intend to transfer selected ML-related posts from the linux-blog to this new blog.