When I gave a few introductory courses on basic Machine Learning [ML] algorithms in 2022, I sometimes ran into a discussion about “features“. The discussions were not only triggered by my personal definition, but also by some introductory books on ML the attendants had read. Across such textbooks, but even in a single book on ML the authors have a tendency to use the term “features” in different contexts of ML-algorithms and in particular Artificial Neural Networks [ANN]. Unfortunately, the meaning of the term is a bit different in the covered contexts. This can lead to some misunderstandings.
With this post I want to specify the most important contexts in which the term “feature” appears, comment on the differences and suggest some measures to distinguish a bit better.
Level of the post: Advanced. You should already be familiar with ML and ANNs, pattern detection and respective variable spaces .
Features in different contexts
In general a feature addresses some property of an object. One would think that an object of interest for some ML application can be described precisely enough by quantifying its relevant properties. How then can it be that object properties get a different meaning in different contexts? The following considerations help to understand why.
We need numeric object data as input for ML-algorithms. But do we always get a direct information about physical properties of an object? Or is this information about an important feature only indirectly accessible? In this context media may play a role. We also must take into account that the processes of a trained ML-algorithm typically map an object’s input data to a point in some abstract multidimensional space which is spanned by internal and abstract variables of the algorithm. These variables could also be regarded (abstract) “features” of an object. In addition ML-algorithms detect and extract (sometimes hidden) patterns in the input data of objects. Such a pattern is also often called a “feature” characterizing a whole class of objects.
Guided by these thoughts I distinguish the following four main contexts regarding different meanings of the term “feature“:
Context 1 – input and training data based on selected and quantified object properties
The first relevant context concerns the representation of an object in a useful way for a numerical ML-algorithm. A “feature” is a quantifiable property of a class of objects to which we want to apply an algorithm. We define a single object by an ordered array (= tensor) providing numeric values for a set of selected, relevant properties. Such an array represents our object numerically and can be used as input to a computer program, which realizes an ML-algorithm. If numeric values of the properties are available for a whole bunch of objects we can use them as training data for our algorithm.
Mathematically, we interpret a property as a variable which takes a specific value for a selected single object. Thus the numerical representation of an object requires a set of multiple variables. Therefore, we often present the available original training data of our objects as data points in a multidimensional space with an Euclidean coordinate system [ECS]. Each axis of the ECS represents one of our feature variables by which we describe our objects. Sometimes this space is called the (original) “feature space” of the objects. Actually, it is a space to represent numeric training data available for our objects.
Context 2 – object information embedded in the data of some medium
What set of “properties” is used to define quantified input data of objects often depends on the way or form by which we register information about our objects. During information gathering media (as images, videos, sound recordings, …) can play a decisive role.
Let us take an example: We may want to train an ML-algorithm to distinguish between classes of elephants, like to distinguish an African from an Indian elephant. But relevant data of a bunch of elephants may be available in the form of pictures, only. One image for each of the elephants. We may not have have any direct numeric data for an elephant’s properties like its length, height, weight, ear size, … The data of relevant physical properties of elephants would in our case be indirectly embedded in media data.
In such a case we would probably use pixel values as our training data. I.e., the “features” our ML-algorithm gets confronted with would be provided as arrays of pixel values – corresponding to one variable for each of the image ‘s color pixels. Yet, the objects we really are interested in would be the photographed elephants. Our algorithm should distinguish between (depicted) elephants just from analyzing a respective image. The distinctive features must then be evaluated indirectly.
Such a situation opens room for misunderstandings regarding the objects the ML-algorithm really deals with (see the discussion below).
Context 3 – patterns extracted from object data
A “feature” is also used as a term to qualify a pattern which a ML-algorithm may somehow have detected in and extracted from some original training data (by some tricky mathematical methods).
Such pattern-based “features” summarize correlations in the original training data. The detected patterns can be abstract ones or they may correspond to physical properties of the objects. These features may not have been directly referenced by the training data presented to the ML-algorithm, but could have been detected during the training process. E.g. by the evaluation of correlations.
In such a case these features were hidden in the training data. Think again of images of elephant faces for which the training data were chosen to be pixel values: A pattern-based “feature” a capable algorithm detects may then be something like an elephant’s “nose” or “trunk”. More precisely: a nose-like pattern of positional correlations of certain pixel values.
But in other cases the detected pattern-based features may relate to some correlations between data which correspond to no concrete single physical property, but to more or less abstract property relations. E.g., there could be a relation between the size of an elephant and a date of birth, because after some date the food was changed or a genetic modification overtook for a group of elephants.
Context 4 – features as abstract variables of latent representation spaces of objects
The internal processes of many ML-algorithms, especially neural networks, map the data points (or vector) representing objects in the variable space of the input data to data points (or vector) in an internal or latent representation space. A ML-algorithm, e.g. an ANN, can be regarded as a complicated function mapping a vector of a high dimensional vector space to a vector of another vector space which a typically lower number of dimensions.
In the case of ANNs these internal representation spaces relate to vectorized data which are produced by neurons of a special (flat) layer of neurons. Such a layer typically follows a sequence of other analyzing and processing layer and summarizes in a way the results. The output of each of the neurons in this special inner layer can be regarded as a variable for a vector component. The processed data for a specific object thus lead to specific values corresponding to data points in an abstract multidimensional space. If such data are externalized and not directly subject to further internal and classifying networks then we speak of an accessible latent space.
The variables that span an internal or latent object representation space are abstract ones – but they can sometimes also measure the overlap with patterns in physical properties of the objects. In the case of Convolutional Neural Networks [CNNs] an internal or latent representation space condenses information about detected patterns and the degree of overlap of a given object with such a pattern. In this sense internal or latent representation (vector) spaces may also represent secondary, pattern based object features in the sense of context 3.
An internal representation space for objects is in some ANN-contexts (especially regarding Natural Language Processing by ANNs) also called an “embedding space“. The difference in my understanding lies in the way the mapping of training data into a representational space is done: In the case of an embedding space mapping is done by neuron layers close to the input side of a neural network. I.e. the input data are first mapped to an internal representation space and are afterward processed by other network layers. The relevant network parameters for the initial mapping (= embedding) are achieved during training via some parameter optimization. In the case of a latent or inner representation space we instead use data produced by neurons which are members of some special inner layer following major processing layers (as e.g. convolutional or residual layers).
See a Wikipedia article about latent spaces which distinguishes between the “feature space” of context 1 and the “latent feature space” of context 4.
A topic for confusion
The example of image data of elephants makes it clear why we often must define precisely what we mean when we speak about “features” of “objects”. In particular, we must be careful to distinguish between media objects and objects indirectly presented by our media objects. We also must address patterns as particular features and internal object representations. Key questions are:
Do we speak of quantified physical and abstract features of the objects we are interested in? Or do media objects play a role whose features encapsulate the data of the really relevant objects? Or do we speak of patterns? Or do we refer to variables of internal or latent feature spaces?
One widespread source of confusion is that we confuse a media object and the object encoded in media data. We speak of “elephants” when the real objects a ML-algorithm is confronted with are the images of elephants. Then an algorithm classifying elephants on the basis of image data does not really distinguish between different classes of elephants (or other photographed objects). Instead it actually distinguishes between images with different kinds of pixel correlations. If we are lucky the detected pixel correlation patterns reflect some information about single feature or the combination of multiple (physical) features of elephants (or other imaged objects).
Note that the the interpretation of the input data and the latent data of an ML-algorithm would change substantially if we had not used images of elephants and respective pixel values as training data, but data directly quantifying physical properties of an elephant – as e.g. the length of its trunk – to define our “objects”.
But a ML-algorithm may also detect patterns which the human brain cannot even see in pictures of objects. Then the algorithm would work with features in context 2, 3, 4 for which we may not even have a name. The features at least in context 3 and 4 in the end are always abstract – and chosen by the algorithm under optimization criteria.
The interesting thing is that the feature variables chosen to be our training data may totally obscure the really relevant features and respective data of the described objects. If we gave a human being a series of pixel value data and did not show the respective image in the usual 2-dimensional and colored way, we would have enormous difficulties to extract patterns of the photographed elephants. This is exactly the situation an artificial neural network is confronted with.
Be more precise when describing what you mean by a feature
We can resolve some of the confusion about features by specifying more precisely what we talk about. Personally, I would like to completely drop the word “feature space” for the variable space of training and input data to a ML-algorithm. Regarding the training data the terms “input or training variables” and “variable space of training data” seem much more appropriate. If required we should at least speak of “training data features” or “input data features”.
Concerning context 2 we must clarify what the primary objects whose feature data we feed into an algorithm are – and what the secondary objects are and how their features are indirectly encoded in the primary objects. We must also say which kind of objects we are interested in. Such a clarifying distinction is a must in the context of media data.
Context 3 related features, i.e. patterns, are in my opinion a helpful construction, in particular for describing aspects of CNNs. But such features must clearly be characterized as detected (correlation) patterns in the original input data. It should also be said, in which way such a pattern-based feature impacts the output of the algorithm. In case of CNNs referring to “patterns of feature maps” could be helpful to indicate that certain (sub-) layers of a CNN react strongly to a certain type of input pattern.
Regarding “features” in context 4 I think that the differences between internal and latent data representation or between “embedded” or “latent” representation spaces are not really decisive. We can in general speak of a “latent space” when we mean a multidimensional space to which some operational processes of a trained ML-algorithm or ANN map input data of objects to. Regarding the variables defining the respective vector space I would prefer to talk of “related latent variables” and a respective “latent variable space”. If we absolutely must discuss “features” we, at least we should use the term “latent features”.
Conclusion
Referring to features during a discussion of ML-algorithms, their input, output and internal or latent object representation may cause trouble if the term is not explained precisely. There are at least four contexts in which the term “feature” has a different meaning. Sometimes it appears to be better to avoid the term at all and instead refer to the relevant mathematical variables. Careful use is of particular importance if we describe our objects of interest via media as e.g. images.