Skip to content

Jupyterlab, Python3, asyncio – asynchronous tasks in a notebook background thread

Jupyterlab and IPython are always good for some surprises. Things that work in a standard Python task in Eclipse or at the prompt of a Linux shell may not work in a Python notebook within a Jupyterlab environment. One example where things behave a bit differently in Jupyterlab are asynchronous tasks.

This post is about starting and stopping asynchronous tasks via the Python3 package “asyncio” in a Jupyterlab notebook. In addition we do not want to block the usage of further notebook cells despite long or infinitely running asyncio-tasks.

To achieve this we have to use nested asyncio-loops and to start them as background jobs. In addition we also want to stop such loops from the foreground, i.e. from other cells in the notebook.

Being able to do such things is helpful in many Machine-Learning contexts. E.g. when you want to move multiple and concurrent training tasks as well as evaluation tasks into the background. It may also be helpful to control the update of separately started Qt5- or Gtk3/Gtk4-windows on the Linux desktop with new data on your own.

Level of this post: Advanced. You should have some experience with Jupyterlab, the packages asyncio and Ipython.lib.backgroundjobs.

Warnings and clarifications

Experimenting with asyncio and threads requires some knowledge. One reason to be careful: The asyncio-package has changed rapidly with Python3 versions. You have to test thoroughly what works in your virtual Python3 environment and what does not or no longer work.

1) Asynchronous jobs are not threads

Just to avoid confusion: When you start asynchronous tasks via asyncio no new Python threads are opened. Instead asyncio tasks are functions which run concurrently, but under the control of one and the same loop (in one and the same Python thread, most often the main thread). Concurrency is something different than threads or multiprocessing. It is an efficient way to intermittently distribute work between jobs of which at least one has to wait for events. I recommend to spend some minutes and read the nice introduction into asyncio given here by Brad Solomon.

2) Warning: There is already a loop running in a Jupyterlab Python notebook

Those of you who have already tried to work with asyncio-jobs in Jupyterlab notebooks may have come across unexpected errors. My own experience was that some of such errors are probably due to the fact that the notebook itself has an asyncio-loop running, already. The command asyncio.get_event_loop() will point to this basic control loop. As a consequence new tasks started via asyncio.get_event_loop().run_until_complete(task) will lead to an error. And any job which tries to stop the running notebook loop to end additionally assigned tasks [via get_event_loop().create_task(function())] will in the end crush the notebook’s kernel.

3) Warning: Asynchronous tasks are blocking the notebook cell from which they are started

There is a consequence of 1 and 2: Adding a new task to the running loop of the Ipython notebook via
asyncio.get_event_loop().create_task(your_function)
has a cell blocking effect. I.e. you have to wait until your asynchronous task has finished before you can use other notebook cells (than the one you used to start your task). So, please, do not start infinitely running asyncio tasks before you know you have complete control.

4) Consequences

We need a nesting of asyncio.loops. I.e. we need a method to start our own loops within the control of the notebook’s main loop. And: We must transfer our new loop and assigned tasks into a background thread. In the following example I will therefore demonstrate four things:

  1. Define the start of a new and nested asyncio-loop to avoid conflicts with the running loop of the notebook.
  2. Putting all predefined asyncronous actions into a background thread.
  3. Stopping a running asyncio-loop in the background thread
  4. Cancelling asyncio-tasks in the background thread

Example – main code cells and explanations

The following code example illustrates the basic steps listed above. It can also be used as a basis for your own experiments. I ran the code in Jupyterlab 4.0.8, with Python version 3.9.6, Ipython 8.5.0, notebook 7.0.6 and other packages, which all were updated to their present versions (of 11/22/23).

Cell 1 Imports

import os
import time
import asyncio
import nest_asyncio
import matplotlib.backends
import matplotlib.pyplot as plt
from IPython.lib import backgroundjobs as bg

The only thing which may surprise you is the package “nest_asyncio“. It is required to work with nested asyncio-loops. We need it in particular to become able to stop new asyncio-loops which have been started under the control of the notebook’s main loop.

Cell 2 – Activate nested asyncio

nest_asyncio.apply()

This is a super-important statement! Do not forget it! Otherwise you will not get full control.

Cell 3 – Functions for asynchronous tasks

async def sprinter(num_sprinter=200, b_print=False):
    if b_print: 
        print("sprinter: num_sprinter: ", num_sprinter)
    i=0
    print('sprinter :', i)
    while i < num_sprinter:
        i += 1
        if i%20 == 0:
            print('sprinter :', i)
        await asyncio.sleep(0.1)
    print("sprint finished: ", i)

async def stopper(stop_event, num_stopper=21, b_print=False):
    if b_print:
        print("stopper: num_stopper: ", num_stopper)
    for i in range(num_stopper):
        if i%20 == 0: 
            print("stopper : ", i)
        if stop_event.is_set():
            break
        await asyncio.sleep(0.1)
    stop_event.set()
    print("finished")
    await asyncio.sleep(0.01)  

There are two functions "sprinter()" and "stopper(stop_event)". They are rather simple; both do some printing and intermittent sleeping, only. Note that both functions are defined with the keyword "async". This is required because these function shall later be run asynchronous under the control of an asyncio-loop.

sprinter() is just a long running job. To give you a real world example: It could be a job which redraws the canvas of an external plot with high frequency to adapt the plot figure to new data (e.g. update a Gtk-window on a KDE desktop periodically).

The function stopper() is more interesting. We will use it to stop a controlling asyncio-loop a bit later. It gets an asyncio-"Event"-object as one of its arguments. This event is triggered internally at the end of stopper's internal for-loop. But we check if it has been set some other way. Triggering the event will automatically lead to a condition for ending stopper's own asyncio-loop.

Cell 4 - a job to set up a new asyncio event-loop

def run_loop(num_sprinter=400, num_stopper=21, b_print=True):
    if b_print: 
        print("run_loop: num_sprinter: ", num_sprinter)
        print("run_loop: num_stopper: ", num_stopper)
        print()

    async_loop = asyncio.new_event_loop()
    run_loop.loop = async_loop
    asyncio.set_event_loop(async_loop)

    run_loop.task1 = async_loop.create_task(sprinter(num_sprinter, b_print))
    stop_event = asyncio.Event()
    run_loop.stopx = stop_event
    run_loop.task2 = async_loop.create_task(stopper(stop_event, num_stopper, b_print))
    async_loop.run_until_complete(run_loop.task2)

This function does not need the async-keyword. It will not become a asyncio-task. Instead it creates our own new asyncio-loop by asyncio.new_event_loop() and assigns tasks to this loop.

After picking up some external parameters we set up a new asyncio-loop by asyncio.new_event_loop() and get a reference to this loop which we name "async_loop".

As we later want to access and stop the loop from outside the function run_loop() we create and set a new attribute of the function. (Remember, a function in Python is an object; see [9]). This attribute will only be accessible from outside after the function has once be called and established as an object. This is no major problem in our context. But it may take a little time when we call the function in a new Python thread; see below.

After the new asyncio-loop has been defined, we set it as the current loop within the present thread-context [3]. This thread well be a new one aside the notebook-thread; see below.

Then we set up a first task - sprinter() - under the control of our "async_loop" via
async_loop.create_task( sprinter(num_sprinter, b_print) ).
Afterward we define an asyncio-Event-object which we supply to a second task based on stopper().

Note that so far these are all just definitions. Our loop "async_loop" is not yet running and our tasks have not yet been started.

We start our new loop via async_loop.run_until_complete( run_loop.task2 ). Only afterward the two tasks run and do their jobs. This would be very different if we had used asyncio.get_event_loop().create_task(). Had we done that we would have started a task directly in the asyncio-loop of the notebook!

Note that with starting the event loop with run_until_complete() we defined a condition for the loop's existence:
The loop will stop in a natural way as soon as task2 is finished.

So even if task1, i.e. sprinter(), had run infinitely, it would be stopped as soon as stopper() finishes. stopper() is a kind of emergency tool - we will later extend task1 significantly. stopper() always offers us a clean way to stop the whole asyncio-loop correctly. See also [4], [5].

Cell 5 - Starting a background job

# Numbers of internal iterations of 
# sprinter() and stopper()
num_sprinter = 400  
num_stopper = 41     

a = run_loop
jobs = bg.BackgroundJobManager()
out = jobs.new(a, num_sprinter, num_stopper)
print()
print(out)
print()

We set the number of loop-iterations for our two tasks first. Note that the number for stopper() is chosen to be much smaller than that for sprinter(). As we have set the same timing for their sleeping interval asyncio.sleep(0.1) for task1 and task2, task2 should finish long before task1 ends. So, in a first test we should see that the asyncio-loop stops after 40 internal iterations of stopper().

In the second part of the cell we set a callback "a = run_loop". Then we set up an object "jobs" which allows us to control background-jobs from an IPython environment like Jupyterlab. bg refers to a related library (see cell 1 and [6]; in particular the section on classes and "lib.backgroundjobs").

We use the jobs.new() function to create a new Python thread and start the function run_loop() within this thread. "run_loop" in turn leads to the creation of the aspired asyncio-loop in the new thread. (Note that we could have started more loops in this thread.) The positional arguments to the callback "a" are provided directly and comma-separated.

The output problem of background jobs of Jupyterlab

The output of the backgroundjob "run_loop()" will be directed to any cell which we presently use in the notebook. As the tasks are running as a background-job in another thread than the main notebook thread, we (hopefully) can work with other notebook cells after having started the job. But the output area of those cells will potentially be cluttered by messages from the background job and its asyncio-tasks. This can become very confusing.

A very simple solution on Linux would be to write the output of the background tasks into some file whose changing contents we follow by "tail -f -n100 your_filepath". Another solution, which is based on the creation of a separate HTML-window, is outlined in [10] (for a Pandas output, but we can adapt the code to our needs).

For those who like PyQt5 the probably best solution is to open a separate and original Qt5-window containing a QTextEdit-widget with the help PyQt5. You can write to such a window thread-safe via a queue. I will demonstrate this in a forthcoming post.

First test

Let us try all this out and run all of the cells defined so far. The output of cell 5 indeed looks like:

run_loop: num_sprinter:  400
run_loop: num_stopper:  41

sprinter: num_sprinter:  400
sprinter : 0
stopper: num_stopper:  41
stopper :  0

sprinter : 20
stopper :  20
sprinter : 40
stopper :  40
finished

Exactly what we had hoped for! 🙂

Other helping cells to control the status of the background jobs

Cell 6 - Checking the status of the background job

jobs.status()

After we have run up to cell 5 with the iteration numbers given above and have awaited finalization of the syncio-loop this statement produces something like

Completed jobs:
0 : 

Cell 9 - Removing a finalized job from the BackgroundJobManager() control object

When we see something like "Completed jobs" or "Dead jobs" we can use the shown number (in the above case 0) to remove the job from the list of jobs controlled by our object "jobs=bg.BackgroundJobManager()".

jobs.status()
jobs.remove(0)
jobs.status()

Afterward, we should not get any output from the last jobs.status() (if we had not started other jobs in further threads).

How to stop the asyncio-loop in the background thread

We have multiple options to do so. Let us assume that we have set num_sprinter=5000 and num_stopper=2000. This would give us enough time to move to other cells of the notebook and test stopping options there. We prepare the following cell:

Cell 8 - stopping the asyncio-loop in the background

b_stop = 0 
if b_stop == 0:
    a.stopx.set()
elif b_stop == 1: 
    a.loop.stop()
else: 
    a.task2.cancel()
    a.task1.cancel()    

This cell allows for 3 methods, which you can test independently test by the following steps:

  • check that jobs.status() has no output (cell 6)  =>  change b_stop (cell 9)  =>  define run_loop again (cell 4)  =>  restart run_loop (cell 5)  =>  wait (!!) for two outputs from both tasks  =>  stop run_loop via cell 8  =>  check the status of jobs (cell 6)  =>  remove dead job from list

Waiting is required as the jobs have to start and because asyncio.sleep(0.1) must have run at least once on both tasks. Otherwise you will get errors referring to pending tasks. But even if this happens the loop will be dead nevertheless. Our options are:

b_stop = 0: A very clean method which uses our condition for the loop, which should only run until task2 has finished. Can we trigger a stop of this task, before its internal for-loop finalizes? Yes, we can.
We have prepared a property "stopx" of our function "run_loop()", which directly points to the asyncio-Event-object used in task2 (stopper). We can trigger the event by its set()-method and we can do this from any cell in our Jupyterlab notebook. stopper() in turn checks the status of the event within its for-loop and breaks the loop when the event happens. Then stopper() finalizes and the asyncio-loop is stopped and removed.

b_stop = 1: This method directly uses the function's attribute "run_loop.loop", which points to the asyncio-loop, to stop the loop directly. (Note that this method in our case requires to wait for some output.)

b_stop = 2: This method directly uses the function's attributes "run_loop.task1" and "run_loop.task2", which point to the asyncio-loop's tasks. Via a.task2.cancel() and a.task1.cancel() the tasks can be removed. The asyncio-loop stops automatically afterward. (Note that this method in our case requires to wait for some output.)

Restarting the job "run_loop" after having it removed

If you want to be on the safe side redefine "run_loop()" again by using cell 4.

Can we start multiple background jobs - each with its own asyncio loops and tasks?

Yes, we can. You can try this out by preparing two functions with different names "run_loop1" and "run_loop2". You, of course, have to adapt other statements, too.

But note: It is not wise to start our job "run_loop" just twice by running cell 5 two times. The reason is that we get an overlap of function names then which may potentially lead to unclear attribute assignments across the functions.

Can we kill a thread (started by Ipython's backgroundjobs) that runs wild?

Unfortunately, we cannot kill a thread, which we have started from a cell of a Jupyterlab notebook, by some clever commands of the backgrounds-package in another cell of the same Jupyter notebook. We have to wait until the thread terminates regularly.

However and of course, we could eliminate such a process from a Linux terminal or by issuing the Linux kill-command via the OS-package from a Jupyterlab cell. We first have to find the process PID for the notebook kernel. (It would probably the last kernel-process if you had started the notebook as the last one in Jupyterlab.) "ps aux | grep kernel" will help. Afterward you can show running threads of this process via "ps -T -p <pid>". The SPID given there can be used to kill such a process with the "kill -9 <(S)PID>" command.

Further experiments?

You could experiment with tasks that use internal oops with different task-dependent intervals of "await asyncio.sleep(interval)". Also think about realistic examples where one task produces data which the other task uses to evaluate and to plot or print.

Conclusion

With this post I have shown that we can start and stop asyncio-tasks in the background of a Jupyterlab Python notebook. Important ingredients were the use of nest_asyncio.apply() and the creation of a new asyncio-loop in the background of the notebook.

This appears to be useful in many situations. One example would be concurrent jobs of which at least one waits for some events or data from other tasks. This waiting-time can be used by other functions while intermittently event-conditions are controlled by task internal loops with sleep-intervals. Another example could be a job that creates information while another job plots this information in parallel.

In my simple example presented above the tasks had some relatively long intermittent await-intervals of 0.1 sec. In real world examples, in particular when plot-updates are involved, we would probably use significantly shorter intervals. In addition we would use different wait-intervals per task (fitting the production times of each of the concurrent tasks).

A natural advantage of running in the background, i.e. in another thread, is that we do not block working with other notebook cells. We can use code in the foreground while concurrent jobs meanwhile do their jobs in the background. We have to care of separating the outputs of background jobs from the outputs of the concurrently used notebook cells. But this is no principle problem.

In a forthcoming post in this blog I will use the results above for controlling GTK3- windows which presently do not cooperate correctly with Matplotlib's interactive mode (ion()) in Jupyterlab. We will allow for canvas redraw-actions while at the same time awaiting a window close event.

Literature

asyncio documentation
[1] https://docs.python.org/ 3/ library/ asyncio.html
[2] https://realpython.com/ async-io-python/

asyncio Event Loop documentation
[3] https://docs.python.org/ 3/library/ asyncio-eventloop.html

Not working in IPython - but basically interesting for stopping asyncio loops
[4] https://stackoverflow.com/ questions/ 64757209/ how-to-stop-asyncio-loop-with-multiple-tasks
[5] https://superfastpython.com/ asyncio-cancel-task/

Documentation on backgroundjobs
[6] https://ipython.readthedocs.io/ en/stable/ api/generated/ IPython.lib.backgroundjobs.html
[7] https://ipython.org/ ipython-doc/ rel-0.10.2/ html/api/ generated/ IPython.background_jobs.html
[8] https://stackoverflow.com/ questions/ 36895256/ coroutine-as-background-job-in-jupyter-notebook

Set attributes of functions
[9] https://stackoverflow.com/ questions/ 47056059/ best-way-to-add-attributes-to-a-python-function

External output window via HTML
[10] https://stackoverflow.com/ questions/40554839/ pop-out-expand-jupyter-cell-to-new-browser-window

Jupyterlab, matplotlib, dynamic plots – I – relevant backends

When we work with Deep Neural Networks on limited HW-resources we must get an overview over CPU- and VRAM-consumption, training progress, change of metrical variables of our network models, etc. Most of us will probably want to see the development of our system- and model-related variables in a graphical way. All of this requires dynamic plots, which are updated periodically and thus display monitored data live.

As non-professionals we probably use Python code in standalone Jupyter notebooks or (multiple) Python notebooks in a common Jupyterlab environment. All within a browser.

The Jupyterlab interface resembles a typical IDE. Its structure and code are more complicated than those of pure Jupyter Notebooks. Jupyterlab comes with more configuration options, but also with more SW-problems. But Jupyterlab has some notable advantages. Once you turned to it you probably won’t go back to isolated Jupyter notebooks.

In this post series I want to discuss how you can create, update and organize multiple dynamic plots with Jupyterlab 4 (4.0.8 in a Python 3.9 environment), Python 3 and Matplotlib. There are various options and – depending on which way you want to go – one must also overcome some obstacles. I will describe options only for a Linux KDE environment. But things should work on Gnome and other Linux GUIs, too. I do not care about Microbesoft.

In this first post I show you how to get an overview over Matplotlib’s relevant graphics backends. Further posts will then describe whether and how the backends “Qt5Agg”, “TkAgg”, “Gtk3Agg”, “WebAgg” and “ipympl” work. Only two of these backends automatically and completely update plot figures after each of a series of multiple plot commands in a notebook cell. The Gtk3-backend poses a specific problem – but this gives me a welcome opportunity to discuss a method to trigger periodic plot updates with the help of user-controlled asynchronous background tasks.

Addendum and major changes, 11/25/23 and 11/28/23: This post was updated and partially rewritten according to new insights. I have also added an objective regarding plot updates from background jobs.

Objectives – updates of multiple plot frames with live data

We need dynamic, i.e. live plots, whose frames are once created in various Jupyter cells – and which later can be updated based on changed and extended data values.

  1. We will organize multiple plot figures at some central place on our desktop screen – within Jupyterlab or outside Jupyterlab on the Linux desktop GUI.
  2. We will perform updates of multiple plots with changed data – first one after the other with code in separate and independent Jupyter cells (and with the help of respective functions).
  3. We will perform live updates of multiple plots in parallel and continuously by one common loop that gathers new data and updates related figures.
  4. We will perform continuous updates of our plot figures from background jobs.

Please regard the following the following scope and limitations:

  • Regarding the 1st point: The way of preparing dynamic plot frames is a matter of both code and work organization. We can use independent windows on our Linux desktop-GUI or sub-areas of the Jupyterlab interface. While Qt5-, Tk- and Gk3 windows give us a lot of freedom here, we have to be more careful regarding a standard web-interface (WebAgg) supported by Matplotlib. WebAgg forces us to visualize all of our figures at a certain common point in time if we do not want to clatter our browser with many tabs and multiple superfluous instances of a web-page. Jupyterlab with its ipympl-backend offers us the option to show the output of cells in separate sub-windows of the Jupyterlab-interface.
  • Regarding the 2nd point: We will pick up some new data in a notebook cell and just individually update plot figures which we have created before. I will discuss what we have to do when the update of the figures’ canvasses is not done automatically.
  • Regarding the 3rd point: We focus on a situation where we periodically receive new data for all plots and update them in parallel. We will control both data gathering and plot updates by one central loop. (This indirectly covers the case of Keras based ANN-models where we would use callbacks provided to the fit()-function of the training loop. These callbacks would pick up new data and trigger the redrawing of matplotlib-figures and axes-subplot frames.) Note that such an approach blocks the use of further cells in our notebooks until the loop has finished.
  • Regarding the 4th point: In the present post series we will not care much about details of how the changing data are produced or gathered from multiple and different resources. We will just produce artificial new data on the fly. But in real world ML-scenarios we may want to follow data creation of programs that were started independently of the plot figures. And we would like to work with other cells in our notebooks while the plots are updated. To cover such a non-blocking situation I will discuss both data and plot updates from Python threads running in the background of Jupyterlab.

Dynamic plots in Jupyterlab – interactive mode

All of the points named above require that we can produce dynamic live plots at all with Jupyterlab. What does dynamic mean? It basically means that we create a plot figure (with sub-plots) once and then only update the contents of the canvas region (and related tick-marks on the axes). We would expect that Matplotlib supports an automatic update of the figures’ contents. If this were true we would only have to execute plot commands and re-draw the canvas.

So, is there a mode for notebooks and Matplotlib which provides an automatic update of plot figures when we use plotting commands for already defined figures? Yes, this is Matplotlib’s interactive mode.

From my experience with Jupyter Notebook’s plot backends I had expected that starting interactive plot mode via the pyplot.ion() would be sufficient in Jupyterlab to guarantee automatic figure updates as soon as we re-draw the figure’s canvasses. This only works directly and well for 2 backends. Other backends require additional commands.

Addendum 11/28/2023: I am tempted meanwhile to accept this as a feature. Wait for a discussion in the next post.

In addition, at least on my test system (Leap 15.4, KDE 5.90, Qt 5.15, Python 3.9.6, Matplotlib 3.8.2, Jupyterlab 4.0.8), Gtk3 caused further problems which require an additional trick. I will show for the problematic cases how we can take over control by our own periodic update loops in Jupyterlab’s background.

Note that there is also another meaning of “dynamic” and “interactive”: We may also mean the ability to use control buttons for a plot frame which enable us to interactively move, zoom or save plot contents. We will see in the context of a Gtk3-related graphics backend that this kind of interactivity is closely related to the canvas update problem.

Theoretical properties of the interactive mode [ triggered by ion() ]

According to matplotlib’s documentation we would expect 3 things in interactive mode; I quote:

  • newly created figures will be shown immediately;
  • figures will automatically redraw on change;
  • pyplot.show() will not block by default.

The second point leaves some room for interpretation. A hard interpretation would be that any plotting action would automatically be displayed on figures and their canvasses under the control of a Matplotlib-backend as soon as we request a re-draw of the figure’s canvas. A softer interpretation would say: Well, produce real output only when you explicitly provide your basic render results to the output engine of your GUI-backend. And by thinking a bit more about the consequences for the nterplay between a command UI and e.g. a desktop GUI we may find that the “right” way of doing it is a matter of implementation reasoning and even efficiency. We will dig a bit deeper in thenext post.

To make a long test story short for the time being:
In Jupyterlab the so called Jupyterlab specific “inline”-backend does not display changes automatically. The backend “notebook” which often is used in stand-alone Jupyter Notebooks does not work at all with Jupyterlab. Most of so called “interactive backends” do not automatically update the canvas regions in their (external) plot-windows when we call the draw( )-command for rendering. In forthcoming posts we will see that there are two sides to this. And Gtk3 – at least on my system – is not working as expected.

Overview over Matplotlib backends

Matplotlib works together with a variety of backends. A present list of supported standard backends is

‘GTK3Agg’, ‘GTK3Cairo’, ‘GTK4Agg’, ‘GTK4Cairo’, ‘MacOSX’, ‘nbAgg’, ‘QtAgg’, ‘QtCairo’, ‘Qt5Agg’, ‘Qt5Cairo’, ‘TkAgg’, ‘TkCairo’, ‘WebAgg’, ‘WX’, ‘WXAgg’, ‘WXCairo’, ‘agg’, ‘cairo’, ‘pdf’, ‘pgf’, ‘ps’, ‘svg’, ‘template’

See below for Python code cell taken from here to get such a list. You find the list also in the Matplotlib-documentation. Which of these standard backends really are available depends on your system, its installed Linux packages and on installed Python modules. On my system

[‘agg’, ‘gtk3cairo’, ‘gtk3agg’, ‘svg’, ‘tkcairo’, ‘qt5agg’, ‘template’, ‘cairo’, ‘ps’, ‘nbagg’, ‘pgf’, ‘qtagg’, ‘pdf’, ‘tkagg’, ‘webagg’, ‘qtcairo’, ‘qt5cairo’]

were found to be valid backends. Of these only some are so called “interactive backends“, which are required to perform dynamic updates.

Standard backends

  • Static standard backends: The backends {‘pdf’, ‘pgf’, ‘ps’, ‘svg, ‘template’, ‘cairo’} are static ones and not interactive. They do not work in our dynamic context.
  • Interactive standard backends: All of the other backends are theoretically interactive ones. Note that the Cairo-variants of the other backends work similar to their basic counterparts. On my system “GTK4” and “WX” are not fully installed. So I could not test them.

Other Jupyter– and Jupyterlab-related backends

There are some IPython/Jupyter-specific backends. These backends are implemented in Jupyter or Jupyterlab:

  • inline – standard plot-backend in Jupyterlab. It does not support interactive mode and cannot be used in our context.
  • ipympl – the only Ipython-specific plot-backend which supports matplotlib’s interactive mode within Jupyterlab notebooks,
  • notebook [nb, equivalent to nbAgg] – does not work in Jupyterlab at all (error message).

On standalone Jupyter Notebooks the “notebook”-backend does not only work; there it also supports ion(). However, on Jupyterlab “notebook” will lead to a warning like:

notebook backend: Javascript Error: IPython is not defined.

How to activate a backend

You can select and set a specific standard backend to be used in a notebook via the commands:

  • matplotlib.use(‘TkAgg’) or pyplot.switch_backend(‘TkAgg’).

The backend strings will be evaluated case-insensitive.

Note: If you want to change a backend within a running notebook you will almost always get an error messages in Jupyterlab. To test a new backend you normally have to restart the notebook’s kernel.

The Jupyter specific backends can be invoked in Jupyter notebooks by so called magic cell commands , e.g.

  • Jupyterlab: %matplotlib inline, %matplotlib ipympl (or equivalently %matplotlib widget) .

This should, of course, be done before performing figure definitions.

Python code to get lists of supported matplotlib backends

Code to get a list of basically supported standard backends of Matplotlib

from __future__ import print_function, division, absolute_import
from pylab import *
import time
import matplotlib.backends
import matplotlib.pyplot as plt
import os.path

def is_backend_module(fname):
    """Identifies if a filename is a matplotlib backend module"""
    return fname.startswith('backend_') and fname.endswith('.py')
def backend_fname_formatter(fname): 
    """Removes the extension of the given filename, then takes away the leading 'backend_'."""
    return os.path.splitext(fname)[0][8:]

# get the directory where the backends live
backends_dir = os.path.dirname(matplotlib.backends.__file__)

# filter all files in that directory to identify all files which provide a backend
backend_fnames = filter(is_backend_module, os.listdir(backends_dir))
backends = [backend_fname_formatter(fname) for fname in backend_fnames]
print("Supported standard backends: \n" + str(backends))
print()

Code to get a list of valid standard Matplotlib backends on your Linux system

# validate backends
backends_valid = []
for b in backends:
    try:
        plt.switch_backend(b)
        backends_valid += [b]
    except:
        continue
print("Valid standard backends on my system: \n" + str(backends_valid))
print()

Backends working directly with interactive mode in Jupyterlab Python notebooks

Presently the only Matplotlib-backends which truly respect ion() in Jupyterlab (in the sense of our hard interpretation; see above) are

  • WebAgg” (browser-related-backend based on the tornado-server).
  • ipympl” (Jupyter specific-backend)

Both react directly to figure.canvas.draw()-requests.

Backends that need flushing events in Jupyterlab to work dynamical

The following backends do not support multiple automatic updates of figure canvasses when these updates are requested within a loop executed in the same notebook cell:

    • “Qt5Agg” (supports Qt5 windows on the desktop outside Juypterlab)
    • “TkAgg” (supports Qt5 windows on the desktop outside Juypterlab)
    • Gtk3Agg” (supports Qt5 windows on the desktop outside Juypterlab. But requires a workaround regarding an in initial error)

The canvas of a figure will only be updated on the GUI when the last Python code command of a cell has finished. All these backends require a so called “flushing” of canvas related draw()-requests and respective events to the GUI. I will explain in the next post in more detail how we enforce the output production of accumulated render-results which typically are gathered in a queue for a figure’s canvas at any time.

We will see that automatic periodic updates can be performed by asynchronous processes in the background of Jupyterlab. This will lead to the construction of a workaround for all of the named “problematic” backends – if and when we really want a plot window to directly react to draw()-requests in Jupyterlab notebooks.

Workaround for Gtk3-backend error in Jupyterlab?

The Gtk3Agg-backend not only does not work with ion() in Jupyterlab. It creates an additional error both on KDE and Gnome – if ion() is not started ahead of setting the backend to be used. However, when we start ion() ahead even the interactive control elements (buttons) do not work afterward. But we will see that the workaround scheme which we will build for problematic backends doe ssolve this problem, too.

Conclusion

There are several Matplotlib backends which we can potentially use to dynamically update the canvas contents of already created figures with new data. Unfortunately (or by design?), not all backends support direct or continuous automatic canvas updates requested by multiple draw-commands in one and the same notebook cell. If we want to display the results of certain certain plot commands directly and dynamically we either need to issue special commands or develop a specific background job in a Jupyterlab notebook to cover such a situation for critical backends like Qt5Agg.

In addition, for real world scenarios we do not want to our plot production to behave blocking on the level of notebook cells: We want to be able to perform such updates from any cell in our notebook, from a common loop (with and without blocking the respective notebook cell) and in particular from background jobs. To avoid cell blocking we will necessarily have to include background execution into our workaround.

The development of a workaround solutions will be the topic of the next post. I will demonstrate different approaches first for the example of the Qt5-backend and figure windows external to Jupyterlab. In further posts I will discuss the use of Tk– and Gtk3-windows. Afterward, I will demonstrate the usage of the Web-backend of Matplotlib. A final post will cover IPython’s ipympl-backend for dynamic plot figures within the Jupyterlab interface.

 

ResNet basics – II – ResNet V2 architecture

In the 1st post of this series

ResNet basics – I – problems of standard CNNs

I gave an overview over building blocks of standard Convolutional Neural Networks [CNN]. I also briefly discussed some problems that come up when we try to build really deep networks with multiple stacks of convolutional layers [e.g. Conv1D- or Conv2D-layers of the Keras framework].

In this 2nd post I discuss the core elements of so called deep Residual Networks [ResNets]. ResNets have been published in multiple versions. Most of the differences are due to different types of so called Residual Units [RUs], which can be regarded as elementary bricks of a ResNet architecture. A RU can have a complex inner structure composed of basic layers as Conv2D-, Activation and Normalization-layers.

I only cover basic elements of the ResNet V2 architecture. I will do this in a personal introductory and summarizing way. The original research papers of He, Zhang, Ren and Sun (see [1] to [3]) will give you much more details and information. Concerning a solid approach to programming ResNets with the help of Keras and Tensorflow 2 you will find more information in [4] and in future posts of this blog. I strongly recommend to have a look at the named literature.

Level of this post. Advanced. You should be familiar with CNNs both theoretically and regarding numerical experiments.

Basic idea: Transfer of unfiltered data in parallel to filtering

At the end of the previous post I have pointed out that an inner Conv2D-layer of a standard CNN adapts its filters to already filtered data coming from previous layers. Thus the knowledge stored in the Conv2D-maps is build on filters of previous layers. Consequences are:

  • An inner Conv2D-layer of a CNN cannot learn something from original input information.
  • All of a CNN’s layers must together, i.e. as a unit, find an optimal solution of their combined filters during training.

How could an extended approach look like? How could filters of an inner group of layers adapt to unfiltered data? At least partially? Your spontaneous answer will probably suggest the following:

To enable a partial adaption of layers to unfiltered data one must somehow enable a propagation of unfiltered information throughout the network.

One of the basic ideas of ResNets is that the filters of a group of convolutional layers should adapt to patterns in the difference of its last map’s output in comparison to the input data presented to the group. I.e. the maps of such a filtering group should, during training, adapt to relevant patterns in the difference of filtered (=convoluted) data minus the original input data. (“Relevant” means important for solving a defined task as e.g. classification.) For the information transport between filtering units this means that we must actually propagate tensors of filtered data plus a tensor of original data. For some math see the next post in this series.

Thus we can make our simple answer consistent with a basic ResNet idea by adding original input tensors to the output of basic filtering units. Transporting original tensors alongside a filtering unit means that these tensors must be mapped to a new location by a simple identity function.

For the purpose of illustration let us assume that we have a group of sequential Conv2D-layers which we regard as a filtering unit. Then our idea means that we must add the original input tensor to the unit’s output tensor before we transfer the sum as input to the next filtering unit.

Residual Units and Residual Layers

To use the ResNet vocabulary we say that our group Conv2D-layers forms a “Residual Unit” [RU]. (“Residual” refers to the difference with results of an identity transformation.) Residual Units in turn can be stacked to form larger entities (Residual Stacks [RS]) of a ResNet-architecture; see below. Some people use the word Stages instead of stacks.

Inside a RU we typically may find an arrangement of some sub-layers, so called Residual Layers [RLs]. In a ResNet V2 we find just an ordered sequence of RLs with different properties. An RL can itself have a complex substructure comprising an arrangement of standard layers which apply convolutional, normalization and activation operations. I will come back to this substructure for ResNet V2 in a separate section below.

Stacks of Residual Units

How would we get a well-defined stack of RUs? We would not vary all properties of the comprised RUs wildly. In principle a new Stack in a ResNet most often comes with a new level of resolution. But within a stack the resolution of the output– maps of each RU is kept constant. For reasons of simplicity we request the following:

Both the number of filters and the dimensions of the output-maps of all RUs in a well defined Stack RS (of a ResNet V2) do not change.

You have to be careful here: We only refer to the number and dimensions of output-maps of the RUs. We will see later that both the dimension of filters and as well the number of filters and maps can change from RL to RL within a RU. In addition at the 1st RL ind the 1st RUs of a stack the stride changes.

But within a stack of RUs we change neither the number nF of filters/maps, nor the dimensions of the output-maps of the last Conv2D-layer in the RUs change. Actually, we use this property as part of the definition of a residual stack collecting the same type of RUs.

The following images shows the principle for information flow for one stack RS and then for one RU within a defined stack :

In the drawing above the transport of original information is shown as a (gray) side-track on the left. A potential sub-structure of different standard layers in a RL is indicated by varying colors. The maps belong to a Conv2D-layer (light orange) of each RL. The maps of the last RL are the output-maps of the RU.

Despite changing kernel-dimensions used for the Conv2D-layers in the different RLs the respective dimensions of the resulting maps inside a RU can be kept constant via setting padding = “same”. So, the dimensions of the maps remain constant for all RUs/RLs of a certain stack (with the exception of the RU’s first RL; see below).

Regarding the number of maps p, q, nF: Referring to the drawing we typically can use p = nF / 2, q = p (see [4] and below). But also a reductions in map-depth p = nF / 4 have been used. For respective kernel dimensions and other details see a section on the RL-sub-structure for ResNet V2 networks below.

A really important point in the drawing for a RU is that the RU’s output is a pure superposition of the original signal with filtered data. Thus the original data can propagate through the whole net during the training phase of a ResNet V2. As the weights in the filtering part typically remain small for a while after initialization, all layers will initially have a chance to adapt their filters to the original input information. But also in later phases of the training the original information can spread itself as a major contribution to the input of all RUs in a RS stack.

The next plot condenses the information transport a bit by placing the transfer path of the original data inside a Residual Unit. This kind of presentation does not give us any more information, but is a bit more helpful for a later programming of RUs with the help of Keras (or PyTorch).

We speak of a “Shortcut Connection” or “Skip Connection” regarding the transfer of the original information: It bridges the sequence of RL-sublayers of a RU. Such a RU is regarded a kind of basic unit for information processing in ResNets.

Regarding the difference of ResNet V1 vs. ResNet V2:
The plain addition of the signals at the output-side of a RU came with version V2 of ResNets in [2]. It marked a major difference in comparison to [1] and ResNet V1 architectures. In contrast to previous architectures, in a ResNet V2 architecture we do not apply an activation function to the sum at the output side of a RU. This enables a propagation of unfiltered data throughout the whole net.

Resolution reduction and special shortcut connections in the 1st RU of a stack

Also ResNets must realize a basic principle of CNNs: Patterns should be detected on different length scales or spatial dimensions. I.e., we need stacks of RUs and RLs working on averaged data of lower resolution.

In CNNs we mark a point of a transition to lower resolution either by a pooling layer or by a convolutional layer with a stride s > 1, e.g. s = 2. In a ResNet V2 we use special RUs that contain a first RL (and related Conv2D) having a stride s = 2. Such a RU appears as the 1st RU of a new stack of RUs working on a lower resolution level, i.e. with maps of smaller dimensions.

The drawings above make it clear that such a (1st) RU [K/1] of a stack RS [K] must do something special along its shortcut-connection to keep the resolution of the transported unfiltered input data aligned with the filtered output of the RU’s output-maps.

The most simple way to achieve equal dimensions of the tensors is to employ a Conv2D-layer with equal stride s = 2 along the shortcut. Such a special shortcut-connection is shown in the next graphics:

There is a consequence of this pattern which we cannot avoid:

The original image information is not propagated unchanged throughout the whole ResNet. Instead lower resolution variants of the original input data are propagated as contributions through the RUs of each stack.

Architectural Hierarchy – Stacks of RUs, Residual Units, Residual Layers, Sub-Layers

Putting the elements discussed above together we arrive at the following architectural hierarchy of a deep ResNet V2:

  • A first Conv2D-layer that initially scans the Input data by a filter of suitable dimensions. The Conv2D-layer is accompanied by some standard sub-layers as a Input- a BatchNormalization-, Activation- and sometimes by an additional second Conv2D-Layer reducing resolution.
  • Stacks of Residual Units with the same number of convolutional filters nF. nF varies from stack to stack. All maps of convolutional layers within a stack shall have the same resolution.
  • Residual Unit [RU] (comprising a fixed number of so called Residual Layers)
  • Residual Layer [RL] (comprising some standard sub-layers; the sub-structure of RLs can vary between architecture versions)
  • Standard Sub-Layers of a RL – including BatchNormalization-, Activation- and Conv2D-Layers in a certain order
  • A classifying MLP-like network of fully-connected [FC] layers (classification) or a specific dense FC-layer to fill a latent space (Encoders).

The first layers filter, normalize, transform and sometimes average input data (training data) to an intermediate target resolution. Afterward we have multiple stacks of Residual Units [RUs], each with a shortcut connection. The depth of a network, i.e. the number of analyzing Conv2D– or Dense-layers depends mainly on the number of RUs in the distinguished stacks.

An example of a relatively simple ResNet V2 network could, on the level of RUs, look like this:

This example comprises 4 stacks of RUs (distinguished by different colors in the drawing). All RUs within a stack have a (fixed) number of RLs (not displayed). The number of RUs changes from nRU = 3 for RS1 to nRU = 6 for RS 2, nRU = 6 for RS 3 and eventually to nRU = 3 for RS 4. Each RL comprises a Conv2D-layer (not displayed above). Within the stack the Conv2D-output-layers of the RUs all have the same number of filters nF and of respective maps (64, 128, 256, 512).

As usual we raise the number of maps with shrinking resolution. All output maps of the RUs in a certain stack have the same dimensions. The central filter-kernel in the example is chosen to be k=(3×3) throughout all stacks, RUs and RLs (for the meaning of “central” see the section on a RL’s sub-structure below).

The blue curved lines symbolize the RUs’ shortcut connections. The orange lines instead symbolize special shortcut connections with an extra Conv2D-Layer with kernel (1×1) and stride=2. As discussed above, these special shortcuts compensate for the reduction of resolution and map-dimensions.

Again: The advantage of such a structure is that inner layers can already start learning even if the filter-values in the “residual parts” of the first layers are still small.

Bottleneck Residual Units – and the sub-structure of their Residual Layers

A RU consists of Residual Layers [RLs]. Each RL in turn consists of a sequence of standard layers. The authors of [2] and [3] have investigated the effects of a big variety of different RU/RL sub-structures on convergence during training and on the error-rate of classification ResNets after training. The eventual optimal sub-structure of a RU (for ResNet V2) can be described as follows:

  • A RU (of a ResNet V2) consists of 3 sub-layers, i.e. RLs (Residual Layers) in our terminology.
  • Each of these Residual Layers comprises three standard sub-layers in the following sequential order
         BatchNormalization Layer => Activation Layer (with the Relu function) => Conv2D Layer.
    The only exception is the first RL of the the first stack (see below).
  • The number of output-maps is nF defined for the stack
  • The 1st RL uses a Conv2D-layer with a (1×1)-kernel, a stride s=1 (s=2, only for the first RL of the first RU in a RU- stack), padding = “same”. The number of maps (depth) of this layer is reduced to nF/2 or nF/4.
  • The 2nd RL uses a Conv2D-layer with a (3×3)-kernel, a stride s=1, padding=”same”. The number of maps (depth) of this layer is reduced to nF/2 or nF/4.
  • The 3rd RL uses a Conv2D-layer with a (1×1)-kernel, a stride s=1, padding=”same”. The number of maps is nF.

“Optimal” refers to number of parameters, complexity and the level of accuracy. This structure is called a “botteneck“-architecture with full pre-activation. See the graphics below:

The term “pre-activation” is used because the activation occurs ahead of the convolution. Which is a remarkable deviation from previously used “wisdom” of performing activation after convolution.

The term “bottleneck” itself refers to a reduction of the number of maps used in the Conv2D-layers in RL [K,N,1] and RL[K,N,2] within the RL-sequence:

The first RL with k=(1×1) reduces the number of maps, i.e. the Conv2D-layer’s depth, by some factor (< 1) of the target number for output-maps nF for all RUs of the stack. The 2nd RL also works with this reduced number of maps. The 3rd RL, however, restores the number of maps to the target number nF.

Keep in mind when programming:

The depth-reduction of a bottleneck structure of RUs refers to the number of maps, not to the dimensions of the kernel and neither to the dimensions of the maps. This is often misunderstood.

We do we perform such an intermittent depth-reduction of the Conv2D-layers at all? Well, an important argument is efficiency: We want to keep the number of weights, i.e. connections between the maps of different layers, as small as possible. I think the approach of R. Atienza in [4] to use a reduction factor 1/2, i.e. q = 1/2 * nF, is a very reasonable one as this is just the number of the previous stack.

First layers of a ResNet V2

We must work with the input data before we feed them into the first stack of Residual Units and split the data flow into a regular path through the sub-layers of RUs and residual shortcut connections. Reasons to prepare the data are:

  • We may want to reduce resolution and dimensions to some target values.
  • We may want to normalize the data.
  • We may want to apply a first activation to get positive weights.

The handling depends both on the exact requirements for an adaption of the resolution of the images and efficiency conditions. Regarding a concrete approach the original research papers [1] to [3] show a tendency

  • to first use a convolutional layer with a (3×3) or (7×7) kernel,
  • to apply apply BatchNormalization afterward
  • and perform a first activation (with Relu).

In this order. This has a consequence for the very 1st RL in 1st RU of the 1st stack: We can just omit the usual Activation- and BatchNormalization-Layers there. This kind of approach seems to support convergence and generalization during network training.

Different architectures and number of stacks/stages for the analysis of concrete objects

Regarding the number of stacks, the number of filters per stack and the number of RUs per stack, the papers [1] to [3] leave the reader a bit confused after a first reading. The authors obviously preferred different setups for different kinds of image classes – as e.g. for the ImageNet-, the CIFAR10- and the CIFAR100- and the MS COCO-datasets.

The original strategy in [1] for CIFAR10/100 classification tasks was to use a plain structure of 3 stacks, each with an equal number nRU of up to 18 RUs per stack and numbers of feature maps nF ∈ {16, 32, 64). For nRU = 9 this results in a ResNetV2-56 with 56 layers, for nRU = 18 in a ResNetV2-164, with 164 layers respectively. R. Atienza in [4] uses filter numbers nRU ∈ {64, 128, 256}. I find it also interesting that R. Atienza did not try out a different setup with 4 stacks for CIFAR10.

For more complex images other strategies may be more helpful. Geron in his book [5] discusses a setup of 4 stacks for the ImageNet dataset with the number of filters as {64, 128, 256, 512}, but with numbers of RUs as nRU ∈ {3, 4, 6, 3} for a ReNet-34.

This all means: We should prepare our codes to be flexible enough to cover up to four or five stacks with different numbers of RUs.

As we as private enthusiasts have very limited HW-resources, we can only afford to train a deep ResNet V2 on modest data sets like CIFAR10, Fashion MNIST, CelebA. Still, we need to experiment a bit. I particular we should investigate some time in finding out what depth reduction is possible in the bottleneck layers.

An open question

Personally, the study of [1] to [4] left me with an open question:

Why not additionally bridge a whole stack of RUs with a shortcut connection?

This would be an intermediate step in the direction of “Densely Connected Convolutional” networks [DenseNets]. Maybe this is an overkill, but let us keep this as an idea to investigate further.

Conclusion

ResNet V2 networks are somewhat more complex than standard CNNs. However, the recipes given in [1] to [4] are rather clear. Changing the number of stacks, the number of RUs in the stacks and the parameters of the Residual Bottleneck-Layers leaves more than enough room for experiments and adaptions to specific input data sets. We just have to implement respective parameters and controls in our Python programs. In the next post of this series I will look at bit at the mathematical analysis for a sequence of RUs.

Literature

[1] K. He, X. Zhang, S. Ren , J. Sun, “Deep Residual Learning for Image Recognition”, 2015, arXiv:1512.03385v1
[2] K. He, X. Zhang, S. Ren , J. Sun, “Identity Mappings in Deep Residual Networks”, 2016, version 2 arXiv:1603.05027v2
[3] K. He, X. Zhang, S. Ren , J. Sun, “Identity Mappings in Deep Residual Networks”, 2016, version 3, arXiv:1603.05027v3
[4] R. Atienza, “Avanced Deep Learning with Tensorflow 2 and Keras”, 2nd edition, 202, Packt Publishing Ltd., Birmingham, UK (see chapter 2)
[5] F. Chollet, “Deep Learning with Python”, 2017, Manning Publications, USA
[6] A. Geron, “Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow”, 3rd ed., 2023, O’ReillyMedia Inc., Sebastopol, USA CA

\[\]

ResNet basics – I – problems of standard CNNs

Convolutional Neural Networks [CNNs] do a good job regarding the analysis of image or video data. They extract correlation patterns hidden in the numeric data of our media objects, e.g. images or videos. Thereby, they get an indirect access to (visual) properties of displayed physical objects – like e.g. industry tools, vehicles, human faces, ….

But there are also problems with standard CNNs. They have a tendency to eliminate some small scale patterns. Visually this leads to smoothing or smear-out effects. Due to an interference of applied filters artificial patterns can appear when CNN-based Autoencoders shall recreate or generate images after training. In addition: The number of sequential layers we can use within a CNN on a certain level of resolution is limited. This is on the one hand due to the number of parameters which rises quickly with the number of layers. On the other hand and equally important vanishing gradients can occur during error-back-propagation and cause convergence problems for the usually applied gradient descent method during training.

A significant improvement came with so called Deep Residual Neural Networks [ResNets). In this post series I discuss the most important differences in comparison to standard CNNs. I start the series with a short presentation of some important elements of CNNs. In a second post I will directly turn to the structure of the so called ResNet V2-architecture [2].

To get a more consistent overview over the historical development I recommend to read the series of original papers [1], [2], [3] and a chapter in the book of R. Atienza [4] in addition. This post series only summarizes and comments the ideas in the named resources in a rather personal way. For me it serves as a preparation and overall documentation for Python programs. But I hope the posts will help some other readers to start working with ResNets, too. In a third post I will also look at the math discussed in some of the named papers on ResNets.

Level of this post: Advanced. You should be familiar with the concepts of CNNs and have some practical experience with this type of Artificial Neural Network. You should also be familiar with the Keras framework and standard layer-classes provided by this framework.

Elements of simple CNNs

A short repetition of a CNN’s core elements will later help us to better understand some important properties of ResNets. ResNets are in my opinion a natural extensions to CNNs and will allow us to build really deep networks based on convolutional layers. The discussion in this post focuses on simple standard CNNs for image analysis. Note however, that the application spectrum is much broader, 1D-CNNs can for example be used to detect patterns in sequential data flows as texts.

We can use CNNs to detect patterns in images that depict objects belonging to certain object-classes. Objects of a class have some common properties. For real world tasks the spectrum of classes must of course be limited.

The idea is that there are detectable patterns which are characteristic of the object classes. Some people speak of characteristic features. Then the identification of such patterns or features would help to classify objects on a new images a trained CNN gets confronted with. Or a combination of patterns could help to recreate realistic object images. A CNN must therefore provide not only some mechanism to detect patterns, but also a mechanism for an internal pattern representation which can e.g. be used as basic information for a classification task.

We can safely assume that the patterns for objects of a certain class will show specific structures on different length scales. To cover a reasonable set of length scales we need to look at images at different levels of resolution. This is one task which a CNN must solve; certain elements of its layer architecture must ensure a systematic change in resolution and of the 2-dimensional length scales we look at.

Pattern detection itself is done by applying filters on sub-scales of the spatial dimensions covered by a certain level of resolution. The filtering is done by so called “Convolutional Layers“. A filter tests the overlap of a given object’s 2-diimensional structures with some filter-related periodic pattern on smaller scales. Relevant filter-parameters for optimal patterns are determined during the training of a CNN. The word “optimal” refers to the task the CNN shall eventually solve.

The basic structure of a CNN (e.g. to analyze the MNIST dataset) looks like this:

The sketched simple CNN consists of only three “Convolutional Layers”. Technically, the Keras framework provides a convolutional layer suited for 2-dimensional tensor data by a Python class “Conv2D“. I use this term below to indicate convolutional layers.

Each of our CNN’s Conv2D-layers comprises a series of rectangular arrays of artificial neurons. These arrays are called “maps” or sometimes also “feature maps“.

All maps of a Conv2D-layer have the same dimensions. The output signals of the neurons in a map together represent a filtered view on the original image data. The deeper a Conv2D-layer resides inside the CNN’s network the more filters had an impact on the input and output signals of the layer’s maps. (More precisely: of the neurons of the layer’s maps).

Resolution reduction (i.e. a shift to larger length scales) is in the depicted CNN explicitly done by intermittent pooling-layers. (An alternative would be that the Conv2D-layers themselves work with a stride parameter s = 2; see below.) The output of the innermost convolution layer is flattened into a 1-diemsnional array, which then is analyzed by some suitable sub-network (e.g. a tiny MLP).

Filters and kernels

Convolution in general corresponds to applying a (sub-scale) filter-function to another function. Mathematically we describe this by so called convolution integrals of the functions’ product (with a certain way of linking their arguments). A convolution integral measures the degree of overlap of a (multidimensional) function with a (multidimensional) filter-function. See here for an illustration.

As we are working with signals of distinct artificial neurons our filters must be applied to arrays of discrete signal values. The relevant arrays our filters work upon are the neural maps of a (previous) Conv2D-layer. A sub-scale filter operates sequentially on coherent and fitting sub-arrays of neurons of such a map. It defines an averaged signal of such a sub-array which is fed into a neuron of map located in the following Conv2D-layer. By sub-scale filtering I mean that the dimensions of the filter-array are significantly smaller than the dimensions of the tested map. See the illustration of these points below.

The sub-scale filter of a Conv2D-layer is technically realized by an array of fixed parameter-values, a so called kernel. A filter’s kernel parameters determine how the signals of the neurons located in a covered sub-array of a map are to be modified before adding them up and feeding them into a target neuron. The parameters of a kernel are also called a filter’s weights.

Geometrically you can imagine the kernel as an (k x k)-array systematically moved across an array of [n x n]-neurons of a map (with n > k). The kernel’s convolution operation consists of multiplying each filter-parameter with the signal of the underlying neuron and afterward adding the results up. See the illustration below.

For each combination of a map M[N, i] of a layer LN with a map M[N+1, m] of the next layer L(N+1) there exists a specific kernel, which sequentially tests fitting sub-arrays of map M[N, i] . The filter is moved across map M[N, i] with a constant shift-distance called stride [s]. When the end of a row is reached the filter-array is moved vertically down to another row at distance s.

Note on the difference of kernel and map dimensions: The illustration shows that we have to distinguish between the dimensions of the kernel and the dimensions of the resulting maps. Throughout this post series we will denote kernel dimensions in round brackets, e.g. (5×5), while we refer to map dimensions with numbers in box brackets, e.g. [11×11].

In the image above map M[N, i] has a dimension of [6×6]. The filter is based on a (3×3) kernel-array. The target maps M[N+1, m] all have a dimension of [4×4], corresponding to a stride s=1 (and padding=”valid” as the kernel-array fits 4 times into the map). For details of strides and paddings please see [5] and [6].

Whilst moving with its assigned stride across a map M[N, i] the filter’s “kernel” mathematically invokes a (discrete) convolutional operation at each step. The resulting number is added to the results of other filters working on other maps M[N, j]. The sum is fed into a specific neuron of a target map M[N+1, m] (see the illustration above).

Thus, the output of a Conv2D-layer’s map is the result of filtered input coming from previous maps. The strength of the remaining average signal of a map indicates whether the input is consistent with a distinct pattern in the original input data. After having passed all previous filters up to the length scale of the innermost Conv2D-layer each map reacts selectively and strongly to a specific pattern, which can be rather complex (see pattern examples below).

Note that a filter is not something fixed a priori. Instead the weights of the filters (convolution kernels) are determined during a CNN’s training and weight optimization. Loss optimization dictates which filter weights are established during training and later used at inference, i.e. for the analysis of new images.

Note also that a filter (or its mathematical kernel) represents a repetitive sub-scale pattern. This leads to the fact that patterns detected on a specific length scale very often show a certain translation and a limited rotation invariance. This in turn is a basic requirement for a good generalization of a CNN-based algorithm.

A filter feeds neurons located in a map of a following Conv2D-layer. If a layer N has p maps and the following layer has q maps, then a neuron of a map M[N+1, m] receives the superposition of the outcome of (p*q) different filters (and respective kernels).

Patterns and features

Patterns which fit some filters, of course appear on different length scales and thus at all Conv2D-layers. We first filter for small scale patterns, then for (overlayed) patterns on larger scales. A certain combination of patterns on all length scales investigated so far is represented by the output of the innermost maps.

All in all the neural activation of the maps at the innermost layers result from (surviving) signals which have passed a sequence of non-linearly interacting filters. (Non-linear due to the non-linearity of the neurons’ activation function.) A strong overall activation of an innermost map corresponds to a unique and characteristic pattern in the input image which “survived” the chain of filters on all investigated scales.

Therefore a map is sometimes also called a “feature map”. A feature refers to a distinct (and maybe not directly visible) pattern in the input image data to which an inner map reacts strongly.

Increasing number of maps with lower resolution

When reducing the length scales we typically open up space for more significant pattern combinations; the number of maps increases with each Conv-layer (with a stride s=2 or after a pooling layer). This is a very natural procedure due to filter combinatorics.

Examples of patterns detected for MNIST images

A CNN in the end detects and establishes patterns (inherent in the image data) which are relevant for solving a certain problem (e.g. classification or generative reconstruction). A funny thing is that these “feature patterns” can be visualized.

The next image shows the activation pattern (signal strengths) of the neurons of the 128 (3×3) innermost maps of a CNN that had been trained for the MNIST data and was confronted with an image displaying the digit “6”.

The other image shows some “featured” patterns to which six selected innermost maps react very sensitively and with a large averaged output after having been a trained on MNIST digit images.

 

These patterns obviously result from translations and some rotation of more elementary patterns. The third pattern seems to useful for detecting “9”s at different positions on an image. The fourth pattern for the detection of images of “2”s. It is somewhat amusing of what kind of patterns a CNN thinks to be interesting to distinguish between digits!

If you are interested of how to create images of patterns to which the maps of the innermost Conv2D-layer reacts to see the book of F. Chollet on “Deep Learning with Python” [5]. See also a post of the physicist F. Graetz “How to visualize convolutional features in 40 lines of code” at “towardsdatascience.com”. For MNIST see my own posts on the visualization of filter specific patterns in my linux-blog. I intend to describe and apply the required methods for layers of ResNets somewhere else in the present ML-blog.

Deep CNNs

The CNN depicted above is very elementary and not a really deep network. Anyone who has experimented with CNNs will probably have tried to use groups of Conv2D-layers on the same level of resolution and map-dimensions. And he/she probably will also have tried to stack such groups to get deeper networks. VGG-Net (see the literature, e.g. [2, 5, 6] ) is a typical example of a deeper architecture. In a VGG-Net we have a group of sequential Conv2D layers on each level of resolution – each with the same amount of maps.

BUT: Such simple deep networks do not always give good results regarding both error rates, convergence and computational time. The number of parameters rises quickly without a really satisfactory reward.

Problems of deep CNNs

In a standard CNN each inner convolutional layer works with data that were already filtered and modified by previous layers.

An inner filter can not adapt to original data, but only to filtered information. But any filter does eliminate some originally present information … This also occurs at transitions to layers working on larger dimensions, i.e. with maps of reduced resolution: The first filter working on a larger length scale (lower resolution) eliminates information which originally came from a pooling layer (or a Conv2D-layer with stride=2). The original averaged data are not available to further layers working on the same new length scale.

Therefore, a standard CNN deals with a problem of rather fast information reduction. Furthermore, the maps in a group have no common point of reference – except overall loss optimization. Each filter can eventually only become an optimal one if the previous filtering layer has already found a reasonable solution. An individual layer cannot learn something by itself. This in turn means: During training the net must adapt as a whole, i.e. as a unit. This strong coupling of all layers can enhance the number of training epochs and also create problems of convergence.

How could we work against these trends? Can we somehow support an adaption of each layer to unfiltered data – i.e. support some learning which does not completely dependent on previous layers? This is the topic of the next post in this series.

Conclusion

CNNs were building blocks in the history of Machine Learning for image analysis. Their basic elements as Conv2D-layers, filters and respective neural maps on different length scales (or resolutions) work well in networks whose depth is limited, i.e. when the total number of Conv2D-layers is small (3 to 10). The number of parameters rises quickly with a network’s depth and one encounters convergence problems. Experience shows that building really deep networks with Conv2D-layers requires additional architectural elements and layer-combinations. Such elements are cornerstones of Residual Networks. I will discuss them in the next post of this series. See

ResNet basics – II – ResNet V2 architecture

Literature

[1] K. He, X. Zhang, S. Ren , J. Sun, “Deep Residual Learning for Image Recognition”, 2015, arXiv:1512.03385v1
[2] K. He, X. Zhang, S. Ren , J. Sun, “Identity Mappings in Deep Residual Networks”, 2016, version 2 arXiv:1603.05027v2
[3] K. He, X. Zhang, S. Ren , J. Sun, “Identity Mappings in Deep Residual Networks”, 2016, version 3, arXiv:1603.05027v3
[4] R. Atienza, “Avanced Deep Learning with Tensorflow 2 and Keras”, 2nd edition, 202, Packt Publishing Ltd., Birmingham, UK (see chapter 2)
[5] F. Chollet, “Deep Learning with Python”, 2017, Manning Publications, USA
[6] A. Geron, “Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow”, 3rd ed., 2023, O’ReillyMedia Inc., Sebastopol, USA CA