Posit AI Blog: Upside Down, A Cat’s Still a Cat: Evolving Image Recognition Using Geometric Deep Learning

This is the first in a series of posts on Group Equivalent Convolutional Neural Networks (GCNNs). Today we keep it short, high level and conceptual; examples and implementations will follow. Looking at GCNN, we summarize a topic we first wrote about in 2021: Geometric Deep Learning, a principled, mathematically driven approach to network design that has only grown in scope and impact since then.

From Alchemy to Science: Geometric Deep Learning in Two Minutes

In short, Geometric Deep Learning is about inferring the structure of a network from two things: the domain and the task. The posts will be very detailed, but let me give a quick overview here:

  • By domain I mean the underlying physical space and the way it is represented in the input data. For example, images are usually encoded as a two-dimensional grid with values ​​indicating pixel intensity.
  • The task is what we are training the network to do: classification, say, or segmentation. Tasks may vary in different phases of the architecture. At each stage, the task will have a say in what the layer design should look like.

Take MNIST for example. The dataset consists of images of ten digits, 0 to 10, all in grayscale. The challenge – unsurprisingly – is to match each image with the number it represents.

First, consider the domain. HAS \(7\) Yippee \(7\) wherever it appears on the grid. So we need an operation that is translation-equivariant: Flexibly adapts to shifts (translations) in its input. More precisely, in our context equivariant operations are able to detect the properties of some objects, even if that object has been moved, vertically and/or horizontally, to another location. Convolutionubiquitous not only in deep learning, just such a shift operation is equivalent.

Let me draw particular attention to the fact that in equivariance that “flexible adaptation” is essential. Translational-equivariant operations do take care of the object’s new position; they do not record the element abstractly, but in the new position of the object. To see why this is important, consider the network as a whole. When we stack convolutions, we create a hierarchy of feature detectors. This hierarchy should work no matter where in the image. In addition, they must be consistent: Positional information must be preserved between layers.

Hence, in terms of terminology, it is important to distinguish equivariance from immutability. An invariant operation would still be able to record an element wherever it occurs in our context; however, thankfully it would forget where the function was located. It is therefore obvious that to create a hierarchy of functions it is necessaryimmutability not enough

What we’ve just done is derive the request from the domain, the input grid. What about the task? If in the end we just have to name the number, the location suddenly doesn’t matter anymore. In other words, once there is a hierarchy, invariance Yippee enough. In neural networks, association is an operation that forgets (spatial) details. It only cares about the mean, say, or the maximum value itself. This makes it suitable for “summarizing” information about a region or the complete image if we only care about returning the class label at the end.

In short, we were able to formulate a design wishlist based on (1) what we were given and (2) what we were tasked with.

After this high-level outline of Geometric Deep Learning, we’ll zoom in on the intended topic of this post series: group equivalent convolutional neural networks.

Now why “equivariant” shouldn’t be too much of a puzzle. But what about the “group” prefix?

“Group” in group equivariance

As you may have guessed from the introduction, when we talk about “principles” and “driven by mathematics”, this really is about groups in the “mathematical sense”. Depending on your background, the last time you heard about cliques was in school and you have no idea why they matter. I’m certainly not capable of summarizing the entire wealth of what they’re good for, but I hope that by the end of this post, their importance in deep learning will make intuitive sense.

Symmetry groups

Here is a square.

Now close your eyes.

Now look again. Did something happen in the square?

That cannot be said. Maybe it was flipped; maybe it wasn’t. On the other hand, what if the vertices were numbered?

Now you would know.

Could I rotate the square as I wanted without numbering? Obviously not. This would not go unnoticed:

A square, rotated counterclockwise a few degrees.

There are exactly four ways I could turn the square without arousing suspicion. These methods can be referred to in different ways; one simple way is by degree of rotation: 90, 180 or 270 degrees. Why not more? Any further addition of 90 degrees would result in the configuration we have already seen.

The image above shows three squares, but I have listed three possible rotations. What about the situation on the left, the one I took as the default? One could do this by rotating it 360 degrees (or twice as much, or three times, or…) But the way this is handled in mathematics is to think of it as some kind of “zero rotation”, analogous to \(0\) works in addition \(1\) in multiplication or the identity matrix in linear algebra.

So we have four in total action which could be done on a square (an unnumbered square!) which would leave it as it is or invariant. These are called symmetry square. Symmetry in mathematics/physics is a quantity that remains the same no matter what happens as time evolves. And this is where the groups come in. Groups – specifically theirs Elements – perform actions like rotation.

Before I explain how, I’ll give another example. Take this ball.

How many symmetries does a sphere have? Infinitely much. It follows that whatever group is chosen to act on the square, it will not represent the symmetry of the sphere very well.

Viewing groups via action lens

After these examples, let me generalize. Here is a typical definition.

Group \(G\) is a finite or infinite set of elements together with a binary operation (called a group operation) that together satisfy the four basic properties of closure, associativity, identity, and the inverse property. The operation for which a group is defined is often called the “group operation” and the set is called the group “under” that operation. Elements \(HAS\), \(B\), \(VS\)… with a binary operation between \(HAS\) and \(B\) marked \(AB\) create group if

  1. Conclusion: If \(HAS\) and \(B\) are two elements \(G\)then the product \(AB\) is also in \(G\).
  2. Associativity: The defined multiplication is associative, i.e. for all \(HAS\),\(B\),\(VS\) in \(G\), \((AB)C=A(BC)\).
  3. Identity: There is an element of identity \(I\) (aka \(1\), \(E\)gold \(E\)) such that \(AI=AI=A\) for each element \(HAS\) in \(G\).
  4. Inverse: There must be an inverse (or reciprocal) of each element. Therefore, for each element \(HAS\) of \(G\)a set contains an element \(B=A^{-1}\) such that \(AA^{-1}=A^{-1}A=I\).

In the language of actions, the elements of a group determine the actions allowed; or, more precisely, those that are distinguishable from each other. Two actions can be composed; this is a “binary operation”. The requirements now make intuitive sense:

  1. A combination of two actions – say two rotations – is still an action of the same type (rotation).
  2. If we have three such actions, it doesn’t matter how we group them. (However, the order of their applications must remain the same.)
  3. One of the possible actions is always “zero action”. (Just like in life.) As for “doing nothing,” it doesn’t matter if it happens before or after “something”; that “something” is always the end result.
  4. Every action must have a “back button”. In the squares example, if I turn 180 degrees and then 180 degrees again, I’m back in the original state. That is if I did nothing.

In a more “bird’s eye view” summary, what we have now seen is the definition of a group in terms of how its elements interact with each other. But if groups are to matter “in the real world,” they must act on something outside (such as neural network components). How this works is the subject of subsequent posts, but I’ll briefly outline the intuition here.

Outlook: Group-equivalent to CNN

We noted above that when classifying an image a translation– an invariant operation (like convolution) is needed: A \(1\) Yippee \(1\) whether it moves horizontally, vertically, in both directions, or not at all. But what about rotation? When standing on its head, the number is still what it is. Conventional convolution does not support this type of action.

We can add to our architectural wish list by entering a symmetry group. What group? If we wanted to detect axis-aligned squares, a group would be appropriate \(C_4\), a cyclic group of order four. (We saw above that we need four elements and that we can cycle via group.) On the other hand, if we don’t care about alignment, we’d like to none positions to count. Basically, we should end up in the same situation as with the sphere. However, live images on discrete grids; in practice there will not be an unlimited number of spins.

With more realistic applications, we need to think more carefully. Take the digits. When Yippee number “same”? For one thing, it depends on the context. If it was a handwritten address on the envelope, we would accept a \(7\) as such was rotated 90 degrees? Maybe. (Though we might wonder what would make someone change the position of a ballpoint pen to just one digit.) And what and \(7\) does it stand on its head? Apart from similar psychological considerations, we should be seriously uncertain about the intended message and at least reduce the weight of the data point if it were part of our training set.

Importantly, the digit itself also matters. HAS \(6\)upside down, is a \(9\).

By zooming in on neural networks, there is room for even more complexity. We know that CNNs build a hierarchy of features, starting with simple ones like edges and corners. While we may not want rotational equivariance for later layers, we would still like to have it in the initial set of layers. (The output layer – as we have already indicated – must be considered separately in each case, as its requirements result from the specifics of what we are tasked with.)

That’s all for today. I hope I managed to shed some light on it why we would like to have group-equivalent neural networks. The question remains: How do we get them? This is what the next posts in the series will be about.

Until then and thanks for reading!

Photo by Ihor OINUA on Unsplash

Leave a Comment