Basic Image Classification and its limitations

Guymonahan
5 min readJun 28, 2021

The ability for computers to use analyze images is incredible, but at the same time it is limited the same way humans might also classify things. We first need to know what is happening when we ask a computer to “look” at an image, and then we need to see the different ways that

Firstly, deep learning nearly always requires a large amount of annotated data. This biases vision researchers to work on tasks where annotation is easy instead of tasks that are important.

There are methods which reduce the need for supervision, including transfer learning, few-shot learning, unsupervised learning, and weakly supervised learning. But so far they achievements have not been as impressive as for supervised learning.

Secondly, Deep Nets perform well on benchmarked datasets, but can fail badly on real world images outside the dataset. All datasets have biases. These biases were particularly blatant in the early vision datasets and researchers rapidly learned to exploit them for example by exploiting the background context (e.g., detecting fish in Caltech101 was easy because they were the only objects whose backgrounds were water). These problems are reduced, but still remain, despite the use of big datasets and Deep Nets. For example, as shown in Figure 2, a Deep Net trained to detect sofas on ImageNet can fail to detect them if shown from viewpoints which were underrepresented in the training dataset. In particular, Deep Nets are biased against “rare events” which occur infrequently in the datasets. But in real world applications, these biases are particularly problematic since they may correspond to situations where failures of a vision system can lead to terrible consequences. Datasets used to train autonomous vehicles almost never contain babies sitting in the road.

Thirdly, Deep Nets are overly sensitive to changes in the image which would not fool a human observer. Deep Nets are not only sensitive to standard adversarial attacks which cause imperceptible changes to the image

but are also over-sensitive to changes in context. Figure 3 shows the effect of photoshopping a guitar into a picture of a monkey in the jungle. This causes the Deep Net to misidentify the monkey as a human and also misinterpret the guitar as a bird, presumably because monkeys are less likely than humans to carry a guitar and birds are more likely than guitars to be in a jungle near a monkey. Recent work gives many examples of the over-sensitivity of Deep Nets to context, such as putting an elephant in a room.

Several of these conceptual strengths of compositional models have been demonstrated on visual problems such as the ability to perform several tasks with the same underlying model and to recognize CAPTCHAs

Other non-visual examples illustrate the same points. Attempts to train Deep Nets to do IQ tests were not successful.

In this task the goal is to predict the missing image in a 3x3 grid, where the other 8 images are given, and where the underlying rules are compositional (and distractors can be present). Conversely, for some natural language applications Neural Module Networks, whose dynamic architecture seems flexible enough to capture some meaningful compositions, outperform traditional deep learning networks. In fact, we recently verified that the individual modules indeed performed their intended compositional functionalities (e.g. AND, OR, FILTER(RED) etc) after joint training

Compositional models have many desirable theoretical properties, such as being interpretable, and being able to generate samples. This makes errors easier to diagnose, and hence they are harder to fool than black box methods like Deep Nets. But learning compositional models is hard because it requires learning the building blocks and the grammars (and even the nature of the grammars is debatable). Also, in order to perform analysis by synthesis they need to have generative models of objects and scene structures. Putting distributions on images is challenging with a few exceptions like faces, letters, and regular textures

More fundamentally, dealing with the combinatorial explosion requires learning causal models of the 3D world and how these generate images. Studies of human infants suggest that they learn by making causal models that predict the structure of their environment including naive physics. This causal understanding enables learning from limited amounts of data and performing true generalization to novel situations. This is analogous to contrasting Newton’s Laws, which gave causal understanding with a minimal amount of free parameters, with the Ptolemaic model of the solar system, which gave very accurate predictions but required a large amount of data to determines its details (i.e. the epicycles).

Testing on Combinatorial Data

One potential challenge with testing vision algorithms on the combinatorial complexity of the real world is that we can only test on finite data. Game theory deals with this by focusing on the worst cases instead of the average cases. As we argued earlier, average case results on finite sized datasets may not be meaningful if the dataset does not capture the combinatorial complexity of the problem. Clearly paying attention to the worst cases also makes sense if the goal is to develop visual algorithms for self-driving cars, or diagnosing cancer in medical images, where failures of the algorithms can have severe consequences.

If the failure modes can be captured in a low-dimensional space, such as the hazard factors for stereo, then we can study them using computer graphics and grid search

But for most visual tasks, particularly those involving combinatorial data, it will be very hard to identify a small number of hazard factors which can be isolated and tested. One strategy is to extend the notion of standard adversarial attacks to include non-local structure, by allowing complex operations which cause changes to the image or scene, e.g., by occlusion, or changing the physical properties of the objects being viewed, but without significantly impacting human perception. Extending this strategy to vision algorithms which deal with combinatorial data remains very challenging. But, if algorithms are designed with compositionality in mind,their explicit structures may make it possible to diagnose them and determine their failure modes.

Conclusion

A few years ago Aude Oliva and Alan Yuille (the first author) co-organized a NSF-sponsored workshop on the Frontiers of Computer Vision (MIT CSAIL 2011). The meeting encouraged frank exchanges of opinion and in particular, there was enormous disagreement about the potential of Deep Nets for computer vision. Yann LeCun boldly predicted that everyone would soon use Deep Nets. He was right. Their successes have been extraordinary and have helped vision become quite popular, dramatically increased the interaction between academia and industry, led to applications of vision techniques to a large range of disciplines, and have many other important consequences. But despite their successes there remain enormous challenges which must be overcome before we reach the goal of general purpose artificial intelligence and understanding of biological vision systems. Several of our concerns parallel those mentioned in recent critiques of Deep Nets

Arguably the most serious challenge is how to develop algorithms that can deal with the combinatorial explosion as researchers address increasingly complex visual tasks in increasingly realistic conditions. Although Deep Nets will surely be one part of the solution, we believe that we will also need complementary approaches involving compositional principles and causal models that capture the underlying structures of the data. Moreover, faced with the combinatorial explosion, we will need to rethink how we train and evaluate vision algorithms.

--

--