Fortunately there are ways of dealing with the dependence of grey-level values on prevailing lighting conditions. If you examine figures 8.1 and 8.2, one thing they have in common is the image contours defining the characteristic shape of the London Underground sign and along which the intensity changes abruptly. Such image contours are known as edges. Some of these edges are caused by the use of paints of different colours, while others are just the silhouette of the sign against its background, but all are characterized by changes in grey-level between adjacent pixels of the digitized image. Since the edges in figures 8.1 and 8.2 correspond to the same features in the world, there is a strong basis for comparison and we are no longer subject to the vagaries of variable lighting.
In perhaps the most widely known computational account of visual perception, Marr (1982) suggests that intensity changes are detected and described as the first step in interpreting an image. This view is supported by biological studies of various animal visual systems, most notably in the work of Hubel and Wiesel (1968).
We have observed that edges may be useful for matching an input image with a stored template, but how do we detect edges in a methodical fashion? There has been a long search for the ideal edge finding procedure -- that is, one which responds only to the meaningful edges in an image. All edge finders are essentially looking for step changes in intensity, but they vary as to the way such changes are characterized formally and the need for efficiency as well as correctness.
The search for edges normally proceeds by examining groups of neighbouring pixels, looking for the smallest pieces of edge, known as edge elements, and then grouping these into chains to form complete edges.
We consider a simple procedure for detecting edge elements, which produces an array of the same size as the digitized image with a 1 at each location for which there is an edge element in the vicinity of the corresponding pixel and 0's elsewhere. This array is called an edge map (strictly speaking this should be edge element map). The procedure works by repeating an identical operation at each pixel in the image. Consider one such pixel -- the current pixel -- valued a and two of its neighbours situated above and to the left, with values b and c:
b | |
c | a |
If there is a horizontal edge in the region of the current pixel, there should be a significant difference between a and b, while there may be very little difference between a and c. Similarly, if there is a vertical edge in the region of the current pixel, there should be a significant difference between a and c, while there may be little difference between a and b.
The difference between two numbers, a and b, expressed as a positive number, is called the `absolute difference'; this is written |a-b|. We can test for both horizontal and vertical edges at once by adding together the absolute difference between a and b, and the absolute difference between a and c and testing to see whether this exceeds some pre-determined threshold value t. Formally, we can write this condition as:
When this condition is satisfied, 1 is recorded in the edge map at the position corresponding to the current pixel; otherwise 0 is recorded.
This works fine for vertical and horizontal edges, but what about diagonal edges? In this case, the absolute differences will be smaller than for optimal horizontal or vertical edges but when added together they have a similar chance of exceeding the threshold t.
Figure 8.7 shows the edge map produced from the grey-levels shown in figure 8.3 corresponding to part of figure 8.1.
[IMAGE ]
Figure: The edge map produced from the Underground sign of
figure 8.1.
The threshold value t was 10. For clarity, small squares are shown surrounding the pixels where edge elements are present (i.e., 1's in the edge map).
Figure 8.8 shows the edge map produced from the template with the array of grey-levels shown in figure 8.6. Again, a threshold value of 10 was used.
[IMAGE ]
Figure: The edge map produced from the template of
figure 8.6.
By substituting edge maps for the input image and the template, we have eliminated the worst effects of variable lighting and so are now in a much better position to make a comparison. Again an easy way to compare the two edge maps is to slide one over the other, looking for a position in which edges from the two line up with one another. A good match is one in which most 1's in the first edge map lie directly above 1's in the other edge map. We will accept some failed matches between 1's due to extra markings on the physical sign and errors in the assumed viewing angle and image size used to generate the template.
Now if we place the top left-hand corner of the edge map derived from the template over the 10th column and the 12th row of the edge map derived from the input image, there is a good match between the two maps. Indeed over 60% of the edges in the first edge map are found to be in the second edge map.
Given the size of the physical sign and information about the camera optics, we could in principle compute the 3-dimensional position of the Underground sign depicted in figure 8.1 given its image location and size. This position could be outputted from the vision sub-system and used to guide the MTG towards the entrance to the Underground station (assuming, as is the case here, that the entrance to the station is roughly underneath the sign!).
A major problem with this method is the need to generate and scan through a catalogue of templates. This is can take up a great deal of computer time and is suited only to situations in which the position of a target object is known to be restricted to a narrow range. This might be the case, for example, in applications within production engineering. The problem would be alleviated if some means could be found for focusing on a restricted range of viewing angles, image locations, and sizes using fast but reliable heuristics.
Another problem arises when we deal with objects which are non-planar (i.e., 3-dimensional, unlike the Underground sign). For planar objects it is straightforward to predict how they will appear from different viewing angles, but this is not the case for non-planar objects. One solution is to develop an idealized 3-D model for the non-planar object (similar to those used in the design of motor cars) from which its appearance from different viewpoints can be predicted. Again this method has been shown to work for a variety of everyday objects.
In conclusion, method 1 is highly predictive, in that no attempt is made to examine the image prior to matching with a catalogue of stored templates. It seems implausible that the human visual system could operate such a scheme to recognize familiar objects, if only because it would be overloaded by the vast number of comparisons between objects and templates. By way of contrast, we now examine a different approach to recognizing Underground signs which does try to make some sense of the image before attempting to find the sign.