|NSF Report - Facial Expression Understanding|
III-C. Neural Networks and Eigenfaces for Finding and Analyzing Faces
Alexander Pentland and Terrence Sejnowski
Abstract: Animal communication through facial expressions is often thought to require very high-level, almost cognitive processing, yet pigeons can be trained to recognize faces and babies can do so as early as 3 hours after birth (Johnson & Morton, 1991). This talk surveys solutions attempted by computer vision researchers, placing them in the context of the biological literature. Linear principal components analysis (eigenfaces) and nonlinear artificial neural networks have been successful in pilot studies for finding and recognizing faces, lip reading, sex classification, and expression recognition. The representation of faces in these networks have global properties similar to those of neurons in regions of the primate visual cortex that respond selectively to faces. This approach depends on the availability of a large number of labelled exemplars of faces that can be used for training networks and extracting statistical properties from the sample population.
Presenters: T. J. Sejnowski (Neural Networks) and A. Pentland (Finding, Organizing, and Interpreting Faces)
There is a long history of research into face recognition and interpretation. Much of the work in computer recognition of faces has focused on detecting individual features such as the eyes, nose, mouth, and head outline, and defining a face model by the position, size, and relationships among these features. Beginning with Bledsoe's (1966) and Kanade's (1973, 1977) early systems, a number of automated or semi-automated face recognition strategies have modeled and classified faces based on normalized distances and ratios among feature points such as eye corners, mouth corners, nose tip, and chin point (e.g. Goldstein et al., 1971; Kaya & Kobayashi, 1972; Cannon et al., 1986; Craw et al., 1987). Recently this general approach has been continued and improved by the work of Yuille and his colleagues (Yuille, 1991). Their strategy is based on "deformable templates'', which are parameterized models of the face and its features in which the parameter values are determined by interactions with the image.
Such approaches have proven difficult to extend to multiple views, and have often been quite fragile, requiring a good initial guess to guide them. In contrast, humans have remarkable abilities to recognize familiar faces under a wide range of conditions, including the ravages of aging. Research in human strategies of face recognition, moreover, has shown that individual features and their immediate relationships comprise an insufficient representation to account for the performance of adult human face identification (Carey & Diamond, 1977). Nonetheless, this approach to face recognition remains the most popular one in the computer vision literature.
In contrast, recent approaches to face identification seek to capture the configurational, or gestalt-like nature of the task. These more global methods, including many neural network systems, have proven much more successful and robust (Cottrell & Fleming, 1990; Golomb et al., 1991; Brunelli & Poggio, 1991; O'Toole et al., 1988, 1991; Turk & Pentland, 1989, 1991). For instance, the eigenface (Turk & Pentland, 1991) technique has been successfully applied to "mugshot'' databases as large as 8,000 face images (3,000 people), with recognition rates that are well in excess of 90% (Pentland, 1992), and neural networks have performed as well as humans on the problem of identifying sex from faces (Golomb et al., 1991).
The problem of recognizing and interpreting faces comprises four main subproblem areas:
There are three basic approaches that have been taken to address these problems:
View-based approaches. This class of methods attempts to recognize features, faces, and so forth, based on their 2D appearance without attempting to recover the 3D geometry of the scene. Such methods have the advantage that they are typically fast and simple, and can be trained directly from the image data. They have the disadvantage that they can become unreliable and unwieldy when there are many different views that must be considered.
Volume-based approaches. This class of methods attempts to interpret the image in terms of the underlying 3D geometry before attempting interpretation or recognition. These techniques have the advantage that they can be extremely accurate, but have the disadvantage that they are often slow, fragile, and usually must be trained by hand.
Dynamic approaches. These techniques derive from speech and robotics research, where it is necessary to deal with complex, rapidly evolving phenomena. As a consequence, methods such as hidden Markov models and Kalman filtering are applied in order to allow integration of sensor evidence over time, thus making possible reliable, real-time estimation.
Biological Foundation of Face Processing
Before summarizing the methods that have been devised for recognizing and interpreting faces in computer science, it is worthwhile to explore our current understanding of biological visual systems. Face recognition has important biological significance for primates, and facial expressions convey important social information. The scientific problem of how nature solves these problems is central to this report, and we may learn something from biological solutions that will help us in designing machines that attempt to solve the same problem (Churchland & Sejnowski, 1992).
Neurons in early stages of visual processing in visual cortex respond to specific visual features, such as the orientation of an edge, within a limited region of the visual field (Hubel & Wiesel, 1962). The response of a neuron in the primary visual cortex is broadly tuned to parameters such as orientation and color. The information about the shape of an object is distributed over many neurons. In intermediate stages of visual processing, neurons respond over larger regions of the visual field and they respond with greater specificity.
In monkeys, neurons have been found in temporal cortex and the amygdala that respond selectively to images of faces (Desimone, 1991). Such cells respond vigorously to particular faces but little, if at all, to other simple and complex geometric objects. No single feature of a face, such as the eyes or mouth, is necessary or sufficient to produce the response. Such neurons are good candidates for coding the properties of individual faces because their response is invariant over a wide range of spatial locations, a range of sizes of the face, and, to a more limited extent, the orientation and pose of the face. Recent statistical analysis of neurons in the temporal cortex suggest that they may be coding the physical dimensions of faces, whereas the responses of neurons in the amygdala may be better characterized by the social significance or emotional value of the face (Young & Yamane, 1992). Others areas of visual cortex are involved in representing facial expressions.
It is likely that specialized areas exist in the visual cortex of humans that are similar to those found in other primates. Humans with bilateral lesions in the mesial aspect of the occipitotemporal junction have selective deficits in recognizing faces of even familiar individuals (Damasio et al., 1982; Tranel et al., 1988). Lesions in other areas lead to deficits in recognizing facial expressions, but not recognition of identity. Although these data are indirect, they are consistent with the hypothesis that primates have special purpose cortical areas for processing faces.
Detection and location of faces in images encounter two major problems: scale and pose. The scale problem is usually solved by forming a multi-resolution representation of the input image and performing the same detection procedure at different resolutions (Burt, 1988b; Perry & Carney, 1990; Viennet & Fogelman-Soulie, 1992). Pose is a more difficult problem, and currently methods employing representations at several orientations are being investigated, with promising early results (Bichsel & Pentland, 1992).
Strategies in face detection vary a lot, depending on the type of input images. Posed portraits of faces with uniform background constitute the majority of current applications. In this very simple situation, face detection can be accomplished by simple methods. For instance, face edges can be found (Brunelli, 1991; Cannon et al., 1986; Wong et al., 1989), or eyes, nose, mouth, etc., can be found using deformable templates (Yuille, 1991) or sub-templates (Sakai et al., 1969). Even the simplest histogramming methods (summing the image intensity along rows or columns) have been successfully used in this simple situation (Kanade, 1973; Brunelli, 1990).
However, such simple methods have may have difficulties when facial expressions vary. For example, a winking eye or a laughing mouth can pose a serious problem. Moreover, they are simply not adequate to deal with more complex situations.
Detection of faces in images with complex backgrounds requires a strategy of a different kind. Often it is a good idea to start with simple cues such as color (Satoh et al., 1990) or motion (Turk & Pentland, 1991) to locate the potential face targets for further verification. Such initial coarse detection has the effect of greatly reducing processing expense.
These initial quick detection methods, must then be followed by more precise and reliable methods. This problem has been approached in three ways: feature-based templates, intensity-based templates, and neural networks. In the feature-based template approach, features such as the left-side, the right- side, and the hair/top contours, are extracted and grouped and matched to a template face (Govindaraju, 1992). In the intensity-based template approach, principle components of face images are used to locate the potential face regions (Turk & Pentland, 1991). In the neural network approach, face examples and background examples are used to train the neural network, and it is then used to locate candidate faces (Viennet & Fogelman-Soulie, 1992). These methods may be combined with optical preprocessing to obtain very fast face detection (see Tutorial on Hardware, pages 29-31; Wang & George, 1991).
Tracking head motion has been studied mostly under the assumption that the background is either stationary or uniform. Taking the difference between successive frames will thus locate the moving head or body (Turk & Pentland, 1991). Features on the head, such as the hair and face, are then extracted from motion-segmented image region (Mase et al., 1990). Alternatively, one can put marks, say blue dots, on several key points on the face and then the system can track the head motion by extracting the marks (Ohmura et al., 1988).
Several systems require interactive extraction of facial features on the first frame of an image sequence. The systems can then track the motion of these features using a model of the human head or body (Huang et al., 1991; Yamamoto & Koshikawa, 1991).
Human body motions are highly constrained and therefore can be modeled well by a few parameters. Systems that track body motion by constrained motion (O'Rourke & Badler, 1980), Kalman filtering (Pentland & Sclaroff, 1991; Azarbayejani et al., 1992), and Hidden Markov Model (Yamato et al., 1992), have been demonstrated. The most precise tracking reported had a standard deviation error of approximately 1 centimeter in translation and 4 degrees in rotation (Azarbayejani et al., 1992), obtained using a Kalman filter approach.
Finally, face or head tracking can be done by performing fast face detection on each frame. Multi-resolution template matching (Burt, 1988b; Bichsel & Pentland, 1992) and optical transform (see Tutorial on Hardware, pages 29-31) are three such examples. The most precise tracking reported using this approach had a standard deviation error of approximately 2 centimeters in translation (Bichsel & Pentland, 1992).
One relatively successful approach to face recognition (detection of identity from a set of possibilities) is one that extracts global features from 2D, static images. The techniques use principal components analysis of the images, whether directly or via a neural network implementation (Cottrell & Fleming, 1990; Golomb et al., 1991; O'Toole et al., 1988, 1991, 1993; Turk & Pentland, 1989; 1991). For recognition, the projections of new faces onto these principal components are compared with stored projections of the training faces, and are either correlated or compared more non-linearly with a neural net. The extracted features have been called eigenfaces or holons. These "features'' look like ghostly faces, and can be thought of as a weighted template matching approach using multiple templates extracted from the data.
These approaches have been tried on carefully controlled databases with about 16 to 20 subjects, yielding recognition rates of approximately 95 to 100% (Cottrell & Fleming, 1990; Turk & Pentland, 1991). More recently, they have been successfully applied to "mugshot'' databases as large as 8,000 face images (3,000 people), with recognition rates that appear to be well above 90% (Pentland, 1992).
In controlled tests, these approaches have been found to be insensitive to facial expression and lighting direction (but not to shadowing) (Turk & Pentland, 1991). However, they are sensitive to orientation and scale changes, with scale being the most important, followed by orientation. Scale may be solved if one can scale the face to fit the templates in advance, or equivalently, by storing templates at multiple scales (Burt, 1988b; Bichsel & Pentland, 1992). Orientation appears to require multiple templates for this approach to work (Bichsel & Pentland, 1992). It is important to determine how the number of required templates scales as the number of subjects is increased.
On the other hand, these approaches appear quite robust to occlusions, and this simple technique may be capable of approaching human levels of performance, depending on the storage available for the various templates representing the conditions. They have been applied with limited success to expression recognition (Cottrell & Metcalfe, 1991; Turk, 1991), the templates can easily be used to detect the location of faces in the image (Turk & Pentland, 1991) (see previous section), and finally, templates of parts of the face, such as "eigeneyes'', may be used to verify the match or detect important features such as gaze angle or blink rate (Turk, 1991).
The idea of global analysis using an eigenvector basis has been extended to 3D by Pentland and Sclaroff (1991). The major problem in this approach is to relate 2D and 3D information back to some canonical 3D representation. Classically, this can be solved by the technique of Galerkin projection, and is the basis of the well-known finite element method. In Pentland's method, a set of "eigenshapes,'' analogous to the 2D eigenfaces or holons discussed above, were created using a finite element model of compact, head-like shapes. In this approach shape is described as some base shape, e.g., a sphere, that has been deformed by linear superposition of an orthogonal set of deformations such as stretching, shearing, bending, etc. This set of orthogonal deformations are the eigenshapes, and form a canonical representation of the 3D shape.
To describe a shape, the 2D or 3D data is projected onto these eigenshapes, to determine how much of each deformation is required to describe the shape. The coefficients obtained describe the object uniquely, and may be used to compare the object's shape to that of known objects.
Experiments using this approach to recognition have involved eight to sixteen people, and have used either silhouettes or range data as input. Recognition accuracies of approximately 95% have been achieved (Pentland & Horowitz, 1991). One of the most interesting aspects of this approach is that this accuracy seems to be independent of orientation, scale, and illumination.
Deformable templates (Fischler & Elschlager, 1973; Yuille et al., 1989; Yuille, 1991; Buhmann et al., 1989) are another approach that appears very promising. Yuille constructs analytic templates of face features and parameterizes them. The parameters are then used to define a Lyapunov function which is minimized when a match is found. The system thus does gradient descent in the parameters of the templates to detect the features. By ordering the weightings of the parameters in successive minimizations, a nice sequential behavior results in which first the eye is located, then the template oriented, and finally the fine matching of features is performed. This approach is subject to local minima in the Lyapunov function, but a more sophisticated matching strategy avoids this problem (Hallinan, 1991). Robust matching methods may be used (McKendall & Mintz, 1989) to give some ability to deal with occlusions (Yuille & Hallinan, 1992). A disadvantage of this kind of method is the amount of calculation required, five minutes processing time on a SUN, to detect the features. An advantage is that careful choice of the parameters makes the approach insensitive to scale.
Buhmann and colleagues (Buhmann et al., 1989, 1991), for instance, use a deformable template approach with global templates (see also Fischler & Elschlager, 1973). A grid is laid over an example face, and Gabor jets (a set of coefficients of Gabor filters of various orientations, resolutions and frequencies) are extracted at each grid point. So again, this is an unsupervised technique for extracting features from the data. The features are global at the level of Gabor jets because nearly the whole face can be regenerated from one of them.
Given a new face, the grid is deformed using an energy minimization approach (similar to Yuille's technique) until the best match is found. This results in the ability of the system to deal with orientation changes by producing the best match with the deformed template, and only one "training example'' is necessary for each person. The disadvantage is that the system must potentially check the match to every stored template (corresponding the number of known faces) although it is likely that efficient data structures could be designed to store similar faces together. In current work, they have achieved an 88% recognition rate from a gallery of 100 faces (v.d. Malsburg, personal communication).
Burt (1988a, 1988b) uses a resolution hierarchy approach with specialized hardware to locate the face in the image at low resolution, and then proceeds with matching at higher resolutions to identify the face. At each stage, progressively more detailed templates are used in the matching process. This approach is promising because the efficient use of the pyramidal image representation and hardware allows near real-time face identification. Recent work by Bichsel and Pentland (1992) have extended this approach to include orientation (by using whole-face templates), and have been able to achieve matching rates of up to 10 frames per second on an unaided Sun 4 processor.
Face motion produces optical flow in the image. Although noisy, averaged optical flow can be reliably used to track facial motion. The optical flow approach (Mase & Pentland, 1990a, 1991; Mase, 1991) to describing face motion has the advantage of not requiring a feature detection stage of processing. Dense flow information is available throughout the entire facial area, regardless of the existence of facial features, even on the cheeks and forehead (Mase, 1991). Because optical flow is the visible result of movement and is expressed in terms of velocity, it is a direct representation of facial actions (Mase & Pentland, 1990a, 1991). Thus, optical flow analysis provides a good basis for further interpretation of facial action. Even qualitative measurement of optical flow can be useful; for instance, we can focus on the areas where nonzero flow is observed for further processing, and we can detect stopping and/or the reversal of motion of facial expressions by observing when the flow becomes zero (Mase & Pentland, 1990a, 1991).
Visible speech signals can supplement acoustic speech signals, especially in a noisy environment or for the hearing impaired (Sumby & Pollack, 1954). The face, and in particular the region around the lips, contains significant phonemic and articulatory information. However, the vocal tract is not visible and some phonemes, such as [p], [b] and [m], cannot be distinguished.
Petajan (1984), in his doctoral dissertation, developed a pattern matching recognition approach using the oral cavity shadow of a single speaker. His system measured the height, area, width, and perimeter of the oral cavity. Brooke and Petajan (1986) used a radial measure of the lip's motion to distinguish between phonemes and to synthesize animation of speech. Petajan improved his system at Bell Laboratories (Petajan et al., 1988), using only the easily computable area feature and a set of acoustic rules, to achieve near-real-time performance.
Nishida (1986), of the MIT Media Laboratory, used optical information from the oral cavity to find word boundaries for an acoustic automatic speech recognizer. Nishida's work was the first one that used dynamic features of the optical signal. Nishida found that the derivative exceeded a given threshold at a word boundary, since changes in dark areas are abrupt as the pace of speech articulation is interrupted at a word boundary.
Pentland and Mase (1989) and Mase and Pentland (1991), also at MIT the Media Laboratory, were the first to use a velocity or motion-based analysis of speech. Their technique used optical flow analysis, followed by eigenvector analysis and dynamic time warping, to do automatic lipreading. They were able to achieve roughly 80% accuracy for continuously-spoken digits across four speakers, and 90% accuracy when voicing information was available. Perhaps the most interesting element of this research was the finding that the observed "eigenmotions'' corresponded to elements of the FACS model of face motions. Mase (see "Tracking Faces" above and page 29) was later able to extend this approach to recognizing a wide range of facial motions and expressions.
Stephen Smith (1989) reported using optical information from the derivatives of the area and height features to distinguish among the four words that an acoustic automatic speech recognizer confused. Using the two derivatives, Smith could distinguish perfectly among the four acoustically confused words.
Garcia and Goldschen, using a synchronized optical/acoustic database developed by Petajan of 450 single-speaker TIMIT sentences, have analyzed -- by means of correlation and principal component analysis -- the features that are most important for continuous speech recognition (Garcia et al., 1992). The unit of optical speech recognition is the "viseme," a term coined by Fisher (1968), which stands for the (distinguishable) particular sequence of oral cavity region movements (shape) that corresponds to a phoneme. The novelty of their approach is that the techniques of feature extraction pointed to some unexpected grouping of correlated features, and demonstrated the need to put particular emphasis on the dynamic aspects of some features.
Psychoacoustic experiments on humans strongly suggest that the visual and acoustic speech signals are combined before the phonemic segmentation. Ben Yuhas designed a neural network to map normalized images of the mouth into acoustic spectra for nine vowels (Yuhas et al., 1989). The goal of his research was to combine the optical information with the acoustic information to improve the signal-to-noise ratio before phonemic recognition. As acoustic recognition degraded with noise, the optical system for recognition maintained the overall performance. For small vocabularies (ten utterances) Stork, Wolff and Levine (1992) have demonstrated the robustness of a speaker-independent time-delay neural network recognition of both optical and acoustic signals over a purely acoustic recognizer.
Neural networks and the eigenface approach hold promise for producing computation-efficient solutions to the problem of recognizing facial expressions. In this section we present an introduction to feedforward neural networks. Networks provide nonlinear generalizations of many useful statistical techniques such as clustering and principal components analysis. Preliminary results of the application of neural networks to sex recognition and facial expressions are encouraging. They also can be implemented on parallel computers and special purpose optical and electronic devices (see Tutorial on Hardware, page 29-31) so that relatively cost-effective solutions to the real-time analysis of faces are in the offing (Sejnowski & Churchland, 1992).
Neural networks are algorithms, inspired more or less by the types of computational structures found in the brain, enabling computers to learn from experience. Such networks comprise processing elements, known as "units", which are analogous to neurons. These are classed as input units, hidden units, and output units. One unit connected to another implies that activity of one unit directly influences the activity of the other; the propensity of activity in one unit to induce or inhibit activity in the other is called the "weight" of the connection between these units. Networks learn by modifying these connection strengths or "weights."
Input units, akin to sensory receptors in the nervous system, receive information from outside the network. In the nervous system, a sensory receptor must transduce a signal such as light intensity or pressure into the strength of a signal; in neural networks the strength of the input signals is determined by the nature of the problem. In the case of vision, the inputs might be a gray-level array corresponding to the image being classified, or a processing version of the inputs, which might include feature extractions. (If the feature extraction is a nonlinear process then important information may be removed from the input, and the performance may be degraded. For an example of this see the section below on sex recognition.) The input signal is relayed from the input unit to a hidden unit. (Input units may alternatively send their signals directly to output units, but the class of problems which can be solved using this technique is more limited; it corresponds to problems termed "linearly separable", meaning that if the input points were graphed with relevant axes, it would be possible to draw a line separating the points in each output class from those in the others -- or if the points are in a space of n dimensions, a hyperplane of n-1 dimensions could be made to separate the output classes. Input units can, however, be made to send their signals both to hidden and output units leading to more complex network architectures.) Hidden units, analogous to interneurons, serve solely as intermediate processors; they receive signals from input units and send signals either to further layers of hidden units or to output units. Their job is to perform a nonlinear transformation of the inputs, making a three-layer network more powerful than a two layer network without hidden units. Output units serve up the outcome of the processing. Thus, they can express anything from how strongly a muscle fiber should twitch (motor neurons in biology are classic output units) to the recognition of the expression on a face.
The "architecture" of a network comprises the details of how many layers are present, how many units invest each layer, and which units are connected to which others. In a standard three-layer network, there is one input layer, one hidden layer, and one output layer. In a fully connected feedforward network, all input units connect to all hidden units, and all hidden units connect to all output units; however many variations on this theme exist. This class of networks is called feedforward because activity in a unit only influences the activity of units in later layers, not earlier ones; feedback or recurrent networks can also be constructed and are discussed in the next subsection. For each connection between two units, a "weight", akin to a synaptic efficacy, characterizes the "strength" of a connection -- the propensity of one neuron to cause the neuron to which it feeds to become active. Learning by the network requires the selective modification of these weights, and different strategies have been devised to accomplish this. Again, networks "learn" by successively modifying the strengths of the connections between units, in a direction to reduce the error at the output.
Backpropagation of Errors
Backpropagation represents one much used strategy for computing the gradients of the weights with respect to the overall error (Rumelhart et al., 1986). The weights, initially set to small random values, are iteratively changed to reduce the error of the output units for each input pattern. For a given input pattern, the activity of each hidden unit, and later of each output unit, is calculated. The output function (chosen by the network architect) determines the activity of a unit based on this weighted summed input, and is usually taken as a nonlinear sigmoid function. With this choice the output cannot increase without bound as the incoming signals and weights increase. In statistics, fitting input data with sigmoid functions leads to nonlinear logistic regression. Many other nonlinear output functions can also be used and the choice is dictated by practical issues such as speed of training, accuracy, and amount of available data.
For each training example (which serves as an "input" to the network, i.e. a full set of activities for the input units), the actual network output (the collective activities or outputs of all "output" units) is compared to the desired output and an error is calculated. A summed (squared) error across all training examples is obtained, and by taking the derivative of this error with respect to a given weight, one can determine the direction to modify the weight in order to minimize the output error. The weights for hidden-to-output units are modified by a small amount in the direction to reduce the output error, and the "chain rule" from elementary calculus is invoked to extend or "back propagate" this differentiation in order to modify by a small amount the weights at the earlier input-to-hidden level. (The amount by which weights are modified is given by the "learning rate", a variable parameter.) The whole process is repeated, again giving each training example as input, calculating the output error, and incrementally and iteratively modifying the weights until the error begins to asymptote (or until "cross validation" techniques, which involve testing the network with untrained examples, suggest that further training will yield "learning" which is not generalizable to examples outside the training set).
To make this process concrete and germane, a simple compression-net can be considered (see below). A set of face images serves as input; normalized gray level values for each point on a 30x30 pixel image provide values for each of 900 input units (each receiving information from one of the 900 points on the image, in analogy to photoreceptors in the retina). These activities are passed through initially random weights to 40 hidden units, which, in turn, are connected by initially random weights to a single output unit, which is meant to ultimately give a value of zero if the input face is female and 1 if male. The actual output will have no semblance to the desired output with the initially random weights, and the output error will be calculated. A second training face will be shown, and its output error calculated. After all the training faces have been presented (reserving some faces which the network has not trained on for testing the network later), a summed error across all faces will be calculated and the weights will be slightly modified to make the network do less badly with the next round of presentations. This process will be repeated until the network is doing as well as it seems likely to, at which point "test" faces, which the network has never trained on, can be presented to evaluate the network's performance on the task. The 40 hidden units are low-dimensional representations of the face. The weights to the hidden units look like ghosts when viewed as images.
Many of the mysteries regarding the mathematical properties of feedforward networks and backpropagation have yielded to analysis over the last few years. We now know that they are universal approximators in the sense that feedforward networks can approximate well-behaved functions to arbitrary accuracy. The complexity of feedforward network models is also well understood in the sense that the bounds on the number of training examples needed to constrain the weights in the network have been established. Networks with linear hidden units perform principal components analysis, and nonlinear hidden units provide a nonlinear generalization of this technique. These simple networks have proved their worth in diverse tasks ranging from determining whether patients presenting to an emergency room with chest pain are having a heart attack, to making currency trading decisions, to deciding whether cells under a microscope are likely to be cancerous. The key to success in all of these examples is an adequate database of training data.
The feedforward architecture used in many neural network applications is quite versatile. A variety of functions have been used for the nodes in the hidden layer in place of the sigmoid function. For example, radial basis functions have been used with effectiveness and have several advantages for some problems, including faster training (Poggio, 1990). For many problems such as speech recognition, where the information is spread out over time, the temporal patterns can be mapped into a spatial array, converting the temporal pattern into a spatial pattern -- an arrangement called a time-delay neural network (Sejnowski & Rosenberg, 1987; Waibel et al., 1989). More advanced neural network architectures incorporate dynamical properties, such as temporal processing in the nodes (time constants or short-term memory) and feedback connections (Pearlmutter, 1989). These neural network architectures have also been used for solving control problems, such as controlling the limb of a robot arm (Jordan, 1992) or eye tracking of moving objects (Lisberger & Sejnowski, 1992).
Humans are competent at recognizing the sex of an individual from his or her face, though in real life the task may be facilitated by non-featural cues of facial hair or male-pattern baldness, by social cues of hairstyle, makeup, and jewelry, and by non-facial biological and social cues of size, body morphology, voice, dress style, and mannerisms. The task of distinguishing sex from static facial cues alone, in the absence of hairstyle, makeup and other disambiguating cues, is more difficult, though humans still perform quite well. Performance comparable to humans has been reported using neural networks (Cottrell & Metcalfe, 1991; Golomb et al., 1991; Brunelli & Poggio, 1991). These networks rely on preprocessing the image by normalizing it (for instance re-sizing and centering them), and either extracting features or "compressing" the image using autoencoding (as described above). In autoencoding, a method that is equivalent to "eigenfaces" when the hidden units are linear, a network is asked to reproduce each input image as output after forcing it through a "bottleneck" of hidden units. The hidden layer "recodes" each image with many fewer units. The new more parsimonious representation of each face, given by the activities of the hidden units of the "autoencoder" for that face, can substitute for the face -- for instance as input to subsequent networks.
Brunelli and Poggio (1991), using a radial basis functions in the hidden layer, found that the position of the eyebrow was more important than the position of the mouth for gender classification for their data set; however, the practice of brow tweezing among women suggests their network may have tapped into an artificial sign (akin to makeup and hairstyle) rather than a static or slow sign of identity (see page 9). However, comparable performance could be achieved using the gray-level representation of the face directly (Golomb et al., 1991). Recently, it has been shown that compression is not needed and good performance is possible from a two-layer network with direct connections from a normalized gray level image to a single output unit (Gray et al., 1993). This surprising result indicates that extraction of sex information in faces is less complex than had been assumed. Examination of the weights in this model reveal that information about the sex of a person is distributed over many regions of the face.
Our present experience on network architectures for expression recognition is limited. Most work has involved frontal images of static faces under good illumination (Cottrell & Metcalfe, 1991). In one unpublished, preliminary study using expressions from a single person, Golomb trained a network to recognize eight distinct facial actions. The network was trained on 9 examples (of variable intensities) of each facial action, and was tested on a tenth, different, example of that facial action; this process was iterated (using a technique termed "jackknifing"), reserving a different face for testing each time and training from scratch on the remaining nine. The data from the ten independent networks was compiled for statistical analysis. The facial actions employed corresponded roughly, in lay terms, to smile, frown, brow-raise, sneer, squint, pucker-lips, purse-lips, and neutral expression. As expected, the two most similar facial expressions (purse-lips and pucker-lips), which were difficult for human observers to distinguish in some instances, took longer for the network to learn than more dissimilar expressions. These similar expressions were selected for the purpose of assessing how well the network would do when challenged with subtle distinctions in facial expression. However, the "neutral" expression, though never misclassified as any other expression, took longest for the network to learn, though ultimately test cases of all eight expressions were correctly categorized by the network in almost all instances. (Interestingly, human observers also had difficulty classifying neutral faces as such, though they would not classify them among any of the other available options)
Optical flow can also used to extract dynamical muscle actions from sequences of images. Facial actions forming a fifteen-dimensional feature vector was used to categorize four expressions using a nearest-neighbor technique (Mase & Pentland, 1991). The eigenface approach has also been used to successfully classify expressions for a single person (Turk, 1991). More recently, a variation in the eigenface approach has successfully classified the six basic emotional states across a database of eleven people (Pentland et al., 1992).
These preliminary results are encouraging and provide evidence that the recognition part of the problem of automating the classification of expressions may be solvable with existing methods.
NOTE:. This section was based on tutorials given by Alexander Pentland and Terrence Sejnowski at the Workshop. Beatrice Golomb assisted with writing the section on neural networks.