|NSF Report - Facial Expression Understanding|
IV-B. Breakout Group on Sensing and Processing
Participants: A. Yuille, A. Pentland, T. S. Huang, P. Burt, G. Cottrell, O. Garcia, H. Lee, K. Mase, T. Vetter, Z. Zhang
Sensing and Processing
Alan Yuille, Alexander Pentland, Peter Burt, Gary Cottrell, Oscar Garcia, Hsien-che Lee, Kenji Mase, and Thomas Vetter
Introduction and Overview
The charter of Group 2 was to investigate sensing and processing techniques for automatically extracting representations of faces and facial features from real images. These representations should be transformed into descriptions that the psychologists in Group 1 (Basic Science) consider necessary for understanding facial expressions. The goal would be to develop a fully automated system.
There has been very little work done on this problem. This report attempts to summarize the state of the art and to describe promising directions to pursue. Many of the techniques described were developed for the related, and far more studied, problem of face recognition. There now exist face recognition techniques that will reliably recognize faces under restricted viewing conditions.
Lip reading is also a closely related problem. It involves extracting and describing the motion of the lips during speech, which can be considered facial feature understanding.
We organize this section of the report as follows. First, we describe the issues of sensing and environment. Next, we investigate methods for detecting the presence of faces in the image. After that, we consider how to detect facial features first from static images and then from motion sequences assuming that the head has been located. Finally, we consider how to interpret expressions.
There are some reliable techniques for locating faces in images. There is, however, only preliminary work on detecting facial features -- though several directions seem promising. Interpreting expressions is extremely preliminary. The difficulty of all these problems depends strongly on the viewing conditions, the orientation of the head and the speed of head movement. Controlling these factors, when possible, should considerably simplify the problems.
One major difficulty in building a fully automatic system is that many of the basic science questions have not yet been addressed, such as what the relevant patterns of facial movement are, what the combinations of these movements signify, and so forth. Input from the psychologists in Group 1 (Basic Science) and the computer animators in Group 3 (Modeling and Database) would be very useful for determining the quantitative descriptors of facial expressions, including their time dependent behavior.
There are many potential applications for a fully automatic facial feature understanding system, (e.g., performance monitoring, communications, teleconferencing, lip reading, medical diagnosis, security/intelligence, content based image processing, human-computer interactions, virtual reality, expression tracking, lip reading, animation, multi-media). The use of passive vision is desirable since it is non-invasive and could work in a large variety of environments. In certain situations it might be supplemented by active processes (see pages 39-40).
Of particular interest are human/machine interfaces. The availability of reliable head tracking, face recognition, and expression recognition systems would allow major improvements in human/computer interactive systems. The issue of what intermediate systems are worthwhile is discussed in the Group on Basic Science report.
Sensing and Environments.
Various factors must be considered when selecting and arranging the sensors for monitoring facial expression. The essential parameters are quite simple: the spatial and temporal resolution of the video images obtained, and the camera's field of view. The sensor must provide sufficient detail to discriminate expressions of interest, and it must provide a sufficiently wide field of view to ensure that the face stays in view. In general, however, meeting these basic requirements with a single fixed camera can be difficult. Research challenges in the area of sensing and environments relate to strategies for controlling camera gaze and zoom, in order to effectively extend the field of regard while maintaining high resolution.
While it may be desirable to use cameras with both as high a resolution and as wide a field of view as possible, this can place an undo burden on the computer that must analyze the resulting data. The sensor data rate is the product of field of view, spatial resolution (samples per unit angle), and temporal resolution (frame rate). Required rates depend on application, but can easily exceed practical limits on computing devices. Strategies that allow a small field of view camera to survey a wide field of regard also make most effective use of limited computing resources.
A standard NTSC video camera provides an image that, when digitized, measures 768 by 480 pixels. For a typical face monitoring task it may be necessary to arrange the camera so that there are at least 50 pixels across the width of a subject's face. The field of view can then be about ten times the width of the face. This camera should be sufficient for applications in which the subject is seated but otherwise is free to move his head. On the other hand it may not be sufficient for applications in which the subject is free to walk in front of and approach or move away from the camera. (Note that behavioral scientists often try to fill the frame with the face to make FACS scoring easier.)
The temporal frame rate required for monitoring facial expressions depends on the types of expressions that are of interest. Some expressions, such as a smile or frown, may persist for several seconds. A frame rate as low as one frame per second may suffice if one needs only to determine presence as opposed to temporal information. Monitoring more subtle or fleeting expressions may require ten or more frames per second. Lip reading almost certainly requires full NTSC frame rates (30 frames or 60 fields per second).
Large format cameras are available. Kodak, for example, markets a camera that measures 1320 by 1035 pixels, and another that measures 2048 by 2048. However these cameras provide only 10 frames and 5 frames per second, respectively. High definition cameras are roughly 1440 by 860 pixels, and operate at full frame rates, but these are very expensive. A less expensive alternative, if a wide field of view is required, is simply to use several NTSC cameras each covering a portion of the scene.
The task of automatically tracking faces or facial features can be simplified considerably through the addition of marks on the face. While it is desirable to monitor faces as unobtrusively as possible, the use of facial markings may be expedient in the near term for research applications, such as the study of facial expression and the development of computer interfaces that monitor the user's face.
TV cameras can be augmented with other sensors to obtain additional information about a face. For example, sensors have been developed that provide 3D range data. Commercially available range systems are too slow to be used in monitoring expressions. But new devices are being built that have the potential of providing an updated range map at frame rate, 30 frames per second.
Key research challenges
As noted above, a single camera can pan and zoom under computer control to follow faces as they move. This strategy can, in effect, provide a very wide field of view at high resolution, while keeping data rates and computation loads low. But use of a controlled camera introduces other complications. A special camera mount with drive motors is required. And fast image analysis is required to determine where to orient the camera on a moment by moment basis. The development of sensors and analysis techniques with these capabilities is the subject of research in the field of "active vision''.
In general terms the objective of active camera control is to focus sensing resources on relatively small regions of the scene that contain critical information. However, a vision system often must also observe the scene with a wide field of view camera (at the low resolution) in order to determine where to direct the high resolution observations. This is analogous to foveal vision in humans: the fovea provides resolution needed for discriminating patterns of interest, while the periphery provides broad area monitoring for alerting and gaze control.
A foveal strategy that allocates some sensing resources to broad area monitoring, and some to region-of-interest observation can reduce the actual data that needs to be provided by a sensor, and processed by the vision system, by a factor of 1000 or more. This can easily mean the difference between a system that is too large to be considered for any application and one that is sufficiently small to be generally used.
There are two primary areas of research in the area of active vision. The first is in the development of fast, intelligent, control processes to direct the camera. The second is the development of special sensors for foveal vision. Technology for real time control is only beginning to be developed, since real time hardware has been available for only a few years. This work needs to be extended for face location and tracking for the specific application of facial expression recognition. Experimental sensors with foveal organization have recently been built as well. Current devices are too limited in resolution to be considered for practical applications. An alternative is to obtain images at full resolution with a standard camera, then reduce data and resolution electronically to obtain an equivalent foveal sensor. This approach is possible with current technology.
In addition to this work on sensor control, research should be directed to the use of new sensors, such as those that provide range data. Range data has been used in face recognition. The usefulness of such data for recognizing facial expression should be a topic for further study.
It is likely that current sensor technology will suffice for the immediate needs of the research community. New 3D sensors could prove very effective. Advanced sensor technology, particularly to control the sensors, will be essential for practical systems for use in medical, computer interface, communication, or other commercial applications.
Detection of Faces
Discerning the existence and location of a face, and tracking its movements, are perceptual abilities that have not found their own place in the behavioral science literature, but duplicating these native, autonomous functions computationally is not trivial. This task is a precursor to determining the information that the face provides. The strategies that have provided some success in locating faces are described in Tutorial on Neural Networks and Eigenfaces (pages 19 to 21).
Key research challenges
A robust way to locate the faces in images, insensitive to scale, pose, style (with or without eyeglasses or hair), facial expression, and lighting condition, is still the key research challenge, especially in complex environments with multiple moving objects.
It seems that image segmentation based on the combined use of color, texture, shape (geometry and shading), and model knowledge could provide better performance than most existing algorithms.
For applications that allow careful control of lighting and background, some effort should be directed at designing entire systems and environments for face location and tracking.
Face detection and tracking are the first steps in face recognition and facial expression understanding. Without knowing where the faces are, most feature extraction algorithms will produce many false targets and thus make themselves less useful. When faces are properly located and tracked, our knowledge about spatial features of a face can be used very effectively.
Feature Extraction from Static Images
Feature extraction may be divided into at least three dimensions represented in the figure below. The first consideration is static versus dynamic features: Is temporal information (a sequence of images) used or not? The second is the grain of the features: These may be divided into global features, spanning roughly the whole object being analyzed at one extreme, and analytic or part-based features, spanning only subparts of the image. The third is view-based versus volume-based, or 2D versus 3D features. 3D features can be extracted using special sensors or active sensing.
Given this nomenclature, most of computer vision for the last thirty years has been directed towards static, analytic, 2D feature extraction. Optic flow is in the dynamic, analytic, 2D corner. It is of interest here to consider what has been accomplished in these traditional corners, and what might be accomplished in some of the others.
Key research challenges
Most current systems have been applied to small databases; only the eigenface approach has been applied to a large database. It is important to assess these techniques, and contrast them, on large size databases.
The techniques used in recognizing signs of identity should be applied to expression recognition. Some researchers (Cottrell & Metcalfe, 1991; Kohonen et al., 1977; Turk & Pentland, 1991) have looked at expression recognition. Mase, see next section, has looked at optic flow for FACS detection. There is a need to extend these techniques, possibly by principal component analysis, and evaluate them. In order to do this, large, labeled databases are required.
Better techniques are needed to achieve scale invariance and deal with background noise. Combinations of the above techniques could be useful, e.g., one could use eigentemplates for features combined with springs in an energy function approach.
Implementation on parallel hardware could speed up and simplify the algorithms.
These techniques could be used to detect boredom, attention wandering, or sleepiness, but the behavioral scientists have not specified the features required for such performance monitoring. Again, temporal, labeled databases validated by physiological measures are necessary for attacking the problem.
New research should explore different parts of the feature cube diagrammed above. Corners not explored thus far include temporal global features, and the use of 3D part-based features. Can optic flow be used for real-time monitoring? Can active vision techniques reduce the computational requirements of optic flow?
In all of the eigenface/holon approaches, the features extracted were linear. Nonlinear feature extraction through new dimensionality reduction techniques could give lower dimensional representations, and compact parameterizations of expression and face space.
Extracting features from faces is the first step in automating a facial expression understanding system.
Feature Extraction from Image Sequences
Basic features: Current status
Changes in the shapes of facial features, their relative positions, and the optical flow in facial areas are parametric features suitable for describing facial expressions. Moreover, they are extractable by computer vision techniques. The possible parametric features are:
Static parametric features have been used in person identification systems (Sakai et al., 1972; Kanade, 1973), and the same algorithms for feature extraction may be worth trying (see pages 15-17, 23-24, and 41).
The face is a good subject for computer vision research, because the shape of facial features and their relative arrangement are universal regardless of age, gender, and race. Consequently we have a priori knowledge, and perhaps even a facial model, that can be used to help extract information-bearing features. For instance, standard edge extraction algorithms can, in well illuminated images, detect the eyebrows, eyes, nose (nose wing and nostril) and mouth (upper lip and mouth opening). Yet the lower lip contour and the line of the chin are often not detectable without a priori assumptions about the shape and location. Active contour models, such as snakes (Kass et al., 1987; Waite & Welsh, 1990) and deformable templates (Yuille et al., 1989), are one way to accomplish this. After extracting the features, their positions, such as centroid, extreme point, shape and angle, are used to analyze and/or to represent the expression (Choi et al., 1990; Terzopoulos & Waters, 1990a).
In contrast, the optical flow approach (Mase & Pentland, 1990a, 1991; Mase, 1991) to describing face motion has the advantage of not requiring a feature detection stage of processing (see pages 24-25).
Robustness to temporal noise. All of the above parameters except optical flow are theoretically computable from static imagery. Such extraction has, however, proven to be sensitive to noise and illumination. Optical flow information is also affected by these problems, however spatial averaging over facial action groups seems to offer some hope of robust estimation.
Skin deformations. An important but often unavoidable source of noise during expressions is the appearance of wrinkles and dimples. They are confusing for feature extraction techniques, and violate the constant-patch assumption of the optical flow computation. It is necessary for all algorithms to be able to deal with these "irrelevant'' features by, for instance, separating these features in terms of their temporal stability.
Head motion. Before one can interpret the detailed motion of the face, it is first necessary to very accurately track the head. This subject is discussed in more detail in previous sections (see pages 21-22 and 39-41).
Temporal segmentation. Temporal segmentation is necessary for a system to pull out each separate expression from within a long sequence of facial actions; this is particularly true if we are to understand dialogs between people. In lip reading, zeros of the velocity of the facial motion parameters were found to be useful for the temporal segmentation (Mase & Pentland, 1990b). This finding may be useful in attempting to segment more complex facial actions.
Co-articulation. It is well known in speech recognition that adjacent sounds are often co-articulated; that is, the temporal context of a phoneme changes its sound. Facial expression seems to be similar. Analysis of the mechanisms of co-articulation and compensating for them in the recognition stage is a major challenge (Garcia et al., 1992).
Higher level feature extraction. The basic information, such as shape deformation, position change, and optical flow, may be integrated spatially to obtain higher level descriptions, such as muscle actions and Action Unit (AU) descriptions. The use of anatomical knowledge is necessary in this task; however careful statistical analysis has also proven to be useful (Mase & Pentland, 1990b; Mase, 1991).
Feature extraction and computation of facial changes is likely to be the basis for accurate expression description. Description of facial expression based on well-defined and well-segmented information will lead to a reliable recognition of expression.
Computerized lipreading: Background
A main objective in trying to elicit spoken language from optical observations of the oral cavity and facial articulatory movements is to supplement the acoustic perception of speech, particularly important in noisy environments. Also, understanding the relations between speech and observable articulation is useful in teaching the art of lipreading to the hearing-impaired and in the simulated animation of synthesized speaking faces.
Automating the recognition of speech from video signals - also called Optical Automatic Speech Recognition or OASR - is expected to significantly enhance the robustness of acoustic automatic speech recognition because of the complementarity between ambiguous phonetic and visual signals. In the case of human lipreading the experimental gains are on the order of 10 to 12 db improvement in the signal-to-noise ratio, according to Brooke (1989). Brooke also suggests that non-articulatory measurements of facial motions (head, eyebrows, etc.) may be auxiliary to augmentation of acoustic recognition in non-ideal circumstances by providing nonverbal cues.
The earliest reported attempt to mechanically automate lipreading is the patent number 3192321 issued in 1965 to Ernie Nassimbene (1965) of IBM for a device consisting of an array of photocells that captured the reflected light emitted from the oral cavity region. Subsequent research by Petajan, Brooke, Nishida, Pentland, Mase, Smith, Yuhas and others is described in the Tutorial Neural Networks and Eigenfaces on pages 25 to 26.
Mouth region features for lipreading
The work of Garcia and Goldschen (Garcia et al., 1992) in analyzing the features that are most important for continuous speech recognition using TIMIT sentences is described on page 25. Previous work by Montgomery and Jackson (1983) had sought to elicit important features for lipreading vowels in cases where /h/ precedes each vowel, and /g/ follows each vowel. The features examined were the height, width, area, and spreading (width/height) of the oral cavity, video duration (number of frames for vowel articulation), and audio duration (time during vowel articulation using an oscilloscope). These features were taken from a single image frame chosen by experienced lip-readers to characterize the vowel. They concluded that the spreading and the area surrounded by the lips are important features for vowel recognition. More significantly, they found no absolute fixed and observable positions for the lip and tongue corresponding to specific vowels across different speakers. Lip-readers adjust to different speakers through spatial normalization. That is, constant relative differences among some oral cavity features in the visual perception of the vowel utterances exist for all speakers.
Kathleen Finn (1986) investigated appropriate oral cavity features for possible automatic optical recognition of consonant phonemes. Finn investigated the recognition of consonants preceded by and followed by the vowel /a/. Her analysis considered only the middle optical image frame of the utterance for each consonant, thereby taking into account only central static features for viseme discrimination. Finn determined that the five most important static features for consonant discrimination are the height and width of the oral cavity opening, the vertical spreading (each separately) of the upper and lower lips, and the "cornering" of the lips.
Research issues in OASR
One of the most important issues in OASR is how to automate the recognition process in real time. The use of powerful workstations makes possible real time acoustic recognition. As continuous progress on real-time computer vision systems is anticipated, we expect supplementary optical speech recognition also to make comparable progress in real-time computational environments with large vocabularies. A few commercial systems for acoustic speech recognition are available at the present time, and other academic research systems have even shown speaker-independent performance (Lee, 1989). One of the challenges involving human/computer interfacing is how to improve the robustness of recognition using a multi-modal approach. The approach taken by Garcia and Goldschen is to use a Hidden Markov Model (HMM) that parallels the acoustic model, which can therefore be augmented with the additional features obtained from a video camera.
An alternative approach is to use time-dependent neural networks that would allow complex utterances to be recognized beyond the phonetic boundaries of a single phone, providing co-articulation information. Given the difficulty of training and recognizing a large number of different context-dependent phones, it seems that the association of optical speech recognition using neural nets with acoustic recognition using neural nets, for large vocabularies or continuous speech, must await for further research developments. For small vocabularies, Stork, Wolff and Levine (1992) showed the robustness of recognizing both optical and acoustic signals over a purely acoustic recognizer.
The problems of phonetic context dependency, which plague the acoustic recognition of speech, also appear in the optical channel. The mapping between phonemes and visemes has been an area open to argumentation, clearly because context dependency obscures fundamental issues between the actual phones and their correspondingly observed sequence of optical images. Solution of the problems of co-articulation is a likely prerequisite for continuous optical recognition, as contrasted with isolated word recognition.
Another, more fundamental issue, is to what extent optical features can be considered speaker independent, and whether the training techniques for speaker independent acoustic speech recognition are also applicable to optical speech recognition.
The inverse problem to analysis of facial movements in the mouth area -- having the objective of speech recognition -- is the synthesis of the facial motions that take place, given spoken utterances. This problem has obvious commercial implication for computer animation of "talking heads."
A final automated expression recognition system must translate the automatically extracted features into a description of facial expression. It is usually expected that this automatic description should be identical, or at least very close to, a human's description of a facial expression. But in general, the requirements of expression recognition will depend on the applications, so the isolation of action units may require a much finer resolution than a simple classification, such as between sad or happy.
Our present experience on expression recognition is still limited. Usually the work is restricted to frontal images of faces under good illumination. Optical flow has been used to extract dynamical muscle actions. These actions formed a fifteen-dimensional feature vector, which was categorized into four expressions using a nearest neighbor technique (Mase, 1991). The eigenface approach has also been used to successfully classify expressions for a single person (Turk & Pentland, 1991). More recently, a variation in the eigenface approach has successfully classified the six basic emotional states across a database of eleven people (Pentland et al., 1992).
More work has been done on face recognition, including gender classification. Many face recognition systems skip the feature extraction step completely and solve the recognition problem by template matching of the test image with all target images, or in other words each face is its own feature. To reduce the amount of computation, often a smaller set of images is used as templates. Then, the features are the coefficients of the linear representation of the test image using these templates.
Some other nonlinear approximation techniques, such as Hyper-basis functions, may help to get a better understanding of the importance and role of certain features for expression recognition. Using this method, it was found that the position of the eyebrow is more important than the position of the mouth for gender classification (Brunelli & Poggio, 1991). However, Golomb et al. (1991) and Gray et al. (1993) have found that the shading around the filtrum and mouth area provide significant information about sex-identity.
Key research challenges
The main problem in this area is generalizing from given examples of expressions to ones the recognition system has never seen before. It is important that the given examples should be described not only as static images of faces but also as dynamical sequences of images, as in many cases the expression of a face is determined by the temporal changes in the face as much, as by the final shape. We expect to obtain this precise description from psychology and physiology.
Generalization or learning from examples is equivalent to function approximation in higher dimensional spaces (Poggio, 1990). We want to find the functional dependence between the input space, the space of the different features and the output space - the space of different expressions. The success of such learning depends on the quality of the extracted features. If, for example, the features are already viewpoint or illumination independent, the number of necessary examples will decrease.
Unsupervised learning is also possible. In this case, one would try to find some characteristic differences in the extracted feature sets and use these differences afterwards for a classification. Most techniques in this area fit in the projection pursuit framework (Friedman & Stuetzle, 1981). The development of these classes is not guided by human understanding of expression. The success of this method, a correlation between the evaluated classes and our understanding of different expressions, will again depend on the type of extracted features.
NOTE: This section was prepared by A. Yuille and A. Pentland, from contributions of their own and P. Burt, G. Cottrell, O. Garcia, H. Lee, K. Mase, and T. Vetter, and edited by T. S. Huang.