Logo: University of Southern California

'Distracters' and 'Supporters' Keep Viterbi Video Analyzer on Target

Institute for Robotics and Intelligent Systems' Context Tracker system outperforms rivals in tests
By: Eric Mankin
November 11, 2011 —

Take a video of a crowd scene, numerous unfamiliar faces moving around a screen, or a city street filled with multiple vehicles and pedestrians. The task is to capture and then follow one face or one car or one other defined object occupying only a small surface area of the screen through time, through disappearances and reappearances, and through shrinkages and expansions.

Context Tracker: To follow its target triplet accurately, it keeps looser, less computationally demanding track of false positive "distracter" areas (light blue squares), while creating a "supporter" definition of the real target area (green square) tied to points around it (dark blue lines).
A computer program called ‘Context Tracker’ developed by researchers working at the USC Viterbi School of Engineering can now accomplish this task more quickly and economically than existing tools, using off-the-shelf computing and visual hardware. It adapts to problems like a handheld camera abruptly changing angle, or turning on and off, creating gaps in the film record.

Applications include surveillance for security or even housekeeping applications. In tests, the researchers say, the new system outperforms existing systems.

The trick is to identify things that look somewhat like the real target and track them loosely, while maintaining a separate file to identify the real thing. By establishing a limited universe of false positives, the computer can economically run checks to quickly eliminate similar looking areas but reliably identify the real. It uses a two-step process, first scanning the whole scene to eliminate obvious non-matches, and then zeroing in using two categories of image description: ‘distracters’ and ‘supporters.

The Context Tracker system continually looks for possible false positives called "distracters" (light blue boxes), and continually checks against the much more closely imaged "supporter," (green box), kept in contact with contextual clues (blue lines).

According to recent Ph.D. graduate Thang Dinh, the lead author of a study presented at the 2011 Computer Vision and Pattern Recognition Conference, and his faculty advisor, Professor Gérard Medioni, Context Tracker quickly analyzes the scene it is focused on after an operator selects a target in that scene, such as a face.

The program then looks for "distracters." These are areas in the image that look enough like the target to create potential mistakes. For example, if one of five faces in a scene is chosen as the target, the other four faces will be tagged and identified as distracters.

The next step is to quickly analyze the target and find the "supporters," local key-points around the target that consistently move along with the target -- the body, if the target is a face, for example. These must have time-consistent, testable differences with the distracters – that is, ways to accurately distinguished the real thing from all and any of the distracters.

Substantially more data is used to create the supporter definitions, but the whole system remains more compact because so much of the image — non-distracters, non-supporters — does not have to be analyzed.

For more information about the Context Tracker system including video demonstrations, click on the image above.
The USC group is not the only group working on this problem. Dinh and the team tested Context Tracker against a number of competitive systems, including those developed by Technion - Israel Institute of Technology; UC San Diego Computer Vision Laboratory; University of Surrey Centre for Vision, Speech and Signal Processing; Seoul National University Computer Vision Lab; and a collaboration between Nanjing University Media Interaction, HP Labs, and UC Santa Cruz.

All the systems tried to follow targets in videos with names like “Motocross,” “Car Chase,” “ETH Pedestrian,” “Multifaces Girl,” “Liquor,” “Scale,” and “Vehicle.”

The numerical comparisons were based on how closely the program followed the target; that is, how many pixels from the actual target center the system thought the target was.  The numerical comparisons were also based on whether the system was able to avoid losing the target during the trials. Context Tracker performed significantly better than the other systems analyzed.

Medioni is the director of the USC Institute of Robotics and Intelligent Systems (IRIS) lab where the research was done. Dinh is now a Viterbi school graduate student completing his Ph.D. Co-author Nam Vo is from the Faculty of Information Technology, University of Science, in Ho Chi Minh City Vietnam.

Guys in the hood: by loosely tracking false positives (yellow), the real thing (green) can be much more easily identified even if it leaves the image and then returns.