Spotting Audio-Visual Inconsistencies (SAVI) | SRI International
This is an archived website. Go to the current SRI website

Toggle Menu

Spotting Audio-Visual Inconsistencies (SAVI)

SRI is finding new ways to detect altered and tampered video.

As powerful video editing software tools become more commonplace, the ability to tamper with videos has increased significantly. Rapidly developing consumer applications have made it possible for almost anyone to create synthesized speech and synthesized video of a person talking. Now, SRI researchers are working with the University of Amsterdam and Idiap Research Institute to develop new techniques for detecting videos that have been altered.

SRI’s Spotting Audio-Visual Inconsistencies (SAVI) techniques detect tampered videos by identifying discrepancies between the audio and visual tracks. For example, the system can detect when lip synchronization is a little off or if there is an unexplained visual “jerk” in the video. Or it can flag a video as possibly tampered if the visual scene is outdoors, but analysis of the reverberation properties of the audio track indicates the recording was done in a small room.

This video shows how the SAVI system detects speaker inconsistencies. First, the system detects the person’s face, tracks it throughout the video clip, and verifies it is the same person for the entire clip. It then detects when she is likely to be speaking by tracking when she is moving her mouth appropriately.  The system analyzes the audio track, segmenting it into different speakers. As shown in the image below, the system detects two speakers: one represented by the dark blue horizontal line and one by the light blue line. Since there are two audible speakers and only one visual person, the system flags the segments associated with the second speaker as potentially tampered – represented by the horizontal red line.


The system also detects lip synch inconsistency by comparing visual motion features with audio features.  It computes the visual features by detecting the person’s face, using OpenPose to detect and track face landmarks, and computing a spatiotemporal characterization of mouth motion, as seen below.


The SAVI system combines these findings with Mel-frequency cepstral coefficients (MFCC) features of the audio track to classify 2-second video clips as either good lip synchronization or bad, based on a large training set of audiovisual feature vectors. The system marks inconsistencies in red and consistencies in green along the horizontal line at the bottom of the image.


This project is funded by DARPA (contract and funding through AFRL) under DARPA’s Media Forensics (MediFor) Program, Contract #FA8750-16-C-0170.