Video Indexing and Summarization

Very large databases of images and videos depend on efficient algorithms to enable fast browsing and access to the information pursued. In the case of videos, in particular, much of the visual data offered is simply redundant, and a way to retain only the information strictly needed for functional browsing, querying and accessing must be found. This calls for automatic video analysis tools that can support human operators, alleviating and ultimately doing away with the slow, labor-intensive tasks of manual annotation and analysis that must usually be performed in video archives.

System Overview

The Figure above shows the general framework of a video indexing application for video retrieval, evidencing the central role of the video analysis. Automatic video analysis, as can be seen, is a complex process that involves solving several problems:

  • Representation of the video contents (feature extraction);
  • Segmentation of the video in elementary units (structural analysis);
  • Video summarization (video abstraction);
  • Post-processing of results (Presentation of visual information).

The video analysis process can be divided into three main modules: shot boundary detection, key frame extraction, and summary post-processing. Since each module may not be fully reliable in performing its required task, all the modules should cope with any problems introduced by a previous module. A video retrieval application can take advantage of all the low level data obtained by video analysis, using them to index the video contents. Low-level data can also be exploited in high level processing to extract semantic information, enhancing, for example, the informative contents of the visual table of contents.

Shot Boundary Detection

The basic structural unit of a video (apart from the frames themselves) is the shot. A shot consists of an uninterrupted sequence of frames taken with a single camera and in a unique location. At a higher level of the shot units are the scenes. A scene is a continuous sequence that is temporally and spatially cohesive in the real world, but not necessarily cohesive in the video. Scenes describe a story line in the video and are formed by one or more shots not necessarily contiguous. A shot boundary detection algorithm is based on the recognition of the editing effects (such as cuts, fades, dissolves, etc…) that identify the boundaries of the video sequences (shots). These effects must be characterized and suitable algorithms developed. Specialized detection algorithms, one for each effect that must be located, may be needed to maximize the performance of the shot detection algorithm.

Key Frame Extraction

Generally, a video summary is a sequence of still or moving images, with or without audio. These images must preserve the overall contents of the video with a minimum of data. Still images chronologically arranged form a pictorial summary that can be assumed to be the equivalent of a video storyboard. Summarization utilizing moving images (and at times a corresponding audio abstract), is called video skimming; the product is similar to a video trailer or clip. Both approaches must present a summary of the important events recorded in the video. Although video skimming conveys pictorial, motion and, where used, audio information, still images (called Key Frames) can summarize the video contents in more rapid and compact way: users can grasp the overall contents more quickly from key frames than by watching a set of video sequences (even when brief). The extraction of key frames must be automatic, and content based so that they maintain the salient contents of the video while avoiding all redundancy.

The algorithm developed determines the complexity of the sequence in terms of changes in the visual contents expressed by different low-level frame descriptors. The algorithm is able to dynamically and rapidly select key frames within each shot (the number of key frames varies depending on the video complexity). The key frames are extracted as soon as a shot has been detected.

Besides providing video browsing capability and content description, key frames act as video “bookmarks” that designate interesting events captured supplying direct access to video sub-sequences. Key frames, which visually represent the video contents, can also be used in the indexing process, where the same indexing and retrieval strategies developed for image retrieval to retrieve video sequences can be applied. Low level visual features can be used in indexing the key frames and thus the video sequences to which they belong.

Summary Post-Processing

The Figure below shows the three steps of the post-processing algorithm. A three-step post-processing algorithm has been designed to present users with easily accessible visual summaries that are exhaustive, but not redundant. The first step removes meaningless key frames. It utilizes a supervised classification strategy performed by a neural network on the basis of pictorial features derived directly from the frames, together with others derived from the processing of the frames by a visual attention model algorithm. The second step groups the key frames into similar clusters to provide for a multi-level summary presentation and remove redundant key frames. To perform the task, it utilizes both low level and high level features. The third step identifies the default summary level that is shown to the users: starting from this set of key frames, the users can then browse the video content at different level of detail.

As the post-processing algorithm does not use previous knowledge about the video contents, nor is any assumption made about the input data. The generic approaches used in the whole post processing pipeline, mean that it can easily be specialized to support domain specific applications by taking into account the appropriate pictorial and semantic properties. For example, the key frame removal stage could be extended with more pictorial quality features (both low level and high level) in order to better cover the many factors that can cause a user to reject a frame (e.g. wrong skin tone, half faces, etc…).

The Video Tool



A robust multi-feature cut detection algorithm for video segmentation
(Gianluigi Ciocca) In Electronic Letters on Computer Vision and Image Analysis, volume 9, number 1, CVC Press, 2010.

 author = {Ciocca, Gianluigi},
 year = {2010},
 title = {A robust multi-feature cut detection algorithm for video segmentation},
 volume = {9},
 number = {1},
 publisher = {CVC Press},
 journal = {Electronic Letters on Computer Vision and Image Analysis},
 url = {},
 pdf = {/download/ciocca2010robust-multi-feature.pdf},
 issn = {1577--5097}}

Hierarchical Browsing of Video Key Frames
(Gianluigi Ciocca, Raimondo Schettini) In Advances in Information Retrieval, volume 4425 of Lecture Notes in Computer Science, pp. 691-694, Springer Berlin / Heidelberg, 2007.

 author = {Ciocca, Gianluigi and Schettini, Raimondo},
 year = {2007},
 pages = {691-694},
 title = {Hierarchical Browsing of Video Key Frames},
 volume = {4425},
 publisher = {Springer Berlin / Heidelberg},
 series = {Lecture Notes in Computer Science},
 isbn = {978-3-540-71494-1},
 booktitle = {Advances in Information Retrieval},
 url = {}}

An innovative algorithm for key frame extraction in video summarization
(Gianluigi Ciocca, Raimondo Schettini) In Journal of Real-Time Image Processing, volume 1, pp. 69-88, Springer Berlin / Heidelberg, 2006.

 author = {Ciocca, Gianluigi and Schettini, Raimondo},
 year = {2006},
 pages = {69-88},
 title = {An innovative algorithm for key frame extraction in video summarization},
 volume = {1},
 publisher = {Springer Berlin / Heidelberg},
 journal = {Journal of Real-Time Image Processing},
 url = {},
 pdf = {/download/ciocca2006innovative-algorithm.pdf},
 doi = {10.1007/s11554-006-0001-1},
 issn = {1861-8200}}

Dynamic key-frame extraction for video summarization
(Gianluigi Ciocca, Raimondo Schettini) In Internet Imaging {VI}, volume 5670, pp. 137-142, {SPIE}, 2005.

 author = {Ciocca, Gianluigi and Schettini, Raimondo},
 year = {2005},
 pages = {137-142},
 title = {Dynamic key-frame extraction for video summarization},
 volume = {5670},
 publisher = {{SPIE}},
 booktitle = {Internet Imaging {VI}},
 url = {},
 doi = {10.1117/12.586777}}

Video summarization using a neurodynamical model of visual attention
(Silvia Corchs, Gianluigi Ciocca, Raimondo Schettini) In Multimedia Signal Processing, 2004 {IEEE} 6th Workshop on, pp. 71-74, {IEEE}, 2004.

 author = {Corchs, Silvia and Ciocca, Gianluigi and Schettini, Raimondo},
 year = {2004},
 month = {1},
 pages = {71-74},
 title = {Video summarization using a neurodynamical model of visual attention},
 publisher = {{IEEE}},
 booktitle = {Multimedia Signal Processing, 2004 {IEEE} 6th Workshop on},
 url = {},
 doi = {10.1109/MMSP.2004.1436419}}