MediaMixer Winter School and MultiMedia Modelling 2013

The COCo project (Olivier Aubert) was present at the 20th MultiMedia Modelling conference, that took place in Dublin at the beginning of january. Of special interest was the first MediaMixer Winter School, showcasing a number of exchanges about media annotation, media analysis and media rights management. Here are some highlights of the winter school.


Lyndon Nixon, the MediaMixer project coordinator, first presented an overview of the various activities of the MediaMixer consortium. The objective of MediaMixer is to set up and sustain a community of video producers, hosters, and redistributors around the use of semantic multimedia and media fragment technology. It aims at building bridges between the research community and the industry. This support takes the form of funding for a number of collaborative research projects like AXES, ForgetIt or HBBNext.

Lyndon highlighted some issues around media asset re-use, stressing the fact that opening up to user communities is most often a good move. User-Generated Content (UGC) offers a great potential with shared interests, like in NewZulu (citizen journalism platform) or Eyem (which offers to user a way to monetarize their own photos). Media remixing can also offer opportunities: the potential is illustrated by the Harlem Shake viral video. Youtube creators recently launched Mixbit, a platform that encourages remixing and redistribution of user video.

So in summary, there is a huge growth of online digital media. Professional content owners looking for new revenue possibilities (through online redistribution and reselling), and non-professional content creators are becoming enabled to participate in online media value chains.

This highlights some limitations in current media technologies, the first being the semantic gap. Effective retrieval of multimedia assets requires appropriate metadata, and it needs some assistance for good query formulation, such as controlled vocabularies and term normalisation, query suggestion or drill-down search through search learning. Named Entity Recognition (NER) extracts distinct entities out of natural language text, to enable disambiguation and classification and offer a path towards global unique identification (linked data).

Media fragmentation and annotation

Vasileios Mezaris talked about the media fragmentation and annotation (automatic) technologies that MediaMixer promotes. Some methods offer quite good results, like the shot detection technology that works in the uncompressed domain and achieves a high (90%) overall accuracy, in near real-time. Challenges remain in the detection of gradual transition and handling of intense motion.

Scene detection is an important prerequisite in summarization, indexing, video browsing. A scene is a high-level temporal video segment that is elementary in terms of semantic content, covering either a single event or multiple related events taking place in parallel. Different approaches exist (uni-modal vs multi-modal, domain-specific vs domain independant), and their precision depends on the nature of the video (documentaries vs fiction movies for instance). Scene detection is less accurate than shot segmentation, but good enough to improve access to meaningful fragments in various applications (retrieval, video hyperlinking).

Visual concept and event detection has progressed a lot, but its
results remain far from perfect, albeit already useful in some applications (retrieval, further fragment analysis).

Object re-detection is a particular case of image matching where the system has to find instances of a specific objects within a single video or a collection of videos. Current approaches reach pretty good results (99% precision with around 90% recall), with a very reasonable processing time (10 times faster than real-time), which makes it conceivable to build interactive applications, for instance for finding and linking related videos or fragments of them, or supporting other analysis tracks such as scene detection.

In conclusion, a number of techniques exist and the right one must be picked for the problem at hand. In order to achieve the best results, the volume, value and variability of the data is essential.

MediaFragments and deep linking

Raphaël Troncy talked about Media Fragments and deep linking, starting with some history of the W3C Video activity. One of its topics was the definition of identifiers for specifying spatial and temporal clips. This lead to the creation of the MediaFragments working group. The produced specification allows to address media fragments along 4 dimensions: temporal, spatial, named media fragments and track media fragments. In addition, MediaFragments access can be optimized at the server level by using a 2-ways handshake (requiring that the server implements time units in range requests). The MediaFragment is now a W3C Recommandation, and it features various implementations.

Using MediaFragments, it becomes possible to create semantic video annotations, that should aim at the following principles: – use things (dbpedia objects) not strings – use knowledge bases (Linked Open Data) – use common vocabularies (Linked Open Vocabulary – follow the 4 linked data principles

The OpenAnnotation data model proposes an interoperable model for representing annotations, integrating notably the modification in the nature of the annotation according to the user intention (motivation).

Some examples of media annotation tools were cited, among which – MapHub which is media-fragment based annotation system for ancient maps – LinkedTV which is developping innovative, semantic-based solutions interfaces for interactive TV consumption.

Some useful tools were presented, such as NERD (Named Entity Recognition and Disambiguation) from Eurecom, which takes the results of Stanford CoreNLP and combines them with other NER web APIs (Alchemy, Calais, Zemanta…) to provide a more better result. On the interface side, MediaFragment Enricher adds semantic metadata (esp. Named Entities) to video fragments.

Semantic Multimedia Remixing

Benoit Huet presented some advances in the field of automatic algorithms, in the light of the search and hyperlinking task of MediaEval2013.

The task consisted in searching for information in a video dataset (provided by the BBC), retrieving media fragments. The corpus featured 2323 BBC videos of different genres, enriched with 2 types of ASR transcript, manual subtitles, and some additional metadata (such as shot boundaries and keyframes. Two tasks were proposed: – search: find a known segment in the collection from a text query; – hyperlinking: find related segments.

The lessons learned from this experiment were that using scenes (pre-constructed segments) gave better performance than clustering results on the fly. And text-based search/linking (using transcript) is more efficient than using visual concepts.

Return of experience

Some actors presented returns of experience. RTÉ Archives, Ireland largest archive collection, presented their initiative to open up the RTÉ archives through industry and research collaboration, and to build a digital repository for Humanities and Social Sciences Data.

Noel E. O’Connor described the new opportunities offered by the sensor web in the domain of media-mixing. Both domains share temporality concernes, and could potentially enrich each other.

Tinne Tuytelaars presented some outcomes of the AXES project, which aims at using AV content analysis to provide new engaging ways to interact with AV libraries (browse, explore, experience), using weakly supervised methods. An originality of the project is their use of Google Images to produce an initial result set from a textual query, and use this initial set to train a classifier and do a visual retrieval on a private archive. She outline the importance of a good user interface, and of proper user education into using such systems.

Content preservation in the multimedia era

Claudia Niederée (L3S Research Center) shared her visions about forgetful technologies, developed in the FP7 ForgetIT project.

Facing the current abundance of content (of any media type), what kind of policy on data conservation should be adopted in the long run? Beyond technical preservation issues, the knowledge/context loss is crucial. So long term preservation must ensure protection of information of enduring value for access by present and future generations, preserving at the same time the information itself, but also its long term understandability.

Paradox: the world’s (digital) data is now more in danger to be lost, there is a real risk of a Digital Dark Age.

The principle of managed forgetting is inspired by the central role of human forgetting. It does not use automatic deletion, but rather proposes a storage on different levels, with suggestions for deletion, aggregation and summarization/annotation.

How media fragments and their remixing can enable new experiences for e-learners

Gaber Cerle (JSI), lead of the project, presented the past, present and futures of the VideoLectures project.

Started at Pascal network of excellence, it is now an OER repository of almost 18000 educational videos. It is going to benefit from many enhancements from other projects, such as MediaMixer (first results at and Translectures (for automatic transcription).