Authors:
(1) Pinelopi Papalampidi, Institute for Language, Cognition and Computation, School of Informatics, University of Edinburgh;
(2) Frank Keller, Institute for Language, Cognition and Computation, School of Informatics, University of Edinburgh;
(3) Mirella Lapata, Institute for Language, Cognition and Computation, School of Informatics, University of Edinburgh.
Table of Links
- Abstract and Intro
- Related Work
- Problem Formulation
- Experimental Setup
- Results and Analysis
- Conclusions and References
- A. Model Details
- B. Implementation Details
- C. Results: Ablation Studies
4. Experimental Setup
Datasets Our model was trained on TRIPODL, an expanded version of the TRIPOD dataset [41, 42] which contains 122 screenplays with silver-standard TP annotations (scene-level)[3] and the corresponding videos.[4] For each movie, we further collected as many trailers as possible from YouTube, including official and (serious) fan-based ones, or modern trailers for older movies. To evaluate the trailers produced by our algorithm, we also collected a new held-out set of 41 movies. These movies were selected from the Moviescope dataset[5] [11], which contains official movie trailers. The held-out set does not contain any additional information, such as screenplays or TP annotations. The statistics of TRIPODL is presented in Table 1.
Movie and Trailer Processing The modeling approach put forward in previous sections assumes that we know the correspondence between screenplay scenes and movie shots. We obtain this mapping by automatically aligning the dialogue in screenplays with subtitles using Dynamic Time Warping (DTW; [36, 42]). We first segment the video into scenes based on this mapping, and then segment each scene into shots using PySceneDetect[6]. Shots with less than 100 frames in total are too short for both processing and displaying as part of the trailer and are therefore discarded.
Moreover, for each shot we extract visual and audio features. We consider three different types of visual features:
(1) We sample one key frame per shot and extract features using ResNeXt-101 [56] pre-trained for object recognition on ImageNet [14]. (2) We sample frames with a frequency of 1 out of every 10 frames (we increase this time interval for shots with larger duration since we face memory issues) and extract motion features using the two-stream I3D network pre-trained on Kinetics [10]. (3) We use Faster-RCNN [18] implemented in Detectron2 [54] to detect person instances in every key frame and keep the top four bounding boxes per shot which have the highest confidence alongside with the respective regional representations. We first project all individual representations to the same lower dimension and perform L2-normalization. Next, we consider the visual shot representation as the sum of the individual vectors. For the audio modality, we use YAMNet pre-trained on the AudioSet-YouTube corpus [16] for classifying audio segments into 521 audio classes (e.g., tools, music, explosion); for each audio segment contained in the scene, we extract features from the penultimate layer. Finally, we extract textual features [42] from subtitles and screenplay scenes using the Universal Sentence Encoder (USE; [12]).
For evaluation purposes, we need to know which shots in the movie are trailer-worthy or not. We do this by segmenting the corresponding trailer into shots and computing for each shot its visual similarity with all shots in the movie. Shots with highest similarity values receive positive labels (i.e., they should be in the trailer). However, since trailers also contain shots that are not in the movie (e.g., black screens with text, or simply material that did not make it in the final movie), we also set a threshold below which we do not map trailer shots to movie shots. In this way, we create silver-standard binary labels for movie shots.
Sentiment Labels Since TRIPOD does not contain sentiment annotations, we instead obtain silver-standard labels via COSMIC [17], a commonsense-guided framework with state-of-the-art performance for sentiment and emotion classification in natural language conversations. Specifically, we train COSMIC on MELD [43], which contains dialogues from episodes of the TV series Friends and is more suited to our domain than other sentiment classification datasets (e.g., [9, 29]). After training, we use COSMIC to produce sentence-level sentiment predictions for the TRIPOD screenplays. The sentiment of a scene corresponds to the majority sentiment of its sentences. We project scenebased sentiment labels onto shots using the same one-tomany mapping employed for TPs.
This paper is available on arxiv under CC BY-SA 4.0 DEED license.
[3] https://github.com/ppapalampidi/TRIPOD
[4] https://datashare.ed.ac.uk/handle/10283/3819
[5] http://www.cs.virginia.edu/ pc9za/research/moviescope.html
[6] https://github.com/Breakthrough/PySceneDetect