A Summarize-then-Search Method for Long Video Question Answering: Method

26 May 2024

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Jiwan Chung, MIR Lab Yonsei University (https://jiwanchung.github.io/);

(2) Youngjae Yu, MIR Lab Yonsei University (https://jiwanchung.github.io/).

Table of Links

2. Method

Figure 2: The qualitative result showing our proposed Long Story Short (LSS) model that generates and retrieves the index of raw video footage. When the model predicts the final answer from (i) the generated Summary and (ii) the retrieved text context, CLIPCheck validates each candidate’s answers to revise the final answer for the question.

2.1. Plot Generation

2.2. Narrative Search

Given the summarized narrative and the question, we wish to retrieve the relatively short clip relevant to the question from the long video. Language models generate open-ended text which is irregular and often noisy. To retrieve the exact part of the video, we drive the model to output indices of the plot rather than the text form.