Video Scene Detection Using Transformer Encoding Linker Network (TELNet)

Shu Ming Tseng*, Zhi Ting Yeh, Chia Yang Wu, Jia Bin Chang, Mehdi Norouzi

*Corresponding author for this work

Research output: Contribution to journalJournal Article peer-review

5 Scopus citations

Abstract

This paper introduces a transformer encoding linker network (TELNet) for automatically identifying scene boundaries in videos without prior knowledge of their structure. Videos consist of sequences of semantically related shots or chapters, and recognizing scene boundaries is crucial for various video processing tasks, including video summarization. TELNet utilizes a rolling window to scan through video shots, encoding their features extracted from a fine-tuned 3D CNN model (transformer encoder). By establishing links between video shots based on these encoded features (linker), TELNet efficiently identifies scene boundaries where consecutive shots lack links. TELNet was trained on multiple video scene detection datasets and demonstrated results comparable to other state-of-the-art models in standard settings. Notably, in cross-dataset evaluations, TELNet demonstrated significantly improved results (F-score). Furthermore, TELNet’s computational complexity grows linearly with the number of shots, making it highly efficient in processing long videos.

Original languageEnglish
Article number7050
JournalSensors
Volume23
Issue number16
DOIs
StatePublished - 09 08 2023
Externally publishedYes

Bibliographical note

Publisher Copyright:
© 2023 by the authors.

Keywords

  • video chaptering
  • video scene detection
  • video structure analysis
  • video summarization
  • video temporal segmentation

Fingerprint

Dive into the research topics of 'Video Scene Detection Using Transformer Encoding Linker Network (TELNet)'. Together they form a unique fingerprint.

Cite this