Abstract
The rapid growth of multimedia data and the improvement of deep learning technology has allowed high-accuracy models to be trained for various fields. Video tools such as video classification, temporal action detection, and video summary are now available for the understanding of videos. In daily life, many social events start with a small conflict event. If conflicts and the subsequent dangers can be learned about from a video, we can prevent social incidents from occurring early on. This research presents a video and audio reasoning network that infers possible conflict events through video and audio features. To make the respective model more generalizable to other tasks, we have also added a predictive network to predict the risk of conflict events. We use multitasking to render the characteristics of movies and voices more generalizable to other similar tasks. We also propose several methods to integrate video features and audio features, improving the reasoning performance of the model. There’s a model we proposed is called the video and audio reasoning Network (VARN) which is more accurate than other models. Compared with RandomNet, it achieves a 2.9 times greater accuracy.
Original language | English |
---|---|
Pages (from-to) | 6435-6455 |
Number of pages | 21 |
Journal | Journal of Supercomputing |
Volume | 77 |
Issue number | 6 |
DOIs | |
State | Published - 06 2021 |
Externally published | Yes |
Bibliographical note
Publisher Copyright:© 2021, Springer Science+Business Media, LLC, part of Springer Nature.
Keywords
- Computer vision
- Deep learning
- Multitask learning
- Video reasoning