Recognizing Semi-Natural and Spontaneous Speech Emotions Using Deep Neural Networks

Ammar Amjad, Lal Khan, Noman Ashraf, Muhammad Bilal Mahmood, Hsien Tsung Chang*

*Corresponding author for this work

Research output: Contribution to journalJournal Article peer-review

5 Scopus citations

Abstract

We needed to find deep emotional features to identify emotions from audio signals. Identifying emotions in spontaneous speech is a novel and challenging subject of research. Several convolutional neural network (CNN) models were used to learn deep segment-level auditory representations of augmented Mel spectrograms. The proposed study introduces a novel technique for recognizing semi-natural and spontaneous speech emotions based on 1D (Model A) and 2D (Model B) deep convolutional neural networks (DCNNs) with two layers of long-short-term memory (LSTM). Both models used raw speech data and augmented (mid, left, right, and side) segment level Mel spectrograms to learn local and global features. The architecture of both models consists of five local feature learning blocks (LFLBs), two LSTM layers, and a fully connected layer (FCL). In addition to learning local correlations and extracting hierarchical correlations, LFLB comprises two convolutional layers and a max-pooling layer. The LSTM layer learns long-term correlations from local features. The experiments illustrated that the proposed systems perform better than conventional methods. Model A achieved an average identification accuracy of 94.78% for speaker-dependent (SD) with a raw SAVEE dataset. With the IEMOCAP database, Model A achieved an average accuracy of an SD experiment with raw audio of 73.15%. In addition, Model A obtained identification accuracies of 97.19%, 94.09%, and 53.98% on SAVEE, IEMOCAP, and BAUM-1s, the databases for speaker-dependent (SD) experiments with an augmented Mel spectrogram, respectively. In contrast, Model B achieved identification accuracy of 96.85%, 88.80%, and 48.67% on SAVEE, IEMOCAP, and the BAUM-1s database for SI experiments with augmented reality Mel spectrogram, respectively.

Original languageEnglish
Pages (from-to)37149-37163
Number of pages15
JournalIEEE Access
Volume10
DOIs
StatePublished - 2022

Bibliographical note

Publisher Copyright:
© 2013 IEEE.

Keywords

  • Speech emotion recognition
  • convolutional neural network
  • data augmentation
  • long-short-term memory
  • spontaneous speech database

Fingerprint

Dive into the research topics of 'Recognizing Semi-Natural and Spontaneous Speech Emotions Using Deep Neural Networks'. Together they form a unique fingerprint.

Cite this