An overview of high-resource automatic speech recognition methods and their empirical evaluation in low-resource environments

Fatehi, Kavan; Torres Torres, Mercedes; Kucukyilmaz, Ayse

doi:10.1016/j.specom.2024.103151

An overview of high-resource automatic speech recognition methods and their empirical evaluation in low-resource environments

Fatehi, Kavan; Torres Torres, Mercedes; Kucukyilmaz, Ayse

Authors

Kavan Fatehi

Mercedes Torres Torres

Dr AYSE KUCUKYILMAZ AYSE.KUCUKYILMAZ@NOTTINGHAM.AC.UK
Associate Professor

Abstract

Deep learning methods for Automatic Speech Recognition (ASR) often rely on large-scale training datasets, which are typically unavailable in low-resource environments (LREs). This lack of sufficient and representative training data poses a significant challenge for applying ASR systems in specific domains categorized as LREs. In this paper, we provide a comprehensive overview and empirical analysis of state-of-the-art deep learning techniques for ASR, which are primarily designed for high-resource environments (HREs). Our aim is to explore their potential effectiveness in LRE settings. We focus on identifying key factors that influence the adaptation of HRE models to LRE tasks. To this end, we survey advanced deep learning models and conduct a comparative evaluation of their performance in LRE contexts. Additionally, we propose that pre-training ASR models on HRE datasets, followed by domain-specific fine-tuning on LRE data, can significantly enhance performance in data-scarce settings. Using LibriSpeech and WSJ as our HRE datasets, we evaluate these models on two LRE datasets: UASpeech for dysarthria speech and iCUBE, our novel human–robot interaction dataset. Our systematic experiments, involving varying dataset sizes for pre-training, demonstrate the efficacy of combining pre-training and fine-tuning strategies to improve recognition accuracy in LREs.

Citation

Fatehi, K., Torres Torres, M., & Kucukyilmaz, A. (2025). An overview of high-resource automatic speech recognition methods and their empirical evaluation in low-resource environments. Speech Communication, 167, Article 103151. https://doi.org/10.1016/j.specom.2024.103151

Journal Article Type	Article
Acceptance Date	Nov 22, 2024
Online Publication Date	Dec 10, 2024
Publication Date	2025-02
Deposit Date	Dec 17, 2024
Publicly Available Date	Dec 20, 2024
Journal	Speech Communication
Print ISSN	0167-6393
Publisher	Elsevier
Peer Reviewed	Peer Reviewed
Volume	167
Article Number	103151
DOI	https://doi.org/10.1016/j.specom.2024.103151
Keywords	Automatic speech recognition, End-to-end model, Deep learning models, Low-resource environment
Public URL	https://nottingham-repository.worktribe.com/output/42835237
Publisher URL	https://www.sciencedirect.com/science/article/pii/S0167639324001225?via%3Dihub