ScoutWav: Two-Step Fine-Tuning on Self-Supervised Automatic Speech Recognition for Low-Resource Environments

Fatehi, Kavan; Torres, Mercedes Torres; Kucukyilmaz, Ayse

doi:10.21437/Interspeech.2022-10270

ScoutWav: Two-Step Fine-Tuning on Self-Supervised Automatic Speech Recognition for Low-Resource Environments

Fatehi, Kavan; Torres, Mercedes Torres; Kucukyilmaz, Ayse

Authors

Kavan Fatehi

Mercedes Torres Torres

Dr AYSE KUCUKYILMAZ AYSE.KUCUKYILMAZ@NOTTINGHAM.AC.UK
Associate Professor

Abstract

Recent improvements in Automatic Speech Recognition (ASR) systems obtain extraordinary results. However, there are specific domains where training data can be either limited or not representative enough, which are known as Low-Resource Environments (LRE). In this paper, we present ScoutWav, a network that integrates context-based word boundaries with self-supervised learning, wav2vec 2.0, to present a low-resource ASR model. First, we pre-train a model on High-Resource Environment (HRE) datasets and then fine-tune with the LRE datasets to obtain context-based word boundaries. The resulting word boundaries are used for fine-tuning with a pre-trained and iteratively refined wav2vec 2.0 to learn appropriate representations for the downstream ASR task. Our refinement strategy for wav2vec 2.0 comes determined by using canonical correlation analysis (CCA) to detect which layers need updating. This dynamic refinement allows wav2vec 2.0 to learn more descriptive LRE-based representations. Finally, the learned representations in the two-step fine-tuned wav2vec 2.0 framework are fed back to the Scout Network for the downstream task. We carried out experiments with two different LRE datasets: I-CUBE and UASpeech. Our experiments demonstrate that using the target domain word boundary after pre-training and automatic layer analysis, ScoutWav shows up to 12% relative WER reduction on the LR data.

Citation

Fatehi, K., Torres, M. T., & Kucukyilmaz, A. (2022, September). ScoutWav: Two-Step Fine-Tuning on Self-Supervised Automatic Speech Recognition for Low-Resource Environments. Presented at Interspeech 2022, Incheon, Korea

Presentation Conference Type	Edited Proceedings
Conference Name	Interspeech 2022
Start Date	Sep 18, 2022
End Date	Sep 22, 2022
Acceptance Date	Jun 15, 2022
Online Publication Date	Sep 22, 2022
Publication Date	Sep 22, 2022
Deposit Date	Jul 29, 2022
Publicly Available Date	Sep 22, 2022
Peer Reviewed	Peer Reviewed
Volume	2022-September
Pages	3523-3527
Series Title	Interspeech
Book Title	Proceedings of Interspeech 2022
DOI	https://doi.org/10.21437/Interspeech.2022-10270
Keywords	Speech Recognition, Deep Learning
Public URL	https://nottingham-repository.worktribe.com/output/9409043
Publisher URL	https://www.isca-speech.org/archive/interspeech_2022/fatehi22_interspeech.html
Related Public URLs	https://interspeech2022.org/

Files

Fatehi-InterSpeech22-ScoutWav (851 Kb)
PDF

Licence
https://creativecommons.org/licenses/by/4.0/

Somabotics Toolkit for Rapid Prototyping Human-Robot Interaction Experiences using Wearable Haptics (2023)
Presentation / Conference Contribution

In-the-Wild Failures in a Long-Term HRI Deployment (2023)
Presentation / Conference Contribution

Patient, carer, and staff perceptions of robotics in motor rehabilitation: a systematic review and qualitative meta-synthesis (2021)
Journal Article

A Novel Haptic Feature Set for the Classification of Interactive Motor Behaviors in Collaborative Object Transfer (2020)
Journal Article

Downloadable Citations

HTML

BIB

RTF

Authors

Abstract

Citation

Files

You might also like

Downloadable Citations