Zero-Shot Keyword Spotting for Visual Speech Recognition In-the-wild

Tzimiropoulos, Yorgos; Stafylakis, Themos

doi:10.1007/978-3-030-01225-0_32

Zero-Shot Keyword Spotting for Visual Speech Recognition In-the-wild

Tzimiropoulos, Yorgos; Stafylakis, Themos

Authors

Yorgos Tzimiropoulos

Themos Stafylakis

Abstract

Visual keyword spotting (KWS) is the problem of estimating whether a text query occurs in a given recording using only video information. This paper focuses on visual KWS for words unseen during training, a real-world, practical setting which so far has received no attention by the community. To this end, we devise an end-to-end architecture comprising (a) a state-of-the-art visual feature extractor based on spatiotemporal Residual Networks, (b) a grapheme-to-phoneme model based on sequence-to-sequence neural networks, and (c) a stack of recurrent neural networks which learn how to correlate visual features with the keyword representation. Different to prior works on KWS, which try to learn word representations merely from sequences of graphemes (i.e. letters), we propose the use of a grapheme-to-phoneme encoder-decoder model which learns how to map words to their pronunciation. We demonstrate that our system obtains very promising visual-only KWS results on the challenging LRS2 database, for keywords unseen during training. We also show that our system outperforms a baseline which addresses KWS via automatic speech recognition (ASR), while it drastically improves over other recently proposed ASR-free KWS methods.

Citation

Tzimiropoulos, Y., & Stafylakis, T. (2018, September). Zero-Shot Keyword Spotting for Visual Speech Recognition In-the-wild. Presented at European Conference on Computer Vision, Munich, Germany

Conference Name	European Conference on Computer Vision
Start Date	Sep 8, 2018
End Date	Sep 14, 2018
Acceptance Date	Jul 3, 2018
Online Publication Date	Oct 6, 2018
Publication Date	Oct 9, 2018
Deposit Date	Oct 15, 2018
Publicly Available Date	Oct 7, 2019
Publisher	Springer Nature
Volume	11208 LNCS
Pages	536-552
Series Title	Lecture notes in computer science
Series Number	11208
Series ISSN	1611-3349
Book Title	Computer Vision – ECCV 2018
ISBN	978-3-030-01224-3
DOI	https://doi.org/10.1007/978-3-030-01225-0_32
Keywords	Visual keyword spotting; Visual speech recognition; Zero-shot learning
Public URL	https://nottingham-repository.worktribe.com/output/1164454
Publisher URL	https://link.springer.com/chapter/10.1007/978-3-030-01225-0_32
Contract Date	Oct 15, 2018

Files

Themos Stafylakis Zero-shot Keyword Search ECCV 2018 Paper (471 Kb)
PDF

Downloadable Citations

HTML

BIB

RTF