Skip to main content

Research Repository

Advanced Search

Combining residual networks with LSTMs for lipreading

Stafylakis, Themos; Tzimiropoulos, Georgios

Authors

Themos Stafylakis

Georgios Tzimiropoulos



Abstract

We propose an end-to-end deep learning architecture for word-level visual speech recognition. The system is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks. We train and evaluate it on the Lipreading In-The-Wild benchmark, a challenging database of 500-size target-words consisting of 1.28sec video excerpts from BBC TV broadcasts. The proposed network attains word accuracy equal to 83.0%, yielding 6.8% absolute improvement over the current state-of-the-art, without using information about word boundaries during training or testing.

Citation

Stafylakis, T., & Tzimiropoulos, G. (in press). Combining residual networks with LSTMs for lipreading. In Proc. Interspeech 2017 (3652-3656). https://doi.org/10.21437/Interspeech.2017-85

Conference Name Interspeech 2017
Conference Location Stockholm, Sweden
Start Date Aug 20, 2017
End Date Aug 24, 2017
Acceptance Date May 22, 2017
Deposit Date Aug 10, 2017
Publicly Available Date Mar 28, 2024
Peer Reviewed Peer Reviewed
Pages 3652-3656
Book Title Proc. Interspeech 2017
DOI https://doi.org/10.21437/Interspeech.2017-85
Keywords visual speech recognition, lipreading, deep
learning
Public URL https://nottingham-repository.worktribe.com/output/861527
Publisher URL http://www.isca-speech.org/archive/Interspeech_2017/abstracts/0085.html
Related Public URLs http://www.interspeech2017.org/
http://www.isca-speech.org/iscaweb/index.php/archive/online-archive
http://www.isca-speech.org/archive/Interspeech_2017/pdfs/0085.PDF
http://www.interspeech2017.org/calls/papers/
Additional Information Paper available on http://www.isca-speech.org/iscaweb/index.php/archive/online-archive. pp. 3652-3656. doi:10.21437/Interspeech.2017-85

Files





You might also like



Downloadable Citations