Themos Stafylakis
Combining residual networks with LSTMs for lipreading
Stafylakis, Themos; Tzimiropoulos, Georgios
Authors
Georgios Tzimiropoulos
Abstract
We propose an end-to-end deep learning architecture for word level visual speech recognition. The system is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks. We trained and evaluated it on the Lipreading In-The-Wild benchmark, a challenging database of 500-size vocabulary consisting of video excerpts from BBC TV broadcasts. The proposed network attains word accuracy equal to 83.0%, yielding 6.8% absolute improvement over the current state-of-the-art.
Citation
Stafylakis, T., & Tzimiropoulos, G. (in press). Combining residual networks with LSTMs for lipreading. . https://doi.org/10.21437/Interspeech.2017-85
Conference Name | Interspeech 2017 |
---|---|
End Date | Aug 24, 2017 |
Acceptance Date | May 22, 2017 |
Deposit Date | Aug 10, 2017 |
Publicly Available Date | Dec 31, 2017 |
Peer Reviewed | Peer Reviewed |
DOI | https://doi.org/10.21437/Interspeech.2017-85 |
Keywords | visual speech recognition, lipreading, deep learning |
Public URL | https://nottingham-repository.worktribe.com/output/861527 |
Publisher URL | http://www.isca-speech.org/archive/Interspeech_2017/abstracts/0085.html |
Related Public URLs | http://www.interspeech2017.org/ http://www.isca-speech.org/iscaweb/index.php/archive/online-archive http://www.isca-speech.org/archive/Interspeech_2017/pdfs/0085.PDF http://www.interspeech2017.org/calls/papers/ |
Additional Information | Paper available on http://www.isca-speech.org/iscaweb/index.php/archive/online-archive. pp. 3652-3656. doi:10.21437/Interspeech.2017-85 |
Files
1703.04105.pdf
(<nobr>1.4 Mb</nobr>)
PDF
You might also like
Zero-shot keyword spotting for visual speech recognition in-the-wild
(2018)
Conference Proceeding
End-to-end audiovisual speech recognition
(2018)
Conference Proceeding
Deep word embeddings for visual speech recognition
(2018)
Conference Proceeding
A new penalty term for the BIC with respect to speaker diarization
(2010)
Conference Proceeding