Skip to main content

Research Repository

Advanced Search

Combining residual networks with LSTMs for lipreading

Stafylakis, Themos; Tzimiropoulos, Georgios


Themos Stafylakis

Georgios Tzimiropoulos


We propose an end-to-end deep learning architecture for word level visual speech recognition. The system is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks. We trained and evaluated it on the Lipreading In-The-Wild benchmark, a challenging database of 500-size vocabulary consisting of video excerpts from BBC TV broadcasts. The proposed network attains word accuracy equal to 83.0%, yielding 6.8% absolute improvement over the current state-of-the-art.


Stafylakis, T., & Tzimiropoulos, G. (in press). Combining residual networks with LSTMs for lipreading. .

Conference Name Interspeech 2017
End Date Aug 24, 2017
Acceptance Date May 22, 2017
Deposit Date Aug 10, 2017
Publicly Available Date Dec 31, 2017
Peer Reviewed Peer Reviewed
Keywords visual speech recognition, lipreading, deep
Public URL
Publisher URL
Related Public URLs
Additional Information Paper available on pp. 3652-3656. doi:10.21437/Interspeech.2017-85


You might also like

Downloadable Citations