Streaming automatic speech recognition (ASR) systems consist of a set of separate components, namely an acoustic model (AM); a pronunciation model (PM); a language model (LM), and an end pointer (EP). Traditionally, these components are trained independently on different datasets, with a number of independence assumptions made for tractability.
Over the last several years, there has been a growing interest in developing end-to-end systems, which attempt to learn these separate components jointly in a single neural network. When given input acoustic frames, it directly outputs a probability distribution over graphemes or word hypotheses. Such end-to-end models have shown to surpass the performance of a conventional ASR system.
In this talk, we will present a number of recently introduced innovations that have significantly improved the performance of end-to-end models. We will also discuss some of the shortcomings and ongoing efforts to address these challenges.