was successfully added to your cart.

Towards End-To-End Automatic Speech Recognition

Streaming automatic speech recognition (ASR) systems consist of a set of separate components, namely an acoustic model (AM); a pronunciation model (PM); a language model (LM), and an end pointer (EP). Traditionally, these components are trained independently on different datasets, with a number of independence assumptions made for tractability.

Over the last several years, there has been a growing interest in developing end-to-end systems, which attempt to learn these separate components jointly in a single neural network. When given input acoustic frames, it directly outputs a probability distribution over graphemes or word hypotheses. Such end-to-end models have shown to surpass the performance of a conventional ASR system.

In this talk, we will present a number of recently introduced innovations that have significantly improved the performance of end-to-end models. We will also discuss some of the shortcomings and ongoing efforts to address these challenges.

Science vs. COVID, conversations at scale 

The explosive growth of scientific research about the novel coronavirus is one of the truly inspiring and hope-filled stories of this crisis...