Speech-to-Retrieval (S2R): A new approach to voice search

Voice Search is now powered by our new Speech-to-Retrieval engine, which gets answers straight from your spoken query without having to convert it to text first, resulting in a faster, more reliable search for everyone.

Voice-based web search has been around a long time and continues to be used by many people, with the underlying technology evolving rapidly to allow for expanded use cases. Google’s initial voice search solution used automatic speech recognition (ASR) to turn the voice input into a text query, and then searched for documents matching that text query. However, a challenge with this cascade modeling approach is that any slight errors in the speech recognition phase can significantly alter the meaning of the query, producing the wrong results.

For example, imagine someone does a voice-based web search for the famous painting, “The Scream”, by Edvard Munch. The search engine uses the typical approach of cascade modeling, first converting the voice query to text via ASR before passing the text to the search system. Ideally, the ASR transcribes the query perfectly. The search system then receives the correct text — “the Scream painting” — and provides relevant results, like the painting’s history, its meaning, and where it’s displayed. However, what if the ASR system mistakes the “m” of “scream” for an “n”? It misinterprets the query as “screen painting” and returns irrelevant results about screen painting techniques instead of details about Munch’s masterpiece.