When you make a voice search or any other voice input on Android, there's a complex process that goes on behind the scenes. Your voice is recorded, transmitted to Google's servers, analyzed and converted into a text string, then either passed on to the relevant web service (like Google Search) or sent back to your device. It's usually almost instantaneous if you have a decent Internet connection, but therein lies its one weakness: you do have to have that connection in order for it to work. The rudimentary offline system (in Android since Jelly Bean) relies on a relatively unsophisticated vocabulary and detection system that's slow and less powerful than the connected version.

Google may be working on a solution for that. A recent research paper published by a team of Google software engineers outlines a system that crunches the hardware requirements for faster and more robust voice recognition down to something that makes sense for relatively low-power smartphones and tablets. The paper is short, just four pages, but it's probably somewhat less than clear unless you happen to be an expert in this very specialized field. For example:

In this paper we extend previous work that used quantized deep neural networks (DNNs) and on-the-fly language model rescoring to achieve real-time performance on modern smartphones [1]. We demonstrate that given similar size and computation constraints, we achieve large improvements in word error rate (WER) performance and latency by employing Long Short-Term Memory (LSTM) recurrent neural networks (RNNs), trained with connectionist temporal classification (CTC) [2] and state-level minimum Bayes risk (sMBR) [3] techniques. LSTMs are made small and fast enough for embedded speech recognition by quantizing parameters to 8 bits, by using context independent (CI) phone outputs instead of more numerous context dependent (CD) phone outputs, and by using Singular Value Decomposition (SVD) compression [4, 5].

The gist is that this team thinks it's more than possible to create a voice recognition system that can run locally on smartphones, using a comparatively tiny amount of processing power and memory, but retaining important features like error detection and voice customization. The acoustic model in particular has been reduced in size by a factor of ten on the Nexus 5 test device. The system isn't unlimited - it's still using a basic vocabulary database system, and the current implementation has a word error rate of 13.5%. Even so, it's promising, and it wouldn't be at all surprising to see a more complex offline voice recognition system in a future version of Android.

Source: arxivVia: ZDNet