Google has been improving Gboard with the same type of tools it uses for speech recognition: machine learning. The budding technology is rapidly becoming a ubiquitous method for improving results and performance. If a network can be trained to accurately accomplish something in a performant way, odds are you'll see it introduced to any product it can be applied to. Gboard and text-input as a whole are no different, and we are reaping the benefits of improved corrections and predictions every time we swipe out a low-accuracy message to a friend. But how do these improvements work?

This is going to be a pretty technical post, and I'm going to do my best to simplify things to a level where everyone can enjoy the subject. All of this is based on a recent Google Research blog post. If you are really into machine learning and you already know what every term I might use means, the source link is probably the best place for you to learn about the cool new stuff in Gboard. If, however, you think you might want a few of the concepts or terms explained, then this article might be the place for you.

Google's latest research blog post is all about one thing: Gboard. The authors talk about some of the new features that were shown off at I/O, like the transliteration tools for typing in a language you might only have a phonetic familiarity with. But the biggest things discussed in the article relate to how input is determined, and the models and math that determine which swipes equate to which words, and how to most accurately correct mistakes. The two solutions Google has found revolve around its neural spatial models for processing input, and finite-state transducers for state progression to accurately determine content. These sound like complex ideas, but their cleverness can be explained.



Neural spatial models are just a fancy way of saying machine learning applied to any space. For instance, the physical Gboard keyboard layout itself. When we type, we sometimes type wrong. And when we swipe, we sometimes swipe wrong. I might have meant to drag over "P" for this word, but I only made it to O. That's where these neural spatial models come in handy. They are able to determine probabilities for which letters you meant to hit based on physical proximity of input, as well as predictive models based on trained corrections.

Originally, Gboard used a Gaussian model, combed with a set of simple rules. That's a fancy way of saying it used physical proximity in a sort of bell-curve out from a given key, combined with a basic understanding of certain common errors. The developers have replaced the older Gaussian and rule-based model with an LSTM, or long short-term memory model, trained by a CTC, or connectionist temporal classification. Big somewhat scary acronyms, but they're actually pretty simple ideas.

A long short-term memory is a particular kind of neural net ideal for use in applications of intermittent or unknown event/time latencies, and Google has previously used these for speech recognition, the Google Assistant, and Google Translate. A connectionist temporal classification is a specific method of training a neural network. After all, a neural network is nothing without data to derive weighted triggers and values from. In this case, it means segmenting data input to the network by time and allowing the network to retroactively refer to input, states, and output before and after the current set. To further simplify, it means referring to data at many stages of analysis from multiple periods of time, increasing the value of context in the network's training.

To pull data for training, Google used data sets comprised of information collected when users opt-in to share snippets, and set reverted corrections and suggestions as negative signals, and accepted corrections or suggestions as positive.


To help visualize how it might work, we have this image. In the above, the left represents individual data points from the word "could" as swiped on Gboard. The right is a normalized temporal data for the same swiped input. By being able to refer to the probable center of an input, in temporal context with inputs that came before and after, we can tell that the overall path means "Could" even if we might not have hit a couple of letters quite right. Google was further able to adapt some of its speech recognition tools and data and apply it to corrections and suggestions in Gboard, optimizing things over iterations with a heavy dose of Tensorflow to increase the speed of analysis and output and decrease the number of errors.

All that is well and good, but Google can also leverage the rules of a language to enhance predictions. As a basic example, consider: If you tapped O-T instead of I-T, Gboard can look at its dictionary and determine that "OT" isn't something you are likely to input, but that the similar "IT" is a word. Furthermore, "I" is physically close to "O," so you probably meant "IT." Easy. Well, Google is able to do a similar but improved sort thing via what is called a finite-state transducer.



To vastly over-simplify, a finite-state transducer, in this case, means an order of operations for processing input. The operational flow above shows the various stages of potential input for the words "I," "I've," and "If." In this image, input at each stage comes before the ":" (with ε representing nothing as either input or output), and potential outputs are shown after the ":" as they each become logical. So we start having input the letter "i." From the top path we can see, if there is no additional input, then "I" is the correct output. If an apostrophe or space follows, followed by the letter "v," then "I've" is the likely output. If the letter "f" follows, then "If" is the likely output. The final two flows at the bottom return to start. You can think of it as just a series of rules for input and output.


Google is able to use these same tools at other scales. For instance, developers can create these logical chains for whole words and their contexts, combined with the spatial models above and statistical likelihoods for possible inputs. These tools also apply to the transliteration features that were shown off at I/O. By implementing those same workflows and networks towards phonetically spelled input, it can guess what characters or words in another language you might mean. For anyone that might speak a language, but not know how to spell it, that's pretty cool. Unfortunately, if you can't read the language, you aren't really in a position to know if what it's typing is right or wrong, but it's still a useful tool.


Gboard and machine learning are both pretty fantastically interesting and complex subjects. I hope our simplification of the tools behind it has helped you understand a bit more behind how it all works. It's useful to have knowledge about how the things we use on a daily basis work, not just in an abstract sense for intellectual fulfillment, but to provide us with a base for troubleshooting when things might go wrong. Now when GBoard makes seemingly odd recommendations, you might have a small idea as to why.