We were pretty impressed with how the Pixel 2's portrait mode worked, and Google's deep dive into the technical details was easily one of my favorite reads of last year. The company must have recognized what a hit it was, as a new explanation for the enhanced technologies behind the Pixel 3's depth-sensing tech was just published. As always nowadays, a big chunk of the changes are a result of neural network magic, but Google also has a "Frankenphone" five-Pixel case to thank for the Pixel 3's portrait mode progress.
While the PDAF (phase-detection autofocus) used in the Pixel 2 was an excellent way to gather basic depth data via parallax between the two PDAF apertures used to focus, it wasn't infallible. Certain types of scene geometry, like an abundance of horizontal lines, can confuse the system that makes those PDAF image comparisons to determine depth. Basically, visual elements that line up with the linear configuration of the depth-sensing PDAF apertures can result in a false negative, where that content is determined to be closer based on that quick comparison than it actually is.
Note the difference in depth in lines on the right in the "Stereo Depth" mode vs. "Learned Depth."
To mitigate that issue, depth information on the Pixel 3 is augmented with additional data, using what Google calls semantic and defocus cues — the sort of thing we humans can easily pick up on when looking at a photo by noticing the relative sizes of objects and their blur. But what's easy for people to do can be hard for computers, making a neural network one of the best choices for an automated means of interpreting those cues. But, neural networks don't just "work," they need to be trained on good quality data sets to function, so Google needed some.
Google's 5-Pixel monstrosity for capturing data (left), the resulting captured photos animated (center), and the depth data taken from those images (right).
As one does, Google solved that particular problem by building a 5-Pixel "Frankenphone" abomination that was able to capture photos from five angles at once. By ensuring the source images came from the same type of hardware the model was trained on, it would be more accurate, and those extra dimensions of parallax would supply even more depth data, as well as eliminating that issue it can run into with lines in the same direction as the PDAF apertures. This didn't actually solve all of Google's problems detecting depth — insufficient "texture," focus distance, and other problems can hinder efforts — but it was enough to determine relative depth with reasonable accuracy.
With photos/data captured by the monstrous assembly, Google was able to train the neural network to pick up on those more subtle cues, combining them with PDAF parallax data and the previous "people pixel" detection for even greater accuracy in generating portrait depth maps.
Left: "Stereo" PDAF Portrait results. Right: New machine learning depth results. Depth maps alone for each below.
That depth data is also saved with photos taken in Portrait mode, which could open up a ton of additional functionality for developers in the future — you can already extract it from the resulting images yourself.
For more cool details to geek out over, check out the album of examples put together by the folks over at Google. Some of the changes are pretty subtle, but they're all a major improvement over the stereo PDAF data alone, and we have Google's ridiculous 5-phone case to thank for it.
Header image via Google.
- Google AI Blog