Machine learning — particularly in the form of deep convolutional neural networks — is now unquestionably one of the most powerful techniques available to us as software engineers. Within the last decade, nearly every problem we used to attempt to handle by finding heuristics and hand-engineering features has moved to, instead, learning representations over the large volumes of data ubiquitous sensing has made available to us. This approach is not only more flexible and more general; it’s massively more effective.
As an example, consider image classification. The standard benchmark, as of 2011, was ImageNet; the winning entry in the 2010 ImageNet challenge, using hand-engineered features, had 28% top-five classification error rate. The first modern Deep Learning paper introduced AlexNet — a deep convolutional neural network — and, on entering the challenge in 2012, immediately achieved a top-five error of 15.3% — 10% (absolute) better than any previous entry. That paper introduced two key new ideas; using deep convolutional neural networks for image classification and, of equal importance, using GPUs as hardware accelerators for training them. These were not incremental improvements: they redefined the entire field overnight.
Subsequent progress has been truly impressive. The ImageNet challenge is itself no longer really hard enough to drive progress — top-five classification error rate is now well under 5%, which is comparable to human performance. In many other fields, deep neural network approaches are now either competitive or superhuman, and they’ve gone from lab-bench curiosity to ubiquitous and everywhere within five years.
Classification is only one problem, though — and not even the most interesting one when you’re building systems for making virtual reality and augmented reality content. Related techniques have enabled enormous progress in semantic segmentation (dividing up a scene into the objects within it and identifying what they are), predicting depth maps from RGB images, estimating human skeletal pose, and predicting visual saliency (where observers are likely to look in a scene). These, amongst others, are critical problems for XR, and the three pillars of deep convolutional neural networks, hardware acceleration, and high-quality, high-volume data have brought them all from impossibility to tractability within the last few years.
As an XR company, Jaunt cares about all of these problems deeply, and we’re in a unique place to invent, design and build end-to-end systems that solve them; we simultaneously have state-of-the-art capture systems, a library of high-quality footage, and rich datasets on how people behave in XR environments. There’s limitless opportunity — the hardest question is where to start.
This is the third post in a series of blogs from the Jaunt R&D team that will share more about the problems we are solving and how we’re working to help build the future of media. Stay tuned for our next update on the Jaunt Blog.
Interested in joining us? Explore job opportunities with our R&D team here: https://www.jauntvr.com/careers/