Posted by Andrew Walkingshaw, Principal Data Scientist/Software Engineer and Simon Venshtain, Head of Research
Cameras on mobile phones let us capture special moments in our lives at a quality, richness and depth which was unimaginable even fifteen years ago. Whether it’s your child’s first steps, a family portrait or a long-planned vacation, you (and your Instagram account) probably have a photo or video of it.
However, these are just static photographs or 2D recordings. Imagine having captures of these moments in 3D—to be able to walk around and re-live them from any angle. Furthermore, imagine being able to create your own avatar (a 3D model of yourself) and see yourself doing things you might not be able to manage —dance, dunk a basketball, do a backflip—and then being able to share that with your friends and family on social media.
Around two years ago, we realised that all of these possibilities hinge on one particular enabling technology; creating realistic 3D avatars of people at high speed and low cost. At Jaunt, we decided to invest and recruit the experts we needed to make it happen.
Creating volumetric videos and 3D models of humans isn’t just about cameras and traditional computer vision (though those play a large part in it, for sure); it’s about understanding the shape of people, their anatomical structure and how they move. That’s the domain of machine learning; building models which simultaneously detect the major joints in a person’s skeleton and how each pixel in a video frame maps to a part of their body, both in 2D and 3D space. We can then feed this semantic understanding of a person into a skeletally-aware non-rigid reconstruction pipeline; this is the core of XR Cast, our highly portable volumetric capture solution designed for real time, on-location capture.
There are real engineering challenges too; these models need to run fast on off-the-shelf hardware—no-one wants to wait hours for their avatar—and the videos and models we produce must be small enough to distribute over the air and easy to play back on mobile phones of all types. There’s a lot to this, so let’s start by drilling into the machine learning problem first.
Each time one of our cameras captures a person, we want to simultaneously infer three things:
- the position of a number of skeletal keypoints (principal joints, facial landmarks, etc)
- which body part, if any, any pixel in the image belongs to;
- and where there are self-occlusions (where one part of your body might be blocking another part)
What’s more, we want to be able to obtain all these outputs, at camera frame rate, for multiple cameras simultaneously, on a single GPU. To achieve this, we have designed and trained a family of multi-task networks which estimate all of these outputs simultaneously to state-of-the-art precision—and where the fastest members of the family of networks can be evaluated at north of 200 frames per second on an off-the-shelf GPU.
Any network, however, is limited by the availability of training data. Large public annotated datasets are available and we make use of them, but this is also a place where Jaunt’s highly-portable calibrated camera setup is a huge advantage. It makes it much cheaper to capture multi-camera synchronized footage of diverse people in both common and uncommon poses. And we have built workflows to annotate these captures at remarkably low cost (which we will write about in an upcoming post!). This way, we’ve built a feedback loop where we can systematically improve the accuracy and robustness of our models.
With this information on hand our systems can automatically understand the human body. In addition to improving image fidelity, reducing file size and decreasing the number of cameras required, this allows us to create special effects which intelligently interact with our captures. For example, take a look at the video below in which the butterflies follow the capture’s right-hand, no matter where it goes.
You can only do this by making use of our technology’s understanding of the human body. To create our final avatars, however, these network outputs still need to be combined with depth-sensor and RGB data. That’s where our non-rigid reconstruction pipeline comes in.
3D reconstruction of non-rigid objects is a challenging problem and an active research topic. While capturing a rigid object (say a chair) is relatively simple, capturing a highly articulated object such as the human body is significantly more difficult. Our bodies have multiple joints that allow for a practically infinite combination of movements—and tracking each of these poses several challenges for both distribution and capture.
We’re not the only ones focused on this; however, traditional approaches have focused on producing high quality assets with very large file sizes—and work only when captured with many high quality cameras set up in a dedicated studio. With these types of approaches, any reduction in file size or number of cameras drastically reduces quality.
Recent research on the subject of 3D reconstruction of non-rigid objects has shown that using a fast non-linear solver to track the geometry over time is a promising way to address all of the above problems.
The non-rigid motion problem is under-defined given the noisy set of inputs—so the key challenge in making this approach work is to add regularizers and further constrain the solution space of the problem. One way to constrain the solution is to add context on what is being captured and how it is allowed to deform. Only then can this approach transition from working 20% of the time to 95% of the time. This is where all the semantic information of the human subject comes into play and takes this approach out of the research lab and into a product.
Jaunt has built an end-to-end system combining state of the art machine learning and computational geometry to create full-3D animated models of people at unprecedentedly high speed and low cost. There are no shortcuts here; you have to do everything the hard way. We have, and we’re very proud of it.