Posted by: Andrew Walkingshaw, Principal Data Scientist/Software Engineer, Adam Gaige, Head of Engineering and Charles Le Pere, Director of Program Management
As we wrote about previously, Jaunt is in the business of human understanding. We design and train neural networks for human pose estimation and body part segmentation—identifying key joints and labelling pixels in images with the body part they belong to in real-time. Clever network design has its limits, however.
A model is, in a sense, a distillation of the data it sees during training. Designing better models helps somewhat, in that they make more efficient use of data; but in practice, the limiting factor is often the quantity and diversity of the training data itself. In order to ship the best volumetric capture system out there, we knew that obtaining better training data at scale would be key. That’s why we developed a bespoke 3D human pose annotation tool.
Training data are labelled examples of a diverse group of people, in all shapes and sizes, in a widely-varying range of poses. Typically, labelling is done by hand: someone marks up where all the joints are for each person in each camera. This is, as you can imagine, very labor-intensive and several companies have sprung up to manage teams of workers who can outsource this for you. This process is unavoidably expensive—and, to make matters worse, to train high-precision vision models, you likely need many thousands of examples.
That naturally leads you towards seeking an alternative. At Jaunt, we already have XR Cast, which brings us several significant tools in doing that. Firstly, a calibrated capture environment—an eight-camera stage where we know the lens models of each camera and their positions with respect to each other within millimeters. Secondly, pre-existing models, trained on public datasets, which give us initial solutions for the skeleton and segmentation problems. And finally, animation and editing tools we can leverage to annotate captures.
Using this, and building on top of our XR Cast client, we developed a bespoke 3D human pose annotation tool. Here’s how it works: when we capture an individual, we label the skeleton once in 3D. Using the calibration data, we can then project those annotations back to each of the eight cameras in the stage. In other words, we get eight frames for the price of one. (In fact, we get sixteen; annotations can be projected from the RGB cameras to their paired depth cameras, producing an annotated dataset on depth captures at the same time.)
What’s more, using the models we already have, we can infer good initial estimates for all of the joint positions in 3D! This turns the arduous task of labelling eight images by hand in 2D into the much more straightforward task of correcting approximate 3D joint positions once to get the same volume of training examples.
Finally, we leveraged and extended our editing, animation and timeline controls to annotate across ranges of frames in an efficient rough-to-fine workflow. Take a look at how it works below:
As a result, our annotation is both more precise—as the training data is inherently, by construction, consistent across eight cameras—and substantially less labour. It’s also much cheaper: the cost per example is around five times cheaper than the lowest quote from major annotation outsourcers.
This makes building training sets for new, specialized models economically feasible in a way it wasn’t before. And, what’s more, it provides a valuable data set for a wide range of use cases outside of volumetric capture. From robotics to autonomous vehicles to retail—we believe there are many industries that will benefit from a deep understanding of the human body and how it moves. Moreover, we believe these training sets will be particularly useful for difficult settings like certain sports, dance or yoga, where there’s a lot of self-occlusion (parts of the body hide other parts of the body).
But the work doesn’t stop there: we’re designing our system to learn from unlabelled as well as labeled data. Warning: The following is a little more technical than the rest of this article, so feel free to skip ahead to the end, but the central idea is straightforward. As we mentioned above, the predictions made by the network for each camera must be self-consistent. What does that mean? Intuitively, it means that they have to line up right with each other: if predictions from one camera imply, for instance, that your arm is straight, predictions on another camera facing you from a different angle must imply the same thing. If they don’t, at least one of these sets of predictions must be wrong! After all, your arm can’t be both bent and straight at the same time.
How do you express this mathematically? Consider, say, elbows: your left elbow lives at one, and only one, location in 3D space. We can use the predicted 2D position of the elbow in each camera to estimate what that 3D position is, and we can then use the camera equations of each camera to project that back to a second 2D point for each camera.
If your network makes perfect, and perfectly consistent, predictions for every camera, the original 2D location and the new one back-projected from the 3D predictions will coincide, so the distance between those points will be zero; the less consistent, the larger the distance. This is a workable definition of error (or loss) for training—the (squared) distance between these two points.
So minimizing this error corresponds to improving the model. That gives you an approach for learning from recordings which haven’t been annotated. This is less efficient, of course, than learning from annotated data, but unannotated data is much cheaper to obtain at scale.
Of course, you can use both approaches at once. When you pair up this source of signal with the annotated data—an approach called semi-supervised learning—you can stretch your annotation budget even further (approaches of this kind have been successful in other fields, notably machine translation).
All of this, taken together, enables us to build state of the art models for 2D and 3D human pose estimation which have the prospect of working in a wider range of domains, for a wider range of people, at a lower cost than has previously been achievable. And that’s why we’re really excited by the new possibilities this opens up.