Current bottom-up approaches for 2D multi-person pose estimation (MPPE) detect joints collectively without distinguishing between individuals. Associating the joints to individuals is done independently of the learning algorithm, therefore requires formulating a separate problem in a post-processing step that relies on relaxations or sophisticated heuristics. We propose a differentiable learning-based model that performs part detection and association jointly, thereby eliminating the need for further post-processing. The approach introduces a recurrent neural network (RNN), which takes dense low-level features as input and predicts the heatmaps of a single person joints in each iteration, then refines them using a feedback loop. In addition, the network learns a stopping criterion in order to halt once it has identified all individuals in an image, allowing it to output any number of poses. Furthermore, we introduce an efficient implementation that allows training on memory-constrained machines. The approach is generic and can be combined with any bottom-up approach. The approach is evaluated on the challenging MSCOCO and OCHuman datasets and achieves a substantial improvement over the baseline. On OCHuman, which contains severe occlusions, we achieve state-of-the-art results even compared to top-down approaches. Our results demonstrate the advantage of a learning-based detection and association framework, and bottom-up approaches over top-down approaches in challenging scenarios.
Authors: R. Briq, A. Doering, J. Gall