GenHMR: A Generative Approach for Body Mesh Recovery from Monocular Images | Dall-e mini website | Dall-e mini | Dall-e alternative reddit | Turtles AI

GenHMR: A Generative Approach for Body Mesh Recovery from Monocular Images
A probabilistic method to improve 3D reconstruction by addressing image uncertainties and occlusions
Isabella V2 February 2025

 

GenHMR is a novel proposal for generative 3D human mesh recovery from monocular images. It combines probabilistic and deterministic methods to address the uncertainties and occlusions that complicate 2D-3D mapping, providing a more accurate and robust solution in complex contexts. The model is based on a generative approach that optimizes 3D pose prediction.

Key Points:

  • GenHMR improves 3D human mesh recovery from monocular images.
  • It introduces “uncertainty-driven sampling” to reduce reconstruction uncertainties.
  • The model improves prediction accuracy through 2D pose-driven refinements.
  • GenHMR outperforms traditional methods even in complex pose situations.

In the field of computer vision, 3D human mesh recovery (HMR) from monocular images is a task of utmost importance, with applications ranging from healthcare to the entertainment industry. To date, most of the methods used have focused on deterministic approaches, generating a single prediction for each 2D image. However, this process is often hampered by depth uncertainties and occlusions that make accurate reconstruction difficult. To address these challenges, probabilistic approaches have been developed, which try to generate multiple plausible 3D reconstructions to overcome ambiguities, but these methods have shown lower performances compared to deterministic approaches.

GenHMR, a newly introduced novel framework, proposes a reformulation of HMR as an image-conditioned generative task, where uncertainties in the 2D-to-3D mapping process are explicitly modeled and reduced. The key to this approach lies in two main components: the “Pose Tokenizer” and the “Image-Conditional Masked Transformer”. The former converts the 3D human pose into a sequence of discrete tokens within a latent space, while the latter learns the probabilistic distributions of these tokens, conditioned on the input image and a sequence of masked tokens. During inference, the model iteratively samples pose tokens, prioritizing those with high confidence, which leads to a progressive reduction of uncertainties in the 3D reconstruction. To further improve the reconstruction quality, GenHMR adopts a refinement technique that leverages the 2D pose as a guide, allowing pose tokens to better match the reality observed in images, improving the consistency between the 3D mesh and the 2D pose cues.

The results achieved by GenHMR are impressive, as demonstrated by experiments on benchmark datasets that show a significant performance improvement compared to state-of-the-art approaches such as HMR2.0 and TokenHMR. These traditional methods, although based on visual transformers, have shown clear difficulties in dealing with complex or ambiguous poses, often highlighted by significant errors in reconstructions. In contrast, GenHMR addresses these difficulties by directly modeling the uncertainty of the mapping process, thus obtaining more accurate and robust 3D pose reconstructions even in complex scenarios.

The key part of this advancement lies in the "Uncertainty-Guided Sampling" (UGS) process, which allows the model to sample pose tokens with high confidence, gradually reducing the uncertainties related to the reconstruction. Added to this is the "2D Pose-Guided Refinement", which optimizes the pose tokens to better align with the information coming from the 2D pose, further refining the accuracy of the 3D mesh. Refinement iterations lead to a progressive reduction of errors, significantly improving the quality of the reconstruction. This process proves to be particularly useful in difficult pose situations, where occlusions and depth ambiguities can create significant difficulties for traditional methods.

GenHMR represents a significant step forward in the field of 3D human mesh recovery from monocular images, addressing the main issues related to uncertainties and occlusions.

Its innovative approach, based on uncertainty-driven sampling and 2D pose refinement, is redefining the standards of accuracy and reliability in this field.