Resources
Authors & Affiliations
Haozhe Qi, Chen Zhao, Mthieu Salzman, Alexander Mathis
Abstract
Hands are highly articulated and multiskilled instruments for object manipulation. Many domains in neuroscience, from motor control to translational neuroscience, highly benefit from assessments of hand functions. The estimation of hand poses on a single-camera view allows for a simple experimental setting without the need for calibration and synchronization. However, frequent occlusions between interactive objects and digits, challenge state-of-the-art pose estimation predictions. Thus, instead of focusing on the hand pose on its own, we leverage the mutual constraints of the hand and its interacting object to estimate their 3D poses jointly. To provide 3D interaction shape constraints in a global context, we developed HOISDF, a Signed Distance Field (SDF) guided hand-object pose estimation network, which jointly exploits hand and object SDFs to provide an implicit representation over the complete reconstruction volume. Specifically, the role of the SDFs is threefold: equip the visual encoder with implicit shape information, help to encode hand-object interactions, and guide the hand and object pose regression via SDF-based sampling and by augmenting the feature representations. We show that HOISDF achieves state-of-the-art results on the relevant hand-object pose estimation benchmarks (DexYCB and HO3Dv2). During inference, one can still benefit from the efficient design without the need access to any SDF data, HOISDF can achieve high inference speeds (10.6ms for image feature extraction, 11.5ms for query points sampling, and 10.9ms for pose attention and regression). We think HOISDF opens up novel avenues for assessing hand function, particularly when interacting with objects captured from a single cell phone camera.