GestureLSM

Arxiv 2024

We present GestureLSM, a method that explicitly models the body part interactions to achieve smooth overall gesture motions while also capable of real-time generation based on shortcut sampling.

Abstract

Controlling human gestures based on speech signals presents a significant challenge in computer vision. While existing works did preliminary studies of generating holistic co-speech gesture from speech, the spatial interaction of each body region during the speech remains barely explored. This leads to wield body part interactions given the speech signal. Furthermore, the slow generation speed limits the construction of real-world digital avatars. To resolve these problems, we propose \textbf{GestureLSM}, a Latent Shortcut based approach for Co-Speech Gesture Generation with spatial-temporal modeling. We tokenize various body regions and explicitly model their interactions with spatial and temporal attention. To achieve real-time gesture generations, we exam the denoising patterns and design an effective time distribution to speed up sampling while improve the generation quality for shortcut model. Extensive quantitative and qualitative experiments demonstrate the effectiveness of GestureLSM, showcasing its potential for various applications in the development of digital humans and embodied agents.

Method

Left: Different Body Region Gesture Representation Encoding with Residual Vector Quantization. Right: We leverage spatial-temporal attention with position encoding to learn the interaction of body regions given speech input.

BibTeX


    Arxiv 2024